4  Linear mixed model I

Author

Alejandro Morales & Joost van Heerwaarden

4.1 Nested data

4.1.1 Simulating the data

Let’s simulate a dataset that resembles the population of frogs in buckets. We will assume 30 buckets with 30 frogs each. As in the lectures, each bucket will be a different color:

n_buckets <- 30
n_within <- 30
bucket <- as.factor(rep(1:n_buckets, each = n_within))
color <- as.factor(rep(rainbow(n_buckets), each = n_within))
id <- factor(rep(1:n_within, length.out = n_within * n_buckets)) # frog id within a buckets
dat <- data.frame(id, bucket, color)
head(dat)

Let’s now simulate the weight of the frogs by defining some parameters:

mu <- 60 # Mean frog weight across all buckets
sigma_bucket <- 25 # Standard deviation of mean frog weight across buckets
sigma_within <- 10 # Standard deviation of frog weight within a bucket

First we sample from the distribution of buckets:

bucket_mu <- rnorm(n_buckets, mean = mu, sd = sigma_bucket)
dat$bucket_mu <- bucket_mu[dat$bucket]

We create the linear model that describes the weight of the frogs. Note that we create a design matrix without an intercept term, because we are using the mean of each bucket rather than the effect of each bucket on frog weight (we can drop the intercept by adding - 1 to the formula):

X <- model.matrix(~ bucket - 1, data = dat)
beta <- bucket_mu
frog_effect <- rnorm(nrow(X), mean = 0, sd = sigma_within)
dat$weight <- X %*% beta + frog_effect

Note that frog_effect includes the variation in weight across frogs as well as observation errors. Let’s take a look at the data:

head(dat)
Exercise 4.1
  1. Count how many buckets there are per color. Can you estimate the effect of color on frog weight?

  2. Visualize the data. What pattern do you see?

Solution 4.1
  1. We can check how many buckets there are per color by using table() to create a contingency table and dividing by the number of frogs per bucket:
bucket_per_color <- with(dat, table(bucket, color))/n_within
bucket_per_color

Then we can add up each column of this table to get the number of buckets per color:

colSums(bucket_per_color)

We can see that there is only one bucket for each colour. That is, we have no replicates for color and therefore no variation (i.e., no replication of the effect of each color). A common misconception in this scenario is to treat the 30 frogs per bucket as replicates for the effect of color, but this is not the case because of the nested nature of the data. We often refer to these as pseudo-replicates.

  1. We can visualize the data using a boxplot (against bucket and paint the boxes with the corresponding color):
boxplot(weight ~ bucket, data = dat, col = levels(color))

We can see that the varaition across buckets is substantial, in fact it is larger than the variation within buckets. That makes sense given that sigma_bucket > sigma_within.

4.1.2 Building the covariance matrix

Let’s build the covariance matrix for this dataset based on what we learnt in the lecture. First, the variances across buckets and within buckets are:

Vb  <- sigma_bucket^2
Ve  <- sigma_within^2

Now the total variance can be expressed as the sum of the variance across buckets and the variance within buckets:

Vbe <- Vb + Ve # variance (diagonal)

The covariance between the weights of frogs within a bucket is equal to the variance across buckets, (while the covariance between the weights of frogs in different buckets is zero). The correlation can be calculated from the covariance and the variances:

cov <- Vb # covariance (off-diagonal)
rho <- Vb/(Vb+Ve) # correlation
c(Vb, Ve, Vbe, cov, rho)

Note that we use the suffix e for the latter as it correspond to the residual or error term in statistical models. How can we combine these values to create the covariance matrix? First we now that this matrix should be a square matrix with the same number of rows as our dataset:

V <- matrix(0, nrow = nrow(dat), ncol = nrow(dat)) # Initialize the matrix

The diagonal elements of the covariance matrix are the total variance:

diag(V) <- Vbe

The off-diagonal elements of the covariance matrix can be either 0 or cov depending on whether the frogs are in the same bucket or not. If we look at the data above we can see that they are sorted by bucket (i.e., the first 30 rows are all frogs from bucket 1, the next 30 rows are all frogs from bucket 2, and so on). We can use this to create a matrix that indicates which frogs are in the same bucket. The outer() function can be used to create a matrix of logical values that indicates whether the frogs are in the same bucket (i.e., this matrix has the same dimensions as V and has TRUE for the combinations of frogs that are in the same bucket and FALSE for those that are not):

STM <- outer(dat$bucket, dat$bucket, "==") # Combinations with the same bucket

Now we can use this matrix to fill the off-diagonal elements of the covariance matrix:

V[STM] <- cov

Finally, we can visualize the covariance matrix using the image() function:

image(V)

If you want to read the numbers in the matrix you should only print a subset of it (otherwise it will be too large, sometimes you also need to do this for the visualization as it can take very long to render a very large matrix):

V[1:10, 1:10]

If we print a row in the matrix we can see that the first 30 elements are equal to the variance or covariance, while the rest are equal to zero, as expected:

V[1, ]

In this case the data structure is simple enough that we could manually fill the covariance matrix, but in practice this is not the case. In practice we will create these matrix from fitted models (see exercise below).

4.1.3 Estimating the covariance matrix from the sample

The values above are calculated from the true variances in the population, but we can estimate them from the sample data. First we calculate the average of the variance within buckets:

Ve_emp <- with(dat, tapply(weight, bucket, var))
mean_Ve_emp <- mean(Ve_emp)
c(Ve, mean_Ve_emp)

That is indeed pretty close. The total variance can be estimated from the data by ignoring the buckets (i.e., just the variance of all the data pooled together):

Vbe_emp <- var(dat$weight)
c(Vbe, Vbe_emp)

From this we can estimate the variance of means across buckets:

Vb_emp <- Vbe_emp - mean_Ve_emp
c(Vb, Vb_emp)

Finally, we can estimate the induced correlation from the empirical estimates:

rho_emp <- Vb_emp/Vbe_emp
c(rho, rho_emp)
Exercise 4.2
  1. Build the covariance matrix for the dataset using the empirical estimates of the variances and covariances (for the GLS model weight ~ color - 1).
  2. Visualize the covariance matrix.
  3. Apply the linear algebra of generalized least squares to estimate the mean frog weight of each bucket and its standard error.
  4. Fit a linear model using color as a fixed effect while ignoring the nested structure of the data. Compare these estimates to the ones using the linear algebra of GLS.
Solution 4.2
  1. We can build the covariance matrix using the empirical estimates of the variances and covariances. First we create the design matrix X from the data (in this case we could also just look at the number of rows in the dataset of course):
X <- model.matrix(weight ~ color - 1, data = dat)

Then we build the covariance matrix as show above, but using the empirical estimates:

V <- matrix(0, nrow = nrow(X), ncol = nrow(X)) # Initialize the matrix
STM <- outer(dat$color, dat$color,"==") # Combinations with the same color
V[STM] <- Vb_emp # Sample-based covariance
diag(V) <- Vbe_emp # Sample-based variance
  1. We can visualize the covariance matrix using the image() function. We will use only the first 300 rows and columns:
image(V[1:300, 1:300])
  1. We can estimate the mean frog weight of each bucket and its standard error using the linear algebra of generalized least squares. First we calculate the inverse of the covariance:
Vi <- solve(V)

Then we calculate the estimate of the mean frog weight of each bucket:

coef <- solve(t(X) %*% Vi %*% X) %*% t(X) %*% Vi %*% dat$weight
coef

And the standard error of the estimate:

se <- sqrt(diag(solve(t(X) %*% Vi %*% X)))
se
  1. We can compare these estimates to the ones from a linear model that assumes independence of the residuals. For simplicity I just use the lm() function directly:
fit <- lm(weight ~ color - 1, data = dat)
summary(fit)

Let’s compare the results

cbind(EstimateGLS = as.numeric(coef), Std.ErrorGLS = se, coef(summary(fit))[, c("Estimate", "Std. Error")])

We can see that the estimates of the mean frog weight for each color are the same, as expected. The standard errors are larger for GLS because it accounts for the nested structure of the data (i.e., the effective sample size is smaller).

4.2 Mixed effect models

We are going to simulate another population of frogs in buckets but this time we will have replicates for each color. We will assume 8 buckets with 5 frogs per bucket, but only four bucket colors.

n_buckets <- 8
n_within <- 5
bucket <- as.factor(rep(1:n_buckets, each = n_within))
color <- factor(sort(rep(c("black","blue","green","red"),
                 length.out = n_within*n_buckets)))
id <- factor(rep(1:n_within, length.out = n_within * n_buckets))
dat <- data.frame(id, bucket, color)

Let’s simulate the weight of the frogs by defining some parameters:

mu <- 60 # Mean frog weight across all buckets
sigma_bucket <- 30 # Standard deviation of mean frog weight across buckets
sigma_within <- 15 # Standard deviation of frog weight within a bucket

First we sample from the distribution of buckets:

bucket_mu <- rnorm(n_buckets, mean = mu, sd = sigma_bucket)
dat$bucket_mu <- bucket_mu[dat$bucket]

And now generate the data:

X <- model.matrix(~ bucket - 1, data = dat)
beta <- bucket_mu
frog_effect <- rnorm(nrow(X), mean = 0, sd = sigma_within)
dat$weight <- X %*% beta + frog_effect
Exercise 4.3
  1. Count how many buckets the are per color.

  2. Visualize the data. What do you see?

  3. Calculate the correlation between observations within a bucket from the population parameters and compare it to the empirical correlation.

Solution 4.3
  1. We can check how many buckets there are per color by using table() to create a contingency table and dividing by the number of frogs per bucket:
bucket_per_color <- with(dat, table(bucket, color))/n_within

Then we can add up for each column to get the number of buckets per color:

colSums(bucket_per_color)

We can see that there are now two buckets for each colour.

  1. We can visualize the data using a boxplot:
boxplot(weight ~ bucket, data = dat, col = levels(color))

As before, we can see that the variation across buckets is substantial, in fact it is larger than the variation within buckets. That makes sense given that sigma_bucket > sigma_within.

  1. We can calculate the correlation among observations within a bucket from the population parameters using the relevant formula:
rho_pop <- sigma_bucket^2/(sigma_bucket^2 + sigma_within^2)
rho_pop

We can also estimate the correlation from the data:

with(dat, cor(weight[id == 3], weight[id == 4]))
with(dat, cor(weight[id == 1], weight[id == 2]))
with(dat, cor(weight[id == 2], weight[id == 3]))
with(dat, cor(weight[id == 1], weight[id == 3]))

We can see that the empirical correlation values are not close to the population value. This is because of the relative small sample size which increases the variability of the estimates.

Let’s compare a linear model that ignores the nested structure of the data with a mixed effect model that accounts for it. First we fit the linear model:

lm2 <- lm(weight ~ color, data = dat)
summary(fit)
anova(fit)

And now we fit the mixed effect model using the package lme4 with and without color:

library(lme4) # Load the package. If you don't have it, install it first
library(lmerTest) # Load the package. If you don't have it, install it first
lmm1 <- lmer(weight ~ 1 + (1|bucket), data=dat)
lmm2 <- lmer(weight ~ color + (1|bucket), data=dat)

We can perform the F-test on a mixed effect model using the anova() function from the lmerTest package. There are different methods to calculate the degrees of freedom for mixed effect models, but we will always use the Kenward-Roger method (the reason is that for simple, balanced designs, the Kenward-Roger method will give the same results as a classical anova):

anova(lmm2, ddf="Kenward-Roger")

We can see that accounting for the nested structure of the data changes the statistical significance of the fixed effect of color. This is because in the F-test, the linear model assumes 36 degrees of freedom for the residuals which ignores the fact that color is a property of buckets and the frogs are nested within buckets. The linear mixed model does this correctly (numDF = n_colors - 1 and DenDF = n_buckets - n_colors).

Exercise 4.4
  1. Compare the results of the linear model and the mixed effect model. What do you see?
  2. Compare the results of the mixed effect model with and without color. What do you see?
  3. Calculate the covariance matrix of the mixed effect model with color using the linear algebra of generalized least squares and visualize it.
  4. Calculate the standard errors of the estimates of the mixed effect model with color using the linear algebra of generalized least squares (tip: use as.data.frame(VarCorr()) to extract the estimated variances).
Solution 4.4
  1. We can compare the results of the linear model and the mixed effect model:
summary(lm2)
summary(lmm2)

We can see that the estimates of the fixed effect are the same, but the standard errors are larger for the mixed effect model because it accounts for the nested structure of the data.

  1. We can compare the results of the mixed effect model with and without color:
summary(lmm1)
summary(lmm2)

We can see that the estimates of the fixed effects are different, as expected, but also the model without color as a fixed effect has a larger variance for the random effect of bucket. This is a reflection of the fact that the fixed effect of color explains some of the variation across buckets and that is missing in the first model. As the variance explained by the fixed effects decreases, the model will try to compensate by increasing the variance of the random effects.

  1. First we extract the variance components from the mixed effect model:
VC <- as.data.frame(VarCorr(lmm2))
Var_b <- VC$vcov[1]
Var_e <- VC$vcov[2]

Then we construct the covariance matrix from the design matrix and the variance components. Remember that the total variance is the sum of the variance across buckets and the variance within buckets, and that the covariance between the weights of frogs within a bucket is equal to the variance across buckets:

X<-model.matrix(lmm2)
V <- matrix(0,nrow(X),nrow(X))
STM <- with(dat, outer(bucket, bucket, "=="))
V[STM] <- Var_b
diag(V) <- Var_b + Var_e

We can visualize the covariance matrix using the image() function:

image(V)
  1. We can calculate the estimates and their standard errors with the usual linear algebra:
Vi <- solve(V)
coef <- solve(t(X) %*% Vi %*% X) %*% t(X) %*% Vi %*% dat$weight
se <- sqrt(diag(solve(t(X) %*% Vi %*% X)))
cbind(coef, se)
summary(lmm2)$coefficients[,1:2]