4  Linear mixed model I

Author

Alejandro Morales & Joost van Heerwaarden

4.1 Nested data

4.1.1 Simulating the data

Let’s simulate a dataset that resembles the population of frogs in buckets. We will assume 30 buckets with 30 frogs each. As in the lectures, each bucket will be a different color:

n_buckets <- 30
n_within <- 30
bucket <- as.factor(rep(1:n_buckets, each = n_within))
color <- as.factor(rep(rainbow(n_buckets), each = n_within))
id <- factor(rep(1:n_within, length.out = n_within * n_buckets)) # frog id within a buckets
dat <- data.frame(id, bucket, color)

Let’s now simulate the weight of the frogs by defining some parameters:

mu <- 60 # Mean frog weight across all buckets
sigma_bucket <- 25 # Standard deviation of mean frog weight across buckets
sigma_within <- 10 # Standard deviation of frog weight within a bucket

First we sample from the distribution of buckets:

bucket_mu <- rnorm(n_buckets, mean = mu, sd = sigma_bucket)
dat$bucket_mu <- bucket_mu[dat$bucket]

We create the linear model that describes the weight of the frogs. Note that we create a design matrix without an intercept term, because we are using the mean of each bucket rather than the effect of each bucket on frog weight:

X <- model.matrix(~ bucket - 1, data = dat)
beta <- bucket_mu
frog_effect <- rnorm(nrow(X), mean = 0, sd = sigma_within)
dat$weight <- X %*% beta + frog_effect

Note that frog_effect includes the variation in weight across frogs as well as observation errors. Let’s take a look at the data:

head(dat)
Exercise 4.1
  1. Count how many buckets the are per color. Can you estimate the effect of color on frog weight?

  2. Visualize the data. What do you see?

Solution 4.1
  1. We can check how many buckets there are per color by using table() to create a contingency table and dividing by the number of frogs per bucket:
bucket_per_color <- with(dat, table(bucket, color))/n_frogs

Then we can add up for each column to get the number of buckets per color:

colSums(bucket_per_color)

We can see that there is only one bucket for each colour. That is, we have no replicates for color and therefore no variation. This means that we cannot estimate the effect of color on frog weight using this dataset. In statistical inference, one must always measure variation in the effect of interest to estimate it. A common misconception in this scenario is to treat the 30 frogs per bucket as replicates for the effect of color, but this is not the case because of the hierarchical nature of the data. We often refer to these as pseudo-replicates.

  1. We can visualize the data using a boxplot (against bucket and paint the boxes with the corresponding color):
boxplot(weight ~ bucket, data = dat, col = levels(color))

We can see that the varaition across buckets is substantial, in fact it is larger than the variation within buckets. That makes sense given that sigma_bucket > sigma_within.

4.1.2 Building the covariance matrix

Let’s build the covariance matrix for this dataset based on what we observed in the lecture. First, the variances across buckets and within buckets are:

Vb  <- sigma_bucket^2
Ve  <- sigma_within^2
Vbe <- Vb + Ve # variance (diagonal)
cov <- Vb # covariance (off-diagonal)
rho <- Vb/(Vb+Ve) # correlation
c(Vb, Ve, Vbe, cov, rho)

Note that we use the suffix e for the latter as it correspond to the residual or error term in statistical models. Also, we are switch to the notation of the lecture (i.e., using V for variance). We can compare these to the empirical values derived from the data. First we calculate the average variance within each bucket:

Ve_emp <- with(dat, tapply(weight, bucket, var))
mean_Ve_emp <- mean(Ve_emp)
c(Ve, mean_Ve_emp)

The total variance can be estimate from just taking the variance of the weight of the frogs:

Vbe_emp <- var(dat$weight)
c(Vbe, mean_Vbe_emp)

From this we can estimate the variance of means across buckets:

Vb_emp <- Vbe_emp - mean_Ve_emp
c(Vb, Vb_emp)

Finally, we can estimate the induced correlation from the empirical estimates:

rho_emp <- Vb_emp/Vbe_emp
c(rho, rho_emp)
Exercise 4.2
  1. Build the covariance matrix for the dataset using the empirical estimates of the variances and covariances (for the GLS model weight ~ color - 1).
  2. Visualize the covariance matrix (tip: do a subset of the matrix).
  3. Apply the linear algebra of generalized least squares to estimate the mean frog weight of each bucket and its standard error.
  4. Fit a linear model using color as a fixed effect while ignoring the nested structure of the data. Compare these estimates to the ones using the linear algebra of GLS.
Solution 4.2
  1. We can build the covariance matrix using the empirical estimates of the variances and covariances. First we create the design matrix X from the data:
X <- model.matrix(weight ~ color - 1, data = dat)

Then we build the covariance matrix:

V <- matrix(0, nrow = nrow(X), ncol = nrow(X)) # Initialize the matrix
STM <- outer(dat$colour, dat$colour,"==") # Combinations with the same color
V[STM] <- Vb_emp
diag(V) <- Vbe_emp
  1. We can visualize the covariance matrix using the image() function. We will use only the first 300 rows and columns:
image(V[1:300, 1:300])
  1. We can estimate the mean frog weight of each bucket and its standard error using the linear algebra of generalized least squares. First we calculate the inverse of the covariance:
Vi <- solve(V)

Then we calculate the estimate of the mean frog weight of each bucket:

coef <- solve(t(X) %*% Vi %*% X) %*% t(X) %*% Vi %*% dat$weight
coef

And the standard error of the estimate:

se <- sqrt(diag(solve(t(X) %*% Vi %*% X)))
se
  1. We can compare these estimates to the ones from a linear model that assumes independence of the residuals. For simplicity I just use the lm() function directly:
fit <- lm(weight ~ color - 1, data = dat)
summary(fit)

Let’s compare the results

cbind(coef, se, coef(summary(fit))[, c("Estimate", "Std. Error")])

We can see that the estimates of the mean frog weight for each color are the same, as expected. The standard errors are larger for GLS because it accounts for the correlation between the residuals.

4.2 Mixed effect models

We are going to simulate another population of frogs in buckets but this time we will have replicates for each color. We will assume 8 buckets with 5 frogs per bucket, but only four bucket colors.

n_buckets <- 8
n_within <- 5
bucket <- as.factor(rep(1:n_buckets, each = n_within))
color <- factor(sort(rep(c("black","blue","green","red"),
                 length.out = n_within*n_buckets)))
id <- factor(rep(1:n_within, length.out = n_within * n_buckets))
dat <- data.frame(id, bucket, color)

Let’s simulate the weight of the frogs by defining some parameters:

mu <- 60 # Mean frog weight across all buckets
sigma_bucket <- 30 # Standard deviation of mean frog weight across buckets
sigma_within <- 15 # Standard deviation of frog weight within a bucket

First we sample from the distribution of buckets:

bucket_mu <- rnorm(n_buckets, mean = mu, sd = sigma_bucket)
dat$bucket_mu <- bucket_mu[dat$bucket]

And now generate the data:

X <- model.matrix(~ bucket - 1, data = dat)
beta <- bucket_mu
frog_effect <- rnorm(nrow(X), mean = 0, sd = sigma_within)
dat$weight <- X %*% beta + frog_effect
Exercise 4.3
  1. Count how many buckets the are per color. Can you estimate the effect of color on frog weight?

  2. Visualize the data. What do you see?

  3. Calculate the correlation between observations within a bucket from the population parameters and compare it to the empirical correlation.

Solution 4.3
  1. We can check how many buckets there are per color by using table() to create a contingency table and dividing by the number of frogs per bucket:
bucket_per_color <- with(dat, table(bucket, color))/n_within

Then we can add up for each column to get the number of buckets per color:

colSums(bucket_per_color)

We can see that there are now two buckets for each colour. Since the effect of each bucket color is replicated twice, it is possible to estimate the effect of color on frog weight.

  1. We can visualize the data using a boxplot (against bucket and paint the boxes with the corresponding color):
boxplot(weight ~ bucket, data = dat, col = levels(color))

We can see that the variation across buckets is substantial, in fact it is larger than the variation within buckets. That makes sense given that sigma_bucket > sigma_within.

  1. We can calculate the correlation between observations within a bucket from the population parameters:
rho_pop <- sigma_bucket^2/(sigma_bucket^2 + sigma_within^2)
rho_pop

We can also estimate the correlation from the data:

with(dat, cor(weight[id == 3], weight[id == 4]))
with(dat, cor(weight[id == 1], weight[id == 2]))
with(dat, cor(weight[id == 2], weight[id == 3]))
with(dat, cor(weight[id == 1], weight[id == 3]))

We can see that the empirical correlation values are not close to the population value. This is because of the relative small sample size which increases the variability of the estimates.

Let’s compare a linear model that ignores the nested structure of the data with a mixed effect model that accounts for it. First we fit the linear model:

lm2 <- lm(weight ~ color, data = dat)
summary(fit)
anova(fit)

And now we fit the mixed effect model using the package lme4 with and without color:

library(lme4) # Load the package. If you don't have it, install it first
library(lmerTest) # Load the package. If you don't have it, install it first
lmm1 <- lmer(weight ~ 1 + (1|bucket), data=dat)
lmm2 <- lmer(weight ~ colour + (1|bucket), data=dat)
anova(lmm2)

We can see that accounting for the nested structure of the data changes the statistical significance of the fixed effect of color. This is because in the F-test, the linear model assumes 36 degrees of freedom for the residuals which ignores the fact that color is a property of buckets and the frogs are nested within buckets. The linear mixed model does this correctly (degrees of freedom of 4 = n_buckets - n_colors).

Exercise 4.4
  1. Compare the results of the linear model and the mixed effect model. What do you see?
  2. Compare the results of the mixed effect model with and without color. What do you see?
  3. Calculate the covariance matrix of the mixed effect model with color using the linear algebra of generalized least squares and visualize it.
  4. Calculate the standard errors of the estimates of the mixed effect model with color using the linear algebra of generalized least squares (tip: use as.data.frame(VarCorr()) to extract the estimated variances).
Solution 4.4
  1. We can compare the results of the linear model and the mixed effect model:
summary(lm2)
summary(lmm2)

We can see that the estimates of the fixed effect are the same, but the standard errors are larger for the mixed effect model because it accounts for the correlation between the residuals.

  1. We can compare the results of the mixed effect model with and without color:
summary(lmm1)
summary(lmm2)

We can see that the estimates of the fixed effect are different, as expected, but also the model without color as a fixed effect has a larger variance for the random effect of bucket. This is a reflection of the fact that the fixed effect of color explains some of the variation across buckets and that is missing in the first model.

  1. First we extract the variance components:
VC <- as.data.frame(VarCorr(lmm2))
Var_b <- VC$vcov[1]
Var_e <- VC$vcov[2]

Then we construct the covariance matrix from the design matrix and the variance components:

X<-model.matrix(lmm2)
V <- matrix(0,nrow(X),nrow(X))
STM <- with(dat, outer(bucket, bucket, "=="))
V[STM] <- Var_b
diag(V) <- Var_b + Var_e

We can visualize the covariance matrix using the image() function:

image(V)
  1. We can calculate the estimates and their standard errors of the estimates of the mixed effect model with color
Vi <- solve(V)
coef <- solve(t(X) %*% Vi %*% X) %*% t(X) %*% Vi %*% dat$weight
se <- sqrt(diag(solve(t(X) %*% Vi %*% X)))
cbind(coef, se)
summary(lmm2)$coefficients[,1:2]