9 Generalized linear models

Author

Alejandro Morales & Joost van Heerwaarden

9.1 Using binomial distribution

Let’s simulate some data of a similar nature to the one obtained from the case study on banana in Uganda. We will assume a number of locations (n) and observations per location (r). A covariate x is generated with one value per farm.

set.seed(123)
n <- 20
r <- 10
x <- runif(n,-10,10)
names(x) <- 1:n
sim.data<-expand.grid(location = factor(1:n), observation = factor(1:r))
sim.data$x <- x[match(sim.data$location, names(x))]
sim.data <- sim.data[order(sim.data$location, sim.data$observation),]
head(sim.data)

    location observation        x
1          1           1 -4.24845
21         1           2 -4.24845
41         1           3 -4.24845
61         1           4 -4.24845
81         1           5 -4.24845
101        1           6 -4.24845

tail(sim.data)

    location observation        x
100       20           5 9.090073
120       20           6 9.090073
140       20           7 9.090073
160       20           8 9.090073
180       20           9 9.090073
200       20          10 9.090073

Let’s build the model. The design matrices for the fixed and random effects are as follows:

X <- matrix(sim.data$x, ncol = 1)
Z <- model.matrix(~location-1, data = sim.data)

The vectors of coefficients and random effects are:

# Fixed effects
mu <- 0 # Overall mean in logit scale
Beta <- 0.15 # Slope with respect to X in logit scale
# Random effects
sigma.b <- 1 # Standard deviation of random effects on mean in logit scale
b <- rnorm(n,mean = 0,sd = sigma.b) # Random effects on mean in logit scale

Now we generate the response variable y:

library(boot) # for inverse logit
# Mean in logit scale
sim.data$y.mu <- mu+X%*%Beta+Z%*%b
# Proportion in original scale
sim.data$p <- inv.logit(sim.data$y.mu)
# Simulate individual observations by sampling from Binomial distribution with n = 1 and p
sim.data$y <- rbinom(n*r, size = 1, prob = sim.data$p)

Now we can fit a generalized linear model to this data. As shown in the lectures, we will use the function glmer from the lme4 package and specify a binomial distribution:

library(lme4)

Loading required package: Matrix

library(lmerTest)


Attaching package: 'lmerTest'

The following object is masked from 'package:lme4':

    lmer

The following object is masked from 'package:stats':

    step

glmm1 <- glmer(y ~ x + (1|location), data = sim.data, family = "binomial")

Exercise 9.1

Check the estimates of the fixed effects and the random effects in the model and discuss the results considering the values of the parameters used to simulate the data.
Calculate the predictions of intercept for each location (tip: use predict() or the functions ranef() and fixef()) in the original scale.
Add the effect of x to the predictions and plot the predictions against the true

Solution 9.1

We can retrieve the estimates using summary():

summary(glmm1)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: y ~ x + (1 | location)
   Data: sim.data

      AIC       BIC    logLik -2*log(L)  df.resid 
    254.0     263.9    -124.0     248.0       197 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.1599 -0.8229  0.3687  0.7546  1.8467 

Random effects:
 Groups   Name        Variance Std.Dev.
 location (Intercept) 0.6393   0.7996  
Number of obs: 200, groups:  location, 20

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.09139    0.24305   0.376  0.70691   
x            0.12527    0.04058   3.087  0.00202 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
  (Intr)
x -0.118

The true value for the intercept was mu = 0, so the estimate is quite good. The slope with respect to x was Beta = 0.15, and the estimate is 0.18 so the slope was also well estimated. The standard deviation of the random effects was sigma.b = 1, and the estimate is 0.78, which is not too far from the true value although somewhat underestimated.

We can calculate the predictions of the mean for each location using predict() by setting x = 0 and location as a factor with levels 1 to n:

pred <- predict(glmm1, newdata = data.frame(x = 0, location = factor(1:n)))
pred.p <- inv.logit(pred)

We could also use ranef() and fixef() to calculate the predictions:

ranef.glmm1 <- ranef(glmm1)
fixef.glmm1 <- fixef(glmm1)
pred.p2 <- inv.logit(fixef.glmm1[1] + ranef.glmm1$location[,1])
plot(pred.p, pred.p2)
abline(a = 0, b = 1)

We could just use predict() to include predictions for every observation in the original scale:

sim.data$p.pred <- predict(glmm1, type = "response")

plot(p.pred~p, xlab = "Observed proportion", ylab = "Predicted proportion",
     xlim = c(0,1), ylim = c(0,1), data = sim.data)
abline(a = 0, b = 1)

We can also reconstruct the predictions from the fixed and random effects:

x_sorted <- unique(sim.data$x)
pred.p <- rep(inv.logit(fixef.glmm1[1] + ranef.glmm1$location[,1] +
                        x_sorted*fixef.glmm1[2]), each = r)
plot(pred.p, sim.data$p.pred)
abline(a = 0, b = 1)

9.2 Using Poisson distribution

Let’s now simulate data using a Poisson distribution and a log link function but following the same steps as before. In this case n = 30 and r = 5:

set.seed(123)
n <- 30
r <- 5
x <- runif(n,-10,10)
names(x) <- 1:n
sim.data<-expand.grid(location = factor(1:n), observation = factor(1:r))
sim.data$x <- x[match(sim.data$location, names(x))]
sim.data <- sim.data[order(sim.data$location, sim.data$observation),]
head(sim.data)

    location observation         x
1          1           1 -4.248450
31         1           2 -4.248450
61         1           3 -4.248450
91         1           4 -4.248450
121        1           5 -4.248450
2          2           1  5.766103

tail(sim.data)

    location observation         x
149       29           5 -4.216805
30        30           1 -7.057727
60        30           2 -7.057727
90        30           3 -7.057727
120       30           4 -7.057727
150       30           5 -7.057727

The design matrices for the fixed and random effects are the same:

X <- matrix(sim.data$x, ncol = 1)
Z <- model.matrix(~location-1, data = sim.data)

The vectors of coefficients and random effects are:

# Fixed effects
mu <- 0 # Overall mean in log scale
Beta <- 0.15 # Slope with respect to X in log scale
# Random effects
sigma.b <- 1 # Standard deviation of random effects on mean in log scale
b <- rnorm(n,mean = 0,sd = sigma.b) # Random effects on mean in log scale

Now we generate the response variable y using rpois():

# Mean in log scale
sim.data$y.mu <- mu+X%*%Beta+Z%*%b
# Proportion in original scale
sim.data$lambda <- exp(sim.data$y.mu)
# Simulate individual observations by sampling from Binomial distribution with n = 1 and p
sim.data$y <- rpois(n*r, lambda = sim.data$lambda)

Now we can fit a generalized linear model to this data and specify a Poisson distribution:

glmm1 <- glmer(y ~ x + (1|location), data = sim.data, family = "poisson")

Exercise 9.2

Check the estimates of the fixed effects and the random effects in the model and discuss the results considering the values of the parameters used to simulate the data.
Calculate the predictions of intercept for each location (tip: use predict() or the functions ranef() and fixef()) in the original scale.
Add the effect of x to the predictions and plot the predictions against the true

Solution 9.2

We can retrieve the estimates using summary():

summary(glmm1)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: poisson  ( log )
Formula: y ~ x + (1 | location)
   Data: sim.data

      AIC       BIC    logLik -2*log(L)  df.resid 
    494.5     503.5    -244.3     488.5       147 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.6669 -0.6848 -0.1970  0.5163  2.2707 

Random effects:
 Groups   Name        Variance Std.Dev.
 location (Intercept) 0.7104   0.8428  
Number of obs: 150, groups:  location, 30

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.14720    0.18447   0.798  0.42490   
x            0.09060    0.03013   3.007  0.00263 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
  (Intr)
x -0.310

The true value for the intercept was mu = 0, so the estimate is quite good. The slope with respect to x was Beta = 0.15, and the estimate is 0.09 so the slope was somewhat well well estimated. The standard deviation of the random effects was sigma.b = 1, and the estimate is 0.71, which is not too far from the true value although somewhat underestimated.

We can calculate the predictions of the mean for each location using predict() by setting x = 0 and location as a factor with levels 1 to n:

pred <- predict(glmm1, newdata = data.frame(x = 0, location = factor(1:n)))
pred.lambda <- exp(pred)

We could also use ranef() and fixef() to calculate the predictions:

ranef.glmm1 <- ranef(glmm1)
fixef.glmm1 <- fixef(glmm1)
pred.lambda2 <- exp(fixef.glmm1[1] + ranef.glmm1$location[,1])
plot(pred.lambda, pred.lambda2)
abline(a = 0, b = 1)

We could just use predict() to include predictions for every observation in the original scale:

sim.data$y.pred <- predict(glmm1, type = "response")

plot(y.pred~y, xlab = "Observed counts", ylab = "Predicted counts",
     xlim = c(0,14), ylim = c(0,14), data = sim.data)
abline(a = 0, b = 1)

We can also reconstruct the predictions from the fixed and random effects:

x_sorted <- unique(sim.data$x)
pred.y <- rep(exp(fixef.glmm1[1] + ranef.glmm1$location[,1] +
                  x_sorted*fixef.glmm1[2]), each = r)
plot(pred.y, sim.data$y.pred)
abline(a = 0, b = 1)