Lecture 22—Wednesday, November 7, 2012

Topics

Viewing and creating multi-page graphs
The residual deviance as a goodness of fit statistic in Poisson regression
Count regression models for rates and densities
Fitting a Poisson random intercepts model in R
The likelihood of a Poisson random intercepts model
Obtaining the correct log-likelihood of the Poisson random intercepts model
Cited references
R code

R functions and commands demonstrated

offset when applied to a predictor in a regression model fixes the coefficient of the predictor in the model preventing it from being estimated.

R function options

ask= (argument to par) when set to TRUE causes R to pause before displaying the next graph. Useful for displaying multi-page graphs one page at a time.
na.rm= (argument to mean) when set to TRUE causes the mean to be calculated using only the non-missing cases.
offset= (argument to glm) specifies a variable to use as an offset in a regression model.
type= (argument to residuals) specifies the type of residual desired. For glm objects two of the choices are type='deviance' and type='pearson'.

R packages used

lattice for lattice graphics.
lme4 for the lmer function.
MASS for the glm.nb function.

Viewing and creating multi-page graphs

Last time we created a lattice graph with 29 panels. This approaches the practical limit on the number of panels that can be reasonably displayed on a single page. We can spread a panel graph across multiple pages by specifying a value for the layout argument that corresponds to a subset of the panels. The code below recreates the goodness of fit panel graph from last time but specifies that only 16 panels should be displayed on the same page.

# read in galapagos flora data

gala <- read.table('ecol 563/galapagos.txt', header=T)

# fit NB-2 model

library(MASS)

out.NB2 <- glm.nb(Species~log(Area), data=gala)

# calculations for graph

gala$mu <- fitted(out.NB2)

gala$z <- dnbinom(gala$Species, mu=fitted(out.NB2), size=out.NB2$theta)

out.p <- lapply(1:29, function(x) dnbinom(0:1000, mu=fitted(out.NB2)[x], size=out.NB2$theta))

gala$ux <- sapply(1:29, function(x) max((0:1000)[out.p[[x]]>1e-5]))

gala$lx <- rep(0, 29)

gala$uy <- sapply(out.p, max)

gala$ly <- rep(0, 29)

library(lattice)

prepanel.ci2 <- function(x, y, ly, lx, uy, ux, subscripts, ...) {

list(ylim=range(y, uy[subscripts], ly[subscripts], finite=T), xlim=range(x, ux[subscripts], lx[subscripts], finite=T))}

xyplot(z~Species|Island, data=gala, ux=gala$ux, lx=gala$lx, uy=gala$uy, ly=gala$ly, prepanel=prepanel.ci2, ylab='Probability', xlab='Species richness', layout=c(4,4), panel=function(x, y, subscripts){

panel.xyplot(0:1000, dnbinom(0:1000, mu=gala$mu[subscripts], size=out.NB2$theta), type='h', col='grey')

panel.abline(v=x, col=2, lty=2)},

scale=list(x='free', y='free'), strip=strip.custom(par.strip.text=list(cex=0.8)))

When a multi-page graph is created in R only the last page of the multi-page graph will be displayed in the graphics window. In Windows OS the graphics window has a History menu option. If the Recording option of History menu had previously been checked (Fig. 1a) each graph is stored as a separate page of the graphics window. Pressing the PgUp or PgDown keys on the keyboard allows you to navigate to different pages in the graphics window. Note: For this to work the Recording option must be selected before you create the graph. In Mac OS X the Quartz menu has the options Back and Forward (with the keyboard shortcuts shown in Fig. 1b) for navigating to previous pages of a multi-page graph.

(a)	(b)
Fig. 1 (a) Menu selection in Windows OS to stack multiple graphs and pages of graphs in the same window. (b) Menu selection in Mac OS X to move "Back" to previous pages of graphs produced in a Quartz window.

It's also possible to pause the graph before each page is displayed by specifying par(ask=T) at the R prompt. R displays the message "Hit <Return> to see next plot:" Pressing the <enter> key causes the next page of the graph to be displayed. Continue pressing <enter> until all pages are displayed. To disable this option enter par(ask=F).

The residual deviance as a goodness of fit statistic in Poisson regression

I refit the Poisson model for the species-area relationship. Highlighted in the output from the summary function below is a statistic called the residual deviance.

out.pois <- glm(Species~log(Area), data=gala, family=poisson)

summary(out.pois)

Call:
glm(formula = Species ~ log(Area), family = poisson, data = gala)

Deviance Residuals:
Min 1Q Median 3Q Max
-10.4912 -3.6827 -0.9111 2.0992 10.0887

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.240970   0.043050   75.28   <2e-16 ***
log(Area)   0.342991   0.007346   46.69   <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 3447.59 on 28 degrees of freedom
Residual deviance: 640.22 on 27 degrees of freedom
AIC: 800.03

Number of Fisher Scoring iterations: 5

A goodness of fit test based on the residual deviance

Historically a quantity called the deviance has played an important role in assessing the fit of a Poisson regression model, although in truth it's only sometimes appropriate. The deviance is reported in the summary table of the Poisson model where it is called the residual deviance. The deviance can be derived from the usual likelihood ratio statistic as follows.

Suppose we have two nested models with estimated parameter sets and and corresponding likelihoods and . If is a special case of then we can compare the models using a likelihood ratio test.

LR test

Recall that LR has an asymptotic chi-squared distribution with degrees of freedom equal to the difference in the number of parameters estimated by the two models. Alternatively, the degrees of freedom is the number of parameters that are fixed to have a specific value in one model but are freely estimated in the other model.

Now suppose the model with likelihood is the saturated model. A saturated model is one in which one parameter is estimated for each observation. For a saturated Poisson model λ_i = y_i, i.e., the Poisson mean for each observation is the observed count. A likelihood ratio statistic that compares the current model with a saturated model is the residual deviance of that model. We can demonstrate this by calculating the likelihood ratio statistic by hand.

# log-likelihood of current model

LL1 <- sum(log(dpois(gala$Species, lambda = fitted(out.pois))))

LL1

[1] -398.0133

# log-likelihood of saturated model

LL2 <- sum(log(dpois(gala$Species, lambda = gala$Species)))

LL2

[1] -77.90404

2*(LL2-LL1)

[1] 640.2186

For the saturated model there are n parameters, one for each observation. The likelihood ratio statistic has an asymptotic chi-squared distribution with degrees of freedom equal to the difference in the number of estimated parameters of the two nested models. I calculate this difference by hand.

nrow(gala)-length(coef(out.pois))

[1] 27

Notice that 27 is exactly what is reported as the degrees of freedom for the residual deviance in the summary table. The saturated model fits the data perfectly, i.e., it fits it as well as a Poisson model can. If our model is not significantly different from the saturated model we conclude that our model fits as well as a perfect model. Given that our model does so with fewer parameters we should prefer our model because it is more parsimonious.

If we carry out the residual deviance test here we obtain a significant result. Assuming that the test is valid we conclude that our model is significantly different from the saturated model and hence that we have a significant lack of fit.

1-pchisq(2*(LL2-LL1), 27)

[1] 0

Deviance residuals and the residual deviance

The probability mass function for a Poisson random variable is

Poisson probability

If is the maximum likelihood estimate of λ for observation i, i = 1, … , n, then the log-likelihood of a Poisson regression model with n observations is the following.

LL1

For the saturated model the maximum likelihood estimate of λ is y_i and the corresponding log-likelihood is the following.

LL2

Subtracting these quantities and doubling the difference yields the residual deviance for the Poisson model.

Poisson deviance

Define

then

is called a deviance residual. Here sign denotes the sign function which is defined as follows.

The deviance residual measures the i^th observation's contribution to the deviance. Since under ideal conditions the deviance can be used as a measure of lack of fit, the deviance residual measures the i^th observation's contribution to model lack of fit. If we square the deviance residuals and sum them we obtain the residual deviance. We can extract the deviance residuals from a Poisson regression model with the residuals function by specifying type = 'deviance'.

sum(residuals(out.pois, type='deviance')^2)

[1] 640.2186

Problems with using the residual deviance as a goodness of fit test

The chi-squared distribution of the likelihood ratio test is an asymptotic result. The likelihood ratio statistic has a chi-squared distribution only as the sample size gets large. Implicit in this statement is that the sample size should get large while the difference in the number of estimated parameters in the two models remains fixed. This is not the case when one of the models is the saturated model because as the sample size gets large, so does the number of parameters (because there is one parameter per observation). This can invalidate the assumed chi-squared distribution of the residual deviance.

It turns out that the residual deviance for a Poisson regression model is equal to the G-statistic (in the notation of Sokal and Rohlf, 1995). The G test, often called the likelihood ratio chi-squared test, is an alternative to the Pearson chi-squared test to which it is asymptotically equivalent. The G-test is defined as follows.

G statistic

If we calculate this statistic for the current model we see that we obtain the residual deviance.

2*sum(gala$Species * log(gala$Species/fitted(out.pois)))

[1] 640.2186

Like the Pearson chi-squared statistic, the G-statistic has an asymptotic chi-squared distribution with n – p degrees of freedom where p is the number of estimated parameters. Also like the Pearson statistic the validity of the chi-squared distribution of the G-statistic depends on some minimal sample size restrictions on the counts. If none of the expected counts are less than 1 and no more than 20% are less than 5, then the chi-squared distribution is reasonable. If we check this condition here we find that all of the expected counts exceed 5.

sum(fitted(out.pois)<5)

[1] 0

Therefore the goodness of fit test based on the residual deviance is probably valid for these data and we can conclude that the Poisson regression model does not adequately fit the data.

The Pearson deviance

If we use the formula for the Pearson goodness of fit statistic on the individual observations of the Poisson regression model we obtain what's called the Pearson deviance.

Pearson

sum((gala$Species-fitted(out.pois))^2/fitted(out.pois))

[1] 642.8605

The Pearson deviance is generally considered superior to the residual deviance as a goodness of fit statistic for Poisson regression models (if the minimal sample size criteria are met). It is also compared to a chi-squared distribution with n – p degrees of freedom where p is the number of estimated parameters and n is the sample size. Typically there is little substantive difference between using the residual deviance and the Pearson deviance as a goodness of fit statistic.

The individual terms of the Pearson statistic are called the Pearson residuals, which for a Poisson regression model are the following.

Pearson residual

The Pearson residual when squared can be viewed as the contribution the i^th observation makes to the Pearson statistic. Because the mean and variance of the Poisson distribution are equal each Pearson residual has the form of a z-score. The Pearson residuals from a glm object can be extracted with the residuals function by specifying type='pearson'. If we square the Pearson residuals and sum them we get the Pearson deviance.

sum(residuals(out.pois, type='pearson')^2)

[1] 642.8605

Residual deviance and overdispersion

I need to comment on an often spurious seat of the pants rule about the residual deviance that has appeared in the literature and is perpetuated in textbooks and articles about statistics written for ecologists. The rule requires that we calculate the ratio of the residual deviance to its degrees of freedom.

phi

If φ > 1 we say the data are overdispersed relative to a Poisson distribution. If φ < 1 we say the data are underdispersed relative to a Poisson distribution. For the current data set and model we find

out.pois$deviance

[1] 640.2186

out.pois$df.residual

[1] 27

out.pois$deviance/out.pois$df.residual

[1] 23.7118

so we would say using the fitted model that the data are overdispersed relative to a Poisson distribution meaning that the data show more variability than is predicted by a Poisson distribution.

Technically speaking it's not the deviance based on the G-statistic that is usually used for assessing over- or underdispersion, but instead it is the "deviance" based on the Pearson goodness of fit statistic. If we calculate the ratio of the Pearson deviance to its degrees of freedom we obtain the following.

sum(residuals(out.pois, type='pearson')^2)/out.pois$df.residual

[1] 23.80965

Once again φ > 1 so we would say the data are overdispersed relative to a Poisson distribution.

The logic behind this rule is that the mean of a chi-squared distribution is equal to its degrees of freedom. So the ratio compares a chi-squared statistic to its mean. But as we noted above, for the residual deviance to have a chi-squared distribution requires that the individual expected counts are all sufficiently large (at most no more than 20% are smaller than 5). So the seat of the pants rule is inappropriate if there are a large number of small expected counts.

Analysis of deviance

Suppose we wish to compare two nested models with likelihoods L₁ and L₂ where L₂ is the "larger" (more estimated parameters) of the two likelihoods. Let L_S denote the likelihood of the saturated model. The likelihood ratio statistic for comparing these two models is the following.

analysis of deviance

where D₁ and D₂ are the deviances of models 1 and 2, respectively. Thus to carry out a likelihood ratio test to compare two models we can equivalently just compute the differences in their deviances. Because of the parallel to analysis of variance (ANOVA) in ordinary linear models, carrying out an LR test using the deviances of the models is often called analysis of deviance. When we apply the anova function to a glm model the table that is reported is labeled "analysis of deviance" but as we've seen the individual tests that are reported are just the likelihood ratio tests comparing the sequentially nested models.

anova(out.pois, test='Chisq')

Analysis of Deviance Table

Model: poisson, link: log

Response: Species

Terms added sequentially (first to last)

          Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL                         28     3447.6
log(Area) 1   2807.4        27      640.2 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Count regression models for rates and densities

As counterintuitive as it may seem, it is possible to use Poisson and negative binomial regression to fit models in which the response variable is a rate, the number of events that occur per unit time or per unit area, even when the denominator of the rate is different for different observations. To illustrate how this can be done we consider a data set in which counts of bird species were obtained on various sized tracts of land all over the island of Jamaica. We load the data set and examine the available variables.

birds <- read.csv( 'ecol 563/birds.csv')

birds[1:4,]

patch S year landscape area log.area. ENN log.ENN.
1 ag1a 24 2005 Agriculture 3.5101825 0.5453297 20.0000 1.301030
2 ag1b 15 2005 Agriculture 0.6155155 -0.2107610 31.6228 1.500000
3 ag1c 25 2005 Agriculture 2.2350471 0.3492867 31.6228 1.500000
4 ag1d 35 2005 Agriculture 8.2798820 0.9180241 40.0000 1.602060

levels(birds$landscape)

[1] "Agriculture" "Bauxite" "Forest" "Urban"

The variable S records the bird species richness in a patch in a given year. Each patch was visited up to three times in three consecutive years: 2005 through 2007. The area of each patch was recorded and each patch was assigned a landscape category type of which there are four. The objective was to determine how aluminum mining on the island (landscape = "Bauxite") was affecting bird populations. So we have repeated measures data in which an important covariate, area, was recorded that may modify the nature of the relationship between richness and landscape type. Previously for normally-distributed response variables we used analysis of covariance to adjust for a covariate such as area (lecture 9) and we used a mixed effects (random intercepts) model to account for the repeated measures (lecture 11). The exact same approaches will work when the response variable has a Poisson distribution.

A basic Poisson count model—ANOVA

An examination of the gross distribution of bird richness values reveals nothing unusual.

par(lend=2)

tab0 <- table(birds$S)

# plot overall distribution of richness

plot(as.numeric(names(tab0)), tab0, type='h', lwd=6, xlab='richness', ylab='frequency')

Fig. 2 Gross distribution of bird richness values

Based on what we see either a Poisson or a normal distribution might be a suitable probability model. Keep in mind that Fig. 2 should not necessarily remind us of any distribution in particular. The goal is to model the mean richness using landscape, year, and perhaps other predictors and it's only after the set of observations has been carved up into various subsets by predictors that the choice of probability distribution becomes relevant. For instance, if all the individual subsets are normally distributed it doesn't follow that the overall distribution will also be normally distributed. What we do learn from Fig. 2 is that the distribution is shifted away from zero, so we probably don't have to worry about a normal distribution for richness yielding negative predictions. Furthermore we see that the overall distribution is roughly symmetric so a negative binomial model is probably unnecessary nor will we need to log-transform richness.

I examine the mean bird richness values by landscape and year. Because there are a number of missing values in the response variable I include the additional argument na.rm=T to remove those values. (The tapply function allows additional arguments to the function, in this case mean, to be listed after the function.)

# mean richness by landscape category

tapply(birds$S, birds$landscape, mean, na.rm=T)

Agriculture Bauxite Forest Urban
28.36667 20.04348 25.04225 19.61404

# mean richness by landscape category and year

tapply(birds$S, list(birds$year, birds$landscape), mean, na.rm=T)

     Agriculture Bauxite   Forest    Urban
2005    29.92000 22.04762 26.16667 22.36842
2006    26.18750 19.00000 24.53846 18.10526
2007    28.15789 19.30769 24.07143 18.36842

There appear to be important differences in bird richness across the different landscapes. Richness is lowest in urban areas and around bauxite mines. Forested and agricultural areas have much higher richness values. These patterns maintain themselves across all three years but in year 2005 all four landscapes experienced much higher richness values than in subsequent years. These observations suggest that a simple additive model in landscape and year might suffice, the classic main effects two-factor analysis of variance model. Because we're dealing with count data I assume a Poisson distribution for the response with a log link for the mean. With only three years of data it probably makes sense to treat year as a factor.

# Poisson model for richness

out.S <- glm(S~landscape + factor(year), data=birds, family=poisson)

anova(out.S, test="Chisq")

Analysis of Deviance Table

Model: poisson, link: log

Response: S

Terms added sequentially (first to last)

             Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL                           256     691.67
landscape     3 138.918       253     552.75 < 2.2e-16 ***
factor(year) 2   20.495       251     532.25 3.545e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The analysis of deviance table indicates that both landscape and year are significant predictors. I examine the summary table of the model.

summary(out.S)

Call:
glm(formula = S ~ landscape + factor(year), family = poisson,
data = birds)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.80085 -1.06031 0.05366 0.98796 2.96943

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)       3.41323    0.02835 120.416 < 2e-16 ***
landscapeBauxite -0.33365    0.03633 -9.183 < 2e-16 ***
landscapeForest -0.12738    0.03401 -3.745 0.000180 ***
landscapeUrban   -0.35814    0.03858 -9.284 < 2e-16 ***
factor(year)2006 -0.13397    0.03317 -4.039 5.36e-05 ***
factor(year)2007 -0.10764    0.03005 -3.581 0.000342 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 691.67 on 256 degrees of freedom
Residual deviance: 532.25 on 251 degrees of freedom
(49 observations deleted due to missingness)
AIC: 1811.3

Number of Fisher Scoring iterations: 4

The summary table confirms the pattern we observed in the gross summary statistics. Agricultural areas have the highest bird richness followed by forested areas. Urban and bauxite areas bring up the rear and are essentially tied. We also see that richness in 2005 was significantly higher than in the two subsequent years.

We're not done yet because we haven't checked fit. It's unwise to draw inferences from a model that doesn't fit the data. The Poisson distribution is one for which the residual deviance can potentially be used as a goodness of fit statistic. I check to see if most of the observations have predicted cell counts that are larger than five.

sum(fitted(out.S)<5)/length(fitted(out.S))

[1] 0

100% of the expected cell counts are greater than five so the residual deviance can be used a goodness of fit statistic. I examine the ratio of the residual deviance to its degrees of freedom.

out.S$deviance/out.S$df.residual

[1] 2.120532

1-pchisq(out.S$deviance, out.S$df.residual)

[1] 0

The data are overdispersed relative to a Poisson distribution and based on the p-value of the chi-squared test, significantly so. This should not be surprising. The smallest predicted Poisson mean is 18.6 yet in Fig. 2 we saw counts as low as six. Even assuming that these low counts correspond to the patch with the smallest mean, probabilities of obtaining counts this small are extremely low using this Poisson model. This indicates that it is very unlikely that the current model could have generated these data.

# smallest predicted mean

min(fitted(out.S))

[1] 18.56190

# count probabilities using the smallest mean

rbind(6:15, round(dpois(6:15, min(fitted(out.S))),5))

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 6.00000 7.00000 8.00000 9.00000 10.00000 11.0000 12.00000 13.0000 14.00000 15.00000
[2,] 0.00049 0.00131 0.00303 0.00626 0.01162 0.0196 0.03033 0.0433 0.05741 0.07104

A Poisson count model with area as a covariate—ANCOVA

An obvious plot characteristic that we'd expect to have a big impact on bird richness is the size of the patch. It turns out the patches vary quite a bit with respect to size.

range(birds$area)

[1] 0.5580768 59.2797938

So, there's a 100-fold difference in plot size in these data. If we examine the mean patch area by landscape type we see marked differences there also.

tapply(birds$area, birds$landscape, mean)

Agriculture Bauxite Forest Urban
10.118108 3.391259 4.290323 3.460408

Agricultural patches were on average more than twice the size of any of the other patch types. Given that our goal to relate richness to landscape, area is an obvious confounding variable. So, we should replace analysis of variance with analysis of covariance treating area or log(area) as a covariate. A plot of richness versus area and richness versus log(area) suggests that log-transforming area yields a better linear relationship with species richness.

par(mar=c(5.1,5.1,2.1,2.1))

par(mfrow=c(1,2))

plot(S~area, data=birds, cex=.7, col='grey60')

lines(lowess(birds$S[!is.na(birds$S)] ~ birds$area[!is.na(birds$S)]), col=2)

plot(S~log(area), data=birds, cex=.7, col='grey60')

lines(lowess(birds$S[!is.na(birds$S)] ~ log(birds$area[!is.na(birds$S)])), col=2)

par(mfrow=c(1,1))

Fig. 3 Checking the linear relationship between species richness (S) and area.

Because we're using a log link for the mean, using log(area) will also yield a more interpretable model. If we use log-likelihood and AIC to compare a model that includes area to a model that uses log(area) as a predictor, the log(area) model has the larger log-likelihood and the smaller AIC.

out.S0a <- glm(S~landscape + factor(year) + area, data=birds, family=poisson)

out.S1 <- glm(S~landscape + factor(year) + log(area), data=birds, family=poisson)

AIC(out.S0a, out.S1)

df AIC
out.S0a 7 1741.361
out.S1 7 1578.616

summary(out.S1)

Call:
glm(formula = S ~ landscape + factor(year) + log(area), family = poisson,
data = birds)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.79706 -0.75487 0.04366 0.74034 2.24672

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)       3.08891    0.03662 84.348 < 2e-16 ***
landscapeBauxite -0.22688    0.03728 -6.086 1.16e-09 ***
landscapeForest -0.06504    0.03441 -1.890 0.058744 .
landscapeUrban   -0.27087    0.03931 -6.890 5.57e-12 ***
factor(year)2006 -0.11130    0.03321 -3.351 0.000805 ***
factor(year)2007 -0.10660    0.03005 -3.547 0.000389 ***
log(area)         0.21003    0.01353 15.518 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 691.67 on 256 degrees of freedom
Residual deviance: 297.53 on 250 degrees of freedom
(49 observations deleted due to missingness)
AIC: 1578.6

Number of Fisher Scoring iterations: 4

The predictor log(area) is highly significant. So, having controlled for differences in area we find that the richness in forested areas is no longer significantly different from the richness in agricultural areas. Apparently the difference we observed in the previous model and in the raw data was due to the much larger size of agricultural patches.

The final model we've obtained has a simple theoretical basis. The fitted model for mean richness in agricultural areas in 2005 is the following.

areamodel

This is the classic species-area model in which richness is proportional to a power of area. The area exponent is 0.21 so we see that richness scales by the fifth root of area. The role that landscape and year play in this model is to modify the multiplier k but not the basic species-area relationship.

Has adding log(area) to the regression model solved the lack of fit problem? I check again that the expected counts still meet the minimum cell size criterion.

sum(fitted(out.S1)<5)/length(fitted(out.S1))

[1] 0

min(fitted(out.S1))

[1] 13.96769

100% of the expected cell counts are still larger than 5, so the residual deviance can be used as a measure of fit. The smallest expected mean count has decreased somewhat but probably not enough to account for some of the smallest counts in the data set.

out.S1$deviance/out.S1$df.residual

[1] 1.190119

1-pchisq(out.S1$deviance, out.S1$df.residual)

[1] 0.02100231

So, the data are still overdispersed but now only barely so. The chi-squared distribution for the residual deviance indicates that the overdispersion is still statistically significant. So, we may still have a fit problem. This is not surprising because we've been ignoring an important feature of these data, the fact that the same patch is contributing multiple observations to the data set. If a given patch has individual characteristics that cause it to deviate systematically from other patches with the same covariate pattern, that could account for the small amount of overdispersion that remains. We'll consider this possibility below.

Using an offset in count regression models

When I submitted the above analysis (with a further adjustment for the repeated measures structure that we'll discuss below) to the researcher, she was not satisfied. She explained that she didn't want to model richness; she wanted to model density, the number of species per unit area. Density is generally a difficult quantity to model statistically. Density is a ratio and ratios can be extremely unstable particularly when both the numerator and denominator are measured with error. We can expect to see both very small and very large ratios in the same data set. A further complication is that density is a positive quantity, further constraining our choice of probability models. We previously discussed some of the problems with analyzing ratios in lecture 9.

There are a number of reasons why count observations in a data set might not be equivalent.

If the counts are obtained over time the lengths of time t_i may vary for each observation.
If the counts are obtained in space the areas in which the counts occur A_i may vary between observations.
Even if the time interval and area are standardized, the populations sizes giving rise to the counts N_i may vary across sample units.

In each of these cases it would be more natural to work with the rate of occurrence: the number of observations per unit time, per unit area, or the per capita rate. It is possible to analyze a rate while fitting a Poisson model to the counts if you treat the log of the denominator of the rate as an offset. An offset is the name we use for a regressor whose coefficient is not estimated. An offset is typically a term of the form, , , or that is included in the model with its coefficient constrained to be equal to 1.

To understand what using an offset is supposed to accomplish consider the Jamaican birds example where the counts are obtained from patches with different sized areas. Suppose that our model has p additional regressors, . We fit a count regression model for the mean (either as a Poisson or negative binomial regression) using a log link and an offset of .

is an offset because it's not multiplied by a parameter. Its coefficient is 1, although any numeric multiplier could in principle be used. An equivalent way to write this equation is the following.

offset

Thus by including an offset we end up fitting a model for the rate of occurrence, counts per unit area, as was desired. The use of an offset then is just a trick that allows us to use Poisson or negative binomial regression, methods that are appropriate for count data, to fit a rate model.

The glm function has an offset argument for specifying the variable to use as an offset. For regression functions without an offset argument, such as lm and glm.nb, offsets can still be included in a regression model by using the offset function.

# Poisson model using the offset argument of glm

out.S2 <- glm(S~landscape + factor(year), data=birds, family=poisson, offset=log(area))

The following is completely equivalent.

# Poisson model using the offset function

out.S2a <- glm(S~landscape + factor(year) + offset(log(area)), data=birds, family=poisson)

If we examine the summary table we see that the offset doesn't appear anywhere except in the listed formula. Because the coefficient of log(area) is not estimated, no coefficient is reported in the summary table.

summary(out.S2)

Call:
glm(formula = S ~ landscape + factor(year), family = poisson,
data = birds, offset = log(area))

Deviance Residuals:
Min 1Q Median 3Q Max
-13.5597 -0.1101 1.5889 3.3103 9.7338

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)       1.19264    0.02786 42.802 < 2e-16 ***
landscapeBauxite 0.57236    0.03672 15.586 < 2e-16 ***
landscapeForest   0.55630    0.03401 16.355 < 2e-16 ***
landscapeUrban    0.57255    0.03903 14.668 < 2e-16 ***
factor(year)2006 0.02139    0.03351   0.639    0.523
factor(year)2007 -0.11795    0.03001 -3.930 8.49e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 4250.4 on 256 degrees of freedom
Residual deviance: 3798.6 on 251 degrees of freedom
(49 observations deleted due to missingness)
AIC: 5077.7

Number of Fisher Scoring iterations: 5

When we include log(area) as an offset in a regression model we constrain its regression coefficient to be one. One way to determine if the use of an offset is justified is by fitting the analysis of covariance model and testing whether the regression coefficient of log(area) is equal to 1, i.e., test H₀: β = 1. We previously fit the analysis of covariance model so I use the confint function to obtain 95% confidence intervals for the regression parameters.

confint(out.S1)

                      2.5 %       97.5 %
(Intercept)       3.0166971 3.160251302
landscapeBauxite -0.3000205 -0.153872960
landscapeForest -0.1324742 0.002433206
landscapeUrban   -0.3481049 -0.193986538
factor(year)2006 -0.1765673 -0.046356660
factor(year)2007 -0.1655542 -0.047747150
log(area)         0.1834619 0.236518229

The 95% confidence interval for the coefficient of log(area) is (0.18, 0.24). Because this interval does not include 1 we can reject a coefficient of 1 as being a reasonable value at the α = .05 level. We can also assess the appropriateness of the offset model with AIC. If we compare the AICs of the various Poisson models we've fit, we see that the Poisson model with a log area offset appears to be a terrible model. The analysis of covariance model ranks best.

sapply(list(out.S, out.S1, out.S2), AIC)

[1] 1811.340 1578.616 5077.695

The inclusion of an offset in a Poisson regression model is done primarily for purposes of interpretation. If interpreting the response as a rate is not interesting, but controlling for the lack of equivalence of the observations is, it is perfectly legitimate to include time, area, or population size in the model as a covariate. Covariates generally are variables that are of no interest by themselves but are included solely because it is believed that to omit them would have deleterious effects. When a predictor is included as a covariate rather than as an offset, its regression coefficient is estimated instead of being set to one. When a variable is included as a covariate it may not be necessary or desirable to log-transform it first as was done when it was entered as an offset. My personal choice is to always include a potential offset as a covariate rather than as an offset. If it does turn out that the estimated coefficient of the covariate is not significantly different from 1, then and only then might I replace it with an offset and thus save a degree of freedom. I prefer to let the data tell me if treating the response as a density or a rate is the correct choice to make.

Fitting a Poisson random intercepts model in R

We've already noted that the Jamaican birds data set is a repeated measures design with measurements made over three years. Bird counts were obtained from the same patch in multiple years although not every patch was visited every year, so the data are unbalanced.

table(table(birds$patch[!is.na(birds$S)]))

0 1 2 3
1 12 22 67

We see that 12 patches were visited once, 22 patches were visited twice, and 67 patches were visited for all three years. There was no interest in tracking bird richness over time. The concern was that using only a single year of data could be misleading if that year happened to be particularly good or particularly bad for birds. As was discussed in lecture 11 repeated measures designs yield heterogeneous data. Observations from the same patch in different years might be expected to be more similar to each other than they are to observations from different patches in the same or different years.

One way to account for this heterogeneity is to introduce a term that is shared by all observations from the same patch. Because we're not particularly interested in the value of this term we make it a random effect, a parameter whose values are drawn from a probability distribution. This specific kind of mixed effects model is called a random intercepts model. Let i denote the patch and j the observation made on that patch. The model we're fitting is the following.

random int

So, in this model we assume that the population log mean richness in 2005 for patches in agricultural areas with area = 1 are drawn from a common population with mean β₀. To obtain the mean in a particular patch we add to β₀ the patch-level random effects u_0i, which vary from patch to patch. These patch means all then shift by an amount β₁ in 2006 and by an amount β₂ in year 2007 relative to their values in 2005. The patch mean is further affected by the landscape type and the area of the patch.

The Poisson random intercepts model is fit in R with the lmer function from the the lme4 package as follows.

library(lme4)

out.lmer <- lmer(S~landscape + factor(year) + log(area) + (1|patch), data=birds, family=poisson)

summary(out.lmer)

Generalized linear mixed model fit by the Laplace approximation
Formula: S ~ landscape + factor(year) + log(area) + (1 | patch)
   Data: birds.short
   AIC   BIC logLik deviance
273.9 302.3 -128.9    257.9
Random effects:
Groups Name        Variance Std.Dev.
patch (Intercept) 0.01915 0.13838
Number of obs: 257, groups: patch, 101

Fixed effects:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)       3.06613    0.05170   59.31 < 2e-16 ***
landscapeBauxite -0.22858    0.05531   -4.13 3.58e-05 ***
landscapeForest -0.06744    0.05205   -1.30 0.195084
landscapeUrban   -0.26663    0.05855   -4.55 5.26e-06 ***
factor(year)2006 -0.13039    0.03389   -3.85 0.000119 ***
factor(year)2007 -0.10911    0.03044   -3.58 0.000339 ***
log(area)         0.22222    0.02100   10.58 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

From the summary output we see that the between patch variance τ² is estimated to be 0.019. When we compare the summary table of the random intercept model to the summary table of a model without random effects the changes in the point estimates are rather small.

summary(out.S1)

Call:
glm(formula = S ~ landscape + factor(year) + log(area), family = poisson,
data = birds)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.7971 -0.7549 0.0437 0.7403 2.2467

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 691.67 on 256 degrees of freedom
Residual deviance: 297.53 on 250 degrees of freedom
(49 observations deleted due to missingness)
AIC: 1578.6

Number of Fisher Scoring iterations: 4

More striking though is the fairly large increase in the standard errors of the landscape effects in the random intercepts model. Such an increase should be expected because without the random effects we are treating the individual measurements made on the same patch but in different years as independent replicates of the landscape effect. This is pseudo-replication and by including a random effect that links these measurements together and recognizes their correlation we've obtained a more realistic estimate of the precision of the landscape effect.

It would be nice to be able to use AIC to compare this model to the unstructured Poisson model, but it turns out that the log-likelihood values reported by the lmer function aren't comparable to those generated from other distributions or using other R functions. (The AIC from lmer can be used to compare other Poisson mixed effects models.)

# these values are not comparable

sapply(list(out.lmer, out.S1), logLik)

[1] -128.9333 -782.3082

AIC(out.lmer, out.S1)

df AIC
out2.lmer 8 273.8666
out.S1 7 1578.6164

A naive interpretation of the output would suggest that the mixed effects model is a dramatic improvement over the ordinary Poisson regression model, but this isn't necessarily correct. Apparently a portion of the log-likelihood not involving the parameters has been left out of the log-likelihood that is reported by lmer. To correct for this we need to calculate the log-likelihood ourselves. To this end the next section presents some basic information on the likelihood of random effects models.

The likelihood of a Poisson random intercepts model

The switch from a pure fixed effects model to a random intercepts model leads to a fundamental change in the likelihood that is being maximized. Because both the response variable and the random effects have a probability distribution, together they have a joint probability distribution. From our standpoint the random effects are nuisance parameters. The likelihood we desire is the the likelihood of the data without the random effects. We can obtain this likelihood by integrating the joint likelihood with respect to the random effects. The trick in all this is to recognize that a joint probability function can be written as the product of a conditional distribution and a marginal distribution, each of which is a known distribution. It is this product that we then integrate to obtain the desired likelihood.

joint conditional

In much of what follows I use generic notation for the distributions involved because the ideas carry over to other distributions other than the Poisson distribution that we're focused on here. I assume that the observations come in clusters and that the clusters are a random sample from a population of clusters. The clusters for the Jamaican birds example are the patches.

Likelihood of the data at the cluster level

Methods employing maximum likelihood all start with the likelihood function. Recall that if we have a random sample of observations, , that are independent and identically distributed with probability density (mass) function where θ is potentially a vector of parameters of interest, then the likelihood of our sample is the product of the individual density functions. For the Poisson regression model θ will consist solely of the set of regression parameters β so that the likelihood can be written as follows.

likelihood

With correlated data, i.e., structured data, the situation is more complicated. The data consist of observations where i indicates the level-2 unit (cluster) and j the level-1 observation within that unit. Typically the level-2 units will be independent but the level-1 units will not be. If we let

so that , then we can write the likelihood in terms of the m level-2 units as follows.

Joint distribution of the data and random effects

If we choose to model the hierarchical structure using a mixed effects model then the are not the only random quantities to consider. The random effects also have a distribution. Thus the natural starting place in constructing the likelihood is with the joint density function of the data and random effect in cluster i, . From elementary probability theory we can write

Here is the conditional distribution of the observed data assuming the value of the random effect is known. This will correspond to the probability model we assume for the data at level 1. In the Jamaican birds example the assumed distribution is a product of independent Poisson distributions. The idea is that conditional on their common random effect, the count values obtained in the separate years from the same patch are independent. The random effect accounts entirely for the correlation over time. For Poisson models this density will also be a function of the vector of fixed effect parameters β.
is the distribution of the random intercept effects which is assumed to be normal with mean zero and variance τ².

Let . If we include the regression parameters in the likelihood then the joint distribution of the data and the random effects in cluster i would be written as follows

Conditional on the value of the random effect, the individual in cluster i are assumed to be independent. So we have

joint

Here I use to denote the distribution of the individual conditional on the random effect and the regression and variance parameters.

Obtaining the marginal likelihood of the data

To go from the joint probability of and to the marginal probability for we integrate over the possible values of . Thus the marginal likelihood for cluster i is the following.

Putting all these pieces together we can write the marginal likelihood of our data as follows.

marginal

For the Jamaican birds random intercepts model the distribution of the random effects is normal and the distribution of the response given the random effects is Poisson. Thus the functions k and h in the last expression are a Poisson probability and a normal density respectively.

poisson marginal

This expression doesn't simplify further (unlike the case when both k and h are normal densities) and so the integration needs to be carried out numerically in order to obtain the marginal likelihood. This is what the lmer function of the lme4 package does.

Obtaining the correct log-likelihood of the Poisson random intercepts model

To obtain the correct marginal log-likelihood for a Poisson random intercepts model that we can then use to compare with ordinary Poisson regression models I've written a function that inserts the maximum likelihood estimates returned by lmer into the above expression for the joint log-likelihood that I then integrate numerically to obtain the correct marginal log-likelihood. This function is shown below.

poisloglike.log <- function(int.mod, mult=1) {

# int.mod is the Poisson model random intercepts model fit by lmer

# mult is a large multiplier, e.g., 1e40, 1e80, 1e120, etc.

# used to improve the accuracy of the numerical integration

####### extract model components #######

#level-1 variance

s <- attr(VarCorr(int.mod)[[1]],"stddev")

#Poisson mean

etavec <- int.mod@X %*% fixef(int.mod)

#check if offset was used

if(length(int.mod@offset)>0) etavec <- etavec+int.mod@offset

#response variable

y <- int.mod@y

#split the response and Poisson mean by group

temp.dat <- data.frame(y,etavec)

split.by.group <- split(temp.dat, int.mod@flist)

#function to evaluate the integrated likelihood for a group

pois.component <- function(xvec) {

#likelihood term for the first element in the group

base.eta <- paste("exp(", xvec[1,2], "+x)", sep='')

base.pois <- paste("dpois(", xvec[1,1], ",", base.eta,")", sep='')

#add more terms to the product if the group size is > 1

if(nrow(xvec)>1)

for(i in 2:nrow(xvec)) {

cur.eta <- paste("exp(", xvec[i,2], "+x)", sep='')

cur.pois <- paste("dpois(", xvec[i,1], ",", cur.eta,")", sep='')

base.pois <- paste(base.pois, cur.pois, sep='*')

}

#construct the integral as a text expression

int.call <- paste("integrate(function(x) ", base.pois, "*dnorm(x,0,",s,")*mult, -Inf, Inf, rel.tol = 1e-12)", sep='')

#calculate integral

eval(parse(text=int.call))

}

#obtain likelihood terms by group

sapply(split.by.group,pois.component) -> like.terms

#log-likelihood

sum(log(unlist(like.terms[1,])/mult))

}

To use it here we just provide it with the name of the Poisson random intercepts model that was fit by lmer. There is an additional argument that may be needed to get the integral to converge, but it is not needed for this problem.

# log-likelihood of Poisson random intercepts model

poisloglike.log(out.lmer)

[1] -762.4849

If we compare this to the log-likelihood of the ordinary Poisson regression model without random intercepts we see that the log-likelihood of the random effects model is larger.

# log-likelihood of model without random effects

logLik(out.S1)

'log Lik.' -782.3082 (df=7)

To compare the AICs we need the number of estimated parameters in the random intercepts model. This should be one more than it is for the ordinary Poisson regression model, the extra parameter being τ², the variance of the random effects. Alternatively we can extract this information using the attr function applied to the output from logLik.

# number of estimated parameters

attr(logLik(out.lmer),"df")

[1] 8

Comparing the AICs we obtain the following.

# calculate AIC of random intercepts model

-2*poisloglike.log(out.lmer) + 2*attr(logLik(out.lmer),"df")

[1] 1540.97

# compare to model without random effects

AIC(out.S1)

[1] 1578.616

So the AIC of the random intercepts model, 1540.97, is appreciably lower than the AIC of the ordinary Poisson regression model. To check the fit of the random intercepts model we would next need to carry out a predictive simulation using the fitted means to generate Poisson distributions like we've done previously for other examples and assess how closely the pseudo-data that are generated resemble the data that were actually obtained.

Cited references

Sokal, Robert R. and F. J. Rohlf. 1995. Biometry. W. H. Freeman and Company: New York.

R Code

A compact collection of all the R code displayed in this document appears here.

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--November 11, 2012
URL: https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/lectures/lecture22.htm