Assignment 10—Solution

Question 1

bycatch <- read.csv('ecol 563/bycatch.csv')

bycatch[1:6,]

   Season Area Gear.Type Time Tows Bycatch
1 1989-90 North    Bottom   Day   48       0
2 1989-90 North    Bottom Night    6       0
3 1989-90 North Mid-Water Night   1       0
4 1989-90 South    Bottom   Day 139       0
5 1989-90 South Mid-Water   Day    6       0
6 1989-90 South    Bottom Night    6       0

out.manly <- glm(Bycatch~Season+Area+Gear.Type+Time, data=bycatch, offset=log(Tows), family=poisson)

Question 2

Let y be the Poisson response, . Let x be a continuous predictor, z be a dichotomous variable coded 0 and 1, and w be the offset variable, then a generic Poisson regression model with an offset is the following.

offset

Exponentiating both sides and using the fact that exp(a + b + c) = exp(a + b) exp(c) we obtain the following.

exponentiated

Now z is a dummy variable coded 0 and 1. Thus we have two predicted rates depending on the value of z.

dichotomous

So with a dummy regressor in a Poisson regression with a log link, exponentiating the coefficient of the dummy regressor tells us the amount by which the mean rate is multiplied when we switch from the reference level (z = 0) to the other level (z = 1). Area, gear type, and time of day are dummy variables in the current model. So we just need to exponentiate their estimated coefficients to determine the multiplying factor. The regression coefficients and their exponentiated values are summarized in the table below.

round(coef(out.manly),3)

       (Intercept)      Season1990-91      Season1991-92      Season1992-93
            -7.328            -19.665              1.819              0.074
     Season1993-94      Season1994-95          AreaSouth Gear.TypeMid-Water
            -0.932             -0.135              1.822              1.918
         TimeNight
             2.177

exp(coef(out.manly)[7:9])

AreaSouth Gear.TypeMid-Water TimeNight
6.187000 6.805199 8.815959

The table summarizes the effect of area, gear type, and time of day on the estimated mean bycatch rate.

Predictor	Reference level (z = 0)	Effect level (z = 1)	β	exp(β)	Interpretation
Area	North	South	1.822	6.187	The bycatch rate is 6.2 times greater in the South than the North.
Gear type	Bottom	Midwater	1.918	6.805	The bycatch rate is 6.8 times greater with midwater gear than with bottom gear.
Time of day	Day	Night	2.177	8.816	The bycatch rate is 8.8 times greater at night than during the day.

Question 3

The numbers reported by Manly are the same numbers shown at the bottom of the summary table of the model.

Null deviance: 335.096 on 42 degrees of freedom
Residual deviance: 42.078 on 34 degrees of freedom
AIC: 97.461

The residual deviance is an appropriate goodness of fit statistic for a Poisson regression and has a chi-squared distribution only if the individual predicted counts are large. A seat-of-the-pants rule is that no more than 20% of the predicted counts should be less than 5 and none should be less than 1. I tabulate the predicted means obtained from the estimated model.

table(fitted(out.manly)<5)/sum(table(fitted(out.manly)<5))

FALSE TRUE
0.1162791 0.8837209

table(fitted(out.manly)<1)/sum(table(fitted(out.manly)<1))

FALSE TRUE
0.2325581 0.7674419

So 88.4% of the predicted counts are less than 5 and 76.7% of the mean counts are less than 1. Using the residual deviance as a goodness of fit statistic here is highly suspect.

Question 4

The summary table of coefficient estimates is shown below.

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)          -7.32782    0.59038 -12.412 < 2e-16 ***
Season1990-91       -19.66524 2439.00216 -0.008   0.9936
Season1991-92         1.81895    0.28503   6.382 1.75e-10 ***
Season1992-93         0.07391    0.39557   0.187   0.8518
Season1993-94        -0.93191    0.41132 -2.266   0.0235 *
Season1994-95        -0.13503    0.30923 -0.437   0.6623
AreaSouth             1.82245    0.41063   4.438 9.07e-06 ***
Gear.TypeMid-Water    1.91769    0.44308   4.328 1.50e-05 ***
TimeNight             2.17656    0.45138   4.822 1.42e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The obvious problem is that the standard error for the season 1990-1991 effect estimate is 2439, over 100 times larger than the point estimate. Notice that all the other reported standard errors in the model are less 1, so this particular standard error is clearly anomalous. If we tabulate the bycatch by season we can see why this is happening.

table(bycatch$Bycatch, bycatch$Season)

     1989-90 1990-91 1991-92 1992-93 1993-94 1994-95
0        6       5       3       6       7       6
1        0       0       1       0       0       0
2        0       0       1       0       0       0
5        0       0       2       0       0       0
6        0       0       0       0       0       1
8        0       0       0       0       1       0
9        0       0       0       1       0       0
15       0       0       0       0       0       1
16       0       0       1       0       0       0
23       1       0       0       0       0       0

In the 1990-1991 season there was no bycatch at all. This apparently is causing the software problems in estimating the variability of this effect. Given that we can't estimate the precision, we should probably not try to estimate the individual season effects.

Question 5

To fit a random effects model we need to use the lme4 package and let the intercepts vary by season.

library(lme4)

out.manly2 <- lmer(Bycatch~Area + Gear.Type + Time + (1|Season), data=bycatch, offset=log(Tows), family=poisson)

Question 6

I modify the code we've used previously to display the predicted Poisson distributions for individual bycatch observations.

bycatch$mu <- fitted(out.manly2)

bycatch$z <- dpois(bycatch$Bycatch, fitted(out.manly2))

bycatch$obs <- 1:nrow(bycatch)

out.p <- lapply(1:nrow(bycatch), function(x) dpois(0:45, lambda=fitted(out.manly2)[x]))

bycatch$ux <- sapply(1:nrow(bycatch), function(x) max((0:45)[out.p[[x]]>1e-4]))

bycatch$lx <- sapply(1:nrow(bycatch), function(x) min((0:45)[out.p[[x]]>1e-4]))

bycatch$uy <- sapply(out.p, max)

bycatch$ly <- rep(0, nrow(bycatch))

prepanel.ci2 <- function(x, y, ly, lx, uy, ux, subscripts, ...) {

list(ylim=range(y, uy[subscripts], ly[subscripts], finite=T), xlim=range(x, ux[subscripts], lx[subscripts], finite=T))}

library(lattice)

xyplot(z~Bycatch|factor(obs), data=bycatch, ux=bycatch$ux, lx=bycatch$lx, uy=bycatch$uy, ly=bycatch$ly, layout=c(5,5), prepanel=prepanel.ci2, ylab='Probability', xlab='Bycatch', panel=function(x, y, subscripts){

panel.xyplot(0:45, dpois(0:45, lambda=bycatch$mu[subscripts]), type='h', col='grey', lwd=2, lineend=1)

panel.abline(v=x, col=2, lty=2)},

scale=list(x=list(relation='free',at=c(0:4,seq(5,45,5))), y='free'), strip=strip.custom(par.strip.text=list(cex=0.8)))

Fig. 1(a) Predicted Poisson distributions with observed bycatch values superimposed. Observations 1–25.

Fig. 1(b) Predicted Poisson distributions with observed bycatch values superimposed. Observations 26–43.

From the graphs we can see that observations 16, 19, and 41 are poorly described by the model. We can test this formally by calculating upper and lower-tailed p-values. I carry out a two-tailed test and compare each p-value against α = 0.025

upper.p <- 1-ppois(bycatch$Bycatch-1, lambda=bycatch$mu)

lower.p <- ppois(bycatch$Bycatch, lambda=bycatch$mu)

sum(upper.p<=.025)

[1] 3

sum(lower.p<=.025)

[1] 0

3/43

[1] 0.06976744

There are three significant results. Because 3 out of 43 = 6.9%, this is a little bit larger than we'd expect by chance. Fig. 2 displays the upper and lower-tailed p-values separately. (Note: it's not possible here to display them in a common plot because for most of the observations both the lower- and upper-tailed p-values are large, greater than 0.9.)

pval.dat <- data.frame(pvalue=c(lower.p, upper.p), obs=rep(bycatch$obs,2), label=rep(c('lower', 'upper'), each=nrow(bycatch)))

dotplot(obs~pvalue|factor(label, levels=levels(label), labels=c('Lower-tailed', 'Upper-tailed')), data=pval.dat, xlab='p-value', ylab='Observation', panel=function(x,y) {

panel.dotplot(x,y)

panel.abline(v=.05, col=2,lty=2)}, scales=list(y=list(at=seq(1,43,2), cex=.7)))

Fig. 2 Upper- and lower-tailed p-values for individual observations with respect to their predicted Poisson distributions.

Question 7

I first identify the observations with significant p-values.

(1:43)[upper.p<.05]

[1] 16 19 41

I extract the residuals, label them, and display them in decreasing order.

out.r <- residuals(out.manly2)

names(out.r) <- 1:43

round(sort(out.r, decreasing=TRUE),4)

     16      41      19      27      15      35       7      8       9      36      22
5.0873 3.2824 2.7921 0.6848 0.6225 0.3864 0.2637 -0.0082 -0.0979 -0.0993 -0.1178
     28      11      43      30       1      29       2       3      38      21      14
-0.1448 -0.1488 -0.1713 -0.1779 -0.1782 -0.1836 -0.1896 -0.1988 -0.2173 -0.3075 -0.3236
     23      40      10       5      34      33       6      26      32      37      17
-0.3734 -0.3813 -0.3847 -0.4005 -0.4566 -0.4661 -0.4689 -0.4824 -0.5050 -0.5532 -0.5941
     13      42      31      25      18       4      24      12      20      39
-0.6329 -0.6460 -0.6707 -0.6927 -0.6933 -0.7502 -0.8181 -0.9427 -1.2328 -1.6007

So the significant p-values correspond to the observations with the largest Pearson residuals. Thus the residuals can serve as goodness of fit statistics in the sense that we're using it here.

Question 8

I refit the random intercepts model but this time treating log(Tows) as a covariate rather than an offset.

out.manly3 <- lmer(Bycatch~Area+Gear.Type+Time+log(Tows)+(1|Season), data=bycatch, family=poisson)

There are three ways to determine whether a model with log(Tows) as a covariate is superior to a model in which log(Tows) is treated as an offset: likelihood ratio test, Wald test, and AIC.

Likelihood ratio test

The Poisson random intercepts model with log(Tows) as an offset is nested in the Poisson random intercepts model with log(Tows) treated as a covariate. In the offset model we're setting the coefficient equal to 1. Thus we can formally compare the models using a likelihood ratio test. The hypothesis test this yields is the following.

H₀: β = 1
H₁: β ≠ 1

We can carry out the test with the anova function (or by hand by using twice the difference in the loglikelihoods).

anova(out.manly2, out.manly3)

Data: bycatch
Models:
out.manly2: Bycatch ~ Area + Gear.Type + Time + (1 | Season)
out.manly3: Bycatch ~ Area + Gear.Type + Time + log(Tows) + (1 | Season)
           Df    AIC    BIC logLik Chisq Chi Df Pr(>Chisq)
out.manly2 5 78.928 87.734 -34.464
out.manly3 6 77.327 87.894 -32.663 3.6008      1    0.05775 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So the likelihood ratio test is not significant. Thus 1 is a legitimate value for the coefficient suggesting that treating log(Tows) as an offset is reasonable.

Wald test

A second test we can carry out is a Wald test. The Wald test shown in the summary table is not useful here because it tests H₀: β = 0, whereas we want to test whether H₀: β = 1. Testing that a parameter has a specific value is equivalent to determining whether that value lies in a confidence interval for that parameter. We can construct the confidence interval by hand using the output from the summary table. The confidence interval takes the form .

summary(out.manly3)@coefs

                     Estimate Std. Error   z value     Pr(>|z|)
(Intercept)        -6.2572205 0.9865600 -6.342463 2.261201e-10
AreaSouth           1.9094421 0.4093076 4.665055 3.085350e-06
Gear.TypeMid-Water 1.8989800 0.3975042 4.777258 1.777018e-06
TimeNight           2.1012796 0.4143827 5.070867 3.960082e-07
log(Tows)           0.6188474 0.1847997 3.348747 8.117790e-04

summary(out.manly3)@coefs[5,1] + c(qnorm(.025), qnorm(.975)) * summary(out.manly3)@coefs[5,2]

[1] 0.2566467 0.9810482

Because 1 does not lie in the 95% confidence interval a Wald test rejects the null hypothesis that the coefficient of log(Tows) is equal to one. Therefore according to the Wald test log(Tows) should not be treated as an offset. Alternatively to carry out the test formally we need to calculate the following test statistic.

Wald

The two-tailed p-value is calculated as follows.

1-pnorm(abs((summary(out.manly3)@coefs[5,1]-1) / summary(out.manly3)@coefs[5,2]),.025)

[1] 0.02079911

Using AIC

I compare the models using AIC.

AIC(out.manly2, out.manly3)

df AIC
out.manly2 5 78.92763
out.manly3 6 77.32684

Although the AIC values are close, the model in which the coefficient of log(Tows) is estimated has a lower AIC than a model in which log(Tows) is treated as an offset, even though we have to estimate one more parameter. Thus our results are ambiguous. Both the Wald test and AIC favor treating log(Tows) as a covariate while the likelihood ratio test favors treating log(Tows as an offset. This is not surprising given the arbitrariness of the α = .05 cut-off.

Goodness of fit

I calculate the upper and lower-tailed p-values for the new model.

bycatch$mu2 <- fitted(out.manly3)

upper.p2 <- 1-ppois(bycatch$Bycatch-1, lambda=bycatch$mu2)

lower.p2 <- ppois(bycatch$Bycatch, lambda=bycatch$mu2)

sum(upper.p2<=.025)

[1] 2

sum(lower.p2<=.025)

[1] 0

So, the number of significant p-values has been reduced from three to two. I examine the largest residuals from this model.

out.r2 <- residuals(out.manly3)

names(out.r2) <- 1:43

round(sort(out.r2, decreasing=TRUE),4)[1:3]

16 41 19
3.5137 3.0470 2.1011

#offset model

round(sort(out.r, decreasing=TRUE),4)[1:3]

16 41 19
5.0873 3.2824 2.7921

So the three largest positive residuals are all reduced from what they were in the offset model.

HW 10 scores

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for Ecology and the Environment, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--November 25, 2012
URL: https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/solutions/assign10.htm