Lecture 21—Monday, November 5, 2012

Topics

Discussion of Assignment 8
Testing the fit of a negative binomial regression with a continuous predictor
- Graphical assessment of fit
- Analytical assessment of fit
R code

R functions and commands demonstrated

fitted returns the predicted mean from a regression model for each observation inverting the link function. It is equivalent to using predict with the type="response" argument.
lapply for applying a function to a list of elements. The output of lapply is also a list.

R packages used

lattice for lattice graphics.
MASS for the glm.nb function.

Discussion of Assignment 8

See HW 8 solutions.

Testing the fit of a negative binomial regression with a continuous predictor

Last time we fit various regression models of the species-area relationship for data on the vascular flora of the Galapagos islands. The best model according to AIC was a negative binomial regression model with log(Area) as the predictor.

# read in galapagos flora data

gala <- read.table('ecol 563/galapagos.txt', header=T)

# fit NB-2 model

library(MASS)

out.NB2 <- glm.nb(Species~log(Area), data=gala)

In the count regression models we've considered previously our predictors have all been categorical variables. These categorical variables provided a natural way to group the data values and made carrying out a goodness of fit test easy. We compared the observed distribution of counts in each group against what the negative binomial regression model predicted using either a formal chi-squared test or its simulation-based version (nonparametric bootstrap). In the Galapagos regression model log(Area) is a continuous predictor so this is not possible. Each observation has a different value of area and hence its own individual predicted negative binomial distribution so there are no natural groups.The negative binomial regression model can still be treated as a data-generating mechanism but now we have a different mechanism for each observation.

Given an island's size our negative binomial regression model defines a probability distribution of likely species richness values for that island. This predicted distribution can form the basis for a goodness of fit test. For each observation we can ask whether the observed richness value looks like a typical realization from the estimated probability distribution for that observation. This can be assessed graphically by plotting the predicted probability distributions with the observed richness values superimposed. We can then follow this up analytically by calculating a p-value in which we treat the observed richness as the value of our test statistic and the predicted probability distribution as our null model.

Graphical assessment of fit

With 29 different distributions to examine for the 29 different islands the most parsimonious way to display things is with a panel graph. The thing that makes producing this graph complicated is that there is a wide range in observed richness values among the different islands ranging from 2 to 444 species.

range(gala$Species)

[1] 2 444

To obtain a sensible picture we will to need to plot the probability distribution for each island using different settings for the x- and y-axes, cutting off the distributions when the probabilities are too small to be displayed. The fitted function of R returns the predicted mean on the raw scale and is equivalent to using the predict function with type='response'. I add the predicted means as a column to the gala data frame.

fitted(out.NB2)

         1          2          3          4          5          6          7
76.112866 25.242555 13.155450 10.019655   7.769198 15.700120 31.817486
         8          9         10         11         12         13         14
6.441081 12.431872 103.695024 249.071798 18.978209 21.293508 66.476019
        15         16         17         18         19         20         21
518.139404 139.004430   4.303854 104.531677 67.310634 13.602074 41.765825
        22         23         24         25         26         27         28
236.601139 239.823172 283.606385 74.973770 153.911956 29.176613 25.242555
        29
34.258938

gala$mu <- fitted(out.NB2)

I add the predicted probabilities of the observed richness values as another column in the gala data frame.

gala$z <- dnbinom(gala$Species, mu=fitted(out.NB2), size=out.NB2$theta)

min(gala$z)

[1] 0.001036964

max(gala$z)

[1] 0.1436656

To get a sense of the range of count values that need to be displayed I calculate the probability of the smallest richness value, 0, and the probability of a fairly large richness value, 1000.

max(dnbinom(1000, mu=fitted(out.NB2), size=out.NB2$theta))

[1] 0.0003065981

min(dnbinom(0, mu=fitted(out.NB2), size=out.NB2$theta))

[1] 8.946816e-07

So for some islands the probability of zero counts will be too small to show up on the graph whereas for others counts as large as 1000 will still have a non-negligible probability. With these rough guidelines for the data range I calculate the probability of obtaining richness values between 0 and 1000 for each of the 29 islands. I use the lapply function for this to force the output to take the form of a list. (The lapply function is identical to sapply except that sapply will try to format list output as a vector or a matrix when it can while lapply will always produce list output.)

out.p <- lapply(1:29, function(x) dnbinom(0:1000, mu=fitted(out.NB2)[x], size=out.NB2$theta))

At least for some of the islands a lot of the calculated probabilities will be too small to show up on the graph. For each island I determine the largest count for which the calculated probability exceeds 1e-5.

sapply(1:29, function(x) max((0:1000)[out.p[[x]]>1e-5]))

[1] 344 129 73 58 47 85 159 40 70 454 987 101 111 305 1000 589
[17] 29 457 309 75 202 943 955 1000 340 644 147 129 169

For each island the expression out.p[[x]]>1e-5 will evaluate to TRUE or FALSE depending on whether the calculated probability exceeds the given threshold. The expression (0:1000)[out.p[[x]]>1e-5] will then return only those count categories whose probabilities exceed the threshold. Finally max finds the largest of these counts. I store the result as a column ux in the gala data frame.

gala$ux <- sapply(1:29, function(x) max((0:1000)[out.p[[x]]>1e-5]))

We can do a similar calculation for each island at the low end, determining the smallest count category that still has a non-negligible probability.

sapply(1:29, function(x) min((0:1000)[out.p[[x]]>1e-5]))

[1] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 5 0 0 0 0 0 0 1 1 1 0 0 0 0 0

Because these values are barely different from zero it seems reasonable to start all the panel graphs at zero. I create a variable lx that records this minimum value.

gala$lx <- rep(0, 29)

Next I determine the maximum calculated probability on each island and store the result. I record the minimum probability as zero.

gala$uy <- sapply(out.p, max)

gala$ly <- rep(0, 29)

I write a prepanel function that calculates the x-limits (xlim) and y-limits (ylim) for each panel using the variables lx, ux, ly, and uy.

prepanel.ci2 <- function(x, y, ly, lx, uy, ux, subscripts, ...) {

list(ylim=range(y, uy[subscripts], ly[subscripts], finite=T), xlim=range(x, ux[subscripts], lx[subscripts], finite=T))}

We're now ready to create the graph.

I plot the observed probability against the observed count separately by island.
I specify the variables containing the values of ux, lx, uy, and ly as arguments to be passed to the prepanel function that is also specified.
The panel function panel.xyplot draws the predicted negative binomial distribution for each island.
The panel.abline function draws a vertical line at the location that corresponds to the observed species richness value for that island.
The additional argument: scale=list(x="free", y="free") causes the x-limits and y-limits to be set separately in each panel.
To reduce the size of the text appearing in the panel strips I use the strip argument as shown.

# plot distributions for each island

library(lattice)

xyplot(z~Species|Island, data=gala, ux=gala$ux, lx=gala$lx, uy=gala$uy, ly=gala$ly, prepanel=prepanel.ci2, ylab='Probability', xlab='Species richness', panel=function(x, y, subscripts){

panel.xyplot(0:1000, dnbinom(0:1000, mu=gala$mu[subscripts], size=out.NB2$theta), type='h', col='grey')

panel.abline(v=x, col=2, lty=2)},

scale=list(x='free', y='free'), strip=strip.custom(par.strip.text=list(cex=0.8)))

Fig. 1 Predicted negative binomial distributions for individual islands with the location of the observed species richness value superimposed.

Based on the graph it would appear that the observed richness on Champion and Gardner.a islands may be unusually large at least compared to the model prediction. At the other extreme the richness on Gardner.b may be unusually low.

Analytical assessment of fit

To get a better sense of the quality of the fit shown in Fig. 1 we can calculate p-values for the observed richness values. A p-value is the probability of obtaining a predicted value that is as extreme or more extreme than the observed richness value. We can calculate both a lower-tailed p-value (for observed richness values that are too low) and an upper-tailed p-value (for observed richness values that are too big). The upper-tailed p-value is of the form P(X ≥ k). The pnbinom function returns P(X ≤ k), so 1 – pnbinom(k) returns P(X > k). Because the negative binomial function is discrete, to obtain the desired upper-tailed p-value using pnbinom we have to subtract 1 from the observed richness value: P(X ≥ k) = P(X > k–1) = 1 – pnbinom(k – 1).

upper.p <- 1-pnbinom(gala$Species-1, mu=gala$mu, size=out.NB2$theta)

upper.p

[1] 0.58535291 0.30618234 0.94174006 0.04041012 0.92039206 0.35479054 0.90561894
[8] 0.33598154 0.96789984 0.46247618 0.87505733 0.01116176 0.94354653 0.70718942
[15] 0.65445700 0.87859926 0.79455247 0.42418245 0.15644491 0.51231540 0.14058454
[22] 0.31532278 0.42674999 0.16357795 0.53807249 0.09706292 0.19253194 0.68307008
[29] 0.69856167

sum(upper.p<.05)

[1] 2

gala$Island[upper.p<.05]

[1] Champion Gardner.a

Two of the islands have upper-tailed p-values less than α = .05: Gardner.a and Champion. Having carried out 29 tests (one for each island) we'd expect to obtain 29 × .05 = 1.45 significant results by chance alone, so obtaining two "significant" results here is probably nothing to be alarmed about. In a similar fashion we can calculate lower-tailed p-values: P(X ≤ k) = pnbinom(k). The smallest observed p-value is not less than .05.

lower.p <- pnbinom(gala$Species, mu=gala$mu, size=out.NB2$theta)

lower.p

[1] 0.42435700 0.71221773 0.09648845 0.96626903 0.15123079 0.67703197 0.11293472
[8] 0.72867434 0.06506618 0.54387039 0.12756742 0.98995874 0.07951873 0.30434031
[15] 0.34704017 0.12605349 0.34911315 0.58177527 0.84765877 0.53419809 0.86531981
[22] 0.68682199 0.57588233 0.83745902 0.47142078 0.90412548 0.81839213 0.34634258
[29] 0.32342568

sum(lower.p<.05)

[1] 0

min(lower.p)

[1] 0.06506618

We can summarize things using a dot plot with the panels representing the lower-tailed and upper-tailed results.

pval.dat <- data.frame(pvalue=c(lower.p, upper.p), island=rep(gala$Island,2), label=rep(c('lower', 'upper'), each=nrow(gala)))

dotplot(island~pvalue|factor(label, levels=levels(label), labels=c('Lower-tailed', 'Upper-tailed')), data=pval.dat, xlab='p-value', panel=function(x,y) {

panel.dotplot(x,y)

panel.abline(v=.05, col=2,lty=2)})

Fig. 2 Upper-tailed and lower-tailed p-values measuring the probability of obtaining a richness value as extreme or more extreme than what was observed. Dashed vertical lines in each panel are located at α = .05.

Much of what's displayed in Fig. 2 is redundant. The left panel is nearly a mirror image of the right half. (It's not a perfect mirror image because P(X = k) is counted in both the lower-tailed and the upper-tailed calculations.) Given that we're interested in observed richness values that are both too small and too large it probably makes sense to carry out a two-tailed test here. For each island I calculate either a lower-tailed or an upper-tailed p-value (but not both) depending on whether P(X ≤ k) is less than or greater than 0.5.

p.to.plot <- ifelse(pnbinom(gala$Species, mu=gala$mu, size=out.NB2$theta) < .5, pnbinom(gala$Species, mu=gala$mu, size=out.NB2$theta), pnbinom(gala$Species-1, mu=gala$mu, size=out.NB2$theta))

I next place everything in a single graph with lower-tailed p-values displayed on the left and the upper-tailed p-values displayed on the right. I add vertical lines at α = .025 and .975 to identify significant results. The overall rejection region thus corresponds to α = .05. I use the scales argument to relabel the tick marks for the upper-tailed p-values. The use of the page function is just window dressing that allows me to add the identifying text at the bottom of the graph using the grid graphics system.

library(grid)

dotplot(gala$Island~p.to.plot, xlab=' ', xlim=c(-0.03,1.03), panel=function(x,y){

panel.dotplot(x,y)

panel.abline(v=c(.025,.975), col=2, lty=2)

panel.abline(v=0.5, col='grey70', lwd=2)},

scales=list(x=list(at=seq(0,1,.2), labels=c(0, 0.2, 0.4, 0.4, 0.2, 0))), page = function(n) {

grid.text("upper-tailed p-value", x = .85, y = 0.025, default.units = "npc", just = c("right", "bottom"))

grid.text("lower-tailed p-value", x = .45, y = 0.025, default.units = "npc", just = c("right", "bottom"))})

Fig. 3 Probabilities of obtaining values more extreme than the observed richness value. Dashed vertical lines are placed at α = .025.

R Code

A compact collection of all the R code displayed in this document appears here.

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--November 6, 2012
URL: https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/lectures/lecture21.htm