Lecture 16—Wednesday, October 17, 2012

Topics

Graphs of various negative binomial distributions
Negative binomial model for the aphid count data
Assessing the fit of the negative binomial model graphically
Testing the fit of the negative binomial model to the aphid count data
- Parametric goodness of fit test
- Simulation-based goodness of fit test
The Crawley slug data set
Cited references
R code

R functions and commands demonstrated

as.character when applied to a factor extracts the character values of the levels.
as.numeric when applied to a factor extracts the numerical values of the levels (the numbers 1 up to the number of levels).
barchart is the bar plot function of lattice.
barplot is the bar plot function of base graphics.
diag extracts the diagonal entries of a matrix.
dnbinom is the probability mass function of the negative binomial distribution.
glm.nb (from MASS) extends the glm function to fit regression models with negative binomial response variables.
panel.barchart is the panel function of lattice that produces bar plots.
pnbinom is the cumulative distribution function of the negative binomial distribution.
solve is used to compute the inverse of a square matrix
t is the transpose function. It swaps the rows and columns of a matrix.

R function options

beside= (argument to barplot) when TRUE creates side-by-side bar plots rather than stacked bar plots.
cex.names= (argument to barplot) controls the size of the bar labels.
horizontal= (argument to barplot and barchart) takes on values TRUE (the default) or FALSE. It controls whether the bars in the bar plot are displayed horizontal or vertical.
legend.text= (argument to barplot) when TRUE adds a legend to a bar plot when multiple groups are plotted.
lineend= (argument to xyplot of lattice) is used to change the shape of the ends of line segments.
names.arg= (argument to barplot) specifies a vector containing the names to be used to label the bars.
origin= (argument to barchart of lattice) sets the value at which to place the bottom of the bars of the bar plots.
test= (argument to anova) is used to specify a parametric test for model comparison. We used test='Chisq' to obtain the likelihood ratio test.
type= (argument to predict) can be used to obtain different kinds of model predictions for glm models. The default is a prediction on the scale of the linear predictors; the alternative type="response" is a prediction on the scale of the response variable.

R packages used

lattice for lattice graphics functions.
MASS for the glm.nb function.

Graphs of various negative binomial distributions

If you examine the help screen for the dnbinom function, ?dnbinom, the following syntax for the function is displayed.

dnbinom(x, size, prob, mu, log = FALSE)

The dnbinom function can be used to calculate probabilities using the classical definition of the negative binomial (Pascal) or the ecological definition of the negative binomial (Polya). For the ecological parameterization the relevant parameters are size and mu, corresponding to the dispersion parameter and the mean, respectively. Notice that mu is the 4th listed argument and appears after an argument we will not be using, prob. Thus we'll always need to specify the value of mu by name when we use the dnbinom function. For consistency I also specify the size argument by name so that I don't have to remember to list it second.

As a first illustration of various negative binomial distributions I fix the mean of the negative binomial distribution to be 1 (μ = 1) and I vary the dispersion parameter from small (θ = 0.1) to relatively large (θ = 100).

par(mfrow=c(1,4))

par(lend=2)

#fix the mean, vary the dispersion parameter

plot(0:20, dnbinom(0:20, mu=1, size=.1), type='h', lwd=4, xlab='count category', ylab='probability', ylim=c(0,.8))

mtext(side=3, line=.5, expression(list(mu==1, theta==.1)), cex=.9)

plot(0:20, dnbinom(0:20, mu=1, size=1), type='h', lwd=4, xlab='count category', ylab='probability', ylim=c(0,.8))

mtext(side=3, line=.5, expression(list(mu==1, theta==1)), cex=.9)

plot(0:20, dnbinom(0:20, mu=1, size=10),type='h', lwd=4, xlab='count category', ylab='probability', ylim=c(0,.8))

mtext(side=3, line=.5, expression(list(mu==1, theta==10)), cex=.9)

plot(0:20, dnbinom(0:20,mu=1,size=100), type='h', lwd=4, xlab='count category', ylab='probability', ylim=c(0,.8))

mtext(side=3, line=.5, expression(list(mu==1, theta==100)), cex=.9)

par(mfrow=c(1,1))


Fig. 1 Negative binomial distributions with the same mean but different dispersions

What we see is that as θ gets small the fraction of zeros increases and the distribution becomes more spread out. Observe that with θ = 0.1 almost 80% of the observations are zero yet there is a non-negligible probability of obtaining a count of 20 or more. On the other hand when θ = 100 less than 40% of the observations are zero and the probability of obtaining a count beyond 10 is negligible.

When θ = 100 in Fig. 1 the distribution looks very Poisson-like. We can investigate this by comparing it to a Poisson distribution with the same mean.

#compare Poisson and NB

par(mfrow=c(1,2))

plot(0:12, dnbinom(0:12, mu=1, size=100), type='h', lwd=6, xlab='count category', ylab='probability', ylim=c(0,.5))

mtext(side=3, line=.5, expression(paste('NB: ', list(mu==1, theta==100))))

plot(0:12, dpois(0:12,1), type='h', lwd=6, xlab='count category', ylab='probability', ylim=c(0,.5))

mtext(side=3, line=.5, expression(paste('Poisson: ', lambda==1)))

par(mfrow=c(1,1))


Fig. 2 Comparing a negative binomial and Poisson distribution with the same mean (μ = 1) and θ relatively large (θ = 100)

The distributions are nearly identical. In retrospect this should be expected. As θ → ∞, the negative binomial distribution with mean μ becomes a Poisson distribution with mean μ.

Finally I fix the dispersion parameter of the negative binomial distribution at a small value and vary the mean.

#fix the dispersion parameter, θ = 0.1, vary the mean

par(mfrow=c(1,3))

plot(0:40, dnbinom(0:40, mu=1, size=.1), type='h', lwd=3, xlab='count category', ylab='probability', ylim=c(0,.8))

mtext(side=3, line=.5, expression(paste('NB: ', list(mu==1, theta==.1))), cex=.9)

plot(0:40, dnbinom(0:40, mu=10, size=.1), type='h', lwd=3, xlab='count category', ylab='probability', ylim=c(0,.8))

mtext(side=3, line=.5, expression(paste('NB: ', list(mu==10, theta==.1))), cex=.9)

plot(0:40, dnbinom(0:40, mu=20, size=.1), type='h', lwd=3, xlab='count category', ylab='probability', ylim=c(0,.8))

mtext(side=3, line=.5, expression(paste('NB: ', list(mu==20, theta==.1))), cex=.9)

par(mfrow=c(1,1))


Fig. 3 Negative binomial distributions with the same dispersion but different means

What's striking about Fig. 3 is that other than a slight shrinkage in the zero fraction, the distributions look about the same. In truth the right tails of the distribution also become more prominent as the mean increases. Still the primary consequence of increasing the mean is to reduce the number of zeros. A negative binomial distribution with a small dispersion parameter will still have a sizeable zero fraction even if it has a very large mean.

Negative binomial model for the aphid count data

I reload the aphid count data, fit the Poisson model, and display the data and the fit of the Poisson model using the code from last time.

num.stems <- c(6,8,9,6,6,2,5,3,1,4)

# data frame of tabulated data

aphids <- data.frame(aphids=0:9, counts=num.stems)

aphid.data <- rep(0:9,num.stems)

poisson.LL <- function(lambda) sum(log(dpois(aphid.data, lambda)))

poisson.negloglik <- function(lambda) -poisson.LL(lambda)

out.pois <- nlm(poisson.negloglik,3)

exp.pois <- c((dpois(0:8, out.pois$estimate)), 1-ppois(8,out.pois$estimate))*50

out.bar <- barplot(num.stems, ylim=c(0,11), names.arg=c(0:8,'9+'))

points(out.bar, exp.pois, pch=16, cex=.9, type='o')

legend('topright', 'Poisson model', pch=16, col=1, lty=1, cex=.9, bty='n')

fig. 0

Fig. 4 Fit of the Poisson model

Constructing the negative binomial log-likelihood

Following an argument that is completely analogous to the one presented for the Poisson distribution, the log-likelihood for a negative binomial distribution applied to these data is the following.

dnbinom

Translating this into an R function yields the following.

#negative binomial log-likelihood

NB.LL <- function(mu,theta) sum(log(dnbinom(aphid.data, mu=mu, size=theta)))

As you can see the only difference from the Poisson function is that we need to specify two arguments, the mean and the dispersion parameter. Instead of specifying the two arguments as separate variables, we can specify them as components of a vector.

#alternative vector version

NB.LL2 <- function(p) sum(log(dnbinom(aphid.data, mu=p[1], size=p[2])))

The next few lines contrast how these two versions of the negative binomial log-likelihood function differ.

#first function

NB.LL(3,4)

[1] -116.1379

#incorrect use of 2nd function with two separate arguments

NB.LL2(3,4)

Error in NBvec.pos(3, 4) : unused argument(s) ( ...)

#correct usage in which arguments are entered as a vector

NB.LL2(c(3,4))

[1] -116.1379

The nlm function does minimization, not maximization, so our function needs to return the negative of the log-likelihood. The nlm function also requires that the function have a single vector argument of parameters, i.e., it requires the second version of the log-likelihood function given above. I use the function NB.LL2 created above to define a negative log-likelihood negative binomial function.

#negative log-likelihood for nlm

negNB.LL <- function(p) -NB.LL2(p)

Obtaining the MLE

I use the nlm function and the negative log-likelihood function created above to numerically approximate the MLEs. I supply initial guesses for and .

out.NB <- nlm(negNB.LL, c(3,4), hessian=TRUE)

out.NB

$minimum
[1] 114.7009

$estimate
[1] 3.459995 2.645024

$gradient
[1] -2.251151e-05 1.317917e-05

$hessian
[,1] [,2]
[1,] 6.2589597977 0.0002248452
[2,] 0.0002248452 0.9535296142

$code
[1] 1

$iterations
[1] 8

From the output we see the algorithm converged (code = 1) and both components of the gradient are close to zero. The estimated value of μ is 3.460 while that of θ is 2.645.

Negative binomial regression functions

A simpler way to fit the negative binomial distribution to the aphids data is as regression model using the glm.nb function from the the MASS package. (MASS is part of the standard installation of R and does not need to be downloaded first.) To estimate the two parameters we fit a regression model with only an intercept. The glm.nb function only fits negative binomial regressions so there is no family argument. There is a link argument and like Poisson regression the default link is a log link.

library(MASS)

out.nb <- glm.nb(aphid.data~1)

summary(out.nb)

Call:
glm.nb(formula = aphid.data ~ 1, init.theta = 2.645011574, link = log)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.1035 -1.1302 -0.1688 0.6997 1.4722

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.2413 0.1155 10.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(2.645) family taken to be 1)

Null deviance: 57.073 on 49 degrees of freedom
Residual deviance: 57.073 on 49 degrees of freedom
AIC: 233.4

Number of Fisher Scoring iterations: 1

Theta: 2.65
Std. Err.: 1.02

2 x log-likelihood: -229.402

The reported intercept is for the mean model, log μ = β₀. Therefore to obtain the estimate of the mean we need to exponentiate this value.

exp(coef(out.nb))

(Intercept)
3.46

The dispersion parameter is returned as the model component called theta.

out.nb$theta

[1] 2.645012

The estimates of μ and θ are identical to what we obtained using the nlm function.

Assessing the fit of the negative binomial model graphically

I produce a graph that compares the fit of the negative binomial and Poisson models. First I examine the predicted probabilities and counts (by summing over all 50 observations).

dnbinom(0:9, mu=out.NB$estimate[1], size=out.NB$estimate[2])

[1] 0.10943985 0.16405655 0.16945422 0.14869883 0.11893284 0.08958118
[7] 0.06468935 0.04527822 0.03093792 0.02073880

# expected counts

dnbinom(0:9, mu=out.NB$estimate[1], size=out.NB$estimate[2])*50

[1] 5.471993 8.202827 8.472711 7.434941 5.946642 4.479059 3.234468 2.263911
[9] 1.546896 1.036940

# anything left over?

sum(dnbinom(0:9, mu=out.NB$estimate[1], size=out.NB$estimate[2])*50)

[1] 48.09039

I combine the counts 9 and above into the last category using the pnbinom function.

NB.p <- c (dnbinom(0:8, mu=out.NB$estimate[1], size=out.NB$estimate[2]), 1-pnbinom(8, mu=out.NB$estimate[1], size=out.NB$estimate[2]))

exp.NB <- NB.p*50

I rename the last category so that it correctly indicates 9 and above.

names(num.stems) <- c(0:8, "9+")

I next produce a bar graph that compares the fit of the Poisson and negative binomial models.

out.bar <- barplot(num.stems, ylim=c(0, max(c(exp.NB, exp.pois, num.stems))))

#add negative binomial

points(out.bar, exp.NB, col=2, pch=22, cex=.9, type='o')

#add poisson

points(out.bar, exp.pois, col=1, pch=16, cex=.9, type='o')

legend('topright', c('negative binomial', 'Poisson'), col=c(2,1), lty=1, pch=c(22,16), bty='n', cex=.8)

fig. 4

Fig. 5 Fit of the Poisson and negative binomial models

Based on the plot the fit of the negative binomial model appears to be a big improvement over that of the Poisson.

Testing the fit of the negative binomial model

Parametric goodness of fit test

If we examine the expected counts we see that a number of them are below 5.

exp.NB

[1] 5.471993 8.202827 8.472711 7.434941 5.946642 4.479059 3.234468
[8] 2.263911 1.546896 2.946552

As was noted in lecture 15 if too many of the expected counts are small (5 being a guideline for smallness) then the chi-squared distribution of the Pearson statistic becomes questionable. To deal with this I combine categories. One way to do this would be to leave the first five categories alone, combine the next two, and combine the last three. A good strategy is to start with the tail and work backwards.

NB.p.new <- c(NB.p[1:5] ,sum(NB.p[6:7]), sum(NB.p[8:10]))

NB.merge <- 50*NB.p.new

NB.merge

[1] 5.471993 8.202827 8.472711 7.434941 5.946642 7.713526 6.757360

I merge these same categories for the observed counts and carry out the test. The degrees of freedom for the chi-squared statistic is m – 1 – p where m is the number of categories (after merging) and p is the number of estimated parameters. We have two estimated parameters: μ and θ so p = 2.

Oi.merge <- c(num.stems[1:5], sum(num.stems[6:7]), sum(num.stems[8:10]))

Pearson <- sum((Oi.merge-NB.merge)^2/NB.merge)

Pearson

[1] 0.6607193

Pearson.p <- 1-pchisq(Pearson, length(Oi.merge)-1-2)

Pearson.p

[1] 0.9560832

We can also use the output from the chisq.test function to obtain the Pearson statistic and then calculate the p-value ourselves (to properly account for the two parameters we estimated).

chisq.NB <- chisq.test(Oi.merge, p=NB.p.new)

names(chisq.NB)

[1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
[7] "expected" "residuals"

1-pchisq(chisq.NB$statistic, df=chisq.NB$parameter-2)

X-squared
0.9560832

The p-value is large so we fail to reject the fit of the negative binomial model.

Simulation-based goodness of fit test

By pooling categories we have distorted the negative binomial distribution. Failing to find a significant lack of fit with the pooled categories does not guarantee that the model would still fit if we didn't pool the categories. We can obtain a better measure of fit by using a randomization test.

chisq.test(num.stems, p=NB.p, simulate.p.value=TRUE, B=9999)

Chi-squared test for given probabilities with simulated p-value
(based on 9999 replicates)

data: num.stems
X-squared = 3.5113, df = NA, p-value = 0.9459

The result is nearly identical to what we obtained with the parametric test using the combined categories.

A skeptic might argue that the way we're handling the last category is overly favorable to our model. The empirical distribution of the raw data does not decrease to zero; the last observed category has four observations. By lumping the tail of the negative binomial distribution into the last observed category we force the expected results to more closely resemble the observed data. To satisfy such a skeptic I create a new category, X = 10+, for both the observed and expected counts and redo the randomization test. This category represents stems with aphid counts of 10 or more. Of course, the observed frequency of this new category is 0.

new.numstems <- c(num.stems,0)

new.NB.p <- c(dnbinom(0:9, mu=out.NB$estimate[1], size=out.NB$estimate[2]), 1-pnbinom(9, mu=out.NB$estimate[1], size=out.NB$estimate[2]))

new.NB.p

[1] 0.10943985 0.16405655 0.16945422 0.14869883 0.11893284 0.08958118
[7] 0.06468935 0.04527822 0.03093792 0.02073880 0.03819224

new.NB.p*50

[1] 5.471993 8.202827 8.472711 7.434941 5.946642 4.479059 3.234468
[8] 2.263911 1.546896 1.036940 1.909612

chisq.test(new.numstems, p=new.NB.p, simulate.p.value=TRUE, B=9999)

Chi-squared test for given probabilities with simulated p-value
(based on 9999 replicates)

data: new.numstems
X-squared = 13.5113, df = NA, p-value = 0.1923

So even after creating a separate final category in which the observed count is zero, we still fail to find a significant lack of fit (although the p-value of our test has been reduced substantially).

Warning. I should point out one deficiency of the simulation-based test. There is no penalty for overfitting. In the Pearson chi-squared test the degrees of freedom of the chi-squared distribution used for assessing the significance of the test statistic gets reduced by the number of parameters that were estimated. This has the effect of reducing the threshold (critical value) for lack of fit. As a result, models that use more parameters are forced to meet a higher goodness of fit standard than are models using fewer parameters. This makes it harder for complicated models to pass the test. Thus the parametric version of the chi-squared test has a built-in protection against over-fitting. With the simulation-based version of the test there is no such protection.

The Crawley slug data set

We use a data set that appears on Michael Crawley's web site for his text Statistical Computing (Crawley 2002) and is also used in The R Book (2007).

#obtain data

slugs <- read.table( 'http://www.bio.ic.ac.uk/research/mjcraw/statcomp/data/slugsurvey.txt', header=TRUE)

names(slugs)

[1] "slugs" "field"

dim(slugs)

[1] 80 2

slugs[1:8,]

  slugs   field
1     3 Nursery
2     0 Nursery
3     0 Nursery
4     0 Nursery
5     0 Nursery
6     1 Nursery
7     0 Nursery
8     3 Nursery

Crawley describes these data as follows (Crawley 2002, p. 542). "Count data were obtained on the number of slugs found beneath 40 tiles placed in a stratified random grid over each of two permanent grasslands. This was part of a pilot study in which the question was simply whether the mean slug density differed significantly between the two grasslands." My interpretation of this is that the experiment was extraordinarily simple. Rocks of standard shape and size were laid out in two fields and at some point later in time the fields were revisited and the number of slugs under each rock was recorded. The data we are given are the raw counts, i.e., the individual slug counts for each rock. Thus we have a total of 80 observations, 40 from each field type. A value of 0 means no slugs were found under that rock.We can verify this by tabulating the data by field type.

table(slugs$field)

Nursery Rookery
40 40

Crawley then spends a chapter of his textbook trying one statistical test after another to test the hypothesis of no mean difference in slugs between the two field types. In the end his conclusion is ambiguous. Some tests say there is a difference, some say there isn't.

I submit that the problem posed by Crawley is essentially an uninteresting one. Any two populations in nature will typically differ with respect to whatever characteristic we care to measure. Whether that difference can be deemed statistically significant is not a statement about nature, but instead is a statement about the sample size used in the study. With enough data any difference, no matter how small, will achieve "statistical significance". A far more useful approach is to determine some way to characterize the differences in slug distribution that exist and then assess whether that characterization tells us anything interesting about slugs and/or nature in general.

First we summarize the distribution of slugs under rocks in the two field types.

slug.table <- table(slugs$slugs, slugs$field)

slug.table

  Nursery Rookery
0      25       9
1       5       9
2       2       8
3       2       5
4       2       2
5       1       4
6       1       1
7       1       0
8       0       1
9       0       1
10      1       0

A bar plot of the two distributions would be useful. One problem with using a bar plot with ordinal data such as count categories is that it treats the categories as nominal, so if there are any gaps (missing categories) they won't show up in the plot unless we insert the zero categories ourselves beforehand. This is not a problem here because each of the eleven count categories are present in at least one of the field types, so the table function has inserted the zero values when needed.

Bar plots using base graphics

The output of the table function can be used as input to the barplot function to produce a bar plot. When we generate separate frequency distributions for one variable (slugs) by the levels of a second variable (field), the barplot default is to generate a stacked bar plot (Fig. 6a). If we add the beside=T argument we get two separate bar plots, one for each category of field.

#stacked bar plot

barplot(table(slugs$slugs, slugs$field))

#separate bar plots

barplot(table(slugs$slugs, slugs$field), beside=T)

(a)	(b)
Fig. 6 Distribution of slugs counts using (a) a stacked bar plot and (b) separate bar plots

Because our goal is to superimpose predictions from a model onto the observed distribution of slug counts, the stacked bar plot is of no use to us. Although it appears that there are two separate graphs in Fig 1b in fact this is a single graph with a single set of coordinates on the x-axis. We can see this by assigning a name to the output of the barplot function and printing the object.

coord1 <- barplot(table(slugs$slugs, slugs$field), beside=T)

coord1

[,1] [,2]
[1,] 1.5 13.5
[2,] 2.5 14.5
[3,] 3.5 15.5
[4,] 4.5 16.5
[5,] 5.5 17.5
[6,] 6.5 18.5
[7,] 7.5 19.5
[8,] 8.5 20.5
[9,] 9.5 21.5
[10,] 10.5 22.5
[11,] 11.5 23.5

The columns contain the x-coordinates of the bars with each column corresponding to a different field type. Notice that the top of the second column continues where the bottom of the first column left off.

Notice that the categories aren't labeled in the separate bar plots. We can add labels with the names.arg argument in which we specify a single vector of names for both graphs. When we do so the field type labels under the graphs disappear so we need to add those with the mtext function. To position the labels at the center of each display I use the values in row 6 of the matrix of bar coordinates contained in the object coord1. The cex.names argument can be used to reduce the size of the bar labels so that more are displayed.

coord1 <- barplot(table(slugs$slugs, slugs$field), beside=T, names.arg=c(0:10,0:10), cex.names=.75)

mtext(side=1, line=2.5, at=coord1[6,], text=c('Nursery', 'Rookery'), cex=.9)

fig 2

Fig. 7 Separate bar plots of the two slug distributions with labels for the bars

When there are only two groups to display in a bar plot, another option is to use a side by side bar plot in which the distributions of the two groups are interwoven. This can be accomplished with barplot and the beside=T option if we first transpose the table of counts with the t function so that the rows of the matrix correspond to the groups and the columns correspond to the separate count categories. There is a legend.text argument which if set to TRUE will add a legend indicating which bar corresponds to which group.

t(table(slugs$slugs, slugs$field))

0 1 2 3 4 5 6 7 8 9 10
Nursery 25 5 2 2 2 1 1 1 0 0 1
Rookery 9 9 8 5 2 4 1 0 1 1 0

barplot(t(table(slugs$slugs, slugs$field)), beside=T, legend.text=T)

fig 3

Fig. 8 Side by side bar plots of the two slug distributions with labels

The coordinates of the bars now are contained in a 2 × 11 matrix.

coord2 <- barplot(t(table(slugs$slugs, slugs$field)), beside=T, legend=T)

coord2

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] 1.5 4.5 7.5 10.5 13.5 16.5 19.5 22.5 25.5 28.5 31.5
[2,] 2.5 5.5 8.5 11.5 14.5 17.5 20.5 23.5 26.5 29.5 32.5

Bar plots using lattice

When the goal is to produce separate graphs according to the levels of another variable (such as the field variable here) lattice is usually a better choice than base graphics. In order to use lattice we need to organize the table of counts into a data frame. This can be accomplished with the data.frame function applied directly to the output from the table function.

slugtable <- data.frame(table(slugs$slugs, slugs$field))

slugtable

   Var1    Var2 Freq
1     0 Nursery   25
2     1 Nursery    5
3     2 Nursery    2
4     3 Nursery    2
5     4 Nursery    2
6     5 Nursery    1
7     6 Nursery    1
8     7 Nursery    1
9     8 Nursery    0
10    9 Nursery    0
11   10 Nursery    1
12    0 Rookery    9
13    1 Rookery    9
14    2 Rookery    8
15    3 Rookery    5
16    4 Rookery    2
17    5 Rookery    4
18    6 Rookery    1
19    7 Rookery    0
20    8 Rookery    1
21    9 Rookery    1
22   10 Rookery    0

slugtable$Var1

[1] 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Levels: 0 1 2 3 4 5 6 7 8 9 10

The count categories have been converted to a factor. It will be useful to have the actual numerical values when we try to add model predictions. If we try to convert them to numbers with the as.numeric function we see that we don't get what we want.

as.numeric(slugtable$Var1)

[1] 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11

The problem is that factor levels are always numbered starting at 1. The trick to getting the actual numeric labels back is to first convert the factor levels to character data with the as.character function and then convert the character data to numbers.

as.numeric(as.character(slugtable$Var1))

[1] 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

slugtable$Var1.num <- as.numeric(as.character(slugtable$Var1))

The bar plot function in lattice is called barchart. The default arrangement is a horizontal bar plot if the x-variable is numeric and a vertical bar chart if the x-variable is a factor. To get a vertical bar plot under all circumstances we can add the horizontal = F argument. The graph is shown in Fig. 9a.

library(lattice)

barchart(Freq~Var1|Var2, data=slugtable, xlab='Count category')

(a)	(b)
Fig. 9 Bar plots using the lattice barchart function with (a) default settings and (b) with the origin=0 argument.

We see that the graph is not ideal.

The bars start at the bottom of the graph rather than at zero. We can fix that by specifying origin=0.
The default bar color is hard on the eyes. I change the color to gray.

The following produces the graph shown in Fig. 9b.

barchart(Freq~Var1|Var2, data=slugtable, xlab='Count category', origin=0, col='grey')

We can get almost the same display with the xyplot function by using the type='h' option to draw vertical lines from the plotted points to the x-axis with lwd (line width) set to a large value to produce bars. To get square ends on the bars I use lineend=1. For xyplot we need the numerical version of Var1 rather than the factor version. Fig. 10 shows the result.

xyplot(Freq~Var1.num|Var2, data=slugtable, xlab='Count category', type='h', lineend=1, lwd=10, col='grey')

fig. 10
Fig. 10 Bar plots using the lattice xyplot function

The differences from barchart are that we don't get the dark borders around the bars and the missing categories are just missing and not indicated by horizontal lines as they are in Fig. 4. More importantly, if there had been gaps in the data (count categories with a frequency of zero that were not recorded) then xyplot would display the gaps whereas barchart would not. The barchart function uses the levels of the factor function to construct the categories.

Maximum likelihood estimation of a Poisson regression model

I start by fitting a single Poisson distribution to the data that ignores the field type from the which the slug data were obtained. The negative log-likelihood function we need is the same one we used in lecture 15 except modified to use the slug counts rather than the aphid counts we analyzed there.

#negative LL for a Poisson model--common mean

negpois1.LL<-function(p){

LL<-sum(log(dpois(slugs$slugs,lambda=p)))

-LL

}

This is a single mean model, one mean for both field types. The parameter λ in the Poisson distribution is the mean and its maximum likelihood estimate is the sample mean. For the initial estimate required by the nlm function I choose a value close to the sample mean.

mean(slugs$slugs)

[1] 1.775

out.pois1 <- nlm(negpois1.LL,2)

Warning messages:
1: In dpois(x, lambda, log) : NaNs produced
2: In nlm(negpois1.LL, 2) : NA/Inf replaced by maximum positive value

out.pois1

$minimum
[1] 176.8383

$estimate
[1] 1.774999

$gradient
[1] 1.056808e-06

$code
[1] 1

$iterations
[1] 5

The iterations converged (the gradient is approximately zero and code = 1) and as expected the function returns the sample mean as the point estimate of λ. The warning messages indicate that during the iterations we obtained an illegal value for dpois, probably because lambda had become negative. The function did recover and it appears we have sensible estimates.

We can easily generalize this function to fit a separate mean to each field. I begin by defining a 0-1 dummy variable that indicates the field type. For this we can use a Boolean expression that tests whether the field type is equal to "Rookery".

slugs$field=='Rookery'

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

as.numeric(slugs$field=="Rookery")

[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[37] 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[73] 1 1 1 1 1 1 1 1

I next rewrite the negative Poisson log-likelihood function using this so that the mean varies by field type. We don't need to explicitly convert the Boolean expression to numeric because using it in an arithmetic expression will cause the conversion to happen automatically.

#negative LL for a Poisson model--separate means model

negpois2.LL <- function(p){

mu <- p[1] + p[2]*(slugs$field=="Rookery")

LL <- sum(log(dpois(slugs$slugs, lambda=mu)))

-LL

}

The first line of the function creates a variable called mu that is defined in terms of the Boolean expression. Since the Boolean expression returns a vector, mu will also be a vector. As the value of the Boolean expression changes from observation to observation, the value of mu will also change. For nursery slugs the Boolean expression returns 0. Thus for nursery slugs mu has the value

mu = p[1] + p[2]*0 = p[1]

For rookery slugs the Boolean expression returns 1. Thus for rookery slugs mu has the value

mu = p[1] + p[2]*1 = p[1] + p[2]

In summary we have the following.

So we see that p[1] is the value of the mean λ for nursery slugs. Since the value of λ for rookery slugs is p[1] + p[2] we see that p[2] represents the difference in the value of λ between nursery and rookery slugs. So, p[2] is the mean slug count difference between nursery and rookery slugs.

To obtain starting values for the elements of p I calculate the mean in each field type.

tapply(slugs$slugs, slugs$field,mean)

Nursery Rookery
1.275 2.275

So the sample mean for nursery slugs is 1.275 and the mean difference is 1. I use p = c(1.2,1) as initial estimates.

out.pois2 <- nlm(negpois2.LL, c(1.2,1))

out.pois2

$minimum
[1] 171.1275

$estimate
[1] 1.2749997 0.9999997

$gradient
[1] 1.125723e-05 -1.506351e-06

$code
[1] 1

$iterations
[1] 8

The iterations converged (both elements of the gradient vector are near zero and code = 1). The estimates nlm returns are the values we predicted they would be.

Testing for a mean difference: likelihood ratio and Wald tests

Likelihood ratio test

The two models we've fit for the mean are the following.

Poisson models

where z is the dummy variable defined above that indicates the rookery slugs. We wish to test

null hypothesis

The use of the likelihood ratio test in comparing models was discussed in lecture 13. The $minimum component of the nlm object is the negative log-likelihood evaluated at the maximum likelihood estimate. The likelihood ratio statistic is twice the difference in these values for the two models ordered in such a way that the difference is positive.

LRstat <- 2*(out.pois1$minimum-out.pois2$minimum)

LRstat

[1] 11.42156

This statistic has an asymptotic chi-squared distribution with degrees of freedom equal to the difference in the number of parameters estimated in the two models, which is 1 for these two models.

length(out.pois2$estimate)-length(out.pois1$estimate)

[1] 1

The likelihood ratio test is a one-tailed upper-tailed test. We reject the null hypothesis at α = .05 if . I obtain the p-value for this test next.

#Likelihood ratio test of whether the Poisson means are the same in the field types

1-pchisq(LRstat, df=1)

[1] 0.0007259673

So using the likelihood ratio test I reject the null hypothesis and conclude that the Poisson rate constants (means) of slug occurrence are different in the two field types.

Wald test

The Wald test is an alternative to the likelihood ratio test and was also discussed in lecture 13. In order to use it we only need to fit the second of the two models, but we do need to compute the Hessian matrix to obtain the standard errors of the parameter estimates. For that we need to include hessian=T as an argument of the nlm function.

#Wald test: need Hessian to calculate standard errors

out.pois2 <- nlm(negpois2.LL, c(2,1), hessian=T)

out.pois2

$minimum
[1] 171.1275

$estimate
[1] 1.274999 1.000001

$gradient
[1] 1.337493e-07 5.996976e-06

$hessian
[,1] [,2]
[1,] 48.94677 17.58066
[2,] 17.58066 17.58087

$code
[1] 1

$iterations
[1] 9

From lecture 8 we learned that the inverse of the negative of the Hessian yields the variance-covariance matrix of the parameters. Because we're working with the negative log-likelihood, the negative sign has already been taken care of. To carry out matrix inversion we use the solve function of R. The variances occur on the diagonal of the inverted matrix and can be extracted with the diag function. Finally we take their square root to obtain the standard errors of the parameter estimates.

#invert Hessian to get standard errors

se2 <- sqrt(diag(solve(out.pois2$hessian)))

se2

[1] 0.1785534 0.2979271

Individual 95% Wald confidence intervals are obtained from the formula

Using R we obtain the following.

#95% confidence intervals for both parameters

rbind(out.pois2$estimate + qnorm(.025)*se2, out.pois2$estimate + qnorm(.975)*se2)

[,1] [,2]
[1,] 0.9250408 0.4160743
[2,] 1.6249574 1.5839271

The 95% Wald confidence interval for β₀ is (0.95, 1.63) while that for β₁ is (0.42, 1.58). Because 0 is not contained in the confidence interval for β₁, we reject the null hypothesis and conclude that β₁ ≠ 0.

As an alternative to constructing confidence intervals we can carry out the Wald test formally. The Wald statistic for our null hypothesis is the following.

Wald formula

W <- out.pois2$estimate[2]/se2[2]

[1] 3.356528

qnorm(.975)

[1] 1.959964

We reject H₀ if . Because 3.356 > 1.96, we reject the null hypothesis and conclude that the Poisson rate constants are different in the two field types. The p-value of the test statistic is just the probability under the null model that . We obtain this by calculating this probability for one tail and then doubling the result.

2*(1-pnorm(abs(out.pois2$estimate[2]/se2[2])))

[1] 0.0007892766

Observe that this p-value is very close to what we obtained from the likelihood ratio test above.

Fitting the models using the glm function

As we've seen regression models can be fit to a response variable that has a Poisson distribution by using the glm function with family=poisson. The default link function is the log link thus using glm we actually fit the following two models:

Model 1: log μ = β₀
Model 2: log μ = β₀ + β₁*(field=='Rookery')

out1 <- glm(slugs~1, data=slugs , family=poisson)

out2 <- glm(slugs~field, data=slugs, family=poisson)

To obtain the likelihood ratio test of H₀: β₁ = 0, we use the anova function and list the reduced model followed by the full model. Adding the argument test='Chisq' produces the test.

anova(out1, out2, test='Chisq')

Analysis of Deviance Table

Model 1: slugs ~ 1
Model 2: slugs ~ field
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1        79     224.86
2        78     213.44 1   11.422 0.000726 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We can also get the likelihood ratio test by specifying only the full model in the anova function along with test='Chisq'. This time anova returns sequential likelihood ratio tests, much like the sequential F-tests it returns with normal models.

anova(out2, test='Chisq')

Analysis of Deviance Table

Model: poisson, link: log

Response: slugs

Terms added sequentially (first to last)

      Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL                     79     224.86
field 1   11.422        78     213.44 0.000726 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The summary table of the full model reports the Wald test of H₀: β₁ = 0.

summary(out2)

Call:
glm(formula = slugs ~ field, family = poisson, data = slugs)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.1331 -1.5969 -0.9519 0.4580 4.8727

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)    0.2429     0.1400   1.735 0.082744 .
fieldRookery   0.5790     0.1749   3.310 0.000932 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 224.86 on 79 degrees of freedom
Residual deviance: 213.44 on 78 degrees of freedom
AIC: 346.26

Number of Fisher Scoring iterations: 6

Notice that the reported p-values for the Wald test (p = 0.000932) and the likelihood ratio test (p = 0.000726) are similar but different. This will always be the case. The likelihood ratio test is based on the full log-likelihood while the Wald test uses only the curvature of the log-likelihood surface.

If we use the predict function on a glm model we get the predictions on the scale of the link function. For Poisson regression that means the predictions are of the log mean.

predict(out2)

        1         2         3         4         5         6         7
0.2429462 0.2429462 0.2429462 0.2429462 0.2429462 0.2429462 0.2429462
        8         9        10        11        12        13        14
0.2429462 0.2429462 0.2429462 0.2429462 0.2429462 0.2429462 0.2429462
       15        16        17        18        19        20        21
0.2429462 0.2429462 0.2429462 0.2429462 0.2429462 0.2429462 0.2429462
       22        23        24        25        26        27        28
0.2429462 0.2429462 0.2429462 0.2429462 0.2429462 0.2429462 0.2429462
       29        30        31        32        33        34        35
0.2429462 0.2429462 0.2429462 0.2429462 0.2429462 0.2429462 0.2429462
       36        37        38        39        40        41        42
0.2429462 0.2429462 0.2429462 0.2429462 0.2429462 0.8219801 0.8219801
       43        44        45        46        47        48        49
0.8219801 0.8219801 0.8219801 0.8219801 0.8219801 0.8219801 0.8219801
       50        51        52        53        54        55        56
0.8219801 0.8219801 0.8219801 0.8219801 0.8219801 0.8219801 0.8219801
       57        58        59        60        61        62        63
0.8219801 0.8219801 0.8219801 0.8219801 0.8219801 0.8219801 0.8219801
       64        65        66        67        68        69        70
0.8219801 0.8219801 0.8219801 0.8219801 0.8219801 0.8219801 0.8219801
       71        72        73        74        75        76        77
0.8219801 0.8219801 0.8219801 0.8219801 0.8219801 0.8219801 0.8219801
       78        79        80
0.8219801 0.8219801 0.8219801

To get predictions on the scale of the response variable we need to add the type='response' argument.

predict(out2, type='response')

    1     2     3     4     5     6     7     8     9    10    11    12
1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275
   13    14    15    16    17    18    19    20    21    22    23    24
1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275
   25    26    27    28    29    30    31    32    33    34    35    36
1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275 1.275
   37    38    39    40    41    42    43    44    45    46    47    48
1.275 1.275 1.275 1.275 2.275 2.275 2.275 2.275 2.275 2.275 2.275 2.275
   49    50    51    52    53    54    55    56    57    58    59    60
2.275 2.275 2.275 2.275 2.275 2.275 2.275 2.275 2.275 2.275 2.275 2.275
   61    62    63    64    65    66    67    68    69    70    71    72
2.275 2.275 2.275 2.275 2.275 2.275 2.275 2.275 2.275 2.275 2.275 2.275
   73    74    75    76    77    78    79    80
2.275 2.275 2.275 2.275 2.275 2.275 2.275 2.275

Graphing the fitted Poisson model

With only two groups it's easy to calculate the probabilities we need. I do things a bit more systematically in order to be able to generalize to more complicated situations. We need the Poisson probabilities for the categories 0 through 10. The Poisson mean varies depending on the field type the slugs inhabit so we'll need to account for this. I elect to add the tail probability, P(X > 10) to the last category, P(X = 10). The slugtable data frame variable Var1 contains the different count categories while Var2 contains the field variable. I begin by using the predict function with the newdata argument to obtain the predicted means for each observation in the slugtable data frame.

slugtable$mu <- predict(out2, type='response', newdata=data.frame(field=slugtable$Var2))

slugtable

   Var1    Var2 Freq Var1.num    mu
1     0 Nursery   25        0 1.275
2     1 Nursery    5        1 1.275
3     2 Nursery    2        2 1.275
4     3 Nursery    2        3 1.275
5     4 Nursery    2        4 1.275
6     5 Nursery    1        5 1.275
7     6 Nursery    1        6 1.275
8     7 Nursery    1        7 1.275
9     8 Nursery    0        8 1.275
10    9 Nursery    0        9 1.275
11   10 Nursery    1       10 1.275
12    0 Rookery    9        0 2.275
13    1 Rookery    9        1 2.275
14    2 Rookery    8        2 2.275
15    3 Rookery    5        3 2.275
16    4 Rookery    2        4 2.275
17    5 Rookery    4        5 2.275
18    6 Rookery    1        6 2.275
19    7 Rookery    0        7 2.275
20    8 Rookery    1        8 2.275
21   9 Rookery    1        9 2.275
22   10 Rookery    0       10 2.275

Next I calculate the Poisson probabilities for categories 0 through 10 (Var1.num) using the mean recorded in each row.

slugtable$p <- dpois(slugtable$Var1.num, lambda=slugtable$mu)

slugtable

   Var1    Var2 Freq Var1.num    mu            p
1     0 Nursery   25        0 1.275 2.794310e-01
2     1 Nursery    5        1 1.275 3.562745e-01
3     2 Nursery    2        2 1.275 2.271250e-01
4     3 Nursery    2        3 1.275 9.652812e-02
5     4 Nursery    2        4 1.275 3.076834e-02
6     5 Nursery    1        5 1.275 7.845926e-03
7     6 Nursery    1        6 1.275 1.667259e-03
8     7 Nursery    1        7 1.275 3.036794e-04
9     8 Nursery    0        8 1.275 4.839890e-05
10    9 Nursery    0        9 1.275 6.856511e-06
11   10 Nursery    1       10 1.275 8.742051e-07
12    0 Rookery    9        0 2.275 1.027969e-01
13    1 Rookery    9        1 2.275 2.338630e-01
14    2 Rookery    8        2 2.275 2.660191e-01
15    3 Rookery    5        3 2.275 2.017312e-01
16    4 Rookery    2        4 2.275 1.147346e-01
17    5 Rookery    4        5 2.275 5.220424e-02
18    6 Rookery    1        6 2.275 1.979411e-02
19    7 Rookery    0        7 2.275 6.433086e-03
20    8 Rookery    1        8 2.275 1.829409e-03
21    9 Rookery    1        9 2.275 4.624339e-04
22   10 Rookery    0       10 2.275 1.052037e-04

Finally I add the tail probability, P(X > 10) to the X = 10 category. For this I use a Boolean expression to select only the X = 10 category: slugtable$Var1.num==10. Recall that P(X > 10) = ppois(10, lambda, lower.tail=F). To make sure I've done this correctly I create a new variable and calculate all of the probabilities again adding the tail probability only to the X = 10 category.

slugtable$p2 <- dpois(as.numeric(as.character(slugtable$Var1)), lambda=slugtable$mu) + (slugtable$Var1==10) * ppois(as.numeric(as.character(slugtable$Var1)), lambda=slugtable$mu, lower.tail=F)

slugtable

   Var1    Var2 Freq Var1.num    mu            p           p2
1     0 Nursery   25        0 1.275 2.794310e-01 2.794310e-01
2     1 Nursery    5        1 1.275 3.562745e-01 3.562745e-01
3     2 Nursery    2        2 1.275 2.271250e-01 2.271250e-01
4     3 Nursery    2        3 1.275 9.652812e-02 9.652812e-02
5     4 Nursery    2        4 1.275 3.076834e-02 3.076834e-02
6     5 Nursery    1        5 1.275 7.845926e-03 7.845926e-03
7     6 Nursery    1        6 1.275 1.667259e-03 1.667259e-03
8     7 Nursery    1        7 1.275 3.036794e-04 3.036794e-04
9     8 Nursery    0        8 1.275 4.839890e-05 4.839890e-05
10    9 Nursery    0        9 1.275 6.856511e-06 6.856511e-06
11   10 Nursery    1       10 1.275 8.742051e-07 9.874605e-07
12    0 Rookery    9        0 2.275 1.027969e-01 1.027969e-01
13    1 Rookery    9        1 2.275 2.338630e-01 2.338630e-01
14    2 Rookery    8        2 2.275 2.660191e-01 2.660191e-01
15    3 Rookery    5        3 2.275 2.017312e-01 2.017312e-01
16    4 Rookery    2        4 2.275 1.147346e-01 1.147346e-01
17    5 Rookery    4        5 2.275 5.220424e-02 5.220424e-02
18    6 Rookery    1        6 2.275 1.979411e-02 1.979411e-02
19    7 Rookery    0        7 2.275 6.433086e-03 6.433086e-03
20    8 Rookery    1        8 2.275 1.829409e-03 1.829409e-03
21    9 Rookery    1        9 2.275 4.624339e-04 4.624339e-04
22   10 Rookery    0       10 2.275 1.052037e-04 1.319466e-04

Checking we see that the Nursery and Rookery probabilities now sum to 1.

sum(slugtable$p2[slugtable$Var2=='Nursery'])

[1] 1

sum(slugtable$p2[slugtable$Var2=='Rookery'])

[1] 1

If we sum the probabilities separately for each count category over all rocks of that field type, we get the predicted counts for that field. For our simple model the predicted probabilities are the same for each rock that comes from the same field type. So a simpler way to get the predicted counts is to multiply the probability distribution for one rock by the total number of rocks, which is 40 for each field type.

In general the situation could be more complicated so I handle this more generally. I tabulate the counts by field type and then use the Var2 variable to select which entry of this table I should use to multiply the individual probabilities.

n <- table(slugs$field)

Nursery Rookery
40 40

as.numeric(slugtable$Var2)

[1] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2

#calculate predicted counts under Poisson model

slugtable$exp <- slugtable$p2 * n[as.numeric(slugtable$Var2)]

slugtable

   Var1    Var2 Freq Var1.num    mu            p           p2          exp
1     0 Nursery   25        0 1.275 2.794310e-01 2.794310e-01 1.117724e+01
2     1 Nursery    5        1 1.275 3.562745e-01 3.562745e-01 1.425098e+01
3     2 Nursery    2        2 1.275 2.271250e-01 2.271250e-01 9.084999e+00
4     3 Nursery    2        3 1.275 9.652812e-02 9.652812e-02 3.861125e+00
5     4 Nursery    2        4 1.275 3.076834e-02 3.076834e-02 1.230734e+00
6     5 Nursery    1        5 1.275 7.845926e-03 7.845926e-03 3.138370e-01
7     6 Nursery    1        6 1.275 1.667259e-03 1.667259e-03 6.669037e-02
8     7 Nursery    1        7 1.275 3.036794e-04 3.036794e-04 1.214717e-02
9     8 Nursery    0        8 1.275 4.839890e-05 4.839890e-05 1.935956e-03
10    9 Nursery    0        9 1.275 6.856511e-06 6.856511e-06 2.742604e-04
11   10 Nursery    1       10 1.275 8.742051e-07 9.874605e-07 3.949842e-05
12    0 Rookery   9        0 2.275 1.027969e-01 1.027969e-01 4.111876e+00
13    1 Rookery    9        1 2.275 2.338630e-01 2.338630e-01 9.354519e+00
14    2 Rookery    8        2 2.275 2.660191e-01 2.660191e-01 1.064076e+01
15    3 Rookery    5        3 2.275 2.017312e-01 2.017312e-01 8.069247e+00
16    4 Rookery    2        4 2.275 1.147346e-01 1.147346e-01 4.589384e+00
17    5 Rookery    4        5 2.275 5.220424e-02 5.220424e-02 2.088170e+00
18    6 Rookery    1        6 2.275 1.979411e-02 1.979411e-02 7.917644e-01
19   7 Rookery    0        7 2.275 6.433086e-03 6.433086e-03 2.573234e-01
20    8 Rookery    1        8 2.275 1.829409e-03 1.829409e-03 7.317635e-02
21    9 Rookery    1        9 2.275 4.624339e-04 4.624339e-04 1.849735e-02
22   10 Rookery    0       10 2.275 1.052037e-04 1.319466e-04 5.277863e-03

To add the predicted counts to a lattice bar plot of the observed counts we need to use the subscripts variable in a panel function to select the appropriate predicted counts for each panel. Here's an example where I use the panel.xyplot function to draw the bar chart and then use a panel.points function to superimpose the predicted counts on top of the bars.

#add predicted counts to the bar plot of the observed counts

xyplot(Freq~Var1.num|Var2, data=slugtable, xlab='Count category', panel=function(x, y, subscripts) {

panel.xyplot(x, y, type='h', lineend=1, col='grey', lwd=10)

panel.points(x, slugtable$exp[subscripts], pch=16, cex=.6, col=1, type='o')

})

In a lattice graph the formula expression Freq~Var1.num|Var2 defines the x- and y-variables for the panel function. In the panel function x and y correspond to the values of Var1.num and Freq respectively for the subset of observations whose value of variable Var2 matches the the value for the panel currently being drawn. For all other variables not included in the formula we have to do the subsetting by panel explicitly ourselves. That's what the subscripts variable is for. It acts like a dummy variable that selects the current panel observations. So in the panel.points function slugtable$exp[subscripts] selects the predicted counts for the current panel while x represents the already subsetted values of Var1.num.

There's a barchart panel function, panel.barchart, that can be used instead of panel.xyplot to generate the observed counts. This time I use the factor version of Var1.

#add predicted counts to the bar plot of the observed counts

xyplot(Freq~Var1|Var2, data=slugtable, xlab='Count category', panel=function(x, y,subscripts) {

panel.barchart(x, y, horizontal=F, origin=0, col='grey')

panel.points(x, slugtable$exp[subscripts], pch=16, cex=.6, col=1, type='o')

})

The two versions of the lattice bar plot are shown in Fig. 11.

(a)	(b)
Fig. 11 Bar plots with superimposed predictions from a Poisson model. (a) uses the panel.xyplot function to draw the bars while (b) uses the panel.barchart function to draw the bars.

Cited references

Crawley, Michael J. 2002. Statistical Computing: An Introduction to Data Analysis Using S-Plus. Wiley, New York.

R Code

A compact collection of all the R code displayed in this document appears here.

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--October 19, 2012
URL: https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/lectures/lecture16.htm