Lecture 15—Monday, October 15, 2012

Topics

Poisson model for aphid counts
- R code from last time
- Fitting the Poisson distribution using the glm function
Assessing the fit of the Poisson model graphically
Pearson chi-square goodness of fit test for discrete data
Testing the fit of the Poisson model analytically
- Parametric goodness of fit test
- Simulation-based goodness of fit test (parametric bootstrap)
Negative binomial distribution
- Some background on the binomial distribution

Derivation of the formula for the binomial probability mass function
Characteristics of the negative binomial distribution
Derivation of the formula of the negative binomial probability mass function
Mean and variance

Ecological parameterization of the negative binomial distribution
- Gamma function

The variance of the negative binomial distribution in terms of μ and θ

Cited references

R functions and commands demonstrated

barplot is the base graphics function for generating bar plots.
chisq.test can be used to perform a Monte Carlo goodness of fit test based on the Pearson chi-squared statistic.
glm is the generalized linear models function. It can be used to fit regression models to response variables with various probability distributions.
pchisq is the cumulative distribution function of the chi-squared distribution.
ppois is the cumulative distribution function of the Poisson distribution.
qchisq is the quantile function of the chi-squared distribution.

R function options

B= (argument to chisq.test) can be used to modify the default number of simulations (B=2000) for the simulation-based version of the chi-squared test.
family= (argument to glm) is used to specify a probability distribution and, optionally, a link function.
lower.tail= (argument to ppois and other probability functions) can be assigned TRUE (the default) or FALSE. ppois(k, lower.tail=T) calculates P(X ≤ k) while ppois(k, lower.tail=F) calculates P(X > k).
simulate.p.value= (argument to chisq.test) can be assigned the values TRUE or FALSE. When TRUE a Monte Carlo-based p-value is calculated for the observed Pearson chi-squared statistic.

Poisson model for aphid counts

We continue with the analysis of the aphid count data set. Last time we fit a Poisson distribution to these data using maximum likelihood. Today we examine the fit of the model.

R code from last time

num.stems <- c(6,8,9,6,6,2,5,3,1,4)

# data frame of tabulated data

aphids <- data.frame(aphids=0:9, counts=num.stems)

aphids

   aphids counts
1       0      6
2       1      8
3       2      9
4       3      6
5       4      6
6       5      2
7       6      5
8       7      3
9       8      1
10      9      4

aphid.data <- rep(0:9,num.stems)

aphid.data

[1] 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4
[36] 5 5 6 6 6 6 6 7 7 7 8 9 9 9 9

Examine the distribution of the tabulated data with a bar chart using the barplot function.

names(num.stems) <- 0:9

barplot(num.stems)

Fig. 1 Bar plot of the tabulated data

Construct a function to calculate the log-likelihood of the data for a specific value of the parameter λ.

poisson.LL <- function(lambda) sum(log(dpois(aphid.data, lambda)))

To use nlm to obtain the maximum likelihood estimate of λ, we need to change the way we're formulating the problem. Since maximizing f is equivalent to minimizing –f, we need to reformulate our objective function so that it returns the negative log-likelihood rather than the log-likelihood. I use λ = 3 as an initial guess.

poisson.negloglik <- function(lambda) -poisson.LL(lambda)

I use λ = 3 as an initial guess in the call to nlm.

out.pois <- nlm(poisson.negloglik,3)

out.pois

$minimum
[1] 124.1764

$estimate
[1] 3.459998

$gradient
[1] 6.571497e-08

$code
[1] 1

$iterations
[1] 4

Fitting the Poisson distribution using the glm function

The Poisson model that we've fit using maximum likelihood is an example of Poisson regression, which in turn is a type of generalized linear model. Generalized linear models can be fit in R using the glm function. We'll discuss generalized linear models in a more formal fashion later. For the moment think of a generalized linear model as an extension of ordinary regression to response variables that have non-normal distributions. Many of the models we've considered can be fit very simply as a generalized linear model using notation that is identical to what we used when fitting ordinary regression models. The only difference is that in a generalized linear model we need to specify a probability distribution and optionally a link function.

A link function is a function g that is applied to the mean μ of our distribution. The generalized linear model then models g(μ) as a function of predictors. If g(μ) = μ we call g the identity link. In ordinary regression when we model the mean as a linear combination of regressors we are implicitly using an identity link. While an identity link would seem to be the natural choice in all cases it is not ideal when the response variable is bounded in any way. For the Poisson distribution a log link is a better choice than an identity link and it is the default link function with the glm function and family=poisson. We can fit the Poisson distribution with the glm function as follows.

out1 <- glm(aphid.data~1, family=poisson(link=identity))

The notation aphid.data~1 indicates we're fitting a model in which aphid.data is the response variable and the only predictor is an intercept.
To get the Poisson distribution we need to specify family=Poisson.
To match the output from nlm I need to specify the identity link g(μ) = μ because the default is to model log μ rather than μ. With link=identity we are fitting the model μ = β₀.

summary(out1)

Call:
glm(formula = aphid.data ~ 1, family = poisson(link = identity))

Deviance Residuals:
Min 1Q Median 3Q Max
-2.6306 -1.5612 -0.2531 1.1204 2.4753

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.4600 0.2631 13.15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 114.46 on 49 degrees of freedom
Residual deviance: 114.46 on 49 degrees of freedom
AIC: 250.35

Number of Fisher Scoring iterations: 3

The estimate of the intercept is the mean of the Poisson distribution and it's the same value we obtained above using the nlm function. If we fit the model without the link argument and accept the default link we are fitting the model log μ = β₀. To obtain the mean of the Poisson distribution we need to exponentiate the result.

out1a <- glm(aphid.data~1, family=poisson)

coef(out1)

(Intercept)
3.46

coef(out1a)

(Intercept)
1.241269

exp(coef(out1a))

(Intercept)
3.46

Assessing the fit of the Poisson model graphically

Having estimated the Poisson model we next examine how well the model fits the observed data. A good place to start is by calculating the count frequencies predicted by the model and comparing these to the observed frequencies graphically. This can be accomplished with a bar plot of the observed frequencies on which we superimpose the Poisson predictions as points connected by line segments.

I begin by calculating the expected values under a Poisson model using the estimated value of λ. I generate the Poisson probabilities for category counts 0 to 9 (matching the range of observed values we have). To obtain the expected counts I multiply these numbers by 50, the number of observations. (Note: What I'm really doing is adding up the predicted probability distributions for all 50 observations. Because each observation has the same predicted probability distribution, this is the same as multiplying that single probability distribution by 50.)

exp <- dpois(0:9, out.pois$estimate)*50

max(exp)

[1] 10.84896

max(table(aphid.data))

[1] 9

Notice that the expected values have a larger maximum than the observed values. If we sum the expected counts we find that they don't add up to 50. Equivalently, if we sum the predicted probability distribution, the probabilities don't sum to one.

sum(dpois(0:9, out.pois$estimate))

[1] 0.9969393

That's because we're leaving off the tail probability. The Poisson distribution is unbounded above, so there is a nonzero probability of obtaining an observation with 10 or more counts. We can obtain the tail probability with the ppois function.

# P(X>9) = 1 - P(X <= 9)

1-ppois(9, out.pois$estimate)

[1] 0.003060710

Alternatively we can use the lower.tail argument of ppois to calculate P(X > 9) directly.

ppois(9, out.pois$estimate, lower.tail=F)

[1] 0.00306071

So, we have a choice. We can create an additional category for both the observed and predicted counts, labeling it '10+' perhaps, or we can just lump P(X > 9) into the last category effectively making it represent '9 or more'.

# Choice 1: modify observed and expected counts so that there are 11 categories

obs.counts <- c(num.stems,0)

exp.counts <- c((dpois(0:9, out.pois$estimate)), 1-ppois(9,out.pois$estimate))*50

rbind(obs.counts, exp.counts)

0 1 2 3 4 5 6 7
obs.counts 6.000000 8.000000 9.00000 6.00000 6.000000 2.000000 5.000000 3.000000
exp.counts 1.571491 5.437355 9.40662 10.84896 9.384349 6.493966 3.744852 1.851026
8 9
obs.counts 1.0000000 4.0000000 0.0000000
exp.counts 0.8005683 0.3077739 0.1530355

# Choice 2: add NB tail to last observed category

exp.pois <- c((dpois(0:8, out.pois$estimate)), 1-ppois(8,out.pois$estimate))*50

rbind(num.stems, exp.pois)

0 1 2 3 4 5 6 7
num.stems 6.000000 8.000000 9.00000 6.00000 6.000000 2.000000 5.000000 3.000000
exp.pois 1.571491 5.437355 9.40662 10.84896 9.384349 6.493966 3.744852 1.851026
8 9
num.stems 1.0000000 4.0000000
exp.pois 0.8005683 0.4608094

Notice that in the second choice I calculate dpois(0:8) and then use 1-ppois(8) for the last observed category. I verify that the expected counts now sum to 50.

sum(exp.pois)

[1] 50

Either of these choices is reasonable. In what follows I use the second option. In order to add the expected counts to the bar chart I need to know the x-coordinates of the centers of the bars. You might think these are just the labels that appear below the bars, but in fact the labels shown need not correspond to the coordinate system R has used. Fortunately the coordinates used are returned by R when you assign the results of the barplot function to an object. The coordinates of the bars are the return value of the barplot function. To ensure that the expected values (which will be added to the graph with the points function) are displayed completely I will need to use the ylim argument of barplot to extend the y-axis to include the maximum expected count.

out.bar <- barplot(num.stems, ylim=c(0,11))

out.bar

[,1]
[1,] 0.7
[2,] 1.9
[3,] 3.1
[4,] 4.3
[5,] 5.5
[6,] 6.7
[7,] 7.9
[8,] 9.1
[9,] 10.3
[10,] 11.5

I use the barplot coordinates as the x argument of the points function. The type='o' option specifies that point symbols should be overlaid on top of line segments connecting the points.

points(out.bar, exp.pois, pch=16, cex=.9, type='o')

We probably should relabel the last category to indicate that it is '9 or more' rather than just 9. The barplot function has an argument names.arg that can be used to modify the labels appearing under the bars. I also add a legend that identifies the model being fit.

#relabel bars

out.bar <- barplot(num.stems, ylim=c(0,11), names.arg=c(0:8,'9+'))

points(out.bar, exp.pois, pch=16, cex=.9, type='o')

legend('topright', 'Poisson model', pch=16, col=1, lty=1, cex=.9, bty='n')

fig. 2

Fig. 2 Fit of the Poisson model

Based on the plot the fit doesn't look very good.

Pearson chi-squared goodness of fit test for discrete data

We can test the fit of the Poisson model formally with the Pearson chi-squared goodness of fit test. The Pearson chi-squared test compares the observed category frequencies to the frequencies predicted by a model (referred to as the expected frequencies) using the following formula.

where m is the number of categories. It turns out the Pearson chi-squared statistic has a chi-squared distribution.

The degrees of freedom are the number of categories, m, minus one minus the number of estimated parameters, p, used in obtaining the expected frequencies. The null hypothesis of this test is that the fit is adequate.

H₀: model fits the data
H₁: model does not fit the data

We should reject the null hypothesis at level α if the observed value of our test statistic exceeds the 1 – α quantile of a chi-squared distribution with m – 1 – p degrees of freedom.

Reject if

The chi-squared distribution of is an asymptotic result. For it to hold the fitted values, the expected frequencies obtained using the model, should be large. A general rule is that the expected cell counts should be 5 or larger, although with many categories Agresti (2002) notes that an expected cell frequency as small as 1 is okay as long as no more than 20% of the expected counts are less than 5. The difficulty in applying this rule when fitting a model such as the Poisson model to data is that the theoretical count distribution is infinite (although the probabilities after a certain point are essentially zero). So the 20% guideline is not well-defined.

For our data quite a few of the expected counts are small. In the code below I calculate the expected frequencies, , for Poisson counts ranging from 0 to 12. From the output we see that the expected frequencies exceed 5 only for k = 1, 2, 3, 4, and 5.

dpois(0:12, out.pois$estimate)*50

[1] 1.571490812 5.437355498 9.406620322 10.848963362 9.384348629
[6] 6.493966014 3.744851868 1.851025857 0.800568284 0.307773876
[11] 0.106489708 0.033495837 0.009657961

What to do in such a situation is not entirely clear. The standard recommendation, cited for example in Sokal and Rohlf (1995) is to pool categories. They write, p. 702, "Whenever classes with expected frequencies of less than five occur, expected and observed frequencies for those classes are generally pooled with an adjacent class to obtain a joint class with an expected frequency ." The number of categories m is thereby reduced in the degrees of freedom of the chi-squared distribution. Not everyone agrees with this. Here's a sampling from the literature.

Cameron & Trivedi (1998), p. 157. These authors recommend pooling but don't give an absolute criterion: "…it is insightful to compare predicted relative frequencies with actual relative frequencies . These are given in Table 5.6 where counts of five or more are grouped into the one cell to prevent cell sizes from getting too small."
Le (1998), p. 209. This author uses a frequency of 1 as the cutoff: "It is also recommended that adjacent groups at the bottom of the table be combined in order to avoid having any expected frequencies less than 1."
Morgan (2000), p. 17. This author says never pool: "We note here that several of the fitted values are small (say, less than 5), and it is a common practice to pool over categories, to obtain larger expected values. This is somewhat arbitrary and potentially dangerous, and should be avoided if possible."

My comments on this are the following.

Pooling is unavoidable. Because tail probabilities for discrete count models run forever, it is always necessary to at least pool the observations in the tail. It is not clear though where this should begin without additional guidelines, but a sensible approach would be to add the expected tail probabilities to the last category for which the observed counts are also nonzero. My rationale for this is the following. Every additional cell beyond the last for which the observed count is zero will add the value of the expected count to the chi-squared statistic, no matter how small the expected count, while at the same time add 1 to the degrees of freedom. To see this consider the generic term of the Pearson chi-squared test when the observed count is 0.

If there are many such cases the effective result will be to dilute a significant lack of fit that may be occurring if we had just looked at the nonzero cells. Generally speaking, a chi-squared random variable to be significant must at least exceed its degrees of freedom. Each of these terms adds one degree of freedom to the test statistic. If at the same time the expected frequency is less than 1 then these terms will reduce rather than increase the lack of fit as measured by the test statistic. So I think pooling these cells in the tail makes sense.

There is no doubt that when the expected cell counts are small the asymptotic chi-squared distribution is a bad approximation to the true distribution of the test statistic.
The reason you carry out the Pearson test is to provide evidence that your model fits the data. If you fail to find a significant lack of fit and you chose to pool (or not) then you need to investigate whether pooling had an effect on the result you obtained. This is the fundamental difference between model falsification and model confirmation. We can falsify models conclusively, but we can't unequivocally confirm them. If the manner in which the pooling was done was arbitrary, then pooling in some other way is a good check on your results. The bottom line is that you need to convince a skeptic that you tried your best to falsify your model but were unable to do so.
Randomization tests (parametric bootstrap) may be an option when there are small expected cell counts. I discuss this approach below.

Testing the fit of the Poisson model analytically

Parametric goodness of fit test

Using 5 as a cut-off we see that we should combine the first two expected counts corresponding to k = 0 and k = 1. Turning to the right hand tail of the distribution, k = 6 is the first expected count to drop under 5. If we calculate the sum of the expected counts greater than 6 we see that these don't exceed 5 either. So we should combine all category counts for k ≥ 6

# count X = 6

dpois(6, out.pois$estimate)*50

[1] 3.744852

# counts X > 6

(1-ppois(6, out.pois$estimate))*50

[1] 3.112403

Note: ppois(6,out.pois$estimate)) = P(X ≤ 6). Therefore, 1–ppois(6,out.pois$estimate)) = P(X > 6). So, I add X = 6 to the tail too.

expected.p <- c(sum(dpois(0:1, out.pois$estimate)), dpois(2:5, out.pois$estimate), 1-ppois(5, out.pois$estimate))

expected <- expected.p*50

expected

[1] 7.008846 9.406620 10.848963 9.384349 6.493966 6.857255

I next group the observed counts in exactly the same way: pool the first two, keep the next four separate, and pool the remainder. I use length(num.stems) to locate the position of the last value.

observed <- c(sum(num.stems[1:2]), num.stems[3:6], sum(num.stems[7:length(num.stems)]))

observed

2 3 4 5
14 9 6 6 2 13

Finally I carry out the Pearson chi-squared test. It can be calculated by hand or extracted from the output of the chisq.test function. The hand calculations are as follows.

pearson <- sum((observed-expected)^2/expected)

pearson

[1] 18.99147

df <- length(observed)-1-1

[1] 4

qchisq(.95,df)

[1] 9.487729

p.val <- 1-pchisq(pearson,df)

p.val

[1] 0.0007889844

The p-value is quite small indicating a significant lack of fit. Alternatively, the critical value for our test is 9.49 and the observed value of our test statistic is 18.99, so the value of the test statistic exceeds the critical value. Therefore we reject the null hypothesis that the fit is adequate.

We can also use the chisq.test to perform the Pearson chi-square goodness of fit test. Its help screen is shown below (Fig. 3).

?chisq.test

Fig. 3 Help screen for chisq.test

The entries relevant to us are x and p. The x= argument should contain the observed counts. The p= argument should contain the expected probabilities (not the expected counts). I try it for our grouped data.

chisq.test(observed, p=expected.p)

Chi-squared test for given probabilities

data: observed
X-squared = 18.9915, df = 5, p-value = 0.001929

Observe that while the value of the test statistic is correct, the p-value is wrong as are the degrees of freedom. That's because the function doesn't know we estimated a parameter in order to obtain the expected probabilities. We can correct this by using chisq.test to obtain the Pearson statistic, but then calculate the p-value ourselves. I rerun the chi-squared test and save the result as an object in R.

out.chisq <- chisq.test(observed, p=expected.p)

names(out.chisq)

[1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
[7] "expected" "residuals"

out.chisq$statistic

X-squared
18.99147

out.chisq$parameter

df
5

The $statistic component contains the Pearson statistic. The $parameter component contains the number of categories minus one, m – 1. Because we estimated one parameter the degrees of freedom should be decreased by one more.

1-pchisq(out.chisq$statistic, df=out.chisq$parameter-1)

X-squared
0.0007889844

This matches the hand-calculation carried out above. So, we reject the null hypothesis and conclude that there is a significant lack of fit. If we supply the original unpooled data and expected probabilities to chisq.test, it warns us that the assumptions of the test are being violated.

poisson.p <- c(dpois(0:8, out.pois$estimate), 1-ppois(8,out.pois$estimate))

chisq.test(num.stems, p=poisson.p)

Chi-squared test for given probabilities

data: num.stems
X-squared = 48.5686, df = 9, p-value = 1.999e-07

Warning message:
In chisq.test(num.stems, p = exp.pois.p) :
Chi-squared approximation may be incorrect

Simulation-based goodness of fit test (parametric bootstrap)

The chisq.test has an additional argument called simulate.p.value. Setting this argument to TRUE causes R to carry out a Monte Carlo simulation (parametric bootstrap) to obtain the p-value of the observed value of the test statistic . This works as follows.

The expected probability vector p is used to randomly generate new data. In the current problem p has ten components. Let the ten components of p be labeled p₀, p₁, ... , p₉. At each step of the simulation we generate a new observation. The new observation has probability p₀ of being a 0, probability p₁ of being a 1, etc. Thus we are essentially generating a random observation from a specified multinomial distribution. This step can be easily programmed by using a uniform random number generator.
Repeat this as many times as there are observations. That means with the current data set the procedure should be carried out 50 times, each time obtaining one of the values 0 to 9. These 50 simulated observations become the new set of raw counts.
Next calculate using these simulated data as if they were the observed values. The expected values are obtained, as usual, by multiplying the components of p by 50. Denote this chi-squared statistic .
Step 3 is carried out a total of B times (by default B = 2000). Each time a new set of 50 simulated observations is generated and the corresponding statistic is calculated. So in the end there will be B = 2000 values of .
Calculate using the actual observations. Denote this value by and add it the vector of values.
Finally count up the number of values that are greater than or equal to , (there will always be one such value because is included among the values in step 5), and divide the result by B + 1. This is the simulation-based p-value.

The beauty of this approach is that it avoids the whole issue of whether the asymptotic chi-squared distribution is appropriate or not, because it doesn't use it! The set of simulated values defines the distribution of the statistic under the null hypothesis that the model fits the data. The issue of pooling doesn't arise, because it's not necessary. Having said that it is still necessary to combine the tail probabilities of the probability model in order to create the vector p used in the simulation. In addition there is no penalty for overfitting like there is in the parametric Pearson test where the degrees of freedom of the test statistic get reduced by p, the number of estimated parameters.

I redo the goodness of fit test this time setting simulate.p.value=TRUE. I calculate the predicted probabilities using the Poisson model adding the tail probability to the last observed category, X = 9.

poisson.p <- c(dpois(0:8, out.pois$estimate), 1-ppois(8, out.pois$estimate))

chisq.test(num.stems, p=poisson.p, simulate.p.value=TRUE)

Chi-squared test for given probabilities with simulated p-value
(based on 2000 replicates)

data: num.stems
X-squared = 48.5686, df = NA, p-value = 0.0004998

Observe that the reported p-value is equal to meaning that the observed value of the test statistic was larger than any of the simulated values.

1/2001

[1] 0.0004997501

With 2000 simulations our p-value is accurate to 2 or 3 decimal places. Thus the best we can say is that the true p-value is probably less than 0.001. To get a more accurate result we can increase the number of simulations by specifying our own value for the argument B. I try 9,999 simulations. I choose 9,999 so that when we include the observed value of our test statistic as one of the simulations we get the nice round number 10,000.

chisq.test(num.stems, p=poisson.p, simulate.p.value=TRUE, B=9999)

Chi-squared test for given probabilities with simulated p-value
(based on 9999 replicates)

data: num.stems
X-squared = 48.5686, df = NA, p-value = 3e-04

The reported p-value is equal to and tells us that there were two additional simulated values of as extreme as . We now have roughly 3-decimal place accuracy for our p-value and can report p < .001.

Negative Binomial Distribution

It is clear both graphically and analytically that the Poisson model does not provide an adequate fit to the aphid count data. One obvious explanation is that our model is too simple—we're assuming that a single value of λ is appropriate for all 50 observations. So a next logical step would be to allow λ to vary depending upon the value of measured predictors. Often though even after including significant predictors a Poisson regression model may still exhibit a significant lack of fit. In that case a possible solution is to switch from a Poisson distribution to a negative binomial distribution as a model for the response.

Some background on the binomial distribution

Before presenting the definition of the negative binomial distribution we need to briefly discuss the binomial distribution to which it is related. (We will explore the binomial distribution in greater detail later in the course.) In order for an experiment to be considered a binomial experiment four basic assumptions must hold.

The experiment must consist of a sequence of trials in which each trial is a Bernoulli trial, meaning only one of two outcomes with probabilities p and 1 – p can occur.
The number of trials is fixed ahead of time at n, a value that is known to us.
The probability p is the same on each Bernoulli trial.
The Bernoulli trials are independent. Recall that for independent events A and B, .

A simple illustration of a binomial random variable is the response of a seed germination experiment. Suppose an experiment is carried out in which 100 seeds are planted in a pot and the number of seeds that germinate is recorded. Suppose this is done repeatedly for different pots that are subjected to various light regimes, burial depths, etc. Clearly the first two assumptions of the binomial model hold here.

The outcome on individual trials (the fate of an individual seed in the pot) is dichotomous (germinated or not).
The number of trials (number of seeds per pot) was fixed ahead of time at 100.

Derivation of the formula of the binomial probability mass function

Suppose we have five independent Bernoulli trials with the same probability p of success on each trial. If we observe the event:

i.e., three successes and two failures in the order shown, then by independence this event has probability

But in a binomial experiment we don’t observe the actual sequence of outcomes, just the number of successes, in this case 3. There are many other ways to get 3 successes, just rearrange the order of S and F in the sequence SFSSF. So, the probability we have calculated here is too small. How many other distinct arrangements (permutations) of three Ss and two Fs are there?

If all permutations are distinguishable, as in ABCDE, then elementary counting theory tells us there are 5! = 120 different arrangements.
Replace B and E in this permutation by F yielding AFCDF so that now the second and fifth outcomes are indistinguishable. In the original sequence ABCDE and AECDB would be recognizable as different arrangements, but now they would be indistinguishable. With five distinct letters every time you write down a different arrangement of the five letters you immediately get another arrangement just by swapping the B and E. So when B and E are identical, 5! over counts the number of arrangements by a factor of 2.
Now suppose we replace A, C, and D by S to yield SBSSE. In the original sequence you could write down one arrangement of the letters and then immediately get 3! = 6 more by swapping the letters A, C, and D in all possible ways. Thus when A, C, and D are indistinguishable 5! over counts the number of possible arrangements by a factor of 6.

Thus to answer the original question, the number of distinct arrangements of three Ss and two Fs is

combinatorial

where the last two symbols are two common notations for this quantity. Carrying out the arithmetic of this calculation we find that there are ten distinct arrangements of three Ss and two Fs. The first notation, , is called a binomial coefficient and is read "5 choose 3". The C in the second notation denotes "combination" and thus is the number of combinations of five things taken three at a time. Putting this altogether, if then

Prob(X=3) .

For a generic binomial random variable,

, in which the total number of trials is denoted by n, we have

P(X=k) .

Here k = 0, 1, 2, … , n.

Characteristics of the negative binomial distribution

The negative binomial distribution is a popular alternative to the Poisson distribution as a model for count data (White and Bennetts, 1996; Ver Hoef and Bodeng, 1997; O'Hara and Kotze, 2010; Hilbe, 2011; Lindén and Mäntyniemi. 2011). I list some of the characteristics of the negative binomial distribution below.

A negative binomial (NB) random variable is discrete. A typical use of the negative binomial distribution is as a model for count data.
Like the Poisson distribution the negative binomial distribution is bounded on one side. It is bounded below by 0, but is theoretically unbounded above.
Suppose we have a sequence of independent Bernoulli trials in which the probability of a success on any given trial is a constant p. Let X_r denote the number of failures that are endured before r successes are achieved. Then X_r is said to have a negative binomial distribution with parameter p (and r).
The negative binomial is a two-parameter distribution, but like the ordinary binomial one of the parameters, in this case r, is usually treated as known.

Derivation of the formula of the negative binomial probability mass function

The probability mass function of the negative binomial distribution comes in two distinct versions. The first one is the one that appears in every introductory probability textbook; the second is the one that appears in books and articles in ecology. Although the ecological definition is just a reparameterization of the mathematical definition, the reparameterization has a profound impact on the way the negative binomial distribution gets used. We'll begin with the mathematical definition. From an ecological standpoint the mathematical definition is rather bizarre and except for perhaps modeling the number of rejections one has to suffer before getting a manuscript submission accepted for publication, it's hard to see how this distribution is useful.

Let X_r be a negative binomial random variable with parameter p. Using the definition given above let's calculate , the probability of experiencing x failures before r successes are observed. Note: The change in notation from k to x is deliberate. Unfortunately in a number of ecological textbooks the symbol k means something very specific for the negative binomial distribution so I don't want to use it in a generic sense here.

If we experience x failures and r successes, then it must be the case that we had a total of x + r Bernoulli trials. Furthermore, we know that the last Bernoulli trial resulted in a success, the r^th success, because that's when the experiment stops.

successes

What we don't know is where in the first x + r – 1 Bernoulli trials the x failures and r – 1 successes occurred. Since the probability of a success is a constant p on each of these trials, we have a binomial experiment in which the number of trials is x + r – 1. Thus we have the following.

negative binomial formula

So we're done. Note: it's a nontrivial exercise to show that this is a true probability distribution, i.e.,

sum to one

Mean and variance

I'll just state the results.

NB mean

Ecological parameterization of the negative binomial distribution

The ecological definition of the negative binomial is essentially a reparameterization of the formula for the probability mass function that we just derived.

Step 1: The first step in the reparameterization is to express p in terms of the mean μ and use this expression to replace p. Using the formula for the mean of the negative binomial distribution above, I solve for p.

reparameterization

From which it immediately follows that

one minus p

Plugging these two expressions into the formula for the probability mass function yields the following.

NB formula 2

Step 2: This step is purely cosmetic. Replace the symbol r. There is no universal convention as to what symbol should be used as the replacement. Venables and Ripley (2002) use θ. Krebs (1999) uses k. SAS makes the substitution . I will use the symbol θ.

NB formula 3

Step 3: Write the binomial coefficient using factorials.

NB formula 4

Step 4: Rewrite the factorials using gamma functions. This step requires a little bit of explanation.

Gamma Function

The gamma function is defined as follows.

Although the integrand contains two variables, x and α, x is the variable of integration and will disappear once the integral is evaluated. So the gamma function is solely a function of α. The integral defining the gamma function is called an improper integral because infinity appears as an endpoint of integration. Formally such an improper integral is defined as a limit.

Now if , but still an integer, the integral in the gamma function will be a polynomial times an exponential function. The standard approach for integrating such integrands is to use integration by parts. Integration by parts is essentially a reduction of order technique—after a finite number of steps the degree of the polynomial is reduced to 0 and the integral that remains to be computed turns out to be (but multiplied by a number of constants). It is a simple exercise in calculus to show that = 1.

After the first round of integration by parts is applied to the gamma function we obtain the following.

integration by parts

where in the last step I recognize that the integral is just the gamma function in which α has been replaced by . This is an example of a recurrence relation; it allows us to calculate one term in a sequence using the value of a previous term. We can use this recurrence relation to build up a catalog of values for the gamma function.

gamma recurrence

So when α is a positive integer, the gamma function is just the factorial function. But is defined for all positive α. For example, it can be shown that

and then using our recurrence relation we can evaluate others, such as

Step 4 (continued): So using the gamma function we can rewrite the negative binomial probability mass function as follows.

neg binomial formula

where I've chosen to leave x! alone just to remind us that x is the value whose probability we are computing.

So what's been accomplished in all this? It would seem not very much, but that's not true. The formula we're left with bears little resemblance to the one with which we started. In particular, all reference to r, the number of successes, has been lost having been replaced by the symbol θ. Having come this far, ecologists then take the next logical step. Since the gamma function does not require integer arguments, why not let θ be any positive number? And so θ is treated solely as a fitting parameter, it's original meaning having been lost (but see below).

Engineers sometimes follow the convention of reserving the term "negative binomial distribution" for only the first parameterization we've described, the one in which the parameter r takes on only positive integer values. In contrast they refer to the ecologist's parameterization with the positive continuous parameter θ as the Polya distribution.
As if this were not confusing enough the engineer's "true" negative binomial distribution is sometimes called the Pascal distribution. Then the two parameterizations we've described are called the Pascal and Polya distributions respectively, and the term negative binomial distribution is not used at all.

So, what we're left with is a pure, two-parameter distribution, i.e., , where the only restriction on μ and θ is that they are positive. With this last change, the original interpretation of the negative binomial distribution has more or less been lost and it is best perhaps to think of the negative binomial as a probability distribution that can be flexibly used to model discrete data.

The variance of the negative binomial distribution in terms of μ and θ

It turns out that a Poisson random variable can be viewed as a special limiting case of a negative binomial random variable in which the parameter θ is allowed to become infinite. Given that there are infinitely many other choices for θ the negative binomial distribution is clearly more adaptable than the Poisson distribution for fitting data. In a sense θ is a measure of deviation from a Poisson distribution. For that reason θ is sometimes called the inverse index of aggregation (Krebs 1999)—inverse because small values of θ correspond to more clumping than is typically seen in the Poisson distribution. It is also called the size parameter (documentation for R), but most commonly of all, it is called the dispersion parameter (or overdispersion parameter).

The relationship between the negative binomial distribution and the Poisson can also be described in terms of the variances of the two distributions. To see this I express the variance of the negative binomial distribution using the parameters of the ecologist's parameterization.

variance

Observe that the variance is quadratic in the mean. Since , this represents a parabola opening up that crosses the μ-axis at the origin and at the point . θ controls how fast the parabola climbs.

Fig. 4 Negative binomial mean-variance relationship as a function of θ

As , , and we have the variance of a Poisson random variable. For large θ, the parabola is very flat while for small θ the parabola is narrow. Thus θ can be used to describe a whole range of heteroscedastic behavior. Note: In the parameterization of the negative binomial distribution used by SAS, . Thus the Poisson distribution corresponds to α = 0 and values of α > 0 correspond to overdispersion. This is perhaps the more natural parameterization.

Cited references

Agresti, Alan. 2002. Categorical Data Analysis. Wiley: New York.
Cameron, A. Colin and Pravin K. Trivedi. 1998. Regression Analysis of Count Data. Cambridge University Press: New York.
Hilbe, Joseph M. 2011. Negative Binomial Regression. Cambridge University Press: New York.
Krebs, Charles. 1999. Ecological Methodology. Addison-Wesley: Menlo Park, CA.
Le, Chap T. 1998. Applied Categorical Data Analysis. Wiley: New York.
Lindén, Andreas and Samu Mäntyniemi. 2011. Using negative binomial distribution to model overdispersion in ecological count data. Ecology 92: 1414–1421.
Morgan, Byron J. T. 2000. Applied Stochastic Modelling. Arnold Press: London.
O'Hara, Robert B. and D. Johan Kotze. 2010. Do not log-transform count data. Methods in Ecology and Evolution 1: 118–122.
Sokal, Robert R. and F. James Rohlf. 1995. Biometry. W. H. Freeman: New York.
Venables, W. N. and B. D. Ripley. 2002. Modern Applied Statistics with S, 4th edition. Springer-Verlag: New York.
Ver Hoef, Jay M. and Peter L. Boveng. 2007. Quasi-Poisson vs. negative binomial regression: how should we model overdispersed count data? Ecology 88: 2766–2772.
White, Gary C., and Robert E. Bennetts. 1996. Analysis of frequency count data using the negative binomial distribution. Ecology 77: 2549–2557.

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--October 16, 2012
URL: https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/lectures/lecture15.htm