Lecture 19—Monday, October 29, 2012

Topics

Model selection criteria
The theory behind AIC
Using AIC on models for the slugs data
Fitting species area curves to the Galapagos Islands plant richness data set
Normal models with the raw response
- Model 1: linear model
- Model 2: Gleason model
Count models
Assessing the variance-mean relationship in data
Cited references
References on AIC
R code

R functions and commands demonstrated

AIC calculates the AIC of a regression model.
cut defines categories for a continuous variable using user-specified cut points. By default categories are half-open intervals (open on left) and the first cut point (which defines the minimum value) is not part of the first interval.
I is the identity function. Used in formulas to force R to perform arithmetic on a variable before fitting the model.
quantile calculates quantiles of a variable at user-specified values. By default, it returns quartiles, min, and max.
var calculates the variance of its argument.

R function options

include.lowest= (argument to cut) takes on values TRUE or FALSE. Alters first (or last, if right=FALSE) interval created by the cut function so that the interval is closed (rather than half-open).

R packages used

gamlss for separate dispersion negative binomial models.
MASS for the glm.nb function.

Model selection criteria

Choosing a best model from a collection of candidates is a nontrivial task for which statisticians can only offer rough guidelines. The first step is to choose a suitable collection of candidate models. This is fundamentally a biological problem rather than a statistical one (although statistics can offer guidelines in, for example, the choice of probability generating mechanisms). In what follows I will assume a manageable set of models based on clear biological principles has already been assembled.

Having chosen the field of candidates the next problem is choosing a criterion on which to evaluate them. Methods based on significance testing have limited utility here.

Approaches such as the likelihood ratio test require that the models are nested and are of no help when they are not.
If you wish to compare models using different probability generating mechanisms, such as Poisson, negative binomial, lognormal, or some other transformation to normality, significance testing is helpful in only special circumstances.
Finally significance testing has a lot of intellectual baggage associated with it and has been heavily criticized in many scientific disciplines (including ecology) as being divorced from the scientific method.

We can compare non-nested models directly using log-likelihood as long as all the models in question estimate the same number of parameters, use the same response variable, and use exactly the same set of observations in fitting the model. The rule is simple. Models with larger log-likelihoods are better models although there is no real way to quantify how much better. When two models use the same response variable but in one model the response variable is transformed (log, square root, or the like) but in the other it is not, then these models cannot be compared directly using the log-likelihoods of the fitted models. Fortunately it is possible to reformulate the log-likelihood of the transformed model and use the reformulated version for model comparisons. The details for doing this are discussed lecture 20.

Information-theoretic approaches offer an alternative to significance testing and provide a way to extend log-likelhood comparisons to models with different numbers of parameters. There are a number of information-based criteria. I will mention three: AIC, AIC_c, and BIC. In what follows I use the following notation.

θ is the set (vector) of model parameters.
is the likelihood of our model given the data when evaluated at the maximum likelihood estimate of θ, denoted here by .
n is the number of observations.
K is the number of estimated parameters in our model.

Using this notation the three information-theoretic criteria I mentioned are defined as follows.

AIC: Akaike Information Criterion

AIC_c: corrected AIC for small samples (use when )

BIC: Bayesian Information Criterion (also called SIC for Schwarz Information Criterion), although there is very little that is Bayesian about it.

Better models are those that yield the smaller values of these statistics. Observe that all of these statistics involve the log-likelihood. Since the likelihood is the probability of obtaining the data under the given model it makes sense to choose a model that makes this probability as large as possible. Taking the log doesn't change anything, but putting the minus sign in front flips things around. Now instead of maximizing you need to minimize. The model with the smallest value of one of these statistics is best.

All three statistics have the same first term but they differ in the remaining terms. Notice that in each case the terms being added to are positive. Thus these terms make the value of the corresponding statistic bigger which in turn makes it harder for that model to be chosen best. The terms being added to are called penalty terms and it's often stated that the reason they are included is to prevent overfitting. Thus you'll often see it stated that AIC and BIC are more or less equivalent statistics differing only in that they use different "penalty" terms, 2K for AIC versus for BIC. Because the BIC penalty term grows in magnitude more quickly than the AIC penalty term as K increases, BIC favors smaller models than does AIC.

While it's useful to think of model building as a balance between improved fit (increased log-likelihood) and parsimony (fewer parameters), treating the additional term in the formula for AIC as just a penalty term is really incorrect and reflects a complete lack of understanding of how AIC is derived. As we'll see the 2K term is an essential part of the definition of AIC and is more than just a penalty term.

The theory behind AIC

Among the information-theoretic criteria AIC is the most popular and certainly the most used among biologists. AIC has had two primary champions, David Anderson and Ken Burnham both of Colorado State University. Anderson is a wildlife biologist and Burnham is a statistician. These men have been promoting AIC as part of a comprehensive program of model selection in the biological sciences. They have done so through

a series of articles (a partial list can be found at the end of this document),
a book, Model Selection and Multimodel Inference, now in its second edition, currently (as of October 2012) with over 11,000 citations in Science Citation Index, mostly in ecology journals, and
workshops and courses.

Anderson and Burnham have been fairly successful in their mission. As an example, the Journal of Wildlife Management now actively promotes the use of AIC instead of significance testing for model selection in the manuscripts it solicits. Most of the editors of the journal are well-versed in information-theoretic approaches to model selection and are promoters of it.

The primary attraction of AIC over other methods of model selection is that it is not an ad hoc method but instead has a strong theoretical basis. Although the 2K in the AIC formula does serve as a penalty term by penalizing bigger models when the additional parameters don't sufficiently increase the log-likelihood, it is actually much more than a mere penalty term. In what follows I outline a bit of the theory that lies behind AIC. Additional details can be found in Burnham and Anderson (2002).

Kullback-Leibler information and AIC

The Kullback-Leibler (K-L) information between models f and g is defined for continuous distributions as follows.

Here f and g are probability distributions. The verbal description of I(f, g) is that it represents the distance from model g to model f. Alternatively it is the information lost when using g to approximate f. Typically, g is taken to be an approximating model while f is taken to be the true state of nature. The quantity θ denotes the various parameters used in the specification of g. As a measure of distance, I(f, g) is a strange beast. Because of the asymmetry in the way f and g are treated in the integral, I(f, g) ≠ I(g, f).

The form the K-L information takes for discrete probability models may be a bit more enlightening. Let the true state of nature be

and the approximating model be

The K-L information between models f and g is defined for discrete distributions to be

Because the log of a quotient is the difference of logs, the K-L information is also given by

You may recognize that one of the two terms in this difference has the same form as H, the Shannon-Weiner diversity index, another information-based measure.

The logarithm function plays a key role in any formulation of information. This is the case because while independent bits of information multiply on a probability scale, they need to add on an information scale. Recall that for independent events A and B, the probability of observing both is the product of the individual probabilities.

But if the events are independent then the information accrued in observing both should be additive.

Because the logarithm turn products into sums: log(ab) = loga + logb, it is the natural candidate for transforming probabilities into information.

The true state of nature drops out as a constant

Using properties of logarithms the integral form for I(f, g) can be written as a difference of integrals.

where in the last step I use the fact that the form of each integral is that of an expectation. Now suppose we have a second approximating model h for the true state of nature f. The information lost in using h to approximate f is given by the following.

Observe that is a common term in each expression. If we want to compare model g to model h it makes sense to consider the difference I(f, g) – I(f, h). If we do so then we find

Observe that has canceled out. Thus if our goal is to compare models, truth drops out! The two terms that remain in the difference are exactly of the same form and we are just comparing their relative magnitudes when we calculate their difference. Thus for a generic model g all we need to estimate is the following

This last expression is called relative Kullback-Leibler information. Its absolute magnitude has no meaning. It is only useful for measuring how far apart two approximating models are. Relative K-L information is measured on an interval scale, a scale without an absolute zero. The true absolute zero corresponds to truth and is no longer part of the expression. This interval-scale property will later carry over to AIC.

Akaike's contribution

Hirotugu Akaike (Ah-KAH-ee-key), who died in 2009, observed in a classic paper in 1973 that there is one further problem in trying to estimate relative K-L information. In our approximating model we typically won't know the exact value of θ. Instead we will have to use an estimate . Since this estimate is likely to be in error, another layer of uncertainty is added to the mix. So Akaike suggested that we should calculate the average value of relative Kullback-Leibler information over all possible values of . In terms of expectation we might call this quantity expected relative K-L information and write it (suppressing the reference to f) as

The notation makes this expression look fairly intimidating but in fact Akaike found an unbiased estimator of it. The estimator is

Here is the log-likelihood function for model g evaluated at the maximum likelihood estimate of the parameter set θ. K is the number of parameters that are estimated in maximizing the likelihood. Strictly for historical reasons, Akaike chose to multiply this quantity by –2. The resulting quantity has come to be known as Akaike's information criterion or AIC.

Models with smaller values of AIC are better models than models with larger values of AIC. Because AIC estimates expected relative K-L information and, as explained above, the magnitude of relative K-L information is not interpretable, the magnitude of AIC also has no meaning. Fig. 1 illustrates this point. AIC is measured relative to numerical zero rather than to the absolute zero, which in the figure corresponds to the true state of nature. Because its reference point is arbitrary, the absolute magnitude of AIC is not interpretable.

Fig. 1 While the magnitude of K-L information is interpretable, the magnitude of AIC tells us nothing about the absolute quality of a model. AIC only provides relative information about models. (R code)

AIC is typically positive, although it can be negative. If the likelihood is constructed from a discrete probability model, the individual terms in the log-likelihood will be log probabilities. These are negative because probabilities are numbers between 0 and 1. AIC multiplies these terms by –2 and hence AIC will be positive for discrete probability models. In continuous probability models the terms of the log-likelihood are log densities. Because a probability density can exceed one AIC can be negative or positive.

AIC is only an approximate estimate of expected relative K-L information an approximation that is only good when the number of observations is much greater than the number of estimated parameters. When this is not the case a correction is needed to AIC, called AIC_c.

Here n is the number of observations. The basic recommendation is to use AIC_c whenever . Although the expression for AIC_c was derived for normal models, it is routinely applied to other distributions as well.

Caveats with using AIC for model selection

Caveat 1. AIC can only be used to compare models that are fit using exactly the same set of observations. This fact can cause problems when fitting regression models to different sets of variables if there are missing values in the data. Consider the following data set consisting of five observations and three variables where one of the variables has a missing value (NA) for observation 3.

Observation	y	x₁	x₂
1	–	–	–
2	–	–	–
3	–	–	NA
4	–	–	–
5	–	–	–

Most regression packages carry out case-wise deletion so that observations with missing values in any of the variables that are used in the model are deleted automatically. The lm and glm functions of R do this without telling you. Thus in fitting the following three models to these data, the number of observations used will vary.

Model	Formula	n
1	y ~ x₁	5
2	y ~ x₂	4
3	y ~ x₁ + x₂	4

Models 2 and 3, being fit to the same four observations, are comparable using AIC, but neither of these models can be compared to the AIC of Model 1 which is based on one additional observation. The solution is to delete observation 3 and fit all three models using only the four complete observations 1, 2, 4, and 5. If model 1 ranks best among these three models then model 1 can be refit with observation 3 included.

Caveat 2. The response variable must be the same in each model being compared. Monotonic transformations of the response alter the probability densities at individual points so that the probabilities over the corresponding intervals can remain the same. The likelihood on the other hand only includes the densities at individual points. This means that the log-likelihoods and AICs of models with a transformed response, e.g. log y, cannot be directly compared to the log-likelihoods and AICs of models that are fit to the raw response, y. Fortunately there are ways around this problem as we'll see.

Using AIC on models for the slugs data

I reload the slugs data set and fit the six models we considered last time.

slugs <- read.table( 'http://www.bio.ic.ac.uk/research/mjcraw/statcomp/data/slugsurvey.txt', header=TRUE)

glm.pois1 <- glm(slugs~1, data=slugs, family=poisson)

glm.pois2 <- glm(slugs~field, data=slugs, family=poisson)

library(MASS)

glm.NB1 <- glm.nb(slugs~1, data=slugs)

glm.NB2 <- glm.nb(slugs~field, data=slugs)

library(gamlss)

glm.NB3 <- gamlss(slugs ~ 1, sigma.formula=~field, data=slugs, family=NBI)

glm.NB4 <- gamlss(slugs ~ field, sigma.formula=~field, data=slugs, family=NBI)

LL.glm <- sapply(list(glm.pois1, glm.pois2, glm.NB1, glm.NB2, glm.NB3, glm.NB4), logLik)

k.glm <- sapply(list(coef(glm.pois1), coef(glm.pois2), c(coef(glm.NB1), glm.NB1$theta), c(coef(glm.NB2), glm.NB2$theta), c(coef(glm.NB3), coef(glm.NB3, what='sigma')), c(coef(glm.NB4), coef(glm.NB4, what='sigma'))), length)

labels.glm <- c('glm.pois1', 'glm.pois2', 'glm.NB1', 'glm.NB2', 'glm.NB3', 'glm.NB4')

output.glm <- data.frame(model = labels.glm, LL = LL.glm, k = k.glm)

output.glm

      model        LL k
1 glm.pois1 -176.8383 1
2 glm.pois2 -171.1275 2
3   glm.NB1 -144.3980 2
4   glm.NB2 -142.6750 3
5   glm.NB3 -138.2398 3
6   glm.NB4 -137.1253 4

To calculate AIC we can use the AIC function of R. It is listable so we can just give it the names of the models as arguments.

AIC(glm.pois1, glm.pois2, glm.NB1, glm.NB2, glm.NB3, glm.NB4)

          df      AIC
glm.pois1 1 355.6766
glm.pois2 2 346.2551
glm.NB1    2 292.7961
glm.NB2    3 291.3500
glm.NB3    3 282.4796
glm.NB4    4 282.2505

Notice that model with the lowest AIC is NB4. Recall that this is not the model we chose using significance testing. There we chose NB3, which to be fair has an AIC that is only slightly larger than that of the NB4. If we redo the likelihood ratio test comparing models NB3 and NB4 to test H₀: θ₁= 0 we fail to reject the null hypothesis (p = 0.135).

LR.test(glm.NB3, glm.NB4)

Likelihood Ratio Test for nested GAMLSS models.
(No check whether the models are nested is performed).

Null model: deviance= 276.4796 with 3 deg. of freedom
Altenative model: deviance= 274.2505 with 4 deg. of freedom

LRT = 2.229016 with 1 deg. of freedom and p-value= 0.1354401

AIC also prefers model NB2 over NB1. If we compare NB1 and NB2 using a likelihood ratio test to test H₀: β₁ = 0, we fail to reject the null hypothesis (p = 0.06) and thus should prefer the simpler model NB1.

anova(glm.NB1, glm.NB2, test='Chisq')

Likelihood ratio tests of Negative Binomial Models

Response: slugs
Model     theta Resid. df    2 x log-lik.   Test    df LR stat.    Pr(Chi)
1     1 0.7155672        79       -288.7961
2 field 0.7859313        78       -285.3500 1 vs 2     1 3.446124 0.06340028

This is not unusual. Because of the arbitrariness of the α = 0.05 cut-off AIC can prefer the more complicated model among two nested models even when a likelihood ratio test fails to find them significantly different. As a result the AIC-best model can contain some coefficients that are not significantly different from zero. The significance level at which significance testing will choose the same model that AIC does will vary with the sample size. A general statement we can make is that AIC will always prefer the more complicated model among two nested models when the additional terms are significantly different from zero, but AIC can also prefer the more complicated model even when they're not (although in this case typically 0.05 < p < 0.15).

Fitting species area curves to the Galapagos Islands plant richness data set

Johnson and Raven (1973) provide data on plant species richness for 29 islands in the Galapagos Islands archipelago. Their paper was presented as a rebuttal to Hamilton et al. (1963) who claimed that a species-area effect is absent in the Galapagos Islands. Using an expanded data base, Johnson and Raven (1973) argued that the Galapagos Islands do exhibit a species-area effect. In their paper they considered only two forms for the species-area relationship:

a form in which species richness is linearly related to area, with normally distributed errors (Model 1 below), and
a form in which log-transformed species richness is linearly related to log-transformed area, with normally-distributed errors (Model 7 below).

We revisit the analysis of Johnson and Raven but consider an expanded set of possible species-area relations using statistical methodology that was not available in 1973. We consider eight models. In the model descriptions that follow S denotes species richness on an island and A denotes the area of that island.

Normal models with the raw response

Linear model: with identity link such that .
Gleason model: with identity link such that .
nonlinear Arrhenius model: with identity link such that .

Count models

Poisson GLM: with log link such that .
Negative binomial GLM (NB-2): with log link such that .
Negative binomial GLM (NB-1): with log link such that .

Normal models with a transformed response

Arrhenius model: with identity link such that .
Square root model: with identity link such that

The data set included in Johnson and Raven (1973) was missing some elevation data. I found a corrected version of the data set with these values supplied at

http://www.ibiblio.org/pub/academic/biology/ecology+evolution/teaching/weisberg/galapago.dat.

The cleaned up version of the data set, galapagos.txt, is at our class web site. The file is space-delimited with the variable names appearing in the first row.

gala <- read.table( 'ecol 563/galapagos.txt', header=T)

gala[1:4,]

     Island Species Endemics Area Elevation Nearest.dist Santacruz.dist Adjacent.isl.area
1    Baltra      58       23 25.09       100          0.6            0.6              1.84
2 Bartolome      31       21 1.24       109          0.6           26.3            572.33
3 Caldwell       3        3 0.21       114          2.8           58.7              0.78
4 Champion      25        9 0.10        46          1.9           47.4              0.18

Normal models with the raw response

Model 1: Linear Model

I fit a linear model to Area on the raw scale using the lm function. We can extract the log-likelihood of the fitted model with the logLik function and the AIC with the AIC function.

#linear model

out.norm1 <- lm(Species~Area, data=gala)

AIC(out.norm1)

[1] 349.1721

I superimpose the fitted model on a scatter plot of the data.

plot(Species~Area, data=gala)

abline(out.norm1, lty=2, col=2)

fig. 2

Fig. 2 Linear model for mean species richness as a function of area

From the scatter plot alone we would conclude that this is a terrible fit. This is the typical hot dog on a stick plot where a regression line emerges from a cloud of points and runs out to a single influential point that almost single-handedly determines the slope.

Model 2: Gleason Model

The second model, the Gleason model, uses log-transformed Area rather than Area as the predictor. Since the response in models 1 and 2 is the same, we can use the reported log-likelihood or AIC to compare the Gleason and linear models.

out.norm2 <- lm(Species~log(Area), data=gala)

sapply(list(out.norm1, out.norm2), AIC)

[1] 349.1721 335.1547

The AIC of the Gleason model is less than that of the linear Area model. I superimpose the fitted Gleason model on a scatter plot of the data.

plot(Species~log(Area), data=gala)

abline(out.norm2, lty=2)

fig. 3

Fig. 3 Scatter plot of raw data showing the Gleason regression model

This too is clearly a bad model. Based on Fig. 3 the model underestimates species richness at low and high areas, and overestimates it at intermediate areas. Perhaps even more troubling the Gleason model predicts negative values for species richness even within the range of the data that were used to fit the model. When log(Area) drops below –2 species richness is predicted to be negative. This is obviously a nonsensical prediction.

Fig. 4 shows the residual plot with a lowess curve (a locally weighted smooth that reflects local patterns) superimposed. The residuals function is used to extract the residuals from a model object. As expected the lowess curve shows that there is still a pattern in the residuals unaccounted for by the model. The residuals are first positive, then negative, then positive which matches our interpretation of model fit based on Fig. 3.

plot(residuals(out.norm2)~log(Area), data=gala)

lines(lowess(residuals(out.norm2)~log(gala$Area)), col=2)

abline(h=0, lty=2, col=4)

Fig. 4 Residuals from the Gleason model plotted against the predictor log(Area) with a lowess curve superimposed

Count models

Model 4: Poisson model

I carry out Poisson regression with a log link using log(Area) as the predictor. Poisson regression can be carried out with the glm function by specifying family=poisson.

#Poisson and negative binomial models

out.pois <- glm(Species~log(Area), data=gala, family=poisson)

sapply(list(out.norm1, out.norm2, out.pois1), AIC)

[1] 349.1721 335.1547 800.0266

Because the Poisson model has one fewer parameter than the other models, we should use AIC rather than log-likelihood to compare the models. As we can see from the AIC values the Poisson model is ranked last.

Model 5: Negative binomial model (NB-2)

I carry out negative binomial regression with a log link using log(Area) as the predictor. The function for this is glm.nb from the MASS package.

library(MASS)

out.NB <- glm.nb(Species~log(Area), data=gala)

out.NB$theta

[1] 2.634311

sapply(list(out.norm1, out.norm2, out.pois, out.NB), AIC)

[1] 349.1721 335.1547 323.7350 800.0266 273.9926

The negative binomial model has yielded the lowest AIC obtained thus far. On the other hand the Poisson model has yielded the worst. The value of θ suggests that there is a modest amount of overdispersion explaining perhaps why the Poisson model has such a low ranking. I plot the predicted means from the negative binomial model and the Poisson model on the same graph. Because both models by default use a log link we need to exponentiate the regression equation in order to recover the mean.

plot(Species~log(Area), data=gala)

coef(out.pois)

(Intercept) log(Area)
3.2409698 0.3429906

poisson.func <- function(x) exp(coef(out.pois)[1]+coef(out.pois)[2]*x)

coef(out.NB)

(Intercept) log(Area)
3.1495864 0.3669952

NB.func <- function(x) exp(coef(out.NB)[1]+coef(out.NB)[2]*x)

curve(poisson.func, add=T, col=2)

curve(NB.func, add=T, col=4, lty=2)

legend('topleft', c('Poisson', 'negative binomial'), col=c(2,4), lty=c(1,2), cex=.9, bty='n')

Fig. 5 Predicted means of regression models using Poisson and negative binomial distributions for the response

The two models displayed in Fig. 5 look very similar. How is it possible that the model ranked best using AIC looks nearly indistinguishable from the model ranked worst using AIC? Keep in mind that the difference in AIC here (800 versus 274) is gigantic.

This apparent non sequitur illustrates the difference between assessing fit in the least squares approach and assessing fit from a maximum likelihood perspective. When we fit a probability model to data we are not trying to find some sort of best-fit curve in the same sense that least squares does. Instead we're trying to determine a plausible probability generating mechanism for the data. From this perspective a good model is one that could have generated the observed data. One way to assess model fit would be to use the model to generate simulated data sets and ask if the observed data set looks like the simulated data sets. This approach to model assessment is called predictive simulation and has been used previously in this course.

The reason the Poisson model is rejected by AIC in favor of the negative binomial model is obviously not due to a difference in the predicted means. Rather it's because given the same means, a Poisson model cannot generate the data we observed while the negative binomial model can. A major difference between the Poisson and negative binomial model is in their mean-variance relationships. A Poisson model assumes the mean and variance are equal, while a negative binomial model assumes that the variance is a quadratic function of the mean. Apparently the increasing spread in the data around the mean trajectory shown in Fig. 5 is more typical of a negative binomial distribution than it is of a Poisson distribution.

Model 6: Negative binomial model (NB-1)

There is a second type of negative binomial model called the NB-1 model by Cameron and Trivedi (1998). (They refer to the model that is fit by the glm.nb function as the NB-2 model.) These two models differ in how the mean and variance of the negative binomial distribution are related. In the NB-1 model the variance and mean are linearly related while in the NB-2 model they are quadratically related.

NB models

Why there are two different negative binomial models will be explained at a later time. Both the NB-1 model and the NB-2 model can be fit using the gamlss function from the gamlss package. Unfortunately the gamlss notation for these distributions is the opposite of everyone else's. The gamlss package reverses the numbering and refers to the NB-1 and NB-2 models as the NBII and NBI models respectively.

# fit NB models with gamlss

library(gamlss)

# quadratic mean variance relationship--same as glm.nb

out.gamliss1 <- gamlss(Species~log(Area), data=gala, family=NBI)

AIC(out.NB, out.gamliss1)

df AIC
out.NB 3 273.9926
out.gamliss1 3 273.9926

# linear mean variance relationship

out.gamliss2 <- gamlss(Species~log(Area), data=gala, family=NBII)

AIC(out.NB, out.gamliss1, out.gamliss2)

df AIC
out.NB 3 273.9926
out.gamliss1 3 273.9926
out.gamliss2 3 277.8913

So for these data the negative binomial with a quadratic variance-mean relationship has a lower AIC.

Assessing the variance-mean relationship in data

Although using AIC we've rejected a linear variance-mean relationship in favor of a quadratic variance-mean relationship, it would be nice to have an independent verification of this result. We can obtain one by grouping the data using the values of a predictor, calculating the mean and variance of the response in each group, and plotting the result. The regressor we've been using in the model is log(Area). Because it's a continuous variable a sensible way to group its values is to use quantiles. We need enough groups to detect the relationship between the variance and the mean while leaving enough observations within each group to accurately calculate a variance. With 29 observations choosing quintiles will give us approximately six observations per group. The first argument to the quantile function is the variable and the second argument is a vector of the desired quantiles.

quantile(log(gala$Area), seq(0,1,.2))

0% 20% 40% 60% 80% 100%
-4.6051702 -1.5060647 0.2940422 2.8807906 4.9746404 8.4487687

Next we classify observations by the quantile they fall into. The cut variable can be used for this purpose. The first argument to cut is the variable to be cut and the second argument is a vector of cut points (in this case the quintiles including the minimum and maximum). By default the cut function constructs half-open intervals, (a, b], open on the left but closed on the right so that a < x ≤ b. To change the first interval so that it's closed on both ends I add the argument include.lowest=TRUE. This will force the minimum value of log(Area) to be included in one of the intervals.

out.cut <- cut(log(gala$Area), quantile(log(gala$Area), seq(0,1,.2)), include.lowest=T)

out.cut

[1] (2.88,4.97]   (-1.51,0.294] [-4.61,-1.51] [-4.61,-1.51] [-4.61,-1.51] (-1.51,0.294]
[7] (0.294,2.88] [-4.61,-1.51] [-4.61,-1.51] (2.88,4.97]   (4.97,8.45]   (-1.51,0.294]
[13] (-1.51,0.294] (0.294,2.88] (4.97,8.45]   (2.88,4.97]   [-4.61,-1.51] (2.88,4.97]
[19] (2.88,4.97]   (-1.51,0.294] (0.294,2.88] (4.97,8.45]   (4.97,8.45]   (4.97,8.45]
[25] (2.88,4.97]   (4.97,8.45]   (0.294,2.88] (-1.51,0.294] (0.294,2.88]
Levels: [-4.61,-1.51] (-1.51,0.294] (0.294,2.88] (2.88,4.97] (4.97,8.45]

table(out.cut)

out.cut
[-4.61,-1.51] (-1.51,0.294] (0.294,2.88] (2.88,4.97] (4.97,8.45]
6 6 5 6 6

As hoped we have roughly equal numbers of observations in each group. Next I calculate the mean and variance of species richness in each group using tapply.

# use area groupings to calculate mean and variance of richness

out.m <- tapply(gala$Species, out.cut, mean)

out.v <- tapply(gala$Species, out.cut, var)

Finally I plot the variance against the mean, estimate a regression model, and superimpose the regression model on a scatter plot of the data. We can carry out squaring the predictor beforehand or do it within the regression formula by enclosing the arithmetic operation in the I function. Without the I function the arithmetic operation will be ignored.

# plot variance versus mean and superimpose quadratic fit

plot(out.v~out.m, xlab='Mean', ylab='Variance')

out.r <- lm(out.v~out.m + I(out.m^2))

coef(out.r)

(Intercept) out.m I(out.m^2)
381.7404970 -10.9289622 0.2067509

curve(coef(out.r)[1]+coef(out.r)[2]*x+coef(out.r)[3]*x^2, add=T, col=2)

For comparison I add the linear relationship to the plot. As the plot indicates, the quadratic model appears to do a good job of approximating the observed pattern, better than the linear model. Of course with only five observations and four parameters (3 regression parameters and the variance) the AICc for the quadratic model is infinite suggesting that we are overfitting these data.

out.r.linear <- lm(out.v~out.m)

abline(out.r.linear, col=4, lty=2)

legend('topleft', c('linear','quadratic'), col=c(4,2), lty=c(2,1), bty='n', cex=.9)

Fig. 6 Variance-mean relationship of the response variable species richness.

Cited References

Burnham, K. P. and D. R. Anderson. 2002. Model Selection and Multimodel Inference. Springer-Verlag: New York.
Cameron, A. C. and P. K. Trivedi. 1998. Regression Analysis of Count Data. Cambridge University Press: Cambridge, UK.
Hamilton, Terrell H., Ira Rubinoff, Robert H. Barth, Jr., and Guy L. Bush. 1963. Species abundance: natural regulation of insular variation. Science 142: 1575–1576.
Johnson, Michael P. and Peter H. Raven. 1973. Species number and endemism: The Galapagos archipelago revisited. Science 179: 893–895.

References on AIC

Note: a number of David Anderson's papers can be downloaded from his web site.

Akaike, Hirotugu. 1973. Information theory and an extension of the maximum likelihood principle. Proceedings of the 2nd International Symposium on Information Theory (edited by B. N. Petrov and F. Csaki). Akademiai Kiado, Budapest. (Reproduced in S. Kotz and N. L. Johnson (editors), 1992, Breakthroughs in Statistics, New York: Springer-Verlag, pp. 610–624.)
Anderson, D. R., K. P. Burnham, and W. L. Thompson. 2000. Null Hypothesis Testing: Problems, Prevalence, and an Alternative. Journal of Wildlife Management 64(4): 912–923.
Anderson, D. R. and K. P. Burnham. 2002. Avoiding pitfalls when using information-theoretic methods. Journal of Wildlife Management 66(3): 912–918.
Anderson, D. R., K. P. Burnham, W. R. Gould, and S. Cherry. 2001. Concerns about finding effects that are actually spurious. Wildlife Society Bulletin 29(1): 311–316.
Anderson, D. R. 2008. Model Based Inference in the Life Sciences: A Primer of Evidence. Springer-Verlag, New York. This might be called Burnham and Anderson (2002) for dummies.
Burnham, K. P. and D. R. Anderson. 2001. Kullback-Leibler information as a basis for strong inference in ecological studies. Wildlife Research 28: 111–119.
Burnham, K. P. and D. R. Anderson. 2002. Model Selection and Multimodel Inference. Springer-Verlag, New York.
Burnham, K. P. and D. R. Anderson. 2004. Multimodel inference: understanding AIC and BIC in model selection. Sociological Methods & Research 33: 261–304.
Diniz, J. A. F., T. F. L. V. B. Rangel, and L. M. Bini. 2008. Model selection and information theory in geographical ecology. Global Ecology and Biogeography 17(4): 479–488.
Franklin, A. B., T. M. Shenk, D. R. Anderson, and K. P. Burnham. 2001. Statistical model selection: the alternative to null hypothesis testing. Pages 75–90 in T. M. Shenk and A. B. Franklin (editors), Modeling in Natural Resource Management: Development, Interpretation, and Application, Island Press, Washington, D. C.
Godinez-Dominquez, E. and J. Freire. 2003. Information-theoretic approach for selection of spatial and temporal models of community organization. Marine Ecology Progress Series 253: 17–24.
Greaves, R. K., R. A. Sanderson, and S. P. Rushton. 2006. Predicting species occurrence using information-theoretic approaches and significance testing: An example of dormouse distribution in Cumbria, UK. Biological Conservation 130: 239–250.
Hobbs, N. T. and R. Hilborn. 2006. Alternatives to statistical hypothesis testing in ecology: a guide to self teaching. Ecological Applications 16: 5–19.
Johnson, J. B. and Omland, K. S. 2004. Model selection in ecology and evolution. Trends in Ecology and Evolution 19(2): 101–108.
Kullback, S. and R. A. Leibler. 1951. On information and sufficiency. Annals of Mathematical Statistics 22(1): 79–86.
Lukacs, P. M., W. L. Thompson, W. L. Kendall, W. R. Gould, P. F. Doherty, K. P. Burnham, and D. R. Anderson. 2007. Concerns regarding a call for pluralism of information theory and hypothesis testing. Journal of Applied Ecology 44(2): 456–460.
Mazerolle, Marc J. 2006. Improving data analysis in herpetology: using Akaike's Information Criterion (AIC) to assess the strength of biological hypotheses. Amphibia-Reptilia 27: 169–180.
Richards, S. A. 2008. Dealing with overdispersed count data in applied ecology. Journal of Applied Ecology 45(1): 218–227.
Sileshi, G. 2006. Selecting the right statistical model for analysis of insect count data by using information theoretic measures. Bulletin of Entomological Research 96: 479–488.
Wagenmakers, E. J. and S. Farrell. 2004. AIC model selection using Akaike weights. Psychonomic Bulletin & Review 11: 192–196.
Whittingham, M. J., P. A. Stephens, R. B. Bradbury, and R. P. Freckleton. 2006. Why do we still use stepwise modeling in ecology and behaviour. Journal of Animal Ecology 75: 1182–1189.

…and a few words from the critics

Guthery, F. S., L. A. Brennan, M. J. Peterson. and J. J. Lusk. 2005. Information theory in wildlife science: critique and viewpoint. Journal of Wildlife Management 69(2): 457–465.
Link, W. A. and R. J. Barker. 2005. Model weights and the foundations of multimodel inference. Ecology 87: 2626–2635.
Richards, Shane A. 2005. Testing ecological theory using the information-theoretic approach: examples and cautionary results. Ecology 86(10): 2805–2814.
Stephens, P. A., S.W. Buskirk, G. D. Hayward, C. M. del Rio. 2005. Information theory and hypothesis testing: a call for pluralism. Journal of Applied Ecology 42(1): 4–12.
Ward, E. J. 2008. A review and comparison of four commonly used Bayesian and maximum likelihood model selection tools. Ecological Modelling 211(1–2): 1–10.

R Code

A compact collection of all the R code displayed in this document appears here.

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--October 31, 2012
URL: https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/lectures/lecture19.htm