Lecture 11—Friday, February 3, 2012

Outline of lecture

R functions and commands demonstrated

R function options

R packages used

Data

https://sakai.unc.edu/access/content/group/2842013b-58f5-4453-aa8d-3e01bacbfc3d/public/Ecol562_Spring2012/data/corals.csv

Overview

This is a portion of the data set analyzed by Bruno et al. (2007). The abstract of their paper is given below.

Very little is known about how environmental changes such as increasing temperature affect disease dynamics in the ocean, especially at large spatial scales. We asked whether the frequency of warm temperature anomalies is positively related to the frequency of coral disease across 1,500 km of Australia's Great Barrier Reef. We used a new high-resolution satellite data set of ocean temperature and 6 yr of coral disease and coral cover data from annual surveys of 48 reefs to answer this question. We found a highly significant relationship between the frequencies of warm temperature anomalies and of white syndrome, an emergent disease, or potentially, a group of diseases, of Pacific reef- building corals. The effect of temperature was highly dependent on coral cover because white syndrome outbreaks followed warm years, but only on high (> 50%) cover reefs, suggesting an important role of host density as a threshold for outbreaks. Our results indicate that the frequency of temperature anomalies, which is predicted to increase in most tropical oceans, can increase the susceptibility of corals to disease, leading to outbreaks where corals are abundant.

The variables of interest in the data set are the following.

  1. PREV_1: the number of coral colonies infected with white syndrome
  2. CORAL_COVE: the percentage of the bottom covered by living coral
  3. WSSTA: a thermal stress metric developed by the authors

The data arise from repeated surveys of 48 reefs using 50-meter permanent transects. Although the selected reefs exhibit a marked spatial structure with a high degree of clustering and the data were collected repeatedly over time, we will ignore the issue of spatial and temporal correlation in our analysis.

Examining the data

The data file was saved from Excel as a comma-delimited text file. Such a file can be read into R using the read.table function by specifying the argument sep=',' to indicate that variable fields are separated by commas. I stored the file in the 'ecol 562' folder in my Documents folder. The Documents folder is my default R working directory so I only need to specify 'ecol 562/corals.csv' to reference the file

#using read.table
corals <- read.table('ecol 562/corals.csv', header=T, sep=',')

Alternatively there is another R function read.csv that automatically assumes the file is comma-delimited and doesn't require the sep argument.

#using read.csv
corals <- read.csv('ecol 562/corals.csv', header=T)
names(corals)
 [1] "REEF_NAME"  "SECTOR"     "SHELF"      "LAT_DD"     "LON_DD"     "DATE"       "YEAR"     
 [8] "YEARII"     "WEEK"       "WSSTA"      "TSASUMS_SM" "PREV_1"     "CORAL_COVE" "INC"

The response variable of interest is PREV_1. To assess the nature of its relationship to coral cover and WSSTA, I plot it against each of these variables in turn. To better see the nature of the relationship I superimpose a "locally weighted smooth". This is a non-parametric regression model that responds to local patterns in the data. We'll discuss non-parametric regression in detail later in the course so I'll spare you the details until then. I use the R lowess function to estimate the smooth and the lines function to draw the regression curve. (The lowess function returns a set of x- and y-coordinates of points that can then be connected with line segments.) I limit the y-range of the data so that the lowess curve is more visible. The lowess function doesn't have a data argument so I have to reference the data frame as part of the variable name.

#plot response versus coral cover
plot(PREV_1~CORAL_COVE, data=corals, ylim=c(0,50), cex=.8)
#add regression smoother
lines(lowess(corals$PREV_1~corals$CORAL_COVE), col=2)
#plot response versus WSSTA
plot(PREV_1~WSSTA, data=corals, ylim=c(0,20), cex=.8)
#add regression smoother
lines(lowess(corals$PREV_1~corals$WSSTA), col=2)

The plot against coral cover suggests a segmented or even a quadratic relationship. The plot against WSSTA suggests a weak quadratic relationship. Of course these are just bivariate relationships. It is possible that when we control for other variables in the regression model the nature of the relationship with the response may change.

(a) fig. 1a (b) fig. 1b
Fig. 1  Plots of disease prevalence against (a) coral cover and (b) WSSTA. A locally weighted regression smoother is shown to better indicate the nature of the relationship between the variables.

Normal model

I start by assuming the response is normally distributed and based on Fig. 1 I model the mean as quadratic in WSSTA, linear in coral cover, with an interaction between coral cover and WSSTA.

normal model

To include a quadratic term in the model we could create a variable directly in the corals data frame that is the square of WSSTA. Alternatively we can just square WSSTA within the model call. For that we need to use the I function of R, which permits one to perform arithmetic on variables in a model. Without the I function the arithmetic calculations would be ignored.

model1 <- lm(PREV_1~CORAL_COVE*WSSTA+I(WSSTA^2), data=corals)

Because all of the predictors are continuous, the output from the anova and summary functions here are similar. The anova function is still useful because it does variables-added-in-order tests in a logical fashion so that WSSTA is added before WSSTA2 and the interaction term is added last.

anova(model1)
Analysis of Variance Table

Response: PREV_1
                  Df Sum Sq Mean Sq F value    Pr(>F)   
CORAL_COVE         1  60212   60212  47.449 3.839e-11 ***
WSSTA              1    127     127   0.100 0.7520837   
I(WSSTA^2)         1  19690   19690  15.516 0.0001038 ***
CORAL_COVE:WSSTA   1  14746   14746  11.620 0.0007500 ***
Residuals        275 348972    1269                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Because this is a variables-added-in-order table, all the tests make sense. First we see there is a significant effect due to coral cover. Controlling for coral cover there is no linear effect of WSSTA, but there is a quadratic effect. Finally the interaction between coral cover and WSSTA is statistically significant.

The summary table provides variables-added-last tests so all but the tests of WSSTA2 and the interaction are silly. (For instance, it doesn't make sense to test the significance of WSSTA with both the quadratic term and the interaction term already in the model.) From the estimates we see that disease prevalence increases with coral cover and that the coral cover effect on disease prevalence is further increased by WSSTA (the interaction term). The quadratic effect of WSSTA is in the form of a parabola opening down. Without plotting the model we can't tell if the vertex of the parabola occurs at a value of WSSTA within the range of the data, but from Fig. 1b we would suspect that it does.

summary(model1)

Call:
lm(formula = PREV_1 ~ CORAL_COVE * WSSTA + I(WSSTA^2), data = corals)

Residuals:
    Min      1Q  Median      3Q     Max
-86.716 -11.684  -2.682   4.856 263.366

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
(Intercept)      -10.71376    6.28473  -1.705  0.08937 . 
CORAL_COVE         0.45081    0.17105   2.636  0.00888 **
WSSTA              1.32007    1.19505   1.105  0.27029   
I(WSSTA^2)        -0.16345    0.04140  -3.948  0.00010 ***
CORAL_COVE:WSSTA   0.07528    0.02208   3.409  0.00075 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 35.62 on 275 degrees of freedom
Multiple R-squared: 0.2136,            Adjusted R-squared: 0.2021
F-statistic: 18.67 on 4 and 275 DF,  p-value: 1.365e-13

Assessing the fit of the normal model

To assess the fit of the model we can plot the estimated model superimposed on the data. Because there are two continuous predictors in the model we really need three dimensions to fully visualize things, but it is possible to obtain a crude picture in two dimensions. Each observation in the data set has a different value of coral cover. The unique function extracts the unique values of a variable minus the duplicates.

# each observation has a different value of coral cover
length(unique(corals$CORAL_COVE))
[1] 280
dim(corals)
[1] 280  14

We can use this to our advantage by sorting the data frame by coral cover and refitting the model. The model predictions we obtain are now ordered by increasing coral cover so that we can plot the predicted mean against coral cover and connect the predictions by line segments. The result will be a jagged line with the jags corresponding to adjustments in the predicted mean due to an observation's value of WSSTA.

The order function can be used to sort a data frame by one of its variables. To illustrate how order works I create a vector of the first 10 positive integers and then randomly permute them with the sample function. By default the sample function uses the internal clock seed to initialize its random number generator. I set the seed explicitly with the set.seed function so that I can generate the same "random" permutation again if desired. To use set.seed enter a positive integer of any size as its argument. Different choices of the integer will yield different seeds and hence different permutations.

#explanation of the order function
set.seed(20)
x <- sample(1:10)
x
[1] 9 7 3 4 6 5 1 10 8 2

The order function returns the positions of the entries that would appear first, second, etc. if the vector was sorted. So, we see that the element at position 7 would be first, the element at position 10 would be second, etc.

order(x)
[1] 7 10 3 4 6 5 2 9 1 8

The output of the order function can then be used to sort the elements of x.

x[order(x)]
[1] 1 2 3 4 5 6 7 8 9 10

With a data frame we use the order function to sort the rows of the data frame. The following call sorts the rows of the data frame by coral cover and stores the result in a new data frame.

#sort data in coral cover order
corals.sort <- corals[order(corals$CORAL_COVE),]

I now refit the regression model using this new data frame.

#refit model to sorted data
model1 <- lm(PREV_1~CORAL_COVE*WSSTA+I(WSSTA^2), data=corals.sort)

Next I plot prevalence against coral cover and then superimpose the predicted means connected by line segments.

#plot raw data and normal mean
plot(PREV_1~CORAL_COVE, data=corals.sort, ylim=c(-25,350), cex=.6)
lines(corals.sort$CORAL_COVE, predict(model1), col=2)

fig. 2
Fig. 2  Plot of the predicted normal mean for each observation (red). The jags are due to variations in WSSTA for individual observations.

Notice that in a number of places the predicted mean prevalence is negative! We can get a slightly more informative picture if we include the likely range for observations that could be generated by this model. I construct confidence bounds within which we expect 95% of the observations to lie. For this we need the standard deviation of the estimated normal curve. This value is contained in the sigma component of the object created by the summary function.

#add 95% confidence interval for mean
plot(PREV_1~CORAL_COVE, data=corals.sort, ylim=c(-100,350), cex=.6)
lines(corals.sort$CORAL_COVE, predict(model1) + qnorm(.025) * summary(model1)$sigma, col=4, lty=2)
lines(corals.sort$CORAL_COVE, predict(model1) + qnorm(.975) * summary(model1)$sigma, col=4, lty=2)

fig 3
Fig. 3  Repeat of Fig. 1 but with 95% confidence bands for individual observations shown (blue).

Notice that although there are some obvious outliers, most of the observations do fall within the 95% bands. Yet it's quite obvious given that the lower confidence band is almost entirely negative that this is a terrible model.

We can get a better sense of that by using the model to generate data and then assessing whether the data generated by the model look at all like the data we actually obtained. To do this I use the rnorm function to generate one observation from a normal distribution for each of the predicted means. In the call below, x corresponds to the mean that is obtained from the predict function applied to the model. In order to get a single value for each mean we need to use the sapply function.

#simulate data from normal model
set.seed(10)
test.data <- sapply(predict(model1), function(x) rnorm(1, x, summary(model1)$sigma))

We plot the simulated data and compare it against the actual data. To facilitate the comparison the graphs are placed one above the other in a single graphics window using the mfrow argument of the global graphics function par. In mfrow we need to specify the dimensions of the graph matrix we wish to create: the number of rows followed by the number of columns. After creating the graphs we then reset mfrow back to its default of 1 row and 1 column.

#display two graphs in same window
par(mfrow=c(2,1))
plot(PREV_1~CORAL_COVE, data=corals.sort, ylim=c(-100,350), cex=.6)
plot(test.data~corals.sort$CORAL_COVE, cex=.6, ylim=c(-100,350))
abline(h=0 ,col=2, lty=2)
par(mfrow=c(1,1))

fig. 4
Fig. 4  Display of actual data (top) and one realization of data generated from the estimated normal model. Clearly the two do not look at all alike.

It's pretty clear that the simulated data don't look anything like the actual data. Not only are there a lot of negative values in the simulated data set, the data form a wide band around the predicted mean. The actual data form a very narrow band with most of the observations having a prevalence of zero. Furthermore the normal model seems incapable of generating the large outliers that we see in the raw data.

Poisson model

The Poisson distribution is a natural choice for count data so we try fitting it next. The default choice is to use a log link for the mean.

Poisson model

model2 <- glm(PREV_1~CORAL_COVE*WSSTA+I(WSSTA^2), data=corals.sort, family=poisson)
anova(model2, test='Chisq')
Analysis of Deviance Table

Model: poisson, link: log

Response: PREV_1

Terms added sequentially (first to last)

                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)   
NULL                               279    11827.0             
WSSTA             1     36.6       278    11790.4 1.417e-09 ***
CORAL_COVE        1   4632.3       277     7158.0 < 2.2e-16 ***
I(WSSTA^2)        1   1967.7       276     5190.4 < 2.2e-16 ***
WSSTA:CORAL_COVE  1    278.1       275     4912.3 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

coef(model2)
     (Intercept)            WSSTA       CORAL_COVE       I(WSSTA^2) WSSTA:CORAL_COVE
    -0.034946574      0.305609888      0.033218444     -0.027039714      0.003803533

From the anova output we see that each variable added in order is statistically significant, agreeing with our conclusions from the normal model. The coefficient estimates have the same sign as in the normal model indicating the direction of the effects is also the same. We next try simulating data from the fitted Poisson model to see if the results more closely resemble what we observed. The rpois function generates random values from a specified Poisson distribution. Because we used a log link we have to exponentiate the predicted values to get the means.

par(mfrow=c(2,1))
#simulate data from Poisson model
set.seed(12)
test.data2 <- sapply(exp(predict(model2)), function(x) rpois(1,x))
plot(PREV_1~CORAL_COVE, data=corals.sort, ylim=c(-100,350), cex=.6)
abline(h=0, lty=2, col=2)
plot(test.data2~corals.sort$CORAL_COVE, cex=.6, ylim=c(-100,350))
abline(h=0, lty=2, col=2)
par(mfrow=c(1,1))

fig. 5
Fig. 5  Display of actual data (top) and one realization of data generated from the estimated Poisson model (bottom).

The Poisson model appears to do a lot better than the normal in producing data that resemble what we actually observed. The Poisson model respects the lower bound of zero and is even able to generate a couple of the outliers. One noteworthy deficiency is that the spread in the Poisson data is too regular when compared to the actual data. In particular at low values of coral cover the variability in the predictions is extremely low whereas the actual data show far more spread. The regularity in the simulated data is a consequence of the fact that in the Poisson distribution the mean is equal to the variance.

We can compare the Poisson and normal models using log-likelihood and AIC. Because the Poisson estimates one less parameter than does the normal, AIC provides a fairer comparison.

#compare log-likelihood and AIC of Poisson and normal model
logLik(model1)
'log Lik.' -1395.217 (df=6)
logLik(model2)
'log Lik.' -2758.778 (df=5)
AIC(model1, model2)
       df      AIC
model1  6 2802.434
model2  5 5527.556

Using either AIC or log-likelihood we conclude that the Poisson is a far worse model than the normal. Given what we saw in Figs. 4 and 5 this is somewhat surprising. Because we're comparing a density with a probability, perhaps the midpoint approximation of the normal probability is doing a poor job here. To check this we can write our own functions to calculate the log-likelihood and compare the midpoint approximation against using the actual area under the normal curve. Because the maximum likelihood estimate of σ2 uses a denominator of n rather than np, we need to calculate it explicitly in the function and use a denominator of n. The residuals function extracts estimates of the model errors that can then be used to construct the sum of squared errors (SEE).

#log-likelihood function for normal model using midpoint approximation
norm.loglike1 <- function(model,y) {
sigma2 <- (sum(residuals(model)^2))/length(y)
prob <- dnorm(y, mean=predict(model), sd=sqrt(sigma2))
loglike <- sum(log(prob))
loglike
}

#log-likelihood function for normal model using exact probability calculation
norm.loglike2 <- function(model,y) {
sigma2 <- (sum(residuals(model)^2))/length(y)
prob <- pnorm(y+.5, mean=predict(model), sd=sqrt(sigma2)) - pnorm(y-.5, mean=predict(model),sd=sqrt(sigma2))
loglike <- sum(log(prob))
loglike
}

norm.loglike1(model1,corals.sort$PREV_1)
[1] -1395.217
norm.loglike2(model1,corals.sort$PREV_1)
[1] -1395.222

The log-likelihood using the actual probabilities and the log-likelihood using the density approximation are nearly identical, so the poor performance must have to do with the Poisson model itself. The source of the problem is the inflexibility of the Poisson distribution. To see this I examine some of the large values predicted by the Poisson distribution here.

#simulated values
test.data2[test.data2>70]
 53 210  56  55 227 154 226
 74  79  96  81 133 300 307
#corresponding observed values
corals.sort$PREV_1[test.data2>70]
[1] 315   3  18   6  20 343 336
#Poisson means
exp(predict(model2)[test.data2>70])
       53       210        56        55       227       154       226
 72.94906  85.16031 101.85071  87.97483 129.71265 288.82104 287.66155

The second observation shown had a predicted mean of 85.2 that yielded a simulated random count of 79. The observed value for this observation was a count of 3. What's the probability of obtaining such a count under the Poisson model?

dpois(3,exp(predict(model2)[test.data2>70])[2])
[1] 1.066368e-32

It's astronomically low. In fact the total probability of obtaining counts 69 or less is less than .05.

ppois(69,exp(predict(model2)[test.data2>70])[2])
[1] 0.04139172

Under the normal model the predicted mean of this observation is 51.7, quite a bit lower than the Poisson, but still much larger than the observed value of 3. The probability of realizing a count of 3 under the normal model is 0.004, roughly 1029 times higher than for the Poisson. The probability of seeing a value of 3 or smaller is 0.088 (of course this includes negative values too). Extreme values are much more likely under the normal model than under the Poisson.

predict(model1)[test.data2>70][2]
     211
51.68266
pnorm(3.5, predict(model1)[test.data2>70][2], summary(model1)$sigma) - pnorm(2.5, predict(model1)[test.data2>70][2], summary(model1)$sigma)
[1] 0.004401928
pnorm(3.5, predict(model1)[test.data2>70][2], summary(model1)$sigma)
[1] 0.08809547

Negative binomial model

Basic characteristics

A negative binomial (NB) distribution is discrete and, like the Poisson distribution, is bounded below by 0 but is theoretically unbounded above. The negative binomial distribution is a two-parameter distribution, written X distr NB, where the only restriction on μ and θ is that they are positive. Here μ is the mean of the distribution and θ is called the size parameter (in the R documentation), or more usually the dispersion parameter (or overdispersion parameter).

A Poisson random variable can be viewed as the limit of a negative binomial random variable when the parameter θ is allowed to become infinite. So, in a sense θ is a measure of deviation from a Poisson distribution. Given that there are infinitely many other choices for θ, this is further evidence of the flexibility of the negative binomial distribution over the Poisson distribution.

The variance of the negative binomial distribution is a quadratic function of the mean.

variance

Since parabola, this represents a parabola opening up that crosses the μ-axis at the origin and at the point μ = –θ. As , , and we have the variance of a Poisson random variable. θ affects the rate at which the parabola grows. For large θ, the parabola is very flat while for small θ the parabola is narrow. Thus θ can be used to describe a whole range of heteroscedastic behavior.

Fitting the negative binomial model

Regression models with a negative binomial distribution can be fit using the glm.nb function of the MASS package. The MASS package is part of the standard installation of R but it does need to be loaded into memory with the library function.

library(MASS)
model3 <- glm.nb(PREV_1~CORAL_COVE*WSSTA+I(WSSTA^2), data=corals.sort)

The glm.nb function uses a log link by default, so we're fitting the following model.

NB model

The anova function carries out likelihood ratio tests on glm.nb models without the need for a test argument.

anova(model3)
Analysis of Deviance Table

Model: Negative Binomial(0.3699), link: log

Response: PREV_1

Terms added sequentially (first to last)

                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)   
NULL                               279     456.84             
CORAL_COVE        1  108.066       278     348.77 < 2.2e-16 ***
WSSTA             1    2.518       277     346.26    0.1126   
I(WSSTA^2)        1   64.895       276     281.36 7.899e-16 ***
CORAL_COVE:WSSTA  1    2.161       275     279.20   
0.1415   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In anova.negbin(model3) : tests made without re-estimating 'theta'

The statistical tests lead us to a final model that is different from before. The interaction of WSSTA and coral cover is not statistically significant in the negative binomial model. The Wald test of the interaction in the output of the summary function agrees with the likelihood ratio test.

summary(model3)
Call:
glm.nb(formula = PREV_1 ~ CORAL_COVE * WSSTA + I(WSSTA^2), data = corals.sort,
    init.theta = 0.3699352907, link = log)

Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-1.8883  -1.1822  -0.6006  -0.0013   3.7642 

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)   
(Intercept)      -0.436147   0.321700  -1.356   0.1752   
CORAL_COVE        0.039771   0.008451   4.706 2.53e-06 ***
WSSTA             0.372167   0.065724   5.663 1.49e-08 ***
I(WSSTA^2)       -0.022375   0.002847  -7.861 3.82e-15 ***
CORAL_COVE:WSSTA  0.001884   0.001139   1.654  
0.0982
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.3699) family taken to be 1)

    Null deviance: 456.84  on 279  degrees of freedom
Residual deviance: 279.20  on 275  degrees of freedom
AIC: 1427.9

Number of Fisher Scoring iterations: 1

              Theta:  0.3699
          Std. Err.:  0.0379

 2 x log-likelihood:  -1415.8520

Comparing the coefficient estimates of the three models fit thus far, the Poisson and negative binomial fits return fairly similar estimates but different from the normal model. Both use a log link, while the normal model uses an identity link.

out <- t(sapply(list(model1, model2, model3), coef))
rownames(out) <- c('Normal', 'Poisson', 'NB')
out
         (Intercept) CORAL_COVE     WSSTA  I(WSSTA^2) CORAL_COVE:WSSTA
Normal  -10.71376258 0.45081235 1.3200721 -0.16344627      0.075282215
Poisson  -0.03494657 0.03321844 0.3056099 -0.02703971      0.003803533
NB       -0.43614714 0.03977101 0.3721673 -0.02237514      0.001883727

According to AIC, the negative binomial model is the clear winner.

#compare AIC of models
AIC(model1,model2,model3)
       df      AIC
model1  6 2802.434
model2  5 5527.556
model3  6 1427.852

Checking the fit of the negative binomial model

The dispersion parameter is called theta and is returned as a component of the glm.nb model object.

#dispersion parameter is called theta
names(model3)
 [1] "coefficients"      "residuals"         "fitted.values"     "effects"         
 [5] "R"                 "rank"              "qr"                "family"          
 [9] "linear.predictors" "deviance"          "aic"               "null.deviance"   
[13] "iter"              "weights"           "prior.weights"     "df.residual"     
[17] "df.null"           "y"                 "converged"         "boundary"        
[21] "terms"             "call"              "model"             "theta"           
[25] "SE.theta"          "twologlik"         "xlevels"           "method"          
[29] "control"

Using it we can simulate data from our model using the rnbinom function.

#simulate data from negative binomial model
set.seed(12)
test.data3 <- sapply(fitted(model3), function(x) rnbinom(1, mu=x, size=model3$theta))

I graph the generated data and compare it to both the raw data and the simulated data from the Poisson model.

#graph raw data, Poisson data, and NB data
par(mfrow=c(3,1))
plot(PREV_1~CORAL_COVE, data=corals.sort, ylim=c(0,350), cex=.6)
plot(test.data2~corals.sort$CORAL_COVE, cex=.6, ylim=c(0,350))
plot(test.data3~corals.sort$CORAL_COVE, cex=.6, ylim=c(0,350))
par(mfrow=c(1,1))

fig. 6

Fig. 6  Display of actual data (top) and one realization of data generated from the estimated Poisson model (middle) and negative binomial model (bottom).

Notice that the data generated from the negative binomial model more closely resembles the actual data than does the Poisson. In particular the negative binomial data show a good deal of variability even at low coral cover values just like the actual data does.

Log-normal model

The log-transformation has long been recommended for analyzing count data particularly when the counts are highly skewed. The usual assumption that is made is that the log counts are normally distributed.

normal log

By definition, if a log-transformed response is normally distributed the original variable is said to have a lognormal distribution.

If we try to fit a normal model to log-transformed prevalence we obtain an error message.

#because of zeros log is undefined
model4 <- lm(log(PREV_1)~CORAL_COVE*WSSTA+I(WSSTA^2), data=corals.sort)
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
  NA/NaN/Inf in foreign function call (arg 4)

The problem is that are some zero values for the response in the data set and the logarithm of zero is undefined. The usual fix is to add a small constant, a number between 0 and 1, to each of the response values. I refit the model by adding 0.5 to each value of prevalence.

#add a small constant
model4 <- lm(log(PREV_1+.5)~CORAL_COVE*WSSTA+I(WSSTA^2), data=corals.sort)
#AIC of this model is not comparable to previous models
AIC(model4)
[1] 964.7565

Recall that because the response variable has been transformed, the reported AIC is not comparable to the AIC of models fit to the raw response.

Choosing a value for the additive constant

The choice of additive constant in a lognormal model can affect the results. We can formulate this as a maximum likelihood estimation problem and choose the value for the constant that maximizes the likelihood. I write a function that takes two arguments: the value of the constant and the response variable. The first line of the function shown below fits the model for the specified value of the constant. As was explained in lecture 10, the exact count probabilities when using a density are calculated as the area under the curve over the interval (log(y + k – 0.5), log(y + k + 0.5)). When y = 0, the boundary value, the interval is (–∞, log(y + k + 0.5)). Because the log-transformed response is assumed to be normally distributed, these probabilities are easily calculated with the pnorm function.

#function to determine optimal constant to use
norm.loglike3 <- function(k,y) {
#fit model
model <- lm(log(PREV_1+k)~CORAL_COVE*WSSTA+I(WSSTA^2), data=corals.sort)
#MLE of sigma^2
sigma2 <- (sum(residuals(model)^2))/length(y)
# probability at y = 0 is calculated differently
prob <- ifelse(y==0, pnorm(log(y+k+.5), mean=predict(model), sd=sqrt(sigma2)), pnorm(log(y+k+.5), mean=predict(model), sd=sqrt(sigma2)) - pnorm(log(y+k-.5), mean=predict(model), sd=sqrt(sigma2)))
#calculate log-likelihood
loglike <- sum(log(prob))
loglike
}

I test the function.

norm.loglike3(.5, corals.sort$PREV_1)
[1] -714.3271
norm.loglike3(.7, corals.sort$PREV_1)
[1] -718.6262
norm.loglike3(.4, corals.sort$PREV_1)
[1] -711.9158
Warning message:
In log(y + k - 0.5) : NaNs produced

The warning message occurs because of the way ifelse is evaluated. Although in the function we choose the value of prob depending on whether y==0 or not, both alternatives are evaluated each time. So when k < 0.5 the evaluation of log(y + k – 0.5) is undefined. The message is annoying but it doesn't affect the results. I calculate the log-likelihood for a range of values for k and plot the results.

#evaluate function on a range of values
out.L <- sapply(seq(.1,1,.01), function(x) norm.loglike3(x, corals.sort$PREV_1))
There were 40 warnings (use warnings() to see them)
#plot log-likelihood
plot(seq(.1,1,.01), out.L, type='l', ylab='Log-likelihood', xlab='k')

From the graph (Fig. 7a) the maximum appears to be occurring near k = 0.1. I recalculate things and plot over a smaller range.

#evaluate function on a range of values
out.L <- sapply(seq(.01,.4,.01), function(x) norm.loglike3(x, corals.sort$PREV_1))
There were 40 warnings (use warnings() to see them)
#plot log-likelihood
plot(seq(.01,.4,.01), out.L, type='l', ylab='Log-likelihood', xlab='k')

(a) fig. 7a (b) fig. 7b
Fig. 7  Normal log-likelihood for different choices of the additive constant k in log(y + k).

We can use the which.max function to locate the value of k that yields the maximum. The which.max function returns the position in a vector that corresponds to the maximum value.

max(out.L)
[1] -706.0813
#which value gave rise to the max?
which.max(out.L)
[1] 13
seq(.01,4,.01)[which.max(out.L)]
[1] 0.13

So we should use k = 0.13. I refit the model with this choice of constant.

#refit the model using this constant
model4 <- lm(log(PREV_1+.13)~CORAL_COVE*WSSTA+I(WSSTA^2), data=corals.sort)
anova(model4)
Analysis of Variance Table

Response: log(PREV_1 + 0.13)
                  Df Sum Sq Mean Sq F value    Pr(>F)   
CORAL_COVE         1 262.96 262.964 80.9313 < 2.2e-16 ***
WSSTA              1  14.83  14.826  4.5630   0.03355 * 
I(WSSTA^2)         1  97.32  97.317 29.9509 9.978e-08 ***
CORAL_COVE:WSSTA   1   0.15   0.152  0.0468   0.82881   
Residuals        275 893.54   3.249                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The results are in agreement with the negative binomial model. The interaction between coral cover and WSSTA is not significant.

Comparing the four models

I compare the estimates of the regression parameters for all the models we've considered.

#compare coefficients of all the models we fit
out.coef <- t(sapply(list(model1, model2, model3, model4), coef))
rownames(out.coef) <- c('normal', 'Poisson', 'NB', 'lognormal')
out.coef
           (Intercept) CORAL_COVE     WSSTA  I(WSSTA^2) CORAL_COVE:WSSTA
normal    -10.71376258 0.45081235 1.3200721 -0.16344627     0.0752822152
Poisson    -0.03494657 0.03321844 0.3056099 -0.02703971     0.0038035330
NB         -0.43614714 0.03977101 0.3721673 -0.02237514     0.0018837266
lognormal  -1.92097350 0.05825755 0.2289075 -0.01146323    -0.0002418559

There are differences in some of the magnitudes but the signs of the estimates are in agreement except for the interaction term which is not significant in the negative binomial and lognormal models.

I collect the log-likelihoods, AIC values, and number of estimated parameters of all four of the models in a single data frame. So that we have a general function for this purpose, I rewrite the log-likelihood function for a log transformation so that the fitted model is one of the arguments.

#log-likelihood for a log transformed response and discrete data
norm.loglike4 <- function(model, k, y) {
#MLE of sigma^2
sigma2 <- (sum(residuals(model)^2))/length(y)
# probability at y = 0 is calculated differently
prob <- ifelse(y==0, pnorm(log(y+k+.5), mean=predict(model), sd=sqrt(sigma2)), pnorm(log(y+k+.5), mean=predict(model), sd=sqrt(sigma2)) - pnorm(log(y+k-.5), mean=predict(model), sd=sqrt(sigma2)))
#calculate log-likelihood
loglike <- sum(log(prob))
loglike
}

It's worth noting that using the density version of the log-likelihood here gives a different answer. This is because the probability of the zero category is poorly approximated by the midpoint approximation. The density used here is the one obtained from the change of variables formula for integration (lecture 10).

#log-likelihood for a log transformed response using density
norm.loglike5 <- function(model, k, y) {
#MLE of sigma^2
sigma2 <- (sum(residuals(model)^2))/length(y)
prob <- dnorm(log(y+k), mean=predict(model), sd=sqrt(sigma2)) * 1/(y+k)
#calculate log-likelihood
loglike <- sum(log(prob))
loglike
}

# using midpoint approximation
norm.loglike5(model4, .13, corals.sort$PREV_1)
[1] -625.7064
# using exact probability
norm.loglike4(model4, .13, corals.sort$PREV_1)
[1] -706.0813

For the lognormal model the estimated parameters consist of the five regression coefficients, the variance σ2, and the additive constant k that we estimated separately making a total of 7 parameters. The Poisson has 5 estimated parameters, while the negative binomial and normal model have six parameters. (The negative binomial model has the additional dispersion parameter θ while the normal model has the normal variance σ2.)

LL <- c(logLik(model1), logLik(model2), logLik(model3), norm.loglike4(model4, .13, corals.sort$PREV_1))
Warning message:
In log(y + k - 0.5) : NaNs produced
k <- c(AIC(model1, model2, model3)[,1], 7)
aic.model <- c(AIC(model1, model2, model3)[,2], -2*LL[4]+2*k[4])
my.results <- data.frame(LL, k, AIC=aic.model)
rownames(my.results) <- c('normal', 'Poisson', 'NB', 'lognormal')
my.results
                  LL k      AIC
normal    -1395.2168 6 2802.434
Poisson   -2758.7781 5 5527.556
NB         -707.9261 6 1427.852
lognormal  -706.0813 7 1426.163

So the lognormal model narrowly edges out the negative binomial model using AIC. The Poisson and normal models are not even in the race.

Graphing the four models together

The final model contains two predictors, coral cover and WSSTA, so to fully fully visualize the model we need three dimensions. Because WSSTA is the variable of interest while coral cover is just a control variable, we can choose a convenient value for coral cover and plot the model prediction as a function of WSSTA in two dimension. The mean coral cover in the data set is 30.8%,

mean(corals.sort$CORAL_COVE)
[1] 30.81672

so a convenient value to use is 30%. I use the interaction model for all four probability distributions even though the interaction term was not significant in the negative binomial and lognormal models. I start by writing a function that returns the regression equation for a given model at a specified value of WSSTA and coral cover.

linfunc <- function(model,wssta,cover) coef(model)[1] + coef(model)[2]*cover + coef(model)[3]*wssta + coef(model)[4]*wssta^2 + coef(model)[5]*wssta*cover

To draw the models I use the curve function. The curve function accepts a function of a single variable x as its first argument and draws the function over a range specified by the arguments from and to. Because curve is a high-level function it will erase the contents of the graphics window when it is used a second time. To prevent that I include the add=T argument to the subsequent calls to curve. The equations for the Poisson, negative binomial, and lognormal models are all exponentiated to place things on the scale of the raw response.

curve(linfunc(model1,x,30), from=0, to=30, ylim=c(0,25) ,xlab='WSSTA', ylab='Prevalence')
curve(exp(linfunc(model4,x,30)), add=T, lty=2)
curve(exp(linfunc(model2,x,30)), add=T, col=2)
curve(exp(linfunc(model3,x,30)), add=T, col=2, lty=2)
legend('topright', c('normal', 'lognormal', 'Poisson', 'negative binomial'), col=c(1,1,2,2), lty=c(1,2,1,2), cex=.9, bty='n')

fig. 8
Fig. 8  Plot of predicted regression means for the normal, Poisson, and negative binomial models. The curve for the lognormal model represents its median, not its mean.

The graph of the lognormal model looks unusual. That's because exponentiating the lognormal model has not returned the mean, but instead has returned the median on the raw scale. (We'll discuss the reason for this next time.) The lognormal distribution is typically positively skewed so that the median and mean are very different. The mean of the lognormal distribution can be expressed in terms of μ and σ2, the mean and variance of the log-transformed variables.

lognormal mean

I redraw Fig. 8 replacing the graph of the lognormal median with the graph of the lognormal mean.

curve(linfunc(model1,x,30), from=0, to=30, ylim=c(0,25), xlab='WSSTA', ylab='Prevalence')
curve(exp(linfunc(model4,x,30) + summary(model4)$sigma^2/2), add=T, lty=2)
curve(exp(linfunc(model2,x,30)), add=T, col=2)
curve(exp(linfunc(model3,x,30)), add=T, col=2, lty=2)
legend('topright', c('normal', 'lognormal', 'Poisson', 'negative binomial'), col=c(1,1,2,2), lty=c(1,2,1,2), cex=.9, bty='n')

fig 9
Fig. 9  Plot of predicted regression means for the normal, lognormal, Poisson, and negative binomial models.

Cited reference

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--February 7, 2012
URL: https://sakai.unc.edu/access/content/group/2842013b-58f5-4453-aa8d-3e01bacbfc3d/public/Ecol562_Spring2012/docs/lectures/lecture11.htm