Lecture 40 (lab 13)—Friday, April 20, 2012

Topics

Nominal multinomial data
Goodness of fit for multinomial models
- A simulation-based goodness of fit test
- Quasi-Poisson model
Graphical summary of the results
Ordinal logistic regression
- Cumulative logit model
- Testing the proportional odds assumption
Cited references

R functions and commands demonstrated

expand.grid is a convenient function for generating all possible combinations of the levels of two factor variables.
grep is used to search for text strings that match a certain pattern.
matrix creates a matrix of specified dimensions.
multinom (from the nnet package) fits baseline category multinomial logit models.
ordered creates an ordered factor variable.
polr (from the MASS package) estimates cumulative logit proportional odds multinomial regression models.
rmultinom generates random values from a multinomial distribution.
vglm (from the VGAM package) estimates multinomial regression models.

R function options

byrow= argument to matrix. If byrow=T then the rows are populated with values one row at a time.
cumulative is the value needed in the family argument of vglm in order to estimate a cumulative logit multinomial model.
multinomial is the value needed in the family argument of vglm in order to estimate a baseline logit multinomial model.
parallel= argument to vglm for determining if a proportional odds model should be fit, parallel=T, or not.
reflevel= argument to vglm for specifying the reference level in a baseline logit multinomial model.
reverse= argument to vglm controls how the log odds of the response variable is constructed.
weights= argument to multinom (and other regression functions) for specifying the variable that records the frequency of occurrence of each observation.

R packages used

nnet for the multinom function
MASS for the polr function
VGAM for the vglm function

Data sets

Nominal multinomial data

The data set alligators.csv comes from Agresti (2002) but it is used in many other textbooks as an example of multinomial data. It's from a study by the Florida Game and Fresh Water Fish Commission of the types of food that alligators choose to eat in the wild. A sample of 219 alligators was obtained from four Florida lakes. For each alligator the primary food type (by volume) was determined from the animal's stomach contents. Food type categories used were birds, fish, invertebrates, reptiles, and other. The invertebrates were primarily apple snails, aquatic insects, and crayfish. The reptiles were primarily turtles although one stomach contained the tags of 23 baby alligators that had been released in the lake the previous year. The "other" category included amphibians, mammals, plant material, stones, and other debris. In addition to lake, the gender of the animal was noted and its size was recorded as either less than or equal to 2.3 meters or greater than 2.3 meters. The purpose of categorizing size in this way was to distinguish subadults from adults.

alligators <- read.csv("ecol 562/alligators.csv")

dim(alligators)

[1] 219 4

alligators[1:10,]

     food size gender    lake
1    fish <2.3      m hancock
2    fish <2.3      m hancock
3    fish <2.3      m hancock
4    fish <2.3      m hancock
5    fish <2.3      m hancock
6    fish <2.3      m hancock
7    fish <2.3      m hancock
8 invert <2.3      m hancock
9   other <2.3      m hancock
10 other <2.3      m hancock

Each line of the data file is the record of a single alligator listing the predominant food type in its stomach as well as its size, gender, and the lake where it was captured. This is the multinomial equivalent of binary data. Without additional information an appropriate probability could be either multinomial or Poisson.

If the researcher's instruction to his/her minions was to collect 219 alligators then we have a multinomial model. Although the number of observations in each of the five categories is random they do have to sum to 219.
On the other hand if as many animals as possible were collected over a period of time, then the total is also random and a Poisson distribution is appropriate.

As we saw in lecture 38, the distinction between Poisson or multinomial doesn't really matter here because by conditioning on the sample size a Poisson distribution becomes a multinomial distribution.

I cross-tabulate the observations by food type, lake, size, and gender to see the distribution of observations across the categories.

table(alligators$food, alligators$lake, alligators$size, alligators$gender)

, , = <2.3, = f

         george hancock oklawaha trafford
bird        0       2        0        1
fish        3      16        3        2
invert      9       3        9        4
other       1       3        2        4
rep         1       2        1        1

, , = >2.3, = f

         george hancock oklawaha trafford
bird        0       2        1        0
fish        8       3        0        0
invert      1       0        1        1
other       1       3        0        0
rep         0       1        0        0

, , = <2.3, = m

         george hancock oklawaha trafford
bird        2       0        0        0
fish       13       7        2        3
invert     10       1        2        7
other       2       5        1        1
rep         0       0        0        1

, , = >2.3, = m

         george hancock oklawaha trafford
bird        1       1        0        3
fish        9      4       13        8
invert      0       0        7        6
other       2       2        0        5
rep         0       0        6        6

We see that the data are rather sparse. Many of the possible combinations of food type, size, gender, and lake were not observed.

The baseline category multinomial logit model

The multinom function in the nnet package can be used to fit baseline category multinomial logit models. The multinomial variable is entered as the response and predictors are included using the usual formula notation of R. I start by fitting a null model, a multinomial model with no predictors. I then follow it up with separate models that include each of the the three predictors lake, size, and gender individually.

library(nnet)

fit0 <- multinom(food~1, data=alligators)

fit1 <- multinom(food~gender, data=alligators)

fit2 <- multinom(food~lake, data=alligators)

fit3 <- multinom(food~size, data=alligators)

The anova function applied to two nested multinom models carries out a likelihood ratio test.

anova(fit0, fit1)

Likelihood ratio tests of Multinomial Models

Response: food
   Model Resid. df Resid. Dev   Test    Df LR stat.   Pr(Chi)
1      1       872   604.3629
2 gender       868   602.2589 1 vs 2     4 2.104069 0.7166248

From the output above we see that the effect of gender is not statistically significant. Nested and non-nested models can be compared using AIC.

AIC(fit0, fit1, fit2, fit3)

df AIC
fit0 4 612.3629
fit1 8 618.2589
fit2 16 593.1677
fit3 8 605.2134

The lake-only model ranks best followed by the size model. Including both lake and size in the model leads to a further improvement in the AIC.

fit4 <- multinom(food~lake+size, data=alligators)

# weights: 30 (20 variable)
initial value 352.466903
iter 10 value 272.246275
iter 20 value 270.046891
final value 270.040139
converged

AIC(fit4)

[1] 580.0803

Furthermore both terms are statistically significant in the presence of the other.

anova(fit2, fit4)

Likelihood ratio tests of Multinomial Models

Response: food
        Model Resid. df Resid. Dev   Test    Df LR stat.      Pr(Chi)
1        lake       860   561.1677
2 lake + size       856   540.0803 1 vs 2     4 21.08741 0.0003042795

anova(fit3, fit4)

Likelihood ratio tests of Multinomial Models

Response: food
        Model Resid. df Resid. Dev   Test    Df LR stat.      Pr(Chi)
1        size       868   589.2134
2 lake + size       856   540.0803 1 vs 2    12 49.13308 1.982524e-06

If we try to add gender to this model we find that it's not needed.

fit5 <- multinom(food~lake+size+gender, data=alligators)

# weights: 35 (24 variable)
initial value 352.466903
iter 10 value 270.228588
iter 20 value 268.944257
final value 268.932741
converged

anova(fit4, fit5)

Likelihood ratio tests of Multinomial Models

Response: food
                 Model Resid. df Resid. Dev   Test    Df LR stat.   Pr(Chi)
1          lake + size       856   540.0803
2 lake + size + gender       852   537.8655 1 vs 2     4 2.214796 0.6963214

Interpreting the output from a multinomial model

The summary function applied to the model yields the coefficient estimates and their standard errors.

summary(fit4)

Call:
multinom(formula = food ~ lake + size, data = alligators)

Coefficients:
       (Intercept) lakehancock lakeoklawaha laketrafford   size>2.3
fish      2.723723 -0.6950850    0.6531363 -1.08768971 -0.6306971
invert    2.632914 -2.3534487    1.5903438   0.03428501 -2.0888952
other     1.150998   0.1311222    0.6587950   0.42869526 -0.9622560
rep      -0.942082   0.5476742    3.1120046   1.84756421 -0.2794163

Std. Errors:
       (Intercept) lakehancock lakeoklawaha laketrafford size>2.3
fish     0.7103906   0.7812574     1.202038    0.8416676 0.6424797
invert   0.7266419   0.9337072     1.227725    0.8631499 0.6939064
other    0.8059565   0.8919633     1.368529    0.9382990 0.7127082
rep      1.2411890   1.3715034     1.585983    1.3173183 0.8062283

Residual Deviance: 540.0803
AIC: 580.0803

Because the multinomial response has five categories four sets of coefficient estimates are reported. The category "bird" being first alphabetically was chosen as the reference group for food.

levels(alligators$food)

[1] "bird" "fish" "invert" "other" "rep"

levels(alligators$lake)

[1] "george" "hancock" "oklawaha" "trafford"

The four rows of the coefficient table correspond to separate models for the log odds (fish vs bird), the log odds (invert vs bird), the log odds (other vs bird), and the log odds (reptile vs bird). The reference group for lake is Lake George and the reference group for size is subadult. In the log odds (fish vs bird) equation we see that the log odds for subadult alligators (size < 2.3) in Lake George is 2.72 (the value of the intercept). This log odds decreases in Lake Hancock and Lake Trafford, but increases in Lake Oklawaha. Because the model is additive in lake and size, the log odds of fish vs bird is decreased by 0.63 for adults relative to subadults in each of the four lakes.

To obtain a formal statistical test of the lake and size effects for each of food log odds models we can divide the coefficients by their standard errors to obtain a Wald test.

names(summary(fit4))

[1] "n"               "nunits"          "nconn"           "conn"
[5] "nsunits"         "decay"           "entropy"         "softmax"
[9] "censored"        "value"           "wts"             "convergence"
[13] "fitted.values"   "residuals"       "lev"             "call"
[17] "terms"           "weights"         "deviance"        "rank"
[21] "lab"             "coefnames"       "vcoefnames"      "contrasts"
[25] "xlevels"         "edf"             "AIC"             "is.binomial"
[29] "digits"          "coefficients"    "standard.errors"

summary(fit4)$coefficients/ summary(fit4)$standard.errors

       (Intercept) lakehancock lakeoklawaha laketrafford   size>2.3
fish     3.8341209 -0.8897003    0.5433576 -1.29230324 -0.9816608
invert   3.6234001 -2.5205424    1.2953584   0.03972081 -3.0103415
other    1.4281144   0.1470040    0.4813891   0.45688555 -1.3501402
rep     -0.7590158   0.3993240    1.9621934   1.40251919 -0.3465721

Coefficients larger than two in magnitude are statistically significant. Alternatively we can calculate a two-tailed p-value ourselves.

2*(1-pnorm(abs(summary(fit4)$coefficients/ summary(fit4)$standard.errors)))

        (Intercept) lakehancock lakeoklawaha laketrafford    size>2.3
fish   0.0001260141 0.37362683   0.58688360    0.1962521 0.326266988
invert 0.0002907556 0.01171741   0.19519661    0.9683157 0.002609541
other 0.1532589264 0.88312889   0.63023995    0.6477533 0.176971028
rep    0.4478431203 0.68965450   0.04973997    0.1607603 0.728912781

From the output we determine that the log odds of fish or invertebrates relative to birds is statistically significant (greater than zero) in Lake George. The other two log odds are not significantly different from zero in Lake George. The log odds of invertebrates to birds is significantly lower in Lake Hancock than it is in Lake George, while the log odds of reptiles relative to birds is significantly higher in Lake Oklawaha than in Lake George. The only significant size effect is in the log odds of invertebrates to birds with higher odds being favored for subadults than for adults. The statements about log odds ratios can be made more interpretable by exponentiating the log odds coefficients to obtain odds ratios.

exp(-summary(fit4)$coefficients[2,2])

[1] 10.52179

exp(summary(fit4)$coefficients[4,3])

[1] 22.46604

exp(-summary(fit4)$coefficients[2,5])

[1] 8.075988

Thus we see that the odds of invertebrates to birds being the predominant food type is ten times higher in Lake George than in Lake Hancock, the odds of reptiles to birds being the predominant food type is 22 times higher in Lake Oklawaha than in Lake George, and the odds of invertebrates to birds being the predominant food type is 8 times higher in subadults than in adults.

Fitting multinomial models using the weights argument

In the alligator data set each observation is a single realization of a multinomial random variable. We can also fit models to grouped multinomial data, which is the way multinomial data are commonly recorded. To reorganize the current data set in this way, I make a table of food by lake by size and by gender and reorganize it as a data frame, adding the appropriate variable names to the columns.

alli.dat <- data.frame(table(alligators$food, alligators$lake, alligators$size, alligators$gender))

dim(alli.dat)

[1] 80 5

alli.dat[1:8,]

    Var1    Var2 Var3 Var4 Freq
1   bird george <2.3    f    0
2   fish george <2.3    f    3
3 invert george <2.3    f    9
4 other george <2.3    f    1
5    rep george <2.3    f    1
6   bird hancock <2.3    f    2
7   fish hancock <2.3    f   16
8 invert hancock <2.3    f    3

names(alli.dat)[1:4] <- c('food', 'lake', 'size', 'gender')

alli.dat[1:8,]

    food    lake size gender Freq
1   bird george <2.3      f    0
2   fish george <2.3      f    3
3 invert george <2.3      f    9
4 other george <2.3      f    1
5    rep george <2.3      f    1
6   bird hancock <2.3      f    2
7   fish hancock <2.3      f   16
8 invert hancock <2.3      f    3

To fit a multinomial model to data organized in this fashion we have to include the variable that records the counts in the weights argument of multinom. The AIC and log-likelihood we obtain match those obtained with the ungrouped data set.

fit4a <- multinom(food~lake+size, data=alli.dat, weights=Freq)

# weights: 30 (20 variable)
initial value 352.466903
iter 10 value 272.246275
iter 20 value 270.046891
final value 270.040139
converged

AIC(fit4a)

[1] 580.0803

logLik(fit4a)

'log Lik.' -270.0401 (df=20)

Fitting the baseline logit model using the VGAM package

The vglm function of the VGAM package (which we'll use below for ordinal regression models) can also be used to fit the baseline category multinomial logit model. For this we need to specify multinomial in the family argument with the desired level to use as the reference level. To fit model fit4 using VGAM we would proceed as follows.

library(VGAM)

fit4.vglm <- vglm(food~size+lake, family=multinomial(refLevel=1), data=alligators)

coef(fit4.vglm, matrix=T)

             log(mu[,2]/mu[,1]) log(mu[,3]/mu[,1]) log(mu[,4]/mu[,1])
(Intercept)           2.7237365         2.63292251          1.1510151
size>2.3             -0.6306597        -2.08886434         -0.9622100
lakehancock          -0.6951176        -2.35347615          0.1310787
lakeoklawaha          0.6532062         1.59042552          0.6588593
laketrafford         -1.0877668        0.03421807          0.4286020
             log(mu[,5]/mu[,1])
(Intercept)          -0.9420566
size>2.3             -0.2793969
lakehancock           0.5476566
lakeoklawaha          3.1120757
laketrafford          1.8474841

The results are virtually identical to what we obtained with the multinom function.

Fitting multinomial models as Poisson models

As was explained in lecture 38, multinomial models can be fit as Poisson models by fixing the margin totals. In formulating the model this means including a large number of auxiliary variables. For instance, the null multinomial model fit0 could be fit with a weights argument as follows.

fit0a <- multinom(food ~ 1, data=alli.dat, weights=Freq)

Here is how you would fit the same model using a Poisson distribution.

glm0 <- glm(Freq~lake*size*gender + food, data=alli.dat, family=poisson)

The count variable is the response and all the predictors and the response are entered as main effects. In addition all of the interactions between the predictors are included.

Freq ~ lake + size + gender + lake:size + lake:gender + size:gender + lake:size:gender + food

The logic behind this is as follows. There are 80 rows in the weighted version of the data set, one row per food type. If we divide by the number of food types (5) we get 16, the total number of multinomial observations. We need to constrain these 16 totals in order to get a multinomial distribution. Because these 16 observations are uniquely defined by their combined levels of gender, size, and lake, we can constrain the 16 marginal totals by including gender*size*lake in the model.

nrow(alli.dat)

[1] 80

nrow(alli.dat) / length(unique(alli.dat$food))

[1] 16

length(unique(alli.dat$lake)) * length(unique(alli.dat$size)) * length(unique(alli.dat$gender))

[1] 16

We also need to constrain the marginal totals for the five food categories, so we also include food in the model. Thus lake*size*gender + food must be included in every Poisson model we fit in order to match the results from the multinomial regression model.

Whatever variable or combination of variables uniquely identifies the multinomial totals in a data set is what we need to include in the Poisson model. If there had been an ID variable numbered 1 through 16 that identified the 16 multinomial observations, we could have used that as the predictor instead of gender*size*lake. The models we would get are the same, just parameterized differently.

# create ID variable that identifies multinomial observations

alli.dat$ID <- rep(1:16, rep(5,16))

alli.dat[1:10,]

     food    lake size gender Freq ID
1    bird george <2.3      f    0 1
2    fish george <2.3      f    3 1
3 invert george <2.3      f    9 1
4   other george <2.3      f    1 1
5     rep george <2.3      f    1 1
6    bird hancock <2.3      f    2 2
7    fish hancock <2.3      f   16 2
8 invert hancock <2.3      f    3 2
9   other hancock <2.3      f    3 2
10    rep hancock <2.3      f    2 2

#fit model using factor(ID) rather than lake*size*gender

glm0a <- glm(Freq~factor(ID) + food, data=alli.dat, family=poisson)

AIC(glm0, glm0a)

df AIC
glm0 20 320.2464
glm0a 20 320.2464

To test the effect of a predictor on food type in the Poisson model we have to add it to the null model as an interaction with food.

#including the effect of gender on food type

glm1 <- glm(Freq~lake*size*gender + food + food:gender, data=alli.dat, family=poisson)

If we compare the null Poisson model with the gender Poisson model using a likelihood ratio test we get the same test statistic that we obtained with the multinomial models.

anova(glm0, glm1, test='Chisq')

Analysis of Deviance Table

Model 1: Freq ~ lake * size * gender + food
Model 2: Freq ~ lake * size * gender + food + food:gender
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 60 116.76
2 56 114.66 4 2.1041 0.7166

anova(fit1, fit0)

Likelihood ratio tests of Multinomial Models

Response: food
   Model Resid. df Resid. Dev   Test    Df LR stat.   Pr(Chi)
1      1       872   604.3629
2 gender       868   602.2589 1 vs 2     4 2.104069 0.7166248

Our final multinomial model was one that included lake and size as predictors. To fit this as a Poisson model we need to add food:size and food:lake to the null model.

glm4 <- glm(Freq~lake*size*gender + food + food:lake + food:size, data=alli.dat, family=poisson)

Goodness of fit for multinomial models

When multinomial data come grouped it is possible to test the fit of the final model by comparing it to a model in which each observation is assigned its own parameter. This is called the saturated model. We have artificially grouped the multinomial data by cross-classifying observations by lake, gender, and size. This yields a distribution of food types for each combination of lake, gender, and size. As was explained above, when the data are organized in this way we have 80 rows of data but only 16 different multinomial observations. Because there are four lakes, two genders, and two sizes, the saturated model requires 4 × 2 × 2 = 16 parameters per logit for a total of 64 parameters. We enter lake*size*gender as the predictor because the 16 multinomial observations are uniquely identified by the 16 different combinations of lake, size, and gender. Again, any variable that uniquely identifies these 16 multinomial observations could be used instead to fit the saturated model.

fit.S <- multinom(food~lake*size*gender, data=alli.dat, weights=Freq)

# weights: 85 (64 variable)
initial value 352.466903
iter 10 value 264.578561
iter 20 value 249.742554
iter 30 value 244.525050
iter 40 value 243.820680
iter 50 value 243.801639
iter 60 value 243.800900
iter 60 value 243.800899
iter 60 value 243.800899
final value 243.800899
converged

length(coef(fit.S))

[1] 64

Alternatively we could have used the ID variable created above to to fit a separate model to each multinomial observation.

fit.S1 <- multinom(food~factor(ID), data=alli.dat, weights=Freq)

# weights: 85 (64 variable)
initial value 352.466903
iter 10 value 252.268864
iter 20 value 245.629097
iter 30 value 244.083777
iter 40 value 243.833747
iter 50 value 243.801023
final value 243.800898
converged

In the Poisson version the marginal constraints are enforced by entering lake*size*gender + food. To estimate the same effects shown in the saturated multinomial model fit.S we need to interact food with all the individual terms shown there: lake*size*gender. So we would have to write the following.

Freq ~ lake*size*gender + food + food:lake + food:size + food:gender + food:lake:size + food:lake:gender + food:size:gender + food:lake:size:gender

We can obtain all these terms with the following simpler expression.

Freq ~ lake*size*gender*food

The total number of estimated parameters is 80, one for each observed count. With the Poisson version we can check to see that we've done things correctly because the residual deviance of the saturated model will be zero and the number of estimated coefficients is equal to the number of observations.

glm.S <- glm(Freq~lake*size*gender*food, data=alli.dat, family=poisson)

# residual deviance is approximately zero

deviance(glm.S)

[1] 4.925747e-10

# one estimated parameter per observation

length(coef(glm.S))

[1] 80

Alternatively we could have used the ID variable created above to identify the multinomial observations and fit the saturated model as follows.

glm.S1 <- glm(Freq~factor(ID) + food + factor(ID):food, data=alli.dat, family=poisson)

AIC(glm.S,glm.S1)

df AIC
glm.S 80 323.4853
glm.S1 80 323.4853

A formal goodness of fit test uses a likelihood ratio test to compare the model of interest against the saturated model. If the p-value is significant then we have a significant lack of fit.

fit.4 <- multinom(food~lake+size, data=alli.dat, weights=Freq)

# weights: 30 (20 variable)
initial value 352.466903
iter 10 value 272.246275
iter 20 value 270.046891
final value 270.040139
converged

anova(fit.4, fit.S)

Likelihood ratio tests of Multinomial Models Response: food
                 Model Resid. df Resid. Dev   Test    Df LR stat.   Pr(Chi)
1          lake + size       300   540.0803
2 lake * size * gender       256   487.6018 1 vs 2    44 52.47848 0.1783716

anova(glm4, glm.S, test='Chisq')

Analysis of Deviance Table

Model 1: Freq ~ lake * size * gender + food + food:lake + food:size
Model 2: Freq ~ lake * size * gender * food
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 44 52.478
2 0 0.000 44 52.478 0.1784

From the output it would appear that we don't have evidence of lack of fit.

This goodness of fit test based on residual deviance is identical to the G-test of Sokal and Rohlf (2012), which in turn is asymptotically equivalent to the Pearson chi-squared goodness of fit test. If O_i is the observed count and E_i is the expected count predicted by the model, then the G-test and Pearson test are the following.

G-test: G test

Pearson test:

Both test statistics have an asymptotic chi-squared distribution with n – p – 1 degrees of freedom where p is the number of estimated parameters in the model. Both of these tests have restrictions on how small the expected counts can be in order for the chi-squared distribution to still hold. A seat of the pants rule is that no more than 20% of the expected cell counts should be less than 5. We can readily check this by obtaining the predicted counts from the Poisson model. (I exponentiate the predictions because the Poisson model uses a log link. I could have used the fitted function instead, which inverts the link function after obtaining the predictions.)

sum(exp(predict(glm4))<5)

[1] 60

sum(exp(predict(glm4))<5)/length(predict(glm4))

[1] 0.75

So we see that 75% of the expected counts are less than 5. The chi-squared distribution of the goodness of fit test is questionable here, so we should not trust the conclusion of the test that the fit is adequate.

A simulation-based goodness of fit test

When the theoretical distribution of a test statistic is suspect we can turn to Monte Carlo methods. The idea behind a Monte Carlo test is simple. Since we don't know what the theoretical distribution of our test statistic should be, we generate an empirical approximation to the distribution instead. For the current problem that means using our fitted multinomial model to generate new data. Treating the generated data as a new set of observations we carry out the G-test to obtain the G statistic for the simulated data. We then do this repeatedly to obtain a distribution of G-statistics for data that we know are consistent with our model (because the model generated them). We then check to see if the G-statistic obtained using the actual data looks like a typical member of this distribution. If it looks anomalous then we have evidence for lack of fit.

I start by using the model to obtain the predicted probability distributions for each of the 16 multinomial observations. To do this I create a new data frame that contains the values of the predictors for each multinomial observation along with the number of observations in the data set that have this predictor combination.

mydat <- data.frame(table(alligators$lake, alligators$size, alligators$gender))

names(mydat)[1:3] <- c('lake', 'size', 'gender')

mydat

       lake size gender Freq
1    george <2.3      f   14
2   hancock <2.3      f   26
3 oklawaha <2.3      f   15
4 trafford <2.3      f   12
5    george >2.3      f   10
6   hancock >2.3      f    9
7 oklawaha >2.3      f    2
8 trafford >2.3      f    1
9    george <2.3      m   27
10 hancock <2.3      m   13
11 oklawaha <2.3      m    5
12 trafford <2.3      m   12
13   george >2.3      m   12
14 hancock >2.3      m    7
15 oklawaha >2.3      m   26
16 trafford >2.3      m   28

Using the first three columns of this data frame as the value of the newdata argument of predict, I use the type='probs' setting to obtain the probabilities of each of the food type categories.

out.p <- predict(fit4, type="probs", newdata=mydat[,1:3])

out.p

bird fish invert other rep
1 0.029671502 0.4521032 0.41285699 0.09380190 0.01156641
2 0.070400215 0.5353040 0.09309885 0.25374163 0.04745531
3 0.008818267 0.2581872 0.60189518 0.05387241 0.07722691
4 0.035892547 0.1842997 0.51683770 0.17420330 0.08876673
5 0.081071082 0.6574394 0.13967877 0.09791193 0.02389880
6 0.140898571 0.5701968 0.02307179 0.19400899 0.07182382
7 0.029419560 0.4584368 0.24864408 0.06866206 0.19483754
8 0.108222209 0.2957526 0.19296148 0.20066241 0.20240133
9 0.029671502 0.4521032 0.41285699 0.09380190 0.01156641
10 0.070400215 0.5353040 0.09309885 0.25374163 0.04745531
11 0.008818267 0.2581872 0.60189518 0.05387241 0.07722691
12 0.035892547 0.1842997 0.51683770 0.17420330 0.08876673
13 0.081071082 0.6574394 0.13967877 0.09791193 0.02389880
14 0.140898571 0.5701968 0.02307179 0.19400899 0.07182382
15 0.029419560 0.4584368 0.24864408 0.06866206 0.19483754
16 0.108222209 0.2957526 0.19296148 0.20066241 0.20240133

Next I obtain the expected counts. For this I need to multiply the probabilities in each row of out.p by the number of observations that were observed to have that predictor combination, mydat$Freq. To accomplish this I multiply the matrix by the vector (yielding the so-called Hadamard product of two matrices).

out.freq <- out.p*mydat$Freq

out.freq

bird fish invert other rep
1 0.41540103 6.3294448 5.7799978 1.3132266 0.1619297
2 1.83040560 13.9179036 2.4205701 6.5972825 1.2338381
3 0.13227400 3.8728086 9.0284277 0.8080861 1.1584036
4 0.43071057 2.2115967 6.2020524 2.0904396 1.0652007
5 0.81071082 6.5743942 1.3967877 0.9791193 0.2389880
6 1.26808714 5.1317715 0.2076461 1.7460809 0.6464144
7 0.05883912 0.9168735 0.4972882 0.1373241 0.3896751
8 0.10822221 0.2957526 0.1929615 0.2006624 0.2024013
9 0.80113055 12.2067864 11.1471387 2.5326514 0.3122930
10 0.91520280 6.9589518 1.2102851 3.2986412 0.6169191
11 0.04409133 1.2909362 3.0094759 0.2693620 0.3861345
12 0.43071057 2.2115967 6.2020524 2.0904396 1.0652007
13 0.97285298 7.8892731 1.6761452 1.1749432 0.2867855
14 0.98628999 3.9913778 0.1615025 1.3580629 0.5027667
15 0.76490855 11.9193557 6.4647461 1.7852136 5.0657761
16 3.03022184 8.2810720 5.4029214 5.6185474 5.6672374

I verify that this was done correctly by summing across each row to see if I get the correct total.

apply(out.freq, 1, sum)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
14 26 15 12 10 9 2 1 27 13 5 12 12 7 26 28

mydat$Freq

[1] 14 26 15 12 10 9 2 1 27 13 5 12 12 7 26 28

Next I organize the observed counts in a matrix so that they are arranged in the same manner as the expected counts. For this I use the Freq variable, column 5, of the alli.dat data frame created earlier. The five columns of the matrix need to be populated with values one row at a time. This can be accomplished with the byrow=T argument of matrix.

out.obs <- matrix(alli.dat[,5], ncol=5, byrow=T)

Next I write a function to calculate the G-statistic. We need to treat the occurrences of O_i = 0 as special cases because the log of zero is undefined. Using the fact that we see that these observations contribute zero to the lack of fit . Thus we can carry out the calculation using the ifelse function as follows.

gtest.func <- function(O,E) {

result <- ifelse(O==0, 0, 2*O*log(O/E))

sum(result)

}

I verify that the function returns the actual value of the test statistic when it is given the actual data.

actual <- gtest.func(out.obs, out.freq)

actual

[1] 52.47849

anova(fit.4,fit.S)

Likelihood ratio tests of Multinomial Models

Response: food
                 Model Resid. df Resid. Dev   Test    Df LR stat.   Pr(Chi)
1          lake + size       300   540.0803
2 lake * size * gender       256   487.6018 1 vs 2    44 52.47848 0.1783716

The results agree so we're ready to do the simulation. To generate new multinomial data I use the rmultinom function. It takes three arguments:

the number of observations to generate,
size the value of n for the multinomial distribution, and
prob a vector of probabilities.

We need to carry out this calculation on the matrix of predicted probabilities one row at a time using the corresponding value of mydat$Freq for n. Written as a function this takes the following form.

function(x) rmultinom(1, size=mydat$Freq[x], prob=out.p[x,])

To apply this separately to each of the sixteen rows of out.p we can use sapply.

sapply(1:16,function(x) rmultinom(1, size=mydat$Freq[x], prob=out.p[x,]))

The sapply function stores the results in columns whereas we need them stored as rows, so we need to transpose the result.

t(sapply(1:16, function(x) rmultinom(1, size=mydat$Freq[x], prob=out.p[x,])))

Finally we need to use the simulated counts to replace the O_i term in the gtest.func function in order to calculate the G-statistic for the simulated data. Here's the complete version of the final function. The variable y is just a place holder and doesn't actually get used anywhere in the function.

my.simulation <- function(y) {

out.sim <- t(sapply(1:16, function(x) rmultinom(1, size=mydat$Freq[x], prob=out.p[x,])))

gtest.func(out.sim, out.freq)

}

I run the simulation 9999 times and store the results. I first set the random seed so I can recreate the results later if desired.

set.seed(10)

sim.results <- sapply(1:9999, my.simulation)

I append the actual G-statistic calculated using the observed data to the simulation results and calculate the fraction of the simulated results that are equal to or exceed this actual value. This is the simulation-based p-value of the test.

sum(c(sim.results, actual) >= actual)/10000

[1] 0.9123

Because the p-value is large we fail to find evidence for lack of fit. Fig. 1 shows the distribution of the simulated G-statistics along with location of the observed value of the test statistic in this distribution.

hist(c(sim.results, actual), xlab="G-statistic", main="Monte Carlo results")

points(actual, 0, col=2, pch=8)

Fig. 1

Fig. 1 Distribution of simulated G-statistics (* denotes actual value)

Quasi-Poisson model

If a multinomial model exhibits a significant lack of fit, it means that one or more of the assumptions of the multinomial model have been violated. These are the same assumptions that are required for the binomial model.

Among the trials that comprise a multinomial observation, the category probabilities are constant. The probabilities should not change from trial to trial.
The trials making up a multinomial observation are independent.
If two multinomial observations have the same values of the predictors then their category probabilities should be the same.

The third violation can be partially addressed by including additional predictors to differentiate some of those observations or by including interactions of existing predictors. But beyond making the means model more complicated, dealing with lack of fit problems in a multinomial model is hard because software implementations of standard corrective measures are limited for multinomial models.

If we formulate the multinomial model as a Poisson model a number of additional options become possible. Violations of the multinomial assumptions translate into overdispersion in the Poisson model. Typical corrections of Poisson overdispersion include replacing the Poisson model with a negative binomial model, adding random effects, or fitting what's called a quasi-Poisson model. Because of the large number of auxiliary terms that are needed to make the conversion from multinomial model to Poisson, fitting a negative binomial model or a mixed effects model can be difficult. A quasi-Poisson model on the other hand is easy to fit. It just involves making a post hoc adjustment to the standard errors of the Poisson model by inflating them with a measure of the overdispersion, the Pearson deviance divided by its degrees of freedom.

In our current example, if we had found that the multinomial model did not fit, and assuming that adding more predictors and interactions did not help, we could refit the model as a quasi-Poisson model using the family=quasipoisson argument of glm.

glm4.quasi <- glm(Freq ~ lake * size * gender + food + food:lake + food:size, family = quasipoisson, data = alli.dat)

To avoid looking at all of the coefficients in the summary table I use the grep function to locate those coefficients that contain "food" in their name. The first argument to grep is the pattern to search for and the second argument is the object to be searched. In the call below I search the row names of the coefficient matrix for the string 'food'.

grep('food', rownames(summary(glm4)$coefficients))

[1] 7 8 9 10 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

The first four listed occurrences correspond to the main effects of food, so I skip those and extract the remaining entries from both the Poisson and quasi-Poisson models.

#Poisson model

round(summary(glm4)$coefficients[18:33,],3)

                        Estimate Std. Error z value Pr(>|z|)
lakehancock:foodfish      -0.695      0.781 -0.890    0.374
lakeoklawaha:foodfish      0.653      1.202   0.543    0.587
laketrafford:foodfish     -1.088      0.842 -1.292    0.196
lakehancock:foodinvert    -2.353      0.934 -2.521    0.012
lakeoklawaha:foodinvert    1.590      1.228   1.295    0.195
laketrafford:foodinvert    0.034      0.863   0.040    0.968
lakehancock:foodother      0.131      0.892   0.147    0.883
lakeoklawaha:foodother     0.659      1.368   0.481    0.630
laketrafford:foodother     0.429      0.938   0.457    0.648
lakehancock:foodrep        0.548      1.372   0.399    0.690
lakeoklawaha:foodrep       3.112      1.586   1.962    0.050
laketrafford:foodrep       1.847      1.317   1.402    0.161
size>2.3:foodfish         -0.631      0.642 -0.982    0.326
size>2.3:foodinvert       -2.089      0.694 -3.010    0.003
size>2.3:foodother        -0.962      0.713 -1.350    0.177
size>2.3:foodrep          -0.279      0.806 -0.347    0.729

#Quasi-Poisson model

round(summary(glm4.quasi)$coefficients[18:33,],3)

                        Estimate Std. Error t value Pr(>|t|)
lakehancock:foodfish      -0.695      0.897 -0.775    0.443
lakeoklawaha:foodfish      0.653      1.380   0.473    0.638
laketrafford:foodfish     -1.088      0.967 -1.125    0.266
lakehancock:foodinvert    -2.353      1.072 -2.195    0.033
lakeoklawaha:foodinvert    1.590      1.410   1.128    0.265
laketrafford:foodinvert    0.034      0.991   0.035    0.973
lakehancock:foodother      0.131      1.024   0.128    0.899
lakeoklawaha:foodother     0.659      1.571   0.419    0.677
laketrafford:foodother     0.429      1.077 0.398    0.693
lakehancock:foodrep        0.548      1.575   0.348    0.730
lakeoklawaha:foodrep       3.112      1.821   1.709    0.095
laketrafford:foodrep       1.847      1.513   1.221    0.228
size>2.3:foodfish         -0.631      0.738 -0.855    0.397
size>2.3:foodinvert       -2.089      0.797 -2.621    0.012
size>2.3:foodother        -0.962      0.818 -1.176    0.246
size>2.3:foodrep          -0.279      0.926 -0.302    0.764

What we see is that the coefficient estimates of the two models are identical, but the reported standard errors of the quasi-Poisson model are larger to account for the overdispersion. This is a very crude correction but it is better than doing nothing. Of course for the current model we've already demonstrated that making this correction is unnecessary.

Graphical summary of the results

Because multinomial models generate a large number of parameter estimates, it is almost mandatory that we try to summarize the results using graphs. Panel graphs are ideally suited for this. There are a number of ways the results could be presented and in practice it might be useful to consider more than one.

Both lake and food type have multiple categories. It might be interesting to compare food type preference within a lake. On the other hand we could also compare the effect of lake on food type preference. This choice switches which variable plays the role of a conditioning variable in a panel graph.
The natural way to describe the effects of predictors in logistic regression is with odds ratios. One can display odds ratios or log odds ratios. While odds ratios are more interpretable they are also more variable and the confidence intervals of odds ratios that are imprecisely measured will tend to obscure the rest.

As an illustration I summarize the lake effect on food choice using the results from the Poisson fit. As we discovered above these coefficients occupy rows 18:29 of the coefficient table. I extract that portion of the coefficient table as well as the corresponding portion of the variance-covariance matrix of the parameter estimates.

out.coef <- summary(glm4)$coefficients[18:29,]

out.vcov <- vcov(glm4)[18:29,18:29]

rownames(out.coef)

[1] "lakehancock:foodfish"    "lakeoklawaha:foodfish"   "laketrafford:foodfish"
[4] "lakehancock:foodinvert" "lakeoklawaha:foodinvert" "laketrafford:foodinvert"
[7] "lakehancock:foodother"   "lakeoklawaha:foodother" "laketrafford:foodother"
[10] "lakehancock:foodrep"     "lakeoklawaha:foodrep"    "laketrafford:foodrep"

Currently the coefficients are organized in a way that facilitates comparisons across lakes for a given food preference. To compare food preferences within a lake we need to change this. We can sort the row names of the coefficient matrix or we can refit the model with the order of food and lake in the interaction term switched. I choose to sort the row names.

sort(rownames(out.coef))

[1] "lakehancock:foodfish"    "lakehancock:foodinvert" "lakehancock:foodother"
[4] "lakehancock:foodrep"     "lakeoklawaha:foodfish"   "lakeoklawaha:foodinvert"
[7] "lakeoklawaha:foodother" "lakeoklawaha:foodrep"    "laketrafford:foodfish"
[10] "laketrafford:foodinvert" "laketrafford:foodother" "laketrafford:foodrep"

neworder <- order(rownames(out.coef))

neworder

[1] 1 4 7 10 2 5 8 11 3 6 9 12

out.coef <- out.coef[neworder,]

out.vcov <- out.vcov[neworder, neworder]

out.coef

                           Estimate Std. Error     z value   Pr(>|z|)
lakehancock:foodfish    -0.69511756 0.7812634 -0.88973524 0.37360807
lakehancock:foodinvert -2.35347616 0.9337121 -2.52055866 0.01171687
lakehancock:foodother    0.13107865 0.8919677 0.14695448 0.88316795
lakehancock:foodrep      0.54765906 1.3715027 0.39931315 0.68966248
lakeoklawaha:foodfish    0.65320776 1.2019865 0.54344016 0.58682678
lakeoklawaha:foodinvert 1.59042709 1.2276745 1.29547941 0.19515490
lakeoklawaha:foodother   0.65886083 1.3684827 0.48145354 0.63019420
lakeoklawaha:foodrep     3.11207973 1.5859426 1.96229032 0.04972869
laketrafford:foodfish   -1.08776676 0.8416687 -1.29239301 0.19622107
laketrafford:foodinvert 0.03421807 0.8631505 0.03964323 0.96837756
laketrafford:foodother   0.42860197 0.9382993 0.45678599 0.64782488
laketrafford:foodrep     1.84748655 1.3173176 1.40246100 0.16077763

I collect the estimates and standard errors in a single data frame to which I add the two variables that make up the interaction term. An easy way of generating the variables is with the expand.grid function, It combines the different levels of two vectors in all possible ways.

expand.grid(1:4, levels(alligators$lake)[2:4])

   Var1     Var2
1     1 hancock
2     2 hancock
3     3 hancock
4     4 hancock
5     1 oklawaha
6     2 oklawaha
7     3 oklawaha
8     4 oklawaha
9     1 trafford
10    2 trafford
11    3 trafford
12    4 trafford

out.mod <- data.frame(expand.grid(1:4, levels(alligators$lake)[2:4]), est=out.coef[,1], se=out.coef[,2])

The order created by expand.grid matches the row labels of the coefficient table.

rownames(out.mod)

I wish to display 95% confidence intervals for the log odds ratios as well as a second set of confidence intervals that permit making pairwise comparisons between the log odds ratios displayed in the same panel. I extract the functions and code for doing this from lecture 6.

nor.func1 <- function(alpha, model, sig) 1 - pt(-qt(1-alpha/2, model$df.residual) * sum(sqrt(diag(sig))) / sqrt(c(1,-1) %*% sig %*%c (1,-1)), model$df.residual) - pt(qt(1-alpha/2, model$df.residual) * sum(sqrt(diag(sig))) / sqrt(c(1,-1) %*% sig %*% c(1,-1)), model$df.residual, lower.tail=F)

nor.func2 <- function(a,model,sigma) nor.func1(a, model, sigma)-.95

I organize the remaining code as a function so that I can obtain confidence levels only for the estimates that will get displayed in a single panel. The four arguments of the function correspond to the row numbers of the out.mod data frame, the number of estimates being compared (4), the fitted glm model, and the variance-covariance matrix of the parameter estimates.

ci.func <- function(rowvals, n, glm.model, glm.vmat) {

xvec1b <- numeric(n*(n-1)/2)

vmat <- glm.vmat[rowvals,rowvals]

ind <- 1

for(i in 1:(n-1)) {

for(j in (i+1):n){

sig <- vmat[c(i,j), c(i,j)]

#solve for alpha

xvec1b[ind] <- uniroot(function(x) nor.func2(x, glm.model, sig), c(.001,.999))$root

ind <- ind+1

}}

1-xvec1b

}

Because each panel displays four food type odds ratios there are six pairwise comparisons possible in each panel.

vals1 <- ci.func(1:4, 4, glm4, out.vcov)

vals2 <- ci.func(5:8, 4, glm4, out.vcov)

vals3 <- ci.func(9:12, 4, glm4, out.vcov)

vals1

[1] 0.5248087 0.4946145 0.7268751 0.5836296 0.7338971 0.7321855

vals2

[1] 0.3026384 0.4542485 0.5767233 0.4605825 0.5806192 0.6177434

vals3

[1] 0.4350258 0.5146393 0.6969407 0.5128226 0.6965664 0.7062488

The confidence levels are quite variable. I begin by producing the graph using the smallest reported confidence levels for each panel. If a pairwise comparison does not show up as significant at this level then it will not be significant at any higher level either. On the other hand, pairwise comparisons that look to be significantly different at this setting could appear to be different only because the setting is too low relative to what it should be. Except for the changing the labels on the graph the code shown below is mostly unaltered from what was given in lecture 6.

ci.val <- c(rep(min(vals1),4), rep(min(vals2),4), rep(min(vals3),4))

#confidence intervals

out.mod$low95 <- out.mod$est - 1.96 * out.mod$se

out.mod$up95 <- out.mod$est + 1.96 * out.mod$se

out.mod$low50 <- out.mod$est + out.mod$se * qnorm((1-ci.val)/2)

out.mod$up50 <- out.mod$est + out.mod$se * qnorm(1-(1-ci.val)/2)

I generate the lattice plot.

library(lattice)

myprepanel.ci <- function(x,y,lx,ux,subscripts,...) {

list(xlim=range(x, ux[subscripts], lx[subscripts], finite=TRUE))

}

dotplot(factor(Var1, levels=1:4, labels=paste(c("fish", "invertebrate", "other", "reptile"), ' vs bird', sep=''))~est|factor(Var2, levels=levels(Var2), labels=paste('Lake ', c('Hancock', 'Oklawaha', 'Trafford'),' vs Lake George', sep='')), data=out.mod, xlab='log odds ratio', prepanel=myprepanel.ci, lx=out.mod$low95, ux=out.mod$up95, layout=c(3,1), panel=function(x, y, subscripts){

panel.dotplot(x, y, col=4, pch='+', cex=.6)

panel.segments(out.mod$low95[subscripts], y, out.mod$up95[subscripts], y, lwd=3, col='dodgerblue4', lineend=1)

#new line

panel.segments(out.mod$low50[subscripts], y, out.mod$up50[subscripts], y, lwd=6, col='dodgerblue1', lineend=1)

panel.points(x, y, col='white', pch=16, cex=1.1)

panel.points(x, y, col='dodgerblue4', pch=1, cex=1.2)

panel.abline(v=0, col=2, lty=2)

}, scales=list(x='free'),strip=strip.custom(par.strip.text = list(cex=0.9)))

fig. 2

Fig. 2a Effect of lake on food choice log odds ratios (using smallest confidence levels)

In the output from the ci.func function, the smallest confidence levels correspond to the 1 vs 2 comparisons in panels 2 and 3, and the 1 versus 3 comparison in panel 1 (number 1 corresponds to the bottom of the panel). I examine things one panel at a time.

Panel 1

The 1 versus 3 comparison is not significant so we can raise the confidence level to the next lowest value, 0.5248087. This means recalculating the confidence intervals and then redoing the plot. The result is shown below (leaving out the lattice code).

vals1

[1] 0.5248087 0.4946145 0.7268751 0.5836296 0.7338971 0.7321855

ci.val <- c(rep(vals1[1],4), rep(min(vals2),4), rep(min(vals3),4))

out.mod$low50 <- out.mod$est + out.mod$se * qnorm((1-ci.val)/2)

out.mod$up50 <- out.mod$est + out.mod$se * qnorm(1-(1-ci.val)/2)

fig 2b

Fig. 2b Effect of lake on food choice log odds ratios (using 2nd smallest confidence level in panel 1)

At this point we see that the fish and invertebrate intervals do not overlap using the correct confidence level, so these two log odds ratios are clearly significant. The 1 versus 4 comparison is not significant. Raising the the confidence level to 0.7268 (the suggested value in the vals1 vector) will only increase the overlap, so we've completed the comparisons involving fish.

Turning to the remaining comparisons the "other" versus bird interval overlaps the reptile vs bird interval with the confidence level set at 0.52. If we raise it to the recommended level of 0.73 they will overlap even more. So, we don't have to do anything more with this one either. The only comparisons left to consider are invertebrate versus bird compared with other versus bird and reptile versus bird. Currently they don't overlap but we have the confidence levels set too low to be able to make these comparisons. I try changing the levels to their proper values and see if the intervals are still distinct.

ci.val <- c(vals1[1], rep(vals1[5],3), rep(min(vals2),4), rep(min(vals3),4))

ci.val

[1] 0.5248087 0.7338971 0.7338971 0.7338971 0.3026384 0.3026384 0.3026384 0.3026384
[9] 0.4350258 0.4350258 0.4350258 0.4350258

out.mod$low50 <- out.mod$est + out.mod$se * qnorm((1-ci.val)/2)

out.mod$up50 <- out.mod$est + out.mod$se * qnorm(1-(1-ci.val)/2)

fig 2c

Fig. 2c Effect of lake on food choice log odds ratios (using the highest confidence level for panel 1)

From the graph, invertebrates versus bird is different from other versus bird as well as reptile versus bird. Changing this last level screwed up the fish and invertebrate comparison; they now overlap. A solution is to just use the previous settings with the confidence level set at 0.524, vals1[4] for all four points. The actual choice of confidence levels is not important as long as the correct relationships between the estimates are shown.

Panels 2 and 3

We still have to do the same thing separately for panels 2 and 3. I spare you the details and just list the results. In panel 2 it turns out that 1 is different from 4 but nobody else is significantly different. In panel 3, 1 turns out to be different from 2, 3, and 4, but no one else is different. The final settings I used to get all of these relationships to display correctly are given below and they are not unique. The specified levels don't follow much of a pattern except that they do seem to yield the correct display of the relationships between the different log odds ratios.

ci.val <- c(rep(vals1[1],4), vals2[1], rep(0.58,3), vals3[1], vals3[4], rep(vals3[5],2))

#confidence intervals

out.mod$low95 <- out.mod$est-1.96*out.mod$se

out.mod$up95 <- out.mod$est+1.96*out.mod$se

out.mod$low50 <- out.mod$est+out.mod$se*qnorm((1-ci.val)/2)

out.mod$up50 <- out.mod$est+out.mod$se*qnorm(1-(1-ci.val)/2)

myprepanel.ci <- function(x,y,lx,ux,subscripts,...) {

list(xlim=range(x, ux[subscripts], lx[subscripts], finite=TRUE))

}

dotplot(factor(Var1, levels=1:4, labels=paste(c("fish", "invertebrate","other", "reptile"), ' vs bird', sep=''))~est|factor(Var2, levels=levels(Var2), labels=paste('Lake ',c('Hancock','Oklawaha', 'Trafford'),' vs Lake George', sep='')), data=out.mod, xlab='log odds ratio', prepanel=myprepanel.ci, lx=out.mod$low95, ux=out.mod$up95, layout=c(3,1), panel=function(x, y, subscripts){

panel.dotplot(x, y, col=4, pch='+', cex=.6)

panel.segments(out.mod$low95[subscripts], y, out.mod$up95[subscripts], y, lwd=3, col='dodgerblue4', lineend=1)

#new line

panel.segments(out.mod$low50[subscripts], y, out.mod$up50[subscripts], y, lwd=6, col='dodgerblue1', lineend=1)

panel.points(x, y, col='white', pch=16, cex=1.1)

panel.points(x, y, col='dodgerblue4', pch=1, cex=1.2)

panel.abline(v=0, col=2, lty=2)

}, scales=list(x='free'), strip=strip.custom(par.strip.text = list(cex=0.9)))

fig. 3

Fig. 3 Effect of lake on food choice log odds ratios (final choices for the confidence levels)

From Fig. 3 we see that two of the 95% confidence intervals for the log odds ratios don't include zero. Thus we can say that the odds of choosing invertebrates over birds is significantly lower in Lake Hancock than in Lake George. On the other hand the odds of choosing reptiles over birds is significantly higher in Lake Oklawaha than in Lake George. None of the remaining log odds ratios are statistically significant.

Using the pairwise confidence intervals we can look for differences among the food preference odds ratios for the same two lakes. We see that the odds ratio of choosing invertebrates over birds in Lake Hancock versus Lake George is significantly lower than the other three odds ratios for these two lakes: fish versus bird, "other" versus bird, and reptile versus bird. In the Lake Oklawaha versus Lake George comparison, the odds ratio for choosing fish over bird is significantly lower than the odds ratio of choosing reptile over bird. Finally in the Lake Trafford versus Lake George comparisons the odds ratio for choosing fish over bird is significantly lower than all three of the others: choosing invertebrate over bird, "other" over bird, and reptile over bird.

To plot odds ratios instead of log odds ratios we need to exponentiate the individual estimates and their confidence intervals everywhere they appear in the above code. This doesn't provide a useful picture for this example because some of the estimates are very imprecise. When we exponentiate them they end up dominating the display and making it difficult to draw comparisons across groups. To generate a picture that is at least semi-useful I truncated the confidence levels for the odds ratios at 30 and changed the plotting symbol for the point estimate. The changes are highlighted in the code below.

ci.val <- c(rep(vals1[1],4), vals2[1], rep(0.58,3), vals3[1], vals3[4], rep(vals3[5],2))

#confidence intervals

out.mod$low95 <- out.mod$est-1.96*out.mod$se

out.mod$up95 <- out.mod$est+1.96*out.mod$se

out.mod$low50 <- out.mod$est+out.mod$se*qnorm((1-ci.val)/2)

out.mod$up50 <- out.mod$est+out.mod$se*qnorm(1-(1-ci.val)/2)

myprepanel.ci <- function(x,y,lx,ux,subscripts,...) {

list(xlim=range(x, ux[subscripts], lx[subscripts], finite=TRUE))

}

dotplot(factor(Var1, levels=1:4, labels=paste(c("fish", "invertebrate","other", "reptile"), ' vs bird', sep=''))~exp(est)|factor(Var2, levels=levels(Var2), labels=paste('Lake ',c('Hancock','Oklawaha', 'Trafford'),' vs Lake George', sep='')), data=out.mod, xlab='Odds ratio', prepanel=myprepanel.ci, lx=exp(out.mod$low95), ux=ifelse(exp(out.mod$up95)<30, exp(out.mod$up95),30), layout=c(3,1), panel=function(x, y, subscripts){

panel.dotplot(x, y, col=4, pch='+', cex=.6)

panel.segments(exp(out.mod$low95[subscripts]), y, exp(out.mod$up95[subscripts]), y, lwd=3, col='dodgerblue4', lineend=1)

#new line

panel.segments(exp(out.mod$low50[subscripts]), y, exp(out.mod$up50[subscripts]), y, lwd=6, col='dodgerblue1', lineend=1)

panel.points(x, y, col='dodgerblue4', pch='|', cex=2, lwd=2)

panel.abline(v=1, col=2, lty=2)

}, scales=list(x='free'), strip=strip.custom(par.strip.text = list(cex=0.9)))

odds ratios

Fig. 4 Effect of lake on food choice odds ratios (truncated at the right)

Ordinal logistic regression

The data set trees.csv is a portion of the North Carolina Vegetation Survey database and consists of the canopy cover values of individual woody plant species from various plots in North Carolina. Because of the difficulty of measuring canopy cover in the field, cover is often recorded on an ordinal scale. The categories used in this database and their roughly equivalent percentage values are: 1 = trace, 2 = 0–1%, 3 = 1–2%, 4 = 2–5%, 5=5–10%, 6 = 10–25%, 7 = 25-50%, 8 = 50–75%, 9 = 75–95%, 10=95–100%. A portion of the data set is shown below.

trees <- read.csv('ecol 562/trees.csv')

trees[1:12,]

   Plot      SPP U          V           W date truecover        logw
1   503 ACERRUBR 2   2.952756    3.423852 1999         1 1.23076622
2   506 OSTRVIR 80 141.732283 462.600443 1999         1 6.13686371
3   526 CELTOCC 6   6.889764   16.358404 1999         1 2.79474176
4   526 FAGUGRA 13 10.334646   10.081342 1999         1 2.31068638
5   532 QUERFAL 1 50.000000 1963.495408 1999         1 7.58248154
6    63 QUERALB 6 13.779528 101.574275 2001         1 4.62079030
7    67 VIBURUF 5   2.460630    0.951070 2001         1 -0.05016762
8    68 ULMUALA 5   2.460630    0.951070 2001         1 -0.05016762
9   988 ULMURUB 3   6.397638   11.222626 1999         1 2.41793191
10    1 FRAX1S1 1   1.476378    1.711926 2000         2 0.53761904
11    1 JUNIVIR 9 12.303150   29.102742 2000         2 3.37083239
12    1 ULMUALA 8 21.653543   50.216496 2000         2 3.91634357

Because of inter-observer variability in measuring canopy cover there is interest in developing a regression model that can take as inputs the values of variables that are easily measured at ground level and generate an estimate of canopy cover. The variables U, V, and W in the trees data frame are examples of such variables.

U = number of stems of a given woody plant species in the plot
V = sum of the stem diameters of all individuals of a given species in the plot
W = sum of the squared stem diameters of all individuals of a given species in the plot

For the purpose of this lecture we'll use W alone as a predictor of plot canopy cover.

As was discussed in lecture 39, one way to deal with ordinal data derived from an underlying continuous metric is to treat the scores as if they were measured on a ratio scale and then use them in an ordinary regression model. Based on the percentages these categories are supposed to represent it is pretty clear that the categories are not equally spaced (although they are approximately equally spaced on a log base 2 scale). Another approach is to assign to each category the midpoint of the underlying percentage range and again treat the result as if it were a continuous measure. We'll pursue a third strategy which is to treat these data as purely ordinal and carry out one of the variations of ordinal logistic regression discussed in lecture 39.

To simplify things I focus on a single species that is well-represented in the data set. There are 88 different plots in the data frame of which nine species occur in 70% or more of them.

length(unique(trees$Plot))

[1] 88

table(trees$SPP)[table(trees$SPP)>60]

ACERRUBR CORNFLO FRAX1S1 JUNIVIR LIQUSTY LIRITUL NYSSSYL QUERALB QUERRUB
88 85 63 68 64 67 70 71 61

Cornus florida was found in 85 of the plots and was recorded as having six different cover values. I use it to demonstrate fitting an ordinal logistic regression model.

table(trees$truecover[trees$SPP=="CORNFLO"])

2 3 4 5 6 7
8 22 24 15 11 5

cornus <- trees[trees$SPP=="CORNFLO",]

dim(cornus)

[1] 85 8

Cumulative logit model

A number of R packages can be used to fit the cumulative logit model. A straight-forward implementation is the polr function of the MASS package. The only new twist with this function is that it requires an ordered factor, created with ordered function, as the response variable.

library(MASS)

fit1 <- polr(ordered(truecover)~W, data=cornus)

summary(fit1)

Re-fitting to get Hessian

Call:
polr(formula = ordered(truecover) ~ W, data = cornus)

Coefficients:
Value Std. Error t value
W 0.02684 0.005049 5.316

Intercepts:
    Value   Std. Error t value
2|3 -1.4356 0.3992    -3.5961
3|4 0.5211 0.3108     1.6764
4|5 2.1478 0.3962    5.4217
5|6 3.4230 0.5075     6.7444
6|7 4.9364 0.6753     7.3103

Residual Deviance: 247.6201
AIC: 259.6201

The polr function parameterizes the cumulative logit model as follows.

polr logit

So, the polr function models the odds of having a cover value k or less versus having a cover value greater than k. Because of the negative sign in front of the regression coefficient a positive estimate for β means that increased values of W increase the odds of higher cover values. The reported value of β is 0.027 so we can say that large values of W favor the log odds of being in a larger cover class.

The cumulative logit model can also be fit with the VGAM package. This package is quite flexible in the variations of the cumulative logit model that it can fit. To match the sign of β from the polr function, we can use the following parameterization.

vglm logit

This parameterization is obtained with the vglm function by specifying cumulative in the family argument along with the settings of reverse and parallel as shown below.

library(VGAM)

fit2 <- vglm(factor(truecover)~W, family=cumulative(reverse=T, parallel=T), data=cornus)

summary(fit2)

Call:
vglm(formula = factor(truecover) ~ W, family = cumulative(reverse = T,
parallel = T), data = cornus)

Pearson Residuals:
Min 1Q Median 3Q Max
logit(P[Y>=2]) -2.2923 0.080029 0.168923 0.484102 0.65649
logit(P[Y>=3]) -4.5834 -0.908720 0.156884 0.645716 1.47533
logit(P[Y>=4]) -1.4444 -0.631251 -0.221531 0.369025 2.56225
logit(P[Y>=5]) -3.3772 -0.239888 -0.136126 -0.092288 4.10734
logit(P[Y>=6]) -1.7270 -0.169174 -0.088226 -0.060294 5.42472

Coefficients:
Value Std. Error t value
(Intercept):1 1.435673 0.4145891 3.4629
(Intercept):2 -0.520987 0.3087831 -1.6872
(Intercept):3 -2.147748 0.3826250 -5.6132
(Intercept):4 -3.422932 0.5032238 -6.8020
(Intercept):5 -4.936144 0.7169688 -6.8847
W 0.026834 0.0050751 5.2873

Number of linear predictors: 5

Names of linear predictors: logit(P[Y>=2]), logit(P[Y>=3]), logit(P[Y>=4]), logit(P[Y>=5]), logit(P[Y>=6])

Dispersion Parameter for cumulative family: 1

Residual Deviance: 247.6201 on 419 degrees of freedom

Log-likelihood: -123.8101 on 419 degrees of freedom

Number of Iterations: 6

The vglm function returns the same estimates as did the polr function except that while the sign of β is the same the signs of the individual intercepts are reversed.

The two models we've fit are both proportional odds models: they use a single coefficient to describe the effect of the predictor W. If we invert the logits and plot the probability curves of the five modeled probabilities we see the characteristic graphical signature of the proportional odds model. The curves are roughly parallel (except at their endpoints where they are constrained to be zero and one on a probability scale). The intercept shifts the curve to the left or right but the nature of the relationship with the predictor W is the same in each curve.

curve(exp(coef(fit2)[1] + coef(fit2)[6]*x)/ (1+exp(coef(fit2)[1] + coef(fit2)[6]*x)), from=0, to=215, ylab='Probability', xlab='W', ylim=c(0,1))

curve(exp(coef(fit2)[2] + coef(fit2)[6]*x)/ (1+exp(coef(fit2)[2] + coef(fit2)[6]*x)), col=2, add=T)

curve(exp(coef(fit2)[3] + coef(fit2)[6]*x)/ (1+exp(coef(fit2)[3] + coef(fit2)[6]*x)), col=3, add=T)

curve(exp(coef(fit2)[4] + coef(fit2)[6]*x)/ (1+exp(coef(fit2)[4] + coef(fit2)[6]*x)), col=4, add=T)

curve(exp(coef(fit2)[5] + coef(fit2)[6]*x)/ (1+exp(coef(fit2)[5] + coef(fit2)[6]*x)), col=5, add=T)

legend('bottomright', c(expression(P(X>=3)), expression(P(X>=4)), expression(P(X>=5)), expression(P(X>=6)), expression(P(X>=7))), col=1:5, lty=1, bty='n', cex=.85)

fig. 4

Fig. 5 Proportional odds assumption in the cumulative logit model

How might we use these results to address the original problem—finding a regression model to predict canopy cover? The probability expressions shown in Fig. 4 can be used to calculate the probabilities of the individual cover classes as follows.

probs

The last expression follows because Cornus florida doesn't have a cover class greater than 7 in the data set. We can plot the estimated probabilities against W.

p.xk <- function(k,x) exp(coef(fit2)[k-2] + coef(fit2)[6]*x)/ (1+exp(coef(fit2)[k-2] + coef(fit2)[6]*x))-exp(coef(fit2)[k-1] + coef(fit2)[6]*x)/ (1+exp(coef(fit2)[k-1] + coef(fit2)[6]*x))

curve(p.xk(3,x), from=0, to=215, ylab='Probability', xlab='W', ylim=c(0, 0.7))

curve(p.xk(4,x), add=T, col=2)

curve(p.xk(5,x), add=T, col=3)

curve(p.xk(6,x), add=T, col=4)

curve(exp(coef(fit2)[5] + coef(fit2)[6]*x)/ (1+exp(coef(fit2)[5] + coef(fit2)[6]*x)), col=5, add=T)

curve(1-exp(coef(fit2)[1] + coef(fit2)[6]*x)/ (1+exp(coef(fit2)[1] + coef(fit2)[6]*x)), col=1,lty=2, add=T)

legend('topleft', c(expression(P('cover'==2)), expression(P('cover'==3)), expression(P('cover'==4)), expression(P('cover'==5)), expression(P('cover'==6)), expression(P('cover'==7))), col=c(1,1,2,3,4,5), lty=c(2,1,1,1,1,1), cex=.85, bty='n')

Fig. 6 Proportional odds assumption in the cumulative logit model

For each value of W we can pick the cover class with the highest probability or we can assign it a distribution of possible cover class values using the different probabilities shown. If we use the first option then, e.g., we would never assign cover class 2 to any observation.

Testing the proportional odds assumption

The VGAM package provides a way to conduct an analytical test of the proportional odds assumption. We can fit the model that has separate slopes for each of the cumulative logit responses and compare it to the single slope model using a likelihood ratio test. The separate slopes model is obtained with the parallel=F argument. When we try to fit this model to the truecover response variable we receive a large number of warnings.

fit3 <- vglm(factor(truecover)~W, family=cumulative(reverse=T, parallel=F), data=cornus)

Warning messages:
1: In checkwz(wz, M = M, trace = trace, wzeps = control$wzepsilon) :
4 elements replaced by 1.819e-12
2: In checkwz(wz, M = M, trace = trace, wzeps = control$wzepsilon) :
9 elements replaced by 1.819e-12
3: In checkwz(wz, M = M, trace = trace, wzeps = control$wzepsilon) :
11 elements replaced by 1.819e-12
4: In checkwz(wz, M = M, trace = trace, wzeps = control$wzepsilon) :
11 elements replaced by 1.819e-12
5: In Deviance.categorical.data.vgam(mu = mu, y = y, w = w, residuals = residuals, :
fitted values close to 0 or 1
6: In checkwz(wz, M = M, trace = trace, wzeps = control$wzepsilon) :
11 elements replaced by 1.819e-12
7: In Deviance.categorical.data.vgam(mu = mu, y = y, w = w, residuals = residuals, :
fitted values close to 0 or 1
8: In log(prob) : NaNs produced

The warnings indicate that fitted values were obtained that were very close to zero. When we look at the summary output we see that the model apparently was estimated, but the log-likelihood is missing because a number of the estimated probabilities were so close to zero that their log was undefined.

summary(fit3)

Call:
vglm(formula = factor(truecover) ~ W, family = cumulative(reverse = T,
parallel = F), data = cornus)

Pearson Residuals:
Min 1Q Median 3Q Max
logit(P[Y>=2]) -2.8037 3.0091e-06 0.0019349 0.059714 1.4146
logit(P[Y>=3]) -23.3977 -7.0585e-01 0.1071105 0.405207 2.2853
logit(P[Y>=4]) -1.8237 -6.3842e-01 -0.1790875 0.262293 3.5736
logit(P[Y>=5]) -1.9461 -3.0320e-01 -0.1732860 -0.127485 3.0425
logit(P[Y>=6]) -1.2392 -2.0641e-01 -0.1447240 -0.114505 3.6850

Coefficients:
                   Value Std. Error t value
(Intercept):1 -0.4925839 0.7012324 -0.70245
(Intercept):2 -1.2911522 0.4262894 -3.02882
(Intercept):3 -2.3210544 0.4317115 -5.37640
(Intercept):4 -2.3009876 0.4302749 -5.34771
(Intercept):5 -2.9251780 0.6338667 -4.61482
W:1            0.2699540 0.1192624 2.26353
W:2            0.0592541 0.0138522 4.27760
W:3            0.0357884 0.0069295 5.16463
W:4            0.0149486 0.0051945 2.87776
W:5            0.0046051 0.0080005 0.57560

Number of linear predictors: 5

Names of linear predictors: logit(P[Y>=2]), logit(P[Y>=3]), logit(P[Y>=4]), logit(P[Y>=5]), logit(P[Y>=6])

Dispersion Parameter for cumulative family: 1

Residual Deviance: 220.7533 on 415 degrees of freedom

Log-likelihood: NA on 415 degrees of freedom

Number of Iterations: 10

The output indicates that the coefficient of W decreases dramatically as we move up cover classes. Curiously, even though there is no log-likelihood a deviance is reported. We can use the deviance to test the proportional odds assumption by comparing the deviance of this model with that of the proportional odds model.

deviance(fit2)-deviance(fit3)

[1] 26.8668

pchisq(deviance(fit2)-deviance(fit3), df=df.residual(fit2)-df.residual(fit3), lower.tail=F)

[1] 2.115029e-05

The test is significant indicating a significant lack of fit. This result may be unreliable because of the warnings we obtained when fitting the model. Such warnings can arise if any of the categories have too few observations. That certainly could be the case here because the smallest and largest cover classes have only 8 and 5 observations respectively.

table(cornus$truecover)

2 3 4 5 6 7
8 22 24 15 11 5

I try collapsing the two categories at the end with adjacent categories and refitting the two models.

cornus$ord2 <- ifelse(cornus$truecover<3, 1, ifelse(cornus$truecover>=6, 6, cornus$truecover))

table(cornus$ord2)

1 3 4 5 6
8 22 24 15 16

fit2a <- vglm(factor(ord2)~W, family=cumulative(reverse=T, parallel=T), data=cornus)

fit3a <- vglm(factor(ord2)~W, family=cumulative(reverse=T, parallel=F), data=cornus)

There were 15 warnings (use warnings() to see them)

The model with parallel=F still yields warnings about small fitted values. This time it did return a log-likelihood though.

logLik(fit3a)

[1] -100.5546

logLik(fit2a)

[1] -111.6845

The large difference in the log-likelihoods indicates that there is still a significant lack of fit. I carry out a formal test. (Note: the reported deviance is –2 times the reported log-likelihood.)

pchisq(deviance(fit2a)-deviance(fit3a), df=df.residual(fit2a)-df.residual(fit3a), lower.tail=F)

[1] 5.759664e-05

There is still a significant lack of fit. I try collapsing the categories even further.

cornus$ord3 <- ifelse(cornus$truecover<4, 1, ifelse(cornus$truecover>=5, 5, cornus$truecover))

table(cornus$ord3)

1 4 5
30 24 31

fit2b <- vglm(factor(ord3)~W, family=cumulative(reverse=T, parallel=T), data=cornus)

fit3b <- vglm(factor(ord3)~W, family=cumulative(reverse=T, parallel=F), data=cornus)

pchisq(deviance(fit2b)-deviance(fit3b), df=df.residual(fit2b)-df.residual(fit3b), lower.tail=F)

[1] 0.2110906

The lack of fit is no longer significant. If we examine the coefficient tables we see that the separate slope estimates are fairly similar to the estimate of the common slope from the proportional odds model.

summary(fit2b)@coef3

Value Std. Error t value
(Intercept):1 -1.15270839 0.384852289 -2.995197
(Intercept):2 -3.10338481 0.528154691 -5.875901
W 0.04977642 0.009317223 5.342409

summary(fit3b)@coef3

                    Value Std. Error   t value
(Intercept):1 -1.40247413 0.451473949 -3.106434
(Intercept):2 -2.70073170 0.549640061 -4.913637
W:1            0.06151160 0.014506732 4.240210
W:2            0.04293498 0.009664812 4.442402

Cited references

Agresti, Alan. 2002. Categorical Data Analysis. Wiley: New York.
Sokal R. R. and F. J. Rohlf. 2012. Biometry. 4th ed. Freeman & Co., New York

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--April 29, 2012
URL: https://sakai.unc.edu/access/content/group/2842013b-58f5-4453-aa8d-3e01bacbfc3d/public/Ecol562_Spring2012/docs/lectures/lecture40.htm