Lecture 26—Monday, November 26, 2012

Topics

The analysis of binary data
- The product Bernoulli distribution
- The binomial distribution
Visualizing the binomial distribution
The likelihood of a binomial model
The link function for binomial regression
- Distinguishing Poisson from binomial data
A binomial data example
Fitting logistic regression models
- Fitting grouped binary data as binary data
- Obtaining the estimated logits of individual categories
Estimating proportions when there is quasi-complete separation
Dealing with categorical predictors that have many categories
Cited references
R code

R functions and commands demonstrated

binom.test carries out tests of a binomial proportion and calculates confidence intervals.
dbinom is the binomial probability mass function.
logistf implements Firth regression useful for data exhibiting quasi-complete separation.
xtabs tabulates data with respect to categorical variables when the frequencies are stored in another variable.

R packages used

lattice for the dotplot function.
logistf for the logistf function that carries out Firth regression.

The analysis of binary data

So far we've studied regression models for two data types:

continuous data where we assume at least theoretically that any value is possible, and
count data where the only legal values are non-negative integers.

For continuous data we assumed that a reasonable data generating mechanism for the observed values was a normal distribution and we used an ordinary regression model to determine how the mean of that normal distribution varied as a function of the values of various predictor variables. For count data standard data generating mechanisms are the Poisson distribution and the negative binomial distribution. To ensure that our regression model returned non-negative estimates for the mean of these distributions we constructed a regression model for the log of the mean. We refer to the log function in this context as the link function for count data.

We now turn to regression models for binary data. Binary data are data in which there are only two possible outcomes: 0 and 1, failure and success. Thus we are dealing with purely nominal data. The simplest case of binary data is when we have a single observed binary outcome. A textbook example of this is flipping a coin once and recording the outcome, heads or tails.

The standard probability model for such a simple experiment is the Bernoulli distribution. The Bernoulli distribution has one parameter p, the probability of success, with 0 ≤ p ≤ 1. The notation we will use is , to be read "X is distributed Bernoulli with parameter p." Obviously if the probability of success is p, then the probability of failure is 1 – p. The mean of the Bernoulli distribution is p. The variance of the Bernoulli distribution is p(1 – p). We refer to the single binary outcome as a Bernoulli trial.

Binary data don't become interesting statistically until an experiment produces multiple Bernoulli trials. Depending on how we choose to record the results from this experiment we are led to one of two closely related probability models: the product Bernoulli distribution or the binomial distribution.

The product Bernoulli distribution

The product Bernoulli distribution is the natural model to use when we observe a sequence of n independent Bernoulli trials each with some probability of success p_i. An example of this in ecology would be the construction of a habitat suitability model for the spatial distribution of an endangered species. Typically we record the presence or absence of the species in a habitat in each of a set of randomly located quadrats. We then try to relate the observed species distribution to measured characteristics of the habitat. Each individual species occurrence is treated as the realization of a Bernoulli random variable whose parameter p is modeled as a function of habitat characteristics.

Suppose are n independent binary outcomes from such a habitat suitability study. Let the observed binary outcomes be x₁ = 1, x₂ = 0, x₃ = 0, x₄ = 1, … , x_n = 1 for quadrats numbered 1, 2, 3, … , n. To formulate the likelihood of our data we begin by writing down the probability of observing the data we obtained, the joint probability of our data.

product Bernoulli

where in the second step we make use of the fact that the individual binary outcomes are assumed to be independent. If we've measured a common set of predictors that vary among the binary observations we can develop a regression model for p as a function of these predictors. We then replace the various p_i in the log-likelihood by their corresponding regression models and obtain maximum likelihood estimates of the regression coefficients.

The binomial distribution

Another common way binary data arise involves a slight variation of the product Bernoulli scenario. Suppose instead of observing the individual outcomes from the independent trials we only observe the number of successes. The classical example of this is an experiment in which a coin is flipped n times and we record the number of heads, but not the order in which the heads and tails appeared. Because we've haven't recorded the information about the individual binary trials we are forced to assume (unlike the product Bernoulli model) that the probability of a success p is the same for each of the n trials.

A simple biological illustration of this might be a seed germination experiment. Suppose in such an experiment 100 seeds are planted in a pot and the number of seeds that germinate is recorded. Suppose we observe that k out of 100 seeds germinated. In this case we haven't recorded which seeds germinated only that k of them did. If we had recorded the order of the events (i.e., which seed yielded which success or failure) then the product binomial model would be appropriate. Now there are many different ways that k successful germinations could have occurred. The first k seeds could have been successes while the last 100 – k were failures, or the first 100 – k seeds could have been successes and the last k were failures, and there are many others. Each of these possible events is contributing to the overall probability of obtaining the k observed successes. Because the seeds are assumed identical each such outcome has the same probability that is given by the product Bernoulli model:

where I replace p_i by p because the probabilities are the same on each Bernoulli trial. So all we need to do is figure out how many of these different outcomes each leading to k successes there are and multiply that number by their identical probability. The answer is provided by the binomial coefficient whose derivation is given in any elementary probability text.

n choose k

Putting this altogether the probability of observing k successes out of n independent trials where the probability of success p is the same on each trial is given by the binomial probability mass function.

binomial

Formally a binomial random variable arises from a binomial experiment, which is an experiment consisting of a sequence of n independent Bernoulli trials. If are independent and identically distributed Bernoulli random variables each with parameter p, then

is said to have a binomial distribution with parameters n and p. We write this as . Given the mean and variance of the Bernoulli distribution it should not be surprising that the mean of the binomial distribution is np and the variance of the binomial distribution is np(1 – p). To better distinguish the binomial distribution from the product Bernoulli distribution, we sometimes refer to data arising from a binomial distribution as grouped binary data.

In order for an experiment to be considered a binomial experiment four basic assumptions must hold.

The experiment must consist of a sequence of trials in which each trial is a Bernoulli trial, meaning only one of two outcomes with probabilities p and 1 – p can occur.
The number of trials is fixed ahead of time at n, a value that is known to us.
The probability p is the same on each Bernoulli trial.
The Bernoulli trials are independent. Recall that for independent events A and B, .

If (1) or (2) are violated then a binomial model is completely inappropriate and a different probability distribution should be used. If (3) is violated, it may be possible to salvage things by including random effects in the final model. If (4) is violated it may be possible to account for it by adding a correlation structure to the model.

Visualizing the binomial distribution

The dbinom function of R is the probability mass function for the binomial and it returns the probability of a specified value. The syntax of dbinom is dbinom(x, size, prob) where size is what we have been calling n and prob is what we've been calling p. I start by placing three binomial distributions side-by-side. All have n = 10, but p varies with p = 0.1, 0.5, and 0.9. I use the mfrow argument of the par function of R to obtain a graphical display consisting of 1 row and 3 columns.

#set graphics window to have one row and three columns

par(mfrow=c(1,3))

#set bar ends to have square ends

par(lend=2)

plot(0:10, dbinom(0:10,10,.1), type='h', lwd=6, xlab='#successes', ylab='probability')

mtext(side=3,line=.5,'p = 0.10')

plot(0:10, dbinom(0:10,10,.5), type='h', lwd=6, xlab='#successes', ylab='probability')

mtext(side=3,line=.5,'p = 0.50')

plot(0:10, dbinom(0:10,10,.9), type='h', lwd=6, xlab='#successes', ylab='probability')

mtext(side=3,line=.5,'p = 0.90')

par(mfrow=c(1,1))

binomial distributions

Fig. 1 Three binomial probability mass functions with n = 10 and p = 0.1, 0.5, and 0.9

Observe that as p approaches the endpoint values of 0 and 1 the distributions are more skewed. When p = 0.5 the distribution looks quite symmetric. I repeat this for p = 0.5, 0.8, and 0.9 but this time with a much larger number of trials, n = 100.

par(mfrow=c(1,3))

plot(30:70, dbinom(30:70,100,.5), type='h', lwd=2, xlab='#successes', ylab='probability')

mtext(side=3,line=.5,'p = 0.50')

plot(60:100, dbinom(60:100,100,.8), type='h', lwd=2, xlab='#successes', ylab='probability')

mtext(side=3,line=.5,'p = 0.80')

plot(60:100, dbinom(60:100,100,.9), type='h', lwd=2, xlab='#successes', ylab='probability')

mtext(side=3,line=.5,'p = 0.90')

par(mfrow=c(1,1))

fig. 2

Fig. 2 Three binomial probability mass functions with n = 100 and p = 0.5, 0.8, and 0.9

Notice that the distributions for p = 0.5 and 0.8 are nearly perfectly symmetrical, while the distribution for p = 0.9 is nearly symmetrical showing only a hint of left skewness.

Because a binomial distribution is the sum of n independent Bernoulli random variables, the central limit theorem applies. According to the central limit theorem when we add up a large number of independent and identically distributed random variables, no matter what their distribution, the sum will tend to look normally distributed. Fig. 2 is an illustration of the central limit theorem in practice and is an empirical rationale for using a normal distribution to approximate the binomial. Practically speaking, if you have binomial data and your n is big enough, you could analyze your binomial counts as if they came from a normal distribution. In order for the approximation to be good the recommendation is that both the products np and n(1–p) need to be large. If both products exceed 5 that is typically large enough for the normal approximation to be a reasonable one.

The likelihood of a binomial model

To illustrate constructing the likelihood of a binomial random variable we consider a seed germination experiment in which 100 seeds are planted in a pot and the number of seeds that germinate is recorded. Suppose this is done repeatedly for different pots of seeds such that the different pots are each subjected to different light regimes, burial depths, etc. Clearly the first two assumptions of the binomial distribution hold here.

The outcome on individual trials (the fate of an individual seed in the pot) is dichotomous. A seed germinates or it does not.
The number of trials (number of seeds per pot) was fixed ahead of time at 100.

Assumptions (3) and (4), constant p and independent trials, would need to be verified. Suppose there are m pots subjected to different treatments and in each pot we record the total number of germinated seeds. Let the probability of germination in pot i be p_i and to make things more general suppose n_i seeds were planted in pot i. We make the usual assumption that the outcomes in different pots are independent of each other. Then we can write the down the following likelihood for our model.

binomial likelihood

Except for the leading product of binomial coefficients the remainder of the likelihood looks just like the likelihood of the product Bernoulli model. The only real substantive difference is that in the binomial model the probabilities p are the same for observations coming from the same group (pot) whereas in the product Bernoulli model all the probabilities are potentially different. As a result when we formulate a regression model for binomial data our predictors will vary for different groups (pots) but will be constant for all the Bernoulli trials that make up a group. In contrast to this in the product Bernoulli model each individual binary observation can have a different values of a predictor.

Because the binomial distribution assumes that each binomial outcome arises from a sequence of independent Bernoulli trials, a binomial experiment with m binomial random variables each constructed from n independent Bernoulli trials yields a likelihood consisting of a product of m × n Bernoulli probabilities. This is exactly what we would obtain if we had used a product Bernoulli model and viewed the experiment as consisting of m × n independent ordered Bernoulli trials. The likelihoods differ only in the presence of the binomial coefficients.

Because the binomial coefficients in the binomial likelihood don't contain p, they do not affect our estimate of p. So, whether we treat a binomial experiment as grouped binary data or ordered binary data we will obtain the same maximum likelihood estimates of the parameters. Because the likelihood of the binomial model also contains the binomial coefficients the reported log-likelihoods at the maximum likelihood estimates will differ between the two approaches (as will the residual deviance and the AIC), but the estimates, their standard errors, and significance tests will all be the same.

The link function for binomial regression

With continuous data our basic regression model took the following form.

normal

With count data our basic regression model took the following form.

Poisson

(A model similar to this was used for the negative binomial distribution.) The use of the log link with count data is important because it guarantees that the mean of the Poisson (negative binomial) distribution will be greater than zero. A potential down side is that while the predictors act additively on the log mean scale they will act in a multiplicative fashion on the mean on the original scale of the response.

Fitting regression models using a binomial distribution poses an additional challenge. The parameter p that we wish to model is bounded on two sides. It is bounded below by zero and above by one. Thus in order for the regression model to yield sensible results we need a link function that will constrain p to lie within these limits. The most popular modern choice for such a link function is the logit function. Thus with binary data the basic logistic regression model takes the following form.

logit

The logit, also called a log odds, is defined as follows.

log odds

"Odds" is meant in the sense that a gambler uses the term odds. In truth the concept of "odds" is completely interchangeable with the concept of probability. If you're told that the probability of a success is that translates into saying that the odds are 2 to 1.

odds

If on the other hand you're told that the odds are 1.5 in favor of an event this means that the probability of a success is .

odds

So, odds are an equivalent although perhaps less familiar way of talking about probability. Odds can be any number from 0 to ∞. Taking the log of the odds then can yield any number on the real line. So, the logit maps a probability, a number in the open interval (0, 1), onto the set of real numbers. The benefit of using a logit link in binary regression is that it permits the regression equation to take on any value (negative or positive) while guaranteeing that when we back-transform the logit to obtain p, the result will lie in the interval (0, 1).

Although the logit is the most popular link function for binomial data because of its interpretability as a log odds, there are other possible choices for the link function. Historically a commonly used alternative was the probit function defined to be where Φ is the cumulative distribution function of the standard normal distribution. Because the function Φ maps real numbers into the interval [0, 1]. The probit, being the inverse function, therefore maps probabilities onto the real line. By inverting the probit function we can obtain a probability corresponding to a probit regression prediction. For binomial data the glm function of R offers the logit link (the default), the probit link, and a third link called the complementary log log (cloglog) link.

Distinguishing Poisson from binomial data

Although Poisson data resemble binomial data they are fundamentally different. Poisson counts are unbounded above, while binomial counts are bounded above by n. As a result we always know that a binomial response cannot exceed n, the total number of trials, whereas a Poisson response theoretically has no upper bound.

As an example the bycatch data from Assignment 10 that we analyzed using Poisson regression superficially looks like binomial data. The response variable was the number of dolphins caught over a fixed number of tows for each trip. The difference is that we weren't treating the outcome of each tow as a success or a failure. We just counted up the number of dolphins caught, so the individual tows were not Bernoulli trials. On a single tow there could be many dolphins caught, one dolphin, or none. So, potentially at least, there was no maximum value for the number of dolphins caught; the number could conceivably exceed the number of tows. We did include the number of tows in the model as a covariate (or an offset) to account for the fact that the observations in the data set were not equivalent because they were obtained with unequal degrees of effort.

Still, there are situations where it is legitimate to treat binomial data as having an approximate Poisson distribution. The classical instance of this is a binomial experiment in which n is extremely large and p is extremely small. In this case if we let λ = np and calculate probabilities P(X = k) using a Poisson or a binomial model, the answers we get are very similar.

n <- 1000000

p <- .000001

dbinom(1, p=p, size=n)

[1] 0.367879625111

dpois(1, lambda=n*p)

[1] 0.367879441171

There is a formal theorem that specifies the conditions for which a Poisson distribution can be viewed as the limiting case of a binomial distribution.

A binomial data example

The data set for today's lecture is taken from Piegorsch and Bailer (2005) who describe it as follows (p. 126).

In the late 1980s, the US Geological Survey (USGS) conducted a water quality study of land on Long Island, New York (Eckhardt et al., 1989). As part of the study, results were presented on contamination of the groundwater by the industrial solvent trichloroethylene (TCE). We have Y = number of wells where TCE was detected, out of N sampled wells in total. Also recorded for use as potential predictor variables were the qualitative variables x_i1 = land use (with 10 different categories), and x_i2 = whether or not sewers were used in the area around the well, along with the quantitative variates x_i3 = median concentration (mg/L) of nitrate at the N well sites, and x_i4 = median concentration (mg/L) of chloride at the N well sites. Interest existed in identifying if and how these predictor variables affect TCE contamination.

I load and examine the data.

wells <- read.csv( 'ecol 563/wells.txt')

wells

    y n land.use sewer nitrate chloride
1   0 17    undev   yes     1.0       10
2   0 59    undev    no     0.8       17
3   0 7     agri   yes     0.9        9
4   2 48     agri    no     7.0       20
5   0 5     resL   yes     1.6        6
6   2 21     resL    no     3.5       14
7 12 43     resM   yes     3.9       18
8   5 86     resM    no     2.6       15
9 33 76     resH   yes     5.2       33
10 1 17     resH    no     4.2       18
11 3 26     recr   yes     2.3       16
12 4 38     recr    no     1.7       11
13 8 32     inst   yes     3.3       36
14 0 22     inst    no     0.9       11
15 7 29    trans   yes     4.0       66
16 1 30    trans    no     1.3       12
17 20 42     comm   yes     4.3       24
18 0 7     comm    no     1.1       30
19 17 33    indus   yes     3.7       23
20 3 12    indus    no     2.2       19

Each row is a single binomial observation, the observed number of successes y out of a fixed total n. An observation is classified as a "success" if the well is contaminated with TCE. The binomial distribution has two parameters, n and p. We are given n for each observation but we need to estimate p, the probability of success. Our goal is to quantify how p depends on the measured predictors: land use (categorical), sewer use (categorical), nitrate (continuous), and chloride (continuous). The land use categories are: undeveloped (undev), agriculture (agri), low-density residential with less than 2 dwellings/acre (resL), medium-density residential with 2-4 dwellings/acre (resM), high-density residential with greater than 4 dwellings/acre (resH), recreation (recr), institution (inst), transportation (trans), commercial (comm), and industrial (indus).

The source of the binomial grouping for these data is unclear. It is possible that a map of Long Island was stratified into areal regions according to land use (and perhaps the presence of sewers) and then within each stratum a certain number of wells were tested. If this description is correct then we do not have a random sample of wells. Land use and sewer are stratum characteristics that are then inherited by all the wells in that stratum. From the description of the study, chloride and nitrate concentration were measured separately for each individual well but the value reported in the data set is the median value of these concentrations for all the wells in the stratum. Thus for better or worse all of the predictors in the data set are measured at the stratum level, i.e., at the level of the binomial observation (not at the level of the individual well). It's probably the case that these are really binary data (measurements on individual wells) that have been categorized for presentation purposes by landscape type and sewer type.

We can get a preliminary sense of the importance of the categorical predictors sewer and land use with a dot plot of the raw success probabilities for each category.

library(lattice)

dotplot(land.use~y/n, col=as.numeric(wells$sewer), data=wells)

fig 1

Fig. 3 The effect of sewer and land use on the proportion of contaminated wells

From the graph we see that while there are small differences with respect to land type there are some large effects associated with the presence (red) or absence (black) of sewers.

Fitting logistic regression models

Logistic regression models are fit in R using the glm function with the argument family=binomial. The logit link is the default and doesn't have to be specified. For true binomial (grouped binary) data the response "variable" is a matrix constructed with the cbind function in which the first column is the number of successes and the second column is the number of failures. I start by fitting an additive logistic regression model that includes only the two categorical variables as predictors.

out1 <- glm(cbind(y,n-y) ~ land.use + sewer, data=wells, family=binomial)

summary(out1)

Call:
glm(formula = cbind(y, n - y) ~ land.use + sewer, family = binomial,
data = wells)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5363685 -0.9860019 -0.0001991 0.3541879 1.4394048

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)     -3.6568     0.7376 -4.957 7.14e-07 ***
land.usecomm     1.8676     0.8044   2.322 0.02025 *
land.useindus    2.2070     0.8053   2.741 0.00613 **
land.useinst     0.7584     0.8395   0.903 0.36629
land.userecr     0.6676     0.8435   0.791 0.42870
land.useresH     1.7316     0.7784   2.225 0.02611 *
land.useresL     0.6663     1.0501   0.635 0.52572
land.useresM     1.0212     0.7809   1.308 0.19099
land.usetrans    0.7933     0.8360   0.949 0.34267
land.useundev -18.3414 3033.9308 -0.006 0.99518
seweryes         1.5980     0.2955   5.407 6.41e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 146.956 on 19 degrees of freedom
Residual deviance: 15.201 on 9 degrees of freedom
AIC: 82.476

Number of Fisher Scoring iterations: 17

Except for the fact that the response is a logit, much of the summary output should look familiar. The land.use and sewer variables are factors and have been entered into the model as dummy variables. The estimates reported in the summary table are for the coefficients of those dummy variables in the model. The coding has been done alphabetically so that the reference group for land.use is agri and the reference group for sewer is sewer = no. As usual then the reported estimates are "effects", the effect of switching from the reference category to the level shown.

Although we haven't discussed ways to interpret the logit, one useful fact about logit(p) is that it is an invertible function that is monotonic in p. Thus when p goes up, the logit goes up; when p goes down, the logit goes down, and vice versa. We can use this fact to interpret how predictors affect the probability that a well is contaminated. Positive coefficients for a dummy regressor mean when we switch from the reference level to the level indicated by the dummy regressor the logit increases. But if the logit increases then the probability of contamination must also increase. Thus we can see from the reported estimates that for all of the land use categories except "undeveloped" the probability of a well being contaminated by TCE is higher for that land use than it is for agricultural land (the reference category), although only some of these differences are significant. Similarly wells on tracts of land that have sewers have a higher probability of being contaminated than do wells on tracts without sewers.

Fitting grouped binary data as binary data

As was pointed out above, the likelihoods for grouped binary data (under a binomial model) and binary data (under a product Bernoulli model) take the same form and yield the same maximum likelihood estimates for the regression parameters. We can demonstrate this by converting the binomial data of the wells data set to binary data. As binary data each observation should generate wells$y ones and wells$n – wells$y zeros. I write a function that generates the zeros and ones on a per observation basis treating a row of the wells data frame as a vector.

yn.func <- function(x) rep(c(1,0), c(x[1], x[2]-x[1]))

I try the function out on the first four observations of the data set using the apply function. The apply function returns a list because the vectors of zeros and ones created from different rows are of different lengths.

apply(wells[1:4,1:2], 1, yn.func)

$`1`
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

$`2`
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[37] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

$`3`
[1] 0 0 0 0 0 0 0

$`4`
[1] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[37] 0 0 0 0 0 0 0 0 0 0 0 0

Finally I do this for all the observations, unlist the results, and add columns indicating the values of the land.use and sewer variables repeated the necessary number of times.

wells.raw <- data.frame(yn=unlist(apply(wells[,1:2], 1, yn.func)), land.use=rep(wells$land.use, wells$n), sewer=rep(wells$sewer, wells$n))

dim(wells.raw)

[1] 650 3

wells.raw[1:8,]

yn land.use sewer
1 0    undev   yes
2 0    undev   yes
3 0    undev   yes
4 0    undev   yes
5 0    undev   yes
6 0    undev   yes
7 0    undev   yes
8 0    undev   yes

To fit a logistic regression model to binary data we just enter the binary variable as the response and specify the family argument as binomial.

out1a <- glm(yn~ land.use + sewer, data=wells.raw, family=binomial)

summary(out1a)

Call:
glm(formula = yn ~ land.use + sewer, family = binomial, data = wells.raw)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.2409 -0.6666 -0.3332 -0.0001 2.7138

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)     -3.657      0.738   -4.96 7.1e-07 ***
land.usecomm     1.868      0.804    2.32   0.0203 *
land.useindus    2.207      0.805    2.74   0.0061 **
land.useinst     0.758      0.839    0.90   0.3663
land.userecr     0.668      0.844    0.79   0.4287
land.useresH     1.732      0.778    2.22   0.0261 *
land.useresL     0.666      1.050    0.63   0.5257
land.useresM     1.021      0.781    1.31   0.1910
land.usetrans    0.793      0.836    0.95   0.3427
land.useundev -15.456    716.838   -0.02   0.9828
seweryes         1.598      0.296    5.41 6.4e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 615.83 on 649 degrees of freedom
Residual deviance: 484.08 on 639 degrees of freedom
AIC: 506.1

Number of Fisher Scoring iterations: 17

If you compare this output with the output from the grouped binary data model above, you'll see that the coefficient estimates, standard errors, and p-values are exactly the same (except for the undeveloped land use type, discussed further below). The AIC, log-likelihood, and residual deviance are all different.

# binomial model

coef(out1)

(Intercept) land.usecomm land.useindus land.useinst land.userecr
   -3.6568378     1.8676147     2.2069914     0.7584341     0.6675926
land.useresH land.useresL land.useresM land.usetrans land.useundev
    1.7315828     0.6663339     1.0211725     0.7932990   -18.3414442
    seweryes
    1.5979587

# binary model

coef(out1a)

(Intercept) land.usecomm land.useindus land.useinst land.userecr
   -3.6568378     1.8676147     2.2069914     0.7584341     0.6675926
land.useresH land.useresL land.useresM land.usetrans land.useundev
   1.7315828     0.6663339     1.0211725     0.7932990   -15.4559130
     seweryes
    1.5979587

# binomial model

c(AIC(out1), logLik(out1), deviance(out1))

[1] 82.4765 -30.2382 15.2006

# binary model

c(AIC(out1a), logLik(out1a), deviance(out1a))

[1] 506.078 -242.039 484.078

So in terms of obtaining estimates and carrying out statistical tests, it doesn't matter whether we view the response as binomial or binary.

Obtaining the estimated logits of individual categories

It is sometimes useful to see the actual estimates for each category, rather than the estimated effects. We could obtain these estimates by hand by adding the estimate of the intercept to each estimated effect. To obtain standard errors we would need to use the method described in lecture 4. An easier way when it's possible is to refit the model without an intercept. This allows us to see the individual category estimates for one of the factors in the model, namely the factor variable that is listed first in the model formula.

out1b <- glm(cbind(y,n-y) ~ land.use + sewer-1, data=wells, family=binomial)

summary(out1b)

Call:
glm(formula = cbind(y, n - y) ~ land.use + sewer - 1, family = binomial,
data = wells)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5363685 -0.9860019 -0.0001991 0.3541879 1.4394048

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
land.useagri    -3.6568     0.7376 -4.957 7.14e-07 ***
land.usecomm    -1.7892     0.4041 -4.427 9.54e-06 ***
land.useindus   -1.4498     0.3971 -3.651 0.000261 ***
land.useinst    -2.8984     0.4636 -6.252 4.05e-10 ***
land.userecr    -2.9892     0.4620 -6.470 9.82e-11 ***
land.useresH    -1.9253     0.3478 -5.536 3.10e-08 ***
land.useresL    -2.9905     0.7669 -3.899 9.65e-05 ***
land.useresM    -2.6357     0.3246 -8.120 4.68e-16 ***
land.usetrans   -2.8635     0.4524 -6.329 2.47e-10 ***
land.useundev -21.9983 3033.9307 -0.007 0.994215
seweryes         1.5980     0.2955   5.407 6.41e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 432.213 on 20 degrees of freedom
Residual deviance: 15.201 on 9 degrees of freedom
AIC: 82.476

Number of Fisher Scoring iterations: 17

Observe that land.use = 'agri' has the value of the intercept from the previous model and all of the other land use categories have the old intercept value added to them. These are the predicted logits for each land.use type when sewer='no'. The estimate of sewer is still an effect estimate and is the difference in logits between sewer = 'no' and sewer = 'yes'. If we wanted to see the estimated logits for both sewer categories we would need to list sewer before land.use in the model formula. Removing the intercept has no effect on model fit. The reported AIC values are the same.

AIC(out1)

[1] 82.47647

AIC(out1b)

[1] 82.47647

Estimating proportions when there is quasi-complete separation

The reported standard error for land.use = undev is crazy in all versions of the logistic regression model we've fit. The estimated logit for the undeveloped land use category is –22 with a standard error of 3000 yielding a p-value of 1. This indicates that glm failed to estimate this parameter. If we look at the raw data we can see why this has happened.

wells[wells$land.use=='undev',]

y n land.use sewer nitrate chloride
1 0 17 undev yes 1.0 10
2 0 59 undev no 0.8 17

Both binomial observations in the undeveloped land use category had 0 successes, i.e., no contaminated wells. This is the only land use category for which this happened. We can get information about successes and failures for each land use category with the xtabs function. xtabs tabulates data by a categorical variable using the value of another variable for the frequencies.

xtabs(cbind(y,n-y)~land.use, data=wells)

land.use   y V2
   agri    2 53
   comm   20 29
   indus 20 25
   inst    8 46
   recr    7 57
   resH   34 59
   resL    2 24
   resM   17 112
   trans   8 51
   undev   0 76

With no successes the maximum likelihood estimate of the probability of success for this category should be zero. This also corresponds to an odds of zero. Because the log of zero is undefined, the logit is undefined for this group thus leading to estimation problems in a logit model. The problem described here is called quasi-complete separation and occurs any time a parameter in a model corresponds to a set of observations that are all successes or all failures. While this is a nuisance when fitting a logistic regression model it is obviously not a bad problem to have. What it means here is that wells on undeveloped land are never contaminated regardless of their sewer status. As a result knowing that land.use = 'undev' perfectly predicts that a well is not contaminated. Unfortunately maximum likelihood estimation has failed as indicated by the ridiculously large standard error reported in the summary table.

When binomial data exhibit quasi-complete separation we have a number of options.

Abandon logistic regression using maximum likelihood estimation in favor of a method that does not suffer the same deficiencies. One regression method that can work with quasi-complete separation is the Firth method.
Remove the observations causing the problem from the analysis (in this case the two binomial observations corresponding to the undeveloped land use category) and analyze them separately.
Combine the offending category with another category either based on some theoretical rationale or because they exhibit similar binomial proportions.

Regression approaches: the Firth method

There are alternatives to maximum likelihood estimation when the data are sparse: exact methods (Mehta and Patel 1995; Venkataraman and Ananthanarayanan 2008), a Bayesian approach with an informative prior (Gelman et al. 2008), and the Firth bias reduction method (Firth 1993) that maximizes a penalized likelihood function. The exact approach is proprietary and is only implemented in SAS and logXact. Furthermore the exact method can fail for complicated models. The R package elrm approximates exact logistic regression using MCMC but I’ve found its performance to be somewhat unsatisfactory.

Firth regression is identical to Bayesian logistic regression with a noninformative Jeffreys prior (Fijorek and Sokolowski 2012). Heinze and Schemper (2002) have shown that the Firth method is a general solution to the problem of complete separation. The logistf package of R (Ploner et al. 2010) carries out the Firth method. An extension of the Firth method to multinomial models is implemented in the pmlr package of R (Colby et al. 2010).

The logistf function of the logistf package requires a binary response variable in which the categories have the numerical values 0 and 1. The wells.raw data frame we created previously will work here.

library(logistf)

out1f <- logistf(yn~ land.use + sewer, data=wells.raw, family=binomial)

out1f

logistf(formula = yn ~ land.use + sewer, data = wells.raw, family = binomial)
Model fitted by Penalized ML
Confidence intervals and p-values by Profile Likelihood

                   coef se(coef) lower 0.95 upper 0.95     Chisq           p
(Intercept)   -3.424344 0.670564 -5.025055 -2.313637 68.763252 1.11022e-16
land.usecomm   1.682935 0.744342   0.377902   3.376933 6.695309 9.66669e-03
land.useindus 2.015201 0.745245   0.708822   3.710398 9.935773 1.62097e-03
land.useinst   0.613195 0.778832 -0.793258   2.349624 0.680935 4.09265e-01
land.userecr   0.528785 0.781638 -0.891240   2.267772 0.496065 4.81234e-01
land.useresH   1.545687 0.716133   0.308640   3.201830 6.298950 1.20810e-02
land.useresL   0.654184 0.963398 -1.303080   2.610622 0.469950 4.93010e-01
land.useresM   0.844285 0.718005 -0.401195   2.502388 1.652675 1.98595e-01
land.usetrans 0.646981 0.774896 -0.751044   2.377972 0.769544 3.80358e-01
land.useundev -2.208163 1.578255 -7.148982   0.345864 2.806710 9.38707e-02
seweryes       1.555010 0.286769   1.004491   2.147990 33.464284 7.25847e-09

Likelihood ratio test=124.958 on 10 df, p=0, n=650

The reported estimate and standard error for the undeveloped land use category effect are sensible. According to the output we see that although undeveloped land has a lower probability of well contamination than does agricultural land, the difference is not significant (p = 0.09).

Approximate confidence intervals

What if we had chosen to remove land.use="undev" from the analysis? What should we report as the estimate for this group? Based on the data alone the obvious estimate for the probability of contamination is zero, but what about a confidence interval for this estimate? Do we really believe that there are no contaminated wells on undeveloped land? How should we quantify our uncertainty when we have a point estimate of zero? To understand how to proceed I start with a less severe situation that I then generalize to the case of zero successes.

There were 76 wells tested for land.use='undev'. Suppose we had observed 2 contaminated wells among these 76. The obvious point estimate of the probability of contamination is . What should we report for a confidence interval? Statistical theory tells us that the variance of a binomial proportion p when the number of trials is n is the following.

Var(p)

Binomial proportions are in principal sums; hence the central limit theorem apples to them and we expect them to be approximately normally distributed when samples are large. A large sample confidence interval for a proportion is the following.

If we use this formula when the success rate is 2 out of 76, we get a silly confidence interval estimate for p.

my.sd <- sqrt((2/76)*(74/76)/76)

my.sd

[1] 0.01836160

2/76

[1] 0.02631579

2/76 + c(qnorm(.025), qnorm(.975))*my.sd

[1] -0.00967228 0.06230386

An interval estimate for a proportion that includes negative values is obviously nonsensical. The usual seat of the pants rule for when the central limit theorem is applicable to proportions is that both np ≥ 5 and n(1 – p) ≥ 5. In our case np = 2 so the guidelines aren't satisfied. One "fix" is to instead assume normality on a logit scale. While this would yield a sensible confidence interval for the current example, it is of no help in dealing with p = 0 because the logit when p = 0 is undefined.

Exact (Clopper-Pearson) confidence intervals

Another solution is to construct what's called an exact confidence interval. It's called exact because it is based not on a normal approximation but instead on the exact tail probabilities of the binomial distribution. The idea is simple. Let the desired confidence interval for p be denoted and consider the case where we observe X = 2 successes out of n = 76 trials.

We wish to find p_U so that if X ~ binomial(n = 76, p_U), then P(X ≤ 2) = .025. In other words X = 2 is the smallest binomial count that if observed we would still feel comfortable saying it was generated by X. So X = 2 cuts off a probability of .025 in the lower tail of the binomial distribution. If we chose p to be any bigger than this the lower-tailed probability would be less than .025.
Similarly, we wish to find p_L such that if X ~ binomial(n = 76, p_L), then P(X ≥ 2) = .025. In other words X = 2 is the largest binomial count that if observed we would still feel comfortable saying it was generated by X. Thus X = 2 cuts off a probability of .025 in the upper tail of the binomial distribution. If p were any smaller than this the upper-tailed probability would become less than .025.

Fig. 4 displays the two binomial distributions corresponding to p_U and p_L.

fig4

Fig. 4 Finding an exact confidence interval for p when X ~ binomial(n = 76, p) R code

It's not hard to obtain p_U and p_L iteratively by starting with an initial guess and then refining it. To find p_U I just keep trying new values for p_U such that pbinom(2,size=76,p=pU) gets as close as possible to 0.025.

pbinom(2,76,.05)

[1] 0.2614607

pbinom(2,76,.08)

[1] 0.05159937

pbinom(2,76,.09)

[1] 0.02806534

pbinom(2,76,.092)

[1] 0.02476487

pbinom(2,76,.091)

[1] 0.02636699

pbinom(2,76,.0913)

[1] 0.02587648

pbinom(2,76,.0914)

[1] 0.02571488

pbinom(2,76,.0918)

[1] 0.02507784

To find p_L we use the fact that P(X ≥ 2) = P(X > 1) = 1 – P(X ≤ 1) = 1-pbinom(1, size=76, p=pL).

# find pL such that its upper tail rejection region is .025

1-pbinom(1,76,.01)

[1] 0.1764734

1-pbinom(1,76,.005)

[1] 0.05586575

1-pbinom(1,76,.003)

[1] 0.02214837

1-pbinom(1,76,.0035)

[1] 0.02942497

1-pbinom(1,76,.0032)

[1] 0.02495674

So, an approximate answer is (.0032, .0918). But there's no need to do this by hand. The binom.test function of R will generate this very same confidence interval.

binom.test(2,76)

Exact binomial test

data: 2 and 76
number of successes = 2, number of trials = 76, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.003203005 0.091849523
sample estimates:
probability of success
0.02631579

The interval that we've calculated is called the Clopper-Pearson interval. A number of other kinds of confidence intervals are available for proportions. See, e.g., Agresti (2002), p. 14–21, and Brown et al. (2001). What about the the case when there are 0 successes? Here's what the binom.test function reports.

binom.test(0,76)

Exact binomial test

data: 0 and 76
number of successes = 0, number of trials = 76, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.00000000 0.04737875
sample estimates:
probability of success
0

The 95% confidence interval is (0.000, 0.047). The convention is to take 0 as the lower limit when zero successes are observed, but the upper limit is calculated in the manner that was just described.

Combining categories to eliminate quasi-complete separation

We can combine the undeveloped land use category with another similar category (either theoretically similar or empirically similar) and refit the model. A priori it would make sense to combine it with some other low impact category. Using the individual logit estimates from glm or the empirical probability estimates shown below, it would appear that the undeveloped category is most similar to the agriculture category.

tapply(wells$y, wells$land.use, sum)/tapply(wells$n, wells$land.use, sum)

agri comm indus inst recr resH resL
0.03636364 0.40816327 0.44444444 0.14814815 0.10937500 0.36559140 0.07692308
resM trans undev
0.13178295 0.13559322 0.00000000

I combine these two categories to make a new land use category and then refit the model with the new land use variable. For the moment I continue to fit a model without an intercept so that I can compare the logits corresponding to the different land use categories.

wells$land.use2 <- factor(ifelse(wells$land.use %in% c('undev', 'agri'), 'rural', as.character(wells$land.use)))

wells

    y n land.use sewer nitrate chloride land.use2
1   0 17    undev   yes     1.0       10     rural
2   0 59    undev    no     0.8       17     rural
3   0 7     agri   yes     0.9        9     rural
4   2 48     agri    no     7.0       20    rural
5   0 5     resL   yes     1.6        6      resL
6   2 21     resL    no     3.5       14      resL
7 12 43     resM   yes     3.9       18      resM
8   5 86     resM    no     2.6       15      resM
9 33 76     resH   yes     5.2       33      resH
10 1 17     resH    no     4.2       18      resH
11 3 26     recr   yes     2.3       16      recr
12 4 38     recr    no     1.7       11      recr
13 8 32     inst   yes     3.3       36      inst
14 0 22     inst    no     0.9       11      inst
15 7 29    trans   yes     4.0       66     trans
16 1 30    trans    no     1.3       12     trans
17 20 42     comm   yes     4.3       24      comm
18 0 7     comm    no     1.1       30      comm
19 17 33    indus   yes     3.7       23     indus
20 3 12    indus    no     2.2       19     indus

# refit model with new land use variable

out2b <- glm(cbind(y,n-y) ~ land.use2 + sewer -1, data=wells, family=binomial)

summary(out2b)

Call:
glm(formula = cbind(y, n - y) ~ land.use2 + sewer - 1, family = binomial,
data = wells)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5494 -1.0623 -0.3001 0.3583 1.7330

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
land.use2comm   -1.7699     0.4027 -4.395 1.11e-05 ***
land.use2indus -1.4328     0.3956 -3.621 0.000293 ***
land.use2inst   -2.8810     0.4622 -6.233 4.58e-10 ***
land.use2recr   -2.9745     0.4607 -6.456 1.08e-10 ***
land.use2resH   -1.9062     0.3462 -5.507 3.66e-08 ***
land.use2resL   -2.9810     0.7660 -3.892 9.96e-05 ***
land.use2resM   -2.6230     0.3232 -8.116 4.83e-16 ***
land.use2rural -4.6878     0.7317 -6.407 1.48e-10 ***
land.use2trans -2.8475     0.4511 -6.313 2.74e-10 ***
seweryes         1.5770     0.2937   5.369 7.91e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 432.213 on 20 degrees of freedom
Residual deviance: 19.289 on 10 degrees of freedom
AIC: 84.565

Number of Fisher Scoring iterations: 5

Notice that AIC tells us that this last model is worse than the previous model.

AIC(out1)

[1] 82.47647

AIC(out2b)

[1] 84.56454

This is an illustration of why AIC should not always be taken seriously. The previous model yielded an estimated logit with a ridiculous standard error for the undeveloped land use category. Although the estimated logit was very negative it was still not significantly different from zero. A logit of zero corresponds to p = 0.50. So even though the point estimate for the probability of contamination for undeveloped land is near zero the very wide confidence interval tells us the true value could be any number between 0 and 1. Reporting the estimates from such a model doesn't make any sense

Dealing with categorical predictors that have many categories

After collapsing the agricultural and undeveloped land use categories, land use is now a categorical variable with nine levels. Categorical variables with a lot of levels are extremely unwieldy to work with. One approach is to work with the variable as is and then in the end carry out a sequence of post hoc comparisons, such as all pairwise comparisons, to see what's different from what. With 9 categories this would mean 36 different pairwise comparisons of which we'd expect at least two significant results just by chance. After carrying out all these tests we would then be faced with the task of making sense of them.

I find the post hoc testing approach nonsensical, counter-productive, and inconsistent with basic regression modeling strategies. It is also antithetical to the information-theoretic approach to model selection that is very popular in the biological literature (lecture 19). In keeping with the information-theoretic approach I propose that we should make the re-categorization of the land use variable a part of the model building process itself. To this end I suggest that we fit different models in which the land use variable is categorized in different ways and then use AIC to select the superior categorization. Ideally the re-categorization should be done on theoretical grounds. For instance if we think intensity of land use drives well contamination with TCE then we could try a re-categorization that combines land use categories based on some measure of intensity. Because I don't have any theory to work with here I'm going to use a data-snooping empirical approach. I combine land use categories based on the similarities of their estimated logits. To facilitate this I plot the nine estimated land use logits in a single dot plot.

dotplot(coef(out2b)[1:9], xlab='Land use logits')

fig 2

Fig. 5 The predicted logits for different categories of land use

Without worrying about what's significantly different from what, it's pretty clear that the logits fall into three groups. The rural category stands alone but then there are five categories all with logits around –3 and three categories with logits that are larger than –2. These last three all seem to represent high intensity land use: commercial, industrial, and high-density residential. The other five perhaps represent an intermediate level of use intensity. I fit the re-categorization models in stages. First I combine the five middle categories.

wells$land.use3 <- factor(ifelse(wells$land.use2 %in% c('inst', 'recr', 'resL', 'resM', 'trans'), 'mixed', as.character(wells$land.use2)))

wells

    y n land.use sewer nitrate chloride land.use2 land.use3
1   0 17    undev   yes     1.0       10     rural     rural
2   0 59    undev    no     0.8       17     rural     rural
3   0 7     agri   yes     0.9        9     rural     rural
4   2 48     agri    no    7.0       20     rural     rural
5   0 5     resL   yes     1.6        6      resL     mixed
6   2 21     resL    no     3.5       14      resL     mixed
7 12 43     resM   yes     3.9       18      resM     mixed
8   5 86     resM    no     2.6       15      resM     mixed
9 33 76     resH   yes     5.2       33      resH      resH
10 1 17     resH    no     4.2       18      resH      resH
11 3 26     recr   yes     2.3       16      recr     mixed
12 4 38     recr    no     1.7       11      recr     mixed
13 8 32     inst   yes     3.3       36      inst     mixed
14 0 22     inst    no     0.9       11      inst     mixed
15 7 29    trans   yes     4.0       66     trans     mixed
16 1 30    trans    no     1.3       12     trans     mixed
17 20 42     comm   yes     4.3       24      comm      comm
18 0 7     comm    no     1.1       30      comm      comm
19 17 33    indus   yes     3.7       23     indus     indus
20 3 12    indus    no     2.2       19     indus     indus

# refit model with new land use variable

out2c <- glm(cbind(y,n-y) ~ land.use3 + sewer-1, data=wells, family=binomial)

# AIC says we have a better model

sapply(list(out2b, out2c), AIC)

[1] 84.56454 77.27980

sapply(list(out2b, out2c), logLik)

[1] -32.28227 -32.63990

AIC argues for the simpler model. Observe that the log-likelihoods of the two models are barely different. If we were to carry out a likelihood ratio test (legitimate here because the two models are nested) we would fail to find a significant difference.

anova(out2c, out2b, test='Chisq')

Analysis of Deviance Table

Model 1: cbind(y, n - y) ~ land.use3 + sewer - 1
Model 2: cbind(y, n - y) ~ land.use2 + sewer - 1
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 14 20.004
2 10 19.289 4 0.71526 0.9494

The summary table of this last model suggests that the three estimated logits from the high intensity use categories are not very different ranging from –1.4 to –1.9. I proceed to combine them into a single category.

summary(out2c)$coefficients

Estimate Std. Error z value Pr(>|z|)
land.use3comm -1.754564 0.3995232 -4.391646 1.124957e-05
land.use3indus -1.419244 0.3928789 -3.612421 3.033519e-04
land.use3mixed -2.788366 0.2617619 -10.652302 1.701053e-26
land.use3resH -1.891219 0.3425983 -5.520224 3.385672e-08
land.use3rural -4.679524 0.7308714 -6.402665 1.526883e-10
seweryes 1.560404 0.2889619 5.400032 6.662912e-08

# the high use land types all have similar estimates

wells$land.use4 <- factor(ifelse(wells$land.use3 %in% c('resH','comm','indus'), 'high.use', as.character(wells$land.use3)))

wells

    y n land.use sewer nitrate chloride land.use2 land.use3 land.use4
1   0 17    undev   yes     1.0       10     rural     rural     rural
2   0 59    undev    no     0.8       17     rural     rural     rural
3   0 7     agri   yes     0.9        9     rural     rural     rural
4   2 48     agri    no     7.0       20     rural     rural     rural
5   0 5     resL   yes     1.6        6      resL     mixed     mixed
6   2 21     resL    no     3.5       14      resL     mixed     mixed
7 12 43     resM   yes     3.9       18      resM     mixed     mixed
8   5 86     resM    no     2.6       15      resM     mixed     mixed
9 33 76     resH   yes     5.2       33      resH      resH high.use
10 1 17     resH    no     4.2       18      resH      resH high.use
11 3 26     recr   yes     2.3       16      recr     mixed     mixed
12 4 38     recr    no     1.7       11      recr     mixed     mixed
13 8 32     inst   yes     3.3       36      inst     mixed     mixed
14 0 22     inst    no     0.9       11      inst     mixed     mixed
15 7 29    trans   yes     4.0       66     trans     mixed     mixed
16 1 30    trans    no     1.3       12     trans     mixed     mixed
17 20 42     comm   yes     4.3       24      comm      comm high.use
18 0 7     comm    no     1.1       30      comm      comm high.use
19 17 33    indus   yes     3.7       23     indus     indus high.use
20 3 12    indus    no     2.2       19     indus     indus high.use

# fit the model with three land use categories

out2d <- glm(cbind(y,n-y) ~ land.use4 + sewer, data=wells, family=binomial)

# this is the best model so far

sapply(list(out2b, out2c, out2d), AIC)

[1] 84.56454 77.27980 74.77762

AIC chooses the simplest model with just three land use categories. A likelihood ratio test agrees.

anova(out2d, out2c, test='Chisq')

Analysis of Deviance Table

Model 1: cbind(y, n - y) ~ land.use4 + sewer - 1
Model 2: cbind(y, n - y) ~ land.use3 + sewer - 1
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 16 21.502
2 14 20.004 2 1.4978 0.4729

Finally I refit this model without an intercept to obtain the estimated logits of the individual land use categories and their confidence intervals.

# refit the model obtaining separate land use estimates

out2e <- glm(cbind(y,n-y) ~ land.use4 + sewer -1, data=wells, family=binomial)

coef(out2e)

land.use4high.use land.use4mixed land.use4rural seweryes
-1.725217 -2.774902 -4.669765 1.540770

confint(out2e)

                   2.5 %    97.5 %
(Intercept)    -2.338665 -1.160706
land.use4mixed -1.519613 -0.587796
land.use4rural -4.780182 -1.718357
seweryes        0.995094 2.130452

As a final test of model adequacy I try adding an interaction between land use and sewer.

# fit the two-factor interaction model

out2f <- glm(cbind(y,n-y) ~ land.use4*sewer, data=wells, family=binomial)

# compare the models with a likelihood ratio test

anova(out2d, out2f, test='Chisq')

Analysis of Deviance Table

Model 1: cbind(y, n - y) ~ land.use4 + sewer
Model 2: cbind(y, n - y) ~ land.use4 * sewer
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 16 21.502
2 14 18.081 2 3.4206 0.1808

The interaction term is not significant so we can stay with the main effects model. We come to the same conclusion if we use AIC to select the final model.

# AIC agrees with significance testing here

sapply(list(out2e, out2f), AIC)

[1] 74.77762 75.35699

Cited references

Agresti, Alan. 2002. Categorical Data Analysis. Wiley: New York.
Brown, L. D., T. T. Cai, and A.DasGupta. 2001. Interval estimation for a binomial proportion. Statistical Science 16: 101–133.
Eckhardt, D. A. Flipse, W. J., and Oaksford, E. T. 1989. Relation between land use and ground-quality in the upper glacial aquifer in Nassau and Suffolk Counties, Long Island. Water-Resources Investigations Report no. 86-4142. US Geological Survey, Syosset, NY.
Fijorek, K. and A. Sokolowski. 2012. Separation-resistant and bias-reduced logistic regression: STATISTICA Macro. Journal of Statistical Software 47: Code Snippet 2, 1–12.
Firth, D. 1993. Bias reduction of maximum likelihood estimates. Biometrika 80: 27–38.
Gelman, Andrew, Aleks Jakulin, Maria Grazia Pittau, and Yu-Sing Su. 2008. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics 2: 1360–1383.
Heinze, G. and M. Schemper. 2002. A solution to the problem of separation in logistic regression. Statistics in Medicine 21: 2409–2419.
Mehta, C. R. and N. R. Patel. 1995. Exact logistic regression: theory and examples. Statistics in Medicine 14: 2143–2160.
Piegorsch, Walter W. and A. John Bailer. 2005. Analyzing Environmental Data. Wiley.
Venkataraman, G. and V. Ananthanarayanan. 2008. Demystifying “exact” logistic regression for pathologists. Journal of Clinical Pathology 61: 237–238.

R Code

A compact collection of all the R code displayed in this document appears here.

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--November 30, 2012
URL: https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/lectures/lecture26.htm