Lecture 9—Monday, September 24, 2012

Topics

R functions and commands demonstrated

R function options

Analysis of Covariance

Analysis of covariance is just a regression model in which the main predictor(s) of interest is (are) categorical, but a potentially confounding continuous variable was also measured. Thus analysis of covariance is analysis of variance in which there is one or more additional continuous covariates. The terminology dates from a time when it was not generally appreciated that regression is the common theme that links analysis of variance and regression models together.

There are two basic uses for analysis of covariance and related models.

  1. Extending analysis of variance to include a continuous covariate. Here the primary objective is to determine if the mean of a variable y varies across groups, denoted by factor variable g. This is the classic analysis of variance problem. But suppose we discover that the groups also differ in the distribution of a continuous variable x, a variable that also happens to have a linear relationship with the response variable y. Suppose that the relationship between y and x is roughly the same in each of the groups. Analysis of covariance can be used to determine if the mean response of y is different in the different groups g while controlling for the effect of the continuous variable x. Analysis of covariance is the regression of y on both x and g in which x and g have only an additive effect on y.
  2. Separate regressions for different subpopulations. Suppose we know that two continuous variables y and x are linearly related but we suspect that the relationship is not the same in different subgroups g of our population. Either the slopes and/or the intercepts of the relationship may vary. To investigate this we carry out a regression of y on x, g, and the interaction of x and g. If the interaction term is not statistically significant then this model reduces to the analysis of covariance model described in (1).

Analysis of covariance as an extension of analysis of variance

The example we'll consider appears in Crawley (2002), p. 287, (also Crawley 2007) where it is described as follows.

"The next worked example concerns an experiment on the impact of grazing on the seed production of a biennial plant (Ipomopsis). Forty plants were allocated to treatments, grazed and ungrazed, and the grazed plants were exposed to rabbits during the first two weeks of stem elongation. They were then protected from subsequent grazing by the erection of a fence and allowed to regrow. Because initial plant size was thought likely to influence fruit production, the diameter of the top of the rootstock was measured before each plant was potted up. At the end of the growing season, the fruit production (dry weight, mg) was recorded on each of the 40 plants, and this forms the response variable for the analysis."

There are three recorded variables.

The goal is to determine the effect that grazing on plants when they're young has on their fruit production when they're adults.

ipo <- read.table( 'ecol 563/ipomopsis.txt', header=T)
ipo[1:10,]
    Root Fruit  Grazing
1  6.225 59.77 Ungrazed
2  6.487 60.98 Ungrazed
3  4.919 14.73 Ungrazed
4  5.130 19.28 Ungrazed
5  5.417 34.25 Ungrazed
6  5.359 35.53 Ungrazed
7  7.614 87.73 Ungrazed
8  6.352 63.21 Ungrazed
9  4.975 24.25 Ungrazed
10 6.930 64.34 Ungrazed

To test for a grazing effect on fruit production we carry out an analysis of variance.

out0 <- lm(Fruit~Grazing, data=ipo)
anova(out0)
Analysis of Variance Table

Response: Fruit
          Df  Sum Sq Mean Sq F value  Pr(>F) 
Grazing    1  2910.4 2910.44  5.3086 0.02678 *
Residuals 38 20833.4  548.25                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So there's a significant effect due to grazing. When we examine the estimated effect we see that it's in a surprising direction. Grazed plants have higher fruit production than ungrazed plants.

# grazing has a positive effect
summary(out0)
Call:
lm(formula = Fruit ~ Grazing, data = ipo)

Residuals:
    Min      1Q  Median      3Q     Max
-52.991 -18.028   2.915  14.049  48.109

Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)       67.941      5.236  12.976 1.54e-15 ***
GrazingUngrazed 
-17.060      7.404  -2.304   0.0268
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23.41 on 38 degrees of freedom
Multiple R-squared: 0.1226,            Adjusted R-squared: 0.09949
F-statistic: 5.309 on 1 and 38 DF,  p-value: 0.02678

It's worth noting that analysis of variance when a categorical predictor has two levels is identical to carrying out a two sample t-test. We can carry out a t-test in R with the t.test function.

# t-test with separate variances
t.test(Fruit ~ Grazing, data = ipo)
            Welch Two Sample t-test

data:  Fruit by Grazing
t = 2.304, df = 37.306, p-value = 0.02689
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  2.061464 32.058536
sample estimates:
  mean in group Grazed mean in group Ungrazed
               67.9405                50.8805

# t-test with pooled variances = ANOVA
t.test(Fruit ~ Grazing, data = ipo, var.equal=T)
            Two Sample t-test

data:  Fruit by Grazing
t = 2.304, df = 38, p-value = 0.02678
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  2.070631 32.049369
sample estimates:
  mean in group Grazed mean in group Ungrazed
               67.9405                50.8805

The ANOVA result is identical to the pooled-variances t-test obtained by specifying var.equal=T. The separate variances t-test is equivalent to the variance heterogeneity ANOVA model like the one we fit using gls with a weights argument in lecture 3.

If we plot the distribution of fruit mass in the two groups we see that the distributions are consistent with the analytical results. Grazed plants have a higher average fruit production than do ungrazed plants.

boxplot(Fruit~Grazing, data=ipo, ylab='Fruit mass (mg)')

fig 1

Fig. 1  The effect of grazing on fruit mass

So far we've been ignoring the covariate root size that was also measured as part of the experiment. In Fig. 2 I plot fruit mass versus root size color coding the points according to their grazing status. I use the numeric values of the Grazing factor, 1 and 2, as the argument to col= to specify the color codes 1 and 2. I do the same to get different plotting symbols but this time using 1 and 2 to select either the first or the second element of the vector mypch that specifies open and filled circles.

mypch <- c(1,16)
plot(Fruit~Root, data=ipo, col=as.numeric(Grazing), pch=mypch[as.numeric(Grazing)])
legend('topleft', levels(ipo$Grazing), col=1:2, pch=mypch, cex=.9, bty='n')

fig 1

Fig. 2  The effect of grazing on fruit mass while controlling for root size

Observe that the root size distributions are different in the two treatment groups. The plants in the grazed group tend to have larger root sizes that those in the ungrazed group. Also notice that there is a strong linear relationship between fruit production and root size. Thus it seems possible that by not accounting for root size in our analysis that the estimation of a treatment effect has been compromised.

A poor way to control for root size—create a ratio variable

In the biological literature a method that is often used to adjust for a potentially confounding variable such as root size is to create a new response variable that is the ratio of the original response variable and the confounder. Using this approach here we would analyze the ratio of fruit production to root size rather than fruit production. If we do so we find that the fruit production to root size ratio is not significantly different between the two grazing groups.

# change response variable to a ratio
out0a <- lm(Fruit/Root~Grazing, data=ipo)
# grazing effect is not significant
anova(out0a)
Analysis of Variance Table

Response: Fruit/Root
          Df  Sum Sq Mean Sq F value Pr(>F)
Grazing    1   0.177  0.1773  0.0306 
0.862
Residuals 38 219.994  5.7893              

# grazing has negative effect but is not significant
summary(out0a)
Call:
lm(formula = Fruit/Root ~ Grazing, data = ipo)

Residuals:
    Min      1Q  Median      3Q     Max
-5.4980 -1.6605  0.7814  1.5884  3.4426

Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)       7.9464     0.5380  14.770   <2e-16 ***
GrazingUngrazed  
0.1332     0.7609   0.175    0.862   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.406 on 38 degrees of freedom
Multiple R-squared: 0.0008054,     Adjusted R-squared: -0.02549
F-statistic: 0.03063 on 1 and 38 DF,  p-value: 0.862

Notice that although the point estimate of the grazing effect on the ratio is not significant, it is positive suggesting that ungrazed plots have a larger fruit to root ratio than do grazed plots. This is the reverse of the direction we found when not accounting for root size.

In addition to creating a variable that is harder to interpret than the original variable, the use of a ratio response makes some unwarranted assumptions about the relationship of fruit production to root size. We'll examine these assumptions below after we consider a better way to adjust for root size—analysis of covariance.

A better way to control for root size—analysis of covariance

Notice that in Fig. 2 the two treatment groups are completely segregated each lying within a distinct band. At any root size where the treatment groups overlap, the fruit production is seen to be higher for the ungrazed group and than for the grazed group. This suggests that there is a treatment effect, although not as simple as an ordinary mean difference between the groups. Given that a linear relationship between fruit mass and root size appears to be appropriate for both treatment groups there are three possible models to consider here.

(a)fig2a (b)fig 2b (c)fig 2c
Fig. 3  Three possible regression models for the effect of grazing on fruit production

Let

crawley model

The three models can be expressed as follows.

three models

Regression models with categorical variables can also be written as multiple linear equations where there is a separate equation for each category of the categorical variable. A regression model with a single dummy variable yields two different equations, one for each value of the dummy variable. To see the connection between the three models described above and Fig. 3, replace the dummy variable Z by its values 0 and 1.

Model 2: Parallel lines

grazing model 2

We have the equations of two lines that have the same slope, β1, but different intercepts. The parameter β2 measures how much the intercepts differ. It also corresponds to the vertical distance between the parallel lines in Fig. 3b.

Model 3: Non-parallel lines

grazing model 3

Here we have the equations of two lines with different intercepts and different slopes. The parameter β2 measures how much the intercepts differ. The parameter β3 measures how much the slopes differ.

I fit a sequence of models starting with the single line model, then the additive model, and finally the full interaction model. The additive model is the analysis of covariance model in this example. When we examine model 1 in which root is the only predictor, we see that there is a significant positive effect due to root size.

out1 <- lm(Fruit~Root, data=ipo)
round(summary(out1)$coefficients,3)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -41.286     10.723  -3.850        0
Root          14.022      1.463   9.584        0

Next we turn to model 2, the parallel lines model.

out2 <- lm(Fruit~Root+Grazing, data=ipo)
anova(out2)
Analysis of Variance Table

Response: Fruit
          Df  Sum Sq Mean Sq F value    Pr(>F)   
Root       1 16795.0 16795.0  368.91 < 2.2e-16 ***
Grazing    1  5264.4  5264.4  115.63
6.107e-13 ***
Residuals 37  1684.5    45.5                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

From the output we see that having controlled for root size, there is a significant grazing effect on fruit production. To see the nature of that effect we can examine the summary table of the model.

# now grazing has a significant negative effect
summary(out2)
Call:
lm(formula = Fruit ~ Root + Grazing, data = ipo)

Residuals:
     Min       1Q   Median       3Q      Max
-17.1920  -2.8224   0.3223   3.9144  17.3290

Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)     -127.829      9.664  -13.23 1.35e-15 ***
Root              23.560      1.149   20.51  < 2e-16 ***
GrazingUngrazed  
36.103      3.357   10.75 6.11e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.747 on 37 degrees of freedom
Multiple R-squared: 0.9291,            Adjusted R-squared: 0.9252
F-statistic: 242.3 on 2 and 37 DF,  p-value: < 2.2e-16

According to model 2 the fruits of ungrazed plants are on average 36.1 mg heavier than the fruits of grazed plants. Finally I consider model 3, the non-parallel lines model.

# examine interaction model
out3 <- lm(Fruit~Root*Grazing, data=ipo)
# test whether slopes are the same
anova(out3)
Analysis of Variance Table

Response: Fruit
             Df  Sum Sq Mean Sq  F value    Pr(>F)   
Root          1 16795.0 16795.0 359.9681 < 2.2e-16 ***
Grazing       1  5264.4  5264.4 112.8316 1.209e-12 ***
Root:Grazing  1     4.8     4.8   0.1031     
0.75   
Residuals    36  1679.6    46.7                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The interaction is not significant thus we can stick with the parallel lines model. Because the interaction is not significant, the linear relationship between fruit mass and root size is the same in both grazing groups. I re-examine the coefficients of the parallel lines model.

round(summary(out2)$coefficients,3)
                Estimate Std. Error t value Pr(>|t|)
(Intercept)     -127.829      9.664 -13.227        0
Root              23.560      1.149  20.509        0
GrazingUngrazed   36.103      3.357  10.753        0

The parallel lines model is the classic analysis of covariance model. In this model the grazing effect term adds to the intercept, so the intercept of the fruit-root relationship is 36.1 units larger for ungrazed plants than it is for grazed plants. Because the lines are parallel, 36.1 is also the vertical distance between the lines at any root size. So we can say that ungrazed plants yield fruit that is on average 36.1 mg heavier than the the fruit of grazed plants after controlling for the size of the plant (based on the size of their root). This result is the exact opposite of the one we obtained when we did not control for the confounding effects of root size. The utility of the analysis of covariance model is that it allows us to statistically control for systematic differences between groups so that we can validly compare the means of the groups.

Graphing the analysis of covariance model

I graph the raw data and superimpose the best model, the parallel lines model (model 2). A convenient way to superimpose functions on already plotted data is with the curve function. The first argument of curve should be a function of the variable x. To force the curve function to add the graph of the function to the currently active graphics window we need to include the add=T argument.

# write function that yields ANCOVA regression model
coef(out2)
    (Intercept)            Root GrazingUngrazed
     -127.82936        23.56005        36.10325
myfunc <- function(x,z) coef(out2)[1] + coef(out2)[2]*x + coef(out2)[3]*z
# plot models separately by grazing status
mypch <- c(1,16)
plot(Fruit~Root, data=ipo, col=as.numeric(Grazing), pch=mypch[as.numeric(Grazing)])
# grazed group
curve(myfunc(x,0), col=1, lty=2, add=T)
# ungrazed group
curve(myfunc(x,1), col=2, lty=2, add=T)
legend('topleft', levels(ipo$Grazing), col=1:2, pch=mypch, cex=.9, bty='n')
arrows(6.7, myfunc(6.7,0), 6.7, myfunc(6.7,1), code=3, angle=45, length=.07, col=4, lwd=2)
text(6.78, 44, expression(beta[2]==36.1), pos=2, cex=.95, col=4)

fig 2

Fig. 4  Analysis of covariance model showing the treatment effect β2

The estimate of the grazing effect on fruit mass corresponds to the vertical distance between the parallel lines in Fig. 4. Generally speaking when graphing regression lines we should not extrapolate beyond the range of the data. I redraw Fig. 4 and truncate the regression lines accordingly using the minimum and maximum values of root size in each grazing group. The curve function has from= and to= arguments that allow you to specify x-axis limits in drawing the function.

#redo plot so that regression lines do not extend beyond data
plot(Fruit~Root, data=ipo, col=as.numeric(Grazing), pch=mypch[as.numeric(Grazing)])
curve(myfunc(x,0) ,col=1, lty=2, add=T, from=min(ipo$Root[ipo$Grazing=='Grazed']), to=max(ipo$Root[ipo$Grazing=='Grazed']))
curve(myfunc(x,1), col=2, lty=2, add=T, from=min(ipo$Root[ipo$Grazing=='Ungrazed']), to=max(ipo$Root[ipo$Grazing=='Ungrazed']))
legend('topleft', levels(ipo$Grazing), col=1:2, pch=mypch, cex=.9, bty='n')

fig 4

Fig. 5  Analysis of covariance model with regression lines truncated to the range of data

The problem with ratios

Having adjusted for root size by using the analysis of covariance model let's revisit the regression model that used the ratio of fruit mass to root size as the response variable and see what's wrong with it. If Y is fruit mass, X is root size, and Z is a dummy variable indicating grazing status, then the regression model we fit for the ratio of fruit mass to root size is the following.

ratio

(1)

where epsilon. I've used different symbols for the regression coefficients so as not to confuse them with the analysis of covariance model above. Multiply both sides of the equation by X, the root size variable.

ratio2

(2)

where ε* is just my notation for the new error term. (Note: Because ε is multiplied by X this model assumes that the standard deviation of the residuals is not constant but varies with the magnitude of X. So, we have a heteroscedastic error model.) Compare this last equation with model 3, the non-parallel lines model, described above.

model 3

(3)

There are two obvious differences between the models shown in eqns (2) and (3).

  1. The first and third terms on the right side of eqn (3) are missing in the ratio model of eqn (2). These terms together define the intercepts of the two lines that correspond to the two grazing treatments. Because the term β2Z is missing in eqn (2) the ratio model assumes the intercepts in a plot of Y versus X are the same for the two grazing treatments. Because β0 is also missing in eqn (2) the ratio model assumes that both of the intercepts are zero! At first glance this may not seem unreasonable, after all if the root size is zero then the fruit production should also be zero. But the ratio model is actually making a far bigger assumption here. It assumes that the linear relationship that is apparent in Fig. 2 extends beyond the range of the data all the way to the origin. We have no evidence of that. All we know for sure is that over the range of the data, for which the root size is between 5 and 10, the relationship appears to be linear. Over the root size range from 0 to 10 the relationship could be curvilinear.
  2. In the ratio model of eqn (1) a treatment difference corresponds to γ1 ≠ 0. But in eqn (2) we see that γ1 causes the slope of the fruit mass to root size relationship to be different depending on the grazing treatment. Thus when we fit a ratio model and look for a treatment effect we're testing for a difference in these slopes. In the analysis of covariance model the slopes need to be the same in the two treatment groups for the notion of a treatment effect to make sense. (We tested for this by fitting the interaction model and verifying that the interaction is not significant.). When the slopes are the same the effect of grazing on fruit production is the same regardless of the size of the plant. The treatment effect is the vertical distance between the lines. So the ratio model and the analysis of covariance find a "treatment effect" in completely different portions of the fruit mass versus root size linear relationship. For the ratio model a treatment effect is a difference in slopes. For the analysis of covariance model a treatment effect is a difference in intercepts (after verifying that the slopes are the same). If the slopes are not the same in the analysis of covariance model then there is not a fixed treatment effect; it varies with root size.

Fig. 6 shows the analysis of covariance model of Fig. 4 and superimposes the ratio model of eqn (2) using the coefficients estimated from eqn (1). The badness of the ratio model is readily apparent.

myfunc <- function(x,z) coef(out2)[1] + coef(out2)[2]*x + coef(out2)[3]*z
mypch <- c(1,16)
plot(Fruit~Root, data=ipo, col=as.numeric(Grazing), pch=mypch[as.numeric(Grazing)])
curve(myfunc(x,0), col=1, lty=2, add=T, lwd=2)
curve(myfunc(x,1), col=2, lty=2, add=T, lwd=2)
# add ratio model
curve(coef(out0a)[1]*x, col='pink2', add=T ,lty=3, lwd=2)
curve((coef(out0a)[1]+coef(out0a)[2])*x, col='grey70', add=T, lty=3, lwd=2)
legend('topleft', c(levels(ipo$Grazing), 'ANCOVA', 'ratio'), col=c(1,2,1,2), pch=c(mypch,NA,NA), lty=c(NA,NA,2,3), lwd=c(1,1,2,2), cex=.9, bty='n')

fig. 6

Fig. 6  Analysis of covariance model compared with ratio regression model

Fig. 6 is a bit unfair because fitting the ratio model in eqn (1) using least squares does not yield the same parameter estimates as fitting eqn (2) using least squares (although they are close). Basically the difference is that the mean of a ratio is not equal to the ratio of the means. Still dividing a response variable by a covariate as an attempt to "standardize" it is generally a bad idea. Ratio variables cause problems in ordinary regression because their distribution is typically non-normal. Even worse the use of ratios has the potential of inducing spurious correlations between the variables of interest. Recognition of the problems posed by ratio variables dates back to the beginning of statistics itself (Pearson 1897). Over the years researchers in applied disciplines have repeatedly "rediscovered" the ratio problem. Some recent criticisms of using ratio variables in regression instead of analysis of covariance, organized by discipline, include the following.

Here are two excerpts of the criticisms found in these papers. From Beaupre and Dunham (1995), p. 880:

(The) dangers (of using ratios) include: (1) identification of spurious relationships which may lead to erroneous biological interpretation, (2) false identification of effects as significant, (3) failure to identify significant effects and (4) potentially large errors in ... estimation.

… We believe that our reanalysis suggests that these problems are far worse than most appreciate. We urge physiologists to abandon analyses of ratio-based indices in favour of ANCOVA.

From Packard and Boardman (1999), p. 1:

Ecological physiologists commonly divide individual values for variables of interest by corresponding measures for body size to adjust (or scale) data that vary in magnitude or intensity with body size of the animals being studied. These ratios are formed in an attempt to increase the precision of data gathered in planned experiments or to remove the confounding effects of body size from descriptive studies...

This procedure for adjusting data is based on the implicit, albeit critical, assumption that the variable of interest varies isometrically with body size. Isometry occurs whenever a plot of a physiological variable (y) against a measure of body size (x) yields a straight line that passes through the origin of a graph with linear coordinates. Alternatively allometry obtains whenever such a plot yields either a curved line or a straight line that intersects the Y-axis at some point other than zero. Most physiological variables change allometrically with body size (Smith 1984), so the assumption of isometry that is fundamental to the use of ratios for scaling data seldom is satisfied ... statistical analyses of ratios lead  to conclusions that are inconsistent with impressions gained from visual examinations of data displayed in bivariate plots. In comparison, analyses of covariance lead to conclusions that agree with impressions gained from these same plots. We therefore recommend that ecological physiologists discontinue using ratios to scale data and that they use the ANCOVA instead.

The opinion of all of these authors is that ratios should almost never be used for statistical control purposes. Their most important objection is that ratios can correctly adjust for the effect of the covariate only when the the response and the covariate are collinear with an intercept of zero. If the intercept is not zero or the relationship is nonlinear (or linear only over a short range), regression with a ratio response can yield spurious results. While the use of ratios is not always inappropriate, if the ratio itself is not of primary interest but is being used only to adjust the value of the response, then analysis of covariance is a far better approach.

Comparing regression models across groups

Forrester and Steele (2004) studied the effect of competition and resource availability on the mortality of gobies. Resource availability for gobies was defined to be the density of available refuges that allow gobies to hide from their potential predators. It is expected that mortality will increase as population density increases because of intraspecific competition for refuges. Because each refuge is only used by a single goby, the effect of increasing population density should be lower when the availability of refuges is high. The experiment was conducted in the natural habitat of gobies where refuge density was classified into three categories: low, medium, and high. Six replicates of each of these refuge density categories were obtained and for each replicate a different goby density was maintained. Using a detailed census of the resident fish population, mortality was recorded as percent survival. Because percentages are bounded on the interval 0 to 100 treating them as normally distributed variables can lead to nonsensical results. Forrester and Steele (2004) chose to ignore these complications and for today's lecture we will too.

Unlike the Ipomopsis example considered previously where the focus was to compare means across groups and regression was used only to account for a confounding variable, here the focus is on the regression equation itself. The question of interest is whether the relationship between goby mortality and goby density is different for different categories of refuge density. In the Ipomopsis example we had to first rule out the presence of an interaction between the continuous predictor "root" and the categorical treatment variable "grazing" in order to estimate a treatment effect. In the goby experiment the interaction between the continuous predictor goby density and the categorical predictor refuge density was the focus of the analysis. If a significant interaction is present it means that the relationship between goby mortality and goby density is different for different refuge densities.

goby <- read.delim(ecol 563/goby.txt')
plot(mortality~density, col=refuge, data=goby, pch=c(1,15,16)[refuge], xlab='Goby density')
legend('topleft', legend=1:3, col=unique(goby$refuge), pch=c(1,15,16), title='Refuge density', cex=.9, bty='n')

fig. 7

Fig. 7  Effect of goby density and refuge density on goby mortality

The graph would seem to indicate that the goby mortality-goby density relationship varies with refuge density. I fit a model that includes an interaction between refuge density and goby density.

mod1 <- lm(mortality ~ density*factor(refuge), data=goby)
anova(mod1)
Analysis of Variance Table Response: mortality
                       Df Sum Sq Mean Sq F value    Pr(>F)   
density                 1 4.1659  4.1659 49.5146 1.363e-05 ***
factor(refuge)          2 1.2549  0.6274  7.4577 0.0078540 **
density:factor(refuge)  2 2.4788  1.2394 14.7313 0.0005877 ***
Residuals              12 1.0096  0.0841                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The interaction is statistically significant. The regression equation that was fit is the following. If x = goby density,

z2, and z3

then the predicted mean is

goby model

This reduces to three regression models, one for each refuge type.

Table 1   Regression model with a continuous predictor and a categorical predictor with three levels
refuge z1 z2 goby model
1 0 0 refuge 1
2 1 0 refuge 2
3 0 1 refuge 3

Next we examine the summary table of the model.

summary(mod1)
Call:
lm(formula = mortality ~ density * factor(refuge))

Residuals:
     Min       1Q   Median       3Q      Max
-0.47439 -0.10553  0.06187  0.15585  0.35897

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)   
(Intercept)               0.0843     0.3009   0.280 0.784129   
density                   1.0756     0.1581   6.805 1.89e-05 ***
factor(refuge)2           1.0493     0.3564   2.944 0.012288 * 
factor(refuge)3           1.4016     0.4021   3.486 0.004500 **
density:factor(refuge)2  -0.6269     0.1762  -3.558
0.003938 **
density:factor(refuge)3  -1.1029     0.2035  -5.419
0.000155 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2901 on 12 degrees of freedom
Multiple R-squared: 0.8867,  Adjusted R-squared: 0.8395
F-statistic: 18.78 on 5 and 12 DF,  p-value: 2.664e-05

From the summary table we see that β4 ≠ 0 (p = .004) so the slopes for refuge =1 and refuge = 2 differ. Because the point estimate of β4 is negative, the slope for refuge 2 is less than the slope for refuge = 1. Also from the summary table we see that β5 ≠ 0 (p = .0002) so the slopes for refuge =1 and refuge = 3 differ. Because the point estimate of β5 is negative, the slope for refuge = 3 is less than the slope for refuge = 1.

To compare the slopes for refuge = 2 and refuge =3 we can refit the model choosing a different reference level for refuge. I make refuge = 3 the reference level in the run below.

mod1.1 <- lm(mortality~density*factor(refuge, levels=3:1), data=goby)
summary(mod1.1)
Call:
lm(formula = mortality ~ density * factor(refuge, levels = 3:1))

Residuals:
     Min       1Q   Median       3Q      Max
-0.47439 -0.10553  0.06187  0.15585  0.35897

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)   
(Intercept)                            1.48587    0.26674   5.571 0.000122 ***
density                               -0.02728    0.12818  -0.213 0.835038   
factor(refuge, levels = 3:1)2         -0.35231    0.32811  -1.074 0.304049   
factor(refuge, levels = 3:1)1         -1.40157    0.40211  -3.486 0.004500 **
density:factor(refuge, levels = 3:1)2  0.47603    0.14996   3.174
0.008003 **
density:factor(refuge, levels = 3:1)1  1.10292    0.20351   5.419 0.000155 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2901 on 12 degrees of freedom
Multiple R-squared: 0.8867,  Adjusted R-squared: 0.8395
F-statistic: 18.78 on 5 and 12 DF,  p-value: 2.664e-05

With refuge = 3 as the reference level, a test of H0: β4 = 0 is a test of whether the slopes for refuge = 2 and refuge = 3 are the same. From the output we conclude β4 ≠ 0 (p = 0.008). The point estimate is positive so we conclude that the slope for refuge = 2 exceeds the slope for refuge = 1.

To graph the model we can use the abline function applied to lm models fit separately for different values of refuge. The abline extracts the intercept and slope form the lm model object and draws the regression line.

# use separate regressions to plot the lines
plot(mortality~density, col=refuge, data=goby, pch=c(1,15,16)[refuge], xlab='Goby density')
legend('topleft', legend=1:3, col=unique(goby$refuge), pch=c(1,15,16), title='Refuge density', cex=.9, bty='n')
abline(lm(mortality ~ density, data=goby[goby$refuge==1,]), col=1, lty=2)
abline(lm(mortality ~ density, data=goby[goby$refuge==2,]), col=2, lty=2)
abline(lm(mortality ~ density, data=goby[goby$refuge==3,]), col=3, lty=2)

fig. 8

Fig. 8  Regression of mortality on goby density separately by range density

Alternatively we can produce Fig. 8 by extracting the intercepts and slopes ourselves from the full interaction model and then providing them to abline to plot the lines.

# use the full model to plot the lines
plot(mortality~density, col=refuge, data=goby, pch=c(1,15,16)[refuge], xlab='Goby density')
legend('topleft', legend=1:3, col=unique(goby$refuge), pch=c(1,15,16), title='Refuge density', cex=.9, bty='n')
abline(coef(mod1)[1], coef(mod1)[2], col=1, lty=2)
abline(coef(mod1)[1] + coef(mod1)[3], coef(mod1)[2] + coef(mod1)[5], col=2, lty=2)
abline(coef(mod1)[1] + coef(mod1)[4], coef(mod1)[2] + coef(mod1)[6], col=3, lty=2)

Unfortunately abline does not have an option that permits restricting the regression line to the range of the data. For that we could use the curve function like we did in Fig. 5 for the Ipomopsis example. Alternatively we can use the predict function with the newdata argument to obtain the coordinates of the endpoints of the regression line. The range function can be used to calculate the minimum and maximum values of goby density for a given refuge category and the predict function used to obtain the model predictions at those values. I assemble all of this as a function whose sole argument is the refuge density category.

graph.lines <- function(refuge){
# create data frame containing range of goby density for a given refuge
newdat <- data.frame(density=range(goby$density[goby$refuge==refuge]), refuge=c(refuge,refuge))
# convert refuge to a factor
newdat$refuge <- factor(newdat$refuge)
# obtain model predictions
pred.range <- predict(mod1, newdata=newdat)
# add line
lines(newdat$density, pred.range, lty=2, col=refuge)
}

Using the function I add the regression lines to a scatter plot of the data.

plot(mortality~density, col=refuge, data=goby, pch=c(1,15,16)[refuge], xlab='Goby density')
legend('topleft', legend=1:3, col=unique(goby$refuge), pch=c(1,15,16), title='Refuge density', cex=.9, bty='n', lty=2)
graph.lines(1)
graph.lines(2)
graph.lines(3)

   fig. 8
Fig. 9   Regression of goby mortality on goby density separately by refuge density in which regression lines are restricted to the range of the data

Assessing treatment differences in the presence of an interaction

If we attempt to use analysis of covariance to control for a continuous covariate and it turns out that there is a significant interaction between the treatment and the covariate then there is no unique treatment effect. The magnitude (and perhaps the direction) of the treatment effect varies with the value of the continuous covariate. So, in this case we might just summarize the results by reporting the different slopes of the regression lines for the different treatments. Still, we might be interested in knowing if there is a range of covariate values for which a significant treatment effect did occur and perhaps obtain a ranking of the treatments over that range. For example, in Fig. 9 it appears that on the left side of the diagram refuge density doesn't matter, but on the right side of Fig. 9 the percent mortality for different values of refuge can be ranked as follows: mortality(refuge = 1) > mortality(refuge = 2) > mortality(refuge = 3). Can we quantify this result and perhaps summarize it graphically?

Using the trick that was discussed previously, I begin by reparameterizing the model so that the parameter estimates we obtain are estimates of the individual slopes and intercepts.

mod1a <- lm(mortality ~ density:factor(refuge) + factor(refuge)-1, data=goby)
summary(mod1a)
Call:
lm(formula = mortality ~ density:factor(refuge) + factor(refuge) -
    1)

Residuals:
     Min       1Q   Median       3Q      Max
-0.47439 -0.10553  0.06187  0.15585  0.35897

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
factor(refuge)1          0.08430    0.30090   0.280 0.784129   
factor(refuge)2          1.13355    0.19108   5.932 6.90e-05 ***
factor(refuge)3          1.48587    0.26674   5.571 0.000122 ***
density:factor(refuge)1  1.07564    0.15807   6.805 1.89e-05 ***
density:factor(refuge)2  0.44875    0.07782   5.767 8.93e-05 ***
density:factor(refuge)3 -0.02728    0.12818  -0.213 0.835038   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2901 on 12 degrees of freedom
Multiple R-squared: 0.985,   Adjusted R-squared: 0.9775
F-statistic: 131.2 on 6 and 12 DF,  p-value: 3.13e-10

Based on the labels of the coefficient vector of regression parameters I write a function that constructs the appropriate vector of regressors corresponding to the given values of density and refuge.

coef(mod1a)
        factor(refuge)1         factor(refuge)2         factor(refuge)3
             0.08430008              1.13355232              1.48586650
density:factor(refuge)1 density:factor(refuge)2 density:factor(refuge)3
             1.07563805              0.44875123             -0.02728036
myvec <- function(r,x) c(r==1, r==2, r==3, x*(r==1), x*(r==2), x*(r==3))

Our goal is to calculate the confidence level that allows individual overlapping confidence intervals to be used to graphically assess treatment differences separately at each value of goby density. To accomplish this we need the ci.func function that was first presented in lecture 5.

# function to calculate difference-adjusted confidence intervals
ci.func <- function(rowvals, lm.model, glm.vmat) {
nor.func1a <- function(alpha, model, sig) 1-pt(-qt(1-alpha/2, model$df.residual) * sum(sqrt(diag(sig))) / sqrt(c(1,-1) %*% sig %*%c (1,-1)), model$df.residual) - pt(qt(1-alpha/2, model$df.residual) * sum(sqrt(diag(sig))) / sqrt(c(1,-1) %*% sig %*% c(1,-1)), model$df.residual, lower.tail=F)
nor.func2 <- function(a,model,sigma) nor.func1a(a,model,sigma)-.95
n <- length(rowvals)
xvec1b <- numeric(n*(n-1)/2)
vmat <- glm.vmat[rowvals,rowvals]
ind <- 1
for(i in 1:(n-1)) {
for(j in (i+1):n){
sig <- vmat[c(i,j), c(i,j)]
#solve for alpha
xvec1b[ind] <- uniroot(function(x) nor.func2(x, lm.model, sig), c(.001,.999))$root
ind <- ind+1
}}
1-xvec1b
}

This function requires the variance-covariance matrix of the means. I illustrate how to create it for the smallest goby density when refuge = 1.

goby$density[goby$refuge==1]
[1] 1.1250 0.7500 1.5625 1.6250 2.5625 2.8750

I build a data frame consisting of the three values of refuge density with goby density = 1.125.

mydat <- data.frame(r=1:3, x=1.125)
mydat
  r     x
1 1 1.125
2 2 1.125
3 3 1.125

I construct a vector of regressors for each of the three observations in the data frame and assemble them in a matrix.

cmat <- t(apply(mydat, 1, function(x) myvec(x[1],x[2])))
cmat
     r r r     x     x     x
[1,] 1 0 0 1.125 0.000 0.000
[2,] 0 1 0 0.000 1.125 0.000
[3,] 0 0 1 0.000 0.000 1.125

I obtain the variance-covariance matrix of the means and use it as input to the ci.func function to obtain the difference-adjusted confidence levels needed to make pairwise comparisons between the three refuge category means.

vmat1 <- cmat%*%vcov(mod1a)%*%t(cmat)
ci.func(1:3, mod1a, vmat1)
[1] 0.8515959 0.8506829 0.8513936

I assemble the above steps in a function so that we can obtain the difference-adjusted confidence levels using any goby density value.

my.ci <- function(x){
mydat <- data.frame(r=1:3, x=x)
cmat <- t(apply(mydat, 1, function(x) myvec(x[1],x[2])))
vmat1 <- cmat %*% vcov(mod1a) %*% t(cmat)
ci.func(1:3, mod1a, vmat1)
}

Finally I use this function on each of the different goby density values using sapply to apply the function separately to each density value.

sort(unique(goby$density))
 [1] 0.1875 0.5625 0.7500 1.0000 1.1250 1.1875 1.2500 1.4375 1.5625 1.6250 1.7500
[12] 1.8125 2.3125 2.5625 2.8125 2.8750 3.6875 4.7500
ci.values <- sapply(sort(unique(goby$density)), my.ci)
range(ci.values)
[1] 0.8506703 0.8687014

The range of confidence levels reported is from 0.85 to 0.87 so there isn't much variability. Clearly we can get by with using just one value. I add 95% and 85% confidence intervals for the mean to each point on the regression line that corresponds to an observed goby density. To illustrate the steps I start by doing just refuge = 1.

par(lend=1)
# set up the graph
plot(mortality~density, data=goby, ylim=c(0,4), type='n', xlab='Goby density')
# obtain the goby density values for refuge = 1
cur.x <- sort(unique(goby$density[goby$refuge==1]))
cur.x
[1] 0.7500 1.1250 1.5625 1.6250 2.5625 2.8750
# obtain values for the regressors
my.c <- t(sapply(cur.x, function(x) myvec(1,x)))
my.c
     [,1] [,2] [,3]   [,4] [,5] [,6]
[1,]    1    0    0 0.7500    0    0
[2,]    1    0    0 1.1250    0    0
[3,]    1    0    0 1.5625    0    0
[4,]    1    0    0 1.6250    0    0
[5,]    1    0    0 2.5625    0    0
[6,]    1    0    0 2.8750    0    0
# calculate the means
cur.y <- my.c%*%coef(mod1a)
cur.y
          [,1]
[1,] 0.8910286
[2,] 1.2943929
[3,] 1.7649845
[4,] 1.8322119
[5,] 2.8406226
[6,] 3.1767595
# obtain standard errors of the mean
cur.se <- sqrt(diag(my.c %*% vcov(mod1a)%*% t(my.c)))
cur.se
[1] 0.1975068 0.1542166 0.1220689 0.1200533 0.1746923 0.2136492

Using the means and standard errors I plot the regression line, 95% confidence intervals of the mean at each goby density value for refuge = 1, and the 85% difference-adjusted intervals.

lines(cur.x, cur.y, col=2)
segments(cur.x, cur.y + qt(.025,mod1a$df.residual)*cur.se, cur.x, cur.y+qt(.975, mod1a$df.residual)*cur.se, col='pink2')
segments(cur.x, cur.y + qt((1-.85)/2, mod1a$df.residual)*cur.se, cur.x, cur.y + qt(1-(1-.85)/2, mod1a$df.residual)*cur.se, col=2, lwd=3)

   fig. 10
Fig. 10   Preliminary graph showing confidence intervals for refuge = 1

Finally I assemble things as a function whose argument is the value of refuge. I include two color arguments to specify colors for the regression line and the difference-adjusted confidence levels.

# function to draw regression line and confidence intervals
plot.func <- function(r, col1, col2) {
cur.x <- sort(unique(goby$density[goby$refuge==r]))
my.c <- t(sapply(cur.x, function(x) myvec(r,x)))
cur.y <- my.c%*%coef(mod1a)
cur.se <- sqrt(diag(my.c %*% vcov(mod1a)%*% t(my.c)))
lines(cur.x, cur.y, col=col2)
segments(cur.x, cur.y + qt(.025,mod1a$df.residual)*cur.se, cur.x, cur.y+qt(.975, mod1a$df.residual)*cur.se, col=col1)
segments(cur.x, cur.y + qt((1-.86)/2,mod1a$df.residual)*cur.se, cur.x, cur.y + qt(1-(1-.86)/2, mod1a$df.residual)*cur.se, col=col2, lwd=3)
}

Finally I draw the graph.

plot(mortality~density, data=goby, ylim=c(0,4), type='n', xlab='Goby density')
par(lend=1)
plot.func(1, 'pink2', 2)
plot.func(2, 'dodgerblue1', 'dodgerblue4')
plot.func(3, 'seagreen3', 'seagreen4')
legend('topleft', legend=1:3, col=c(2, 'dodgerblue4', 'seagreen4'), title='Refuge density', cex=.9, bty='n', lty=1, lwd=1)

   fig. 11`
Fig. 11  Graph to compare mean mortality across refuge level for specific values of goby density

What we can see from Fig. 11 is that when goby density > 1.8 (approximately) refuge = 3 has a significantly lower mean mortality than do the other two values of refuge. The sparse data make it difficult to determine these change point precisely but probably when goby density > 2.8, refuge = 2 has a significantly lower mortality than does refuge = 1.

Cited references

References on the ratio "problem"

R Code

A compact collection of all the R code displayed in this document appears here.

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--September 30, 2012
URL: https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/lectures/lecture9.htm