Final Part 2—Solution

elfin <- read.csv( 'ecol 562/elfin.csv')
names(elfin)
[1] "Occupied_Unoccupied" "TotalCanopy"         "WildIndigoSize"    
[4] "NearestWildIndigo"   "DistanceNearestTree" "DirectionTree"     
[7] "Slope"               "SlopeAspect"       
dim(elfin)
[1] 454   8
apply(elfin, 2, function(x) sum(is.na(x)))
Occupied_Unoccupied         TotalCanopy      WildIndigoSize
                  0                   0                   0
  NearestWildIndigo DistanceNearestTree       DirectionTree
                  0                   0                   0
              Slope         SlopeAspect
                  0                 126

We see that the SlopeAspect variable has a lot of missing data.

Question 1

In order to fit a classification tree we have to convert the response variable, Occupied_Unoccupied, into a factor.

library(rpart)
set.seed(20)
out2 <- rpart(factor(Occupied_Unoccupied)~., data=elfin, control=rpart.control(cp=.00001))
out2
n= 454

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 454 207 0 (0.5440529 0.4559471) 
   2) TotalCanopy< 8.2 210  40 0 (0.8095238 0.1904762) 
     4) WildIndigoSize< 58.88 127   9 0 (0.9291339 0.0708661) *
     5) WildIndigoSize>=58.88 83  31 0 (0.6265060 0.3734940) 
      10) DistanceNearestTree>=595 46   5 0 (0.8913043 0.1086957) *
      11) DistanceNearestTree< 595 37  11 1 (0.2972973 0.7027027) 
        22) DirectionTree>=247.5 9   4 0 (0.5555556 0.4444444) *
        23) DirectionTree< 247.5 28   6 1 (0.2142857 0.7857143) *
   3) TotalCanopy>=8.2 244  77 1 (0.3155738 0.6844262) 
     6) WildIndigoSize< 58.155 116  49 0 (0.5775862 0.4224138) 
      12) TotalCanopy>=65.235 33   2 0 (0.9393939 0.0606061) *
      13) TotalCanopy< 65.235 83  36 1 (0.4337349 0.5662651) 
        26) TotalCanopy< 18.88 30   5 0 (0.8333333 0.1666667) *
        27) TotalCanopy>=18.88 53  11 1 (0.2075472 0.7924528) *
     7) WildIndigoSize>=58.155 128  10 1 (0.0781250 0.9218750) *

plotcp(out2)

fig 1
Fig. 1  Plot of cross-validation error versus Cp

out2$cptable
          CP nsplit rel error   xerror      xstd
1 0.43478261      0  1.000000 1.000000 0.0512667
2 0.08695652      1  0.565217 0.579710 0.0453906
3 0.07487923      2  0.478261 0.541063 0.0443735
4 0.03623188      4  0.328502 0.444444 0.0413761
5 0.00483092      6  0.256039 0.323671 0.0365086
6 0.00001000      7  0.251208 0.352657 0.0378117

The plot is unambiguous. The first tree whose cross-validation error is within one standard deviation of the minimum cross-validation error is the tree with minimum cross-validation error. We should select a tree of size 7 (nsplit = 6) corresponding to Cp = 0.0048.

which.min(out2$cptable[,"xstd"])
5
5
out2a <- prune(out2, cp=out2$cptable[5,"CP"])
plot(out2a, margin=0.075)
text(out2a, cex=0.8, use.n=T)

fig2
Fig. 2  Pruned classification tree

Question 2

I fit a GAM using all the predictors and then try to simplify it. We can't specify more knots than there are unique values (the default choice is k = 10), so I determine the number of unique values for each variable.

apply(elfin,2, function(x) sum(!is.na(unique(x))))
Occupied_Unoccupied         TotalCanopy      WildIndigoSize   NearestWildIndigo
                  2                 184                 402                 114
DistanceNearestTree       DirectionTree               Slope         SlopeAspect
                212                   8                  15                   8

We need to use no more than 8 knots for DirectionTree and SlopeAspect.

library(mgcv)
gam1 <- gam(Occupied_Unoccupied~s(TotalCanopy)+ s(WildIndigoSize)+ s(NearestWildIndigo)+ s(DistanceNearestTree)+ s(DirectionTree,k=8)+ s(Slope)+ s(SlopeAspect,k=8), data=elfin, method="ML", family=binomial)
summary(gam1)
Family: binomial
Link function: logit

Formula:
Occupied_Unoccupied ~ s(TotalCanopy) + s(WildIndigoSize) + s(NearestWildIndigo) +
    s(DistanceNearestTree) + s(DirectionTree, k = 8) + s(Slope) +
    s(SlopeAspect, k = 8)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|) 
(Intercept)  -0.6010     0.3566  -1.685   0.0919 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                         edf Ref.df Chi.sq  p-value   
s(TotalCanopy)         3.409  4.249 31.697 2.96e-06 ***
s(WildIndigoSize)      1.000  1.000 32.884 9.79e-09 ***
s(NearestWildIndigo)   1.000  1.000  9.586  0.00196 **
s(DistanceNearestTree) 2.091  2.629  4.592  0.16224   
s(DirectionTree)       1.000  1.000  3.928  0.04749 * 
s(Slope)               1.000  1.000  0.054 
0.81608   
s(SlopeAspect)         1.000  1.000  0.667 
0.41416   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =   0.58   Deviance explained = 53.1%
ML score = 116.83  Scale est. = 1         n = 328

The summary table suggests that Slope and SlopeAspect are not significant, but we know that the p-values of the Wald tests for GAMs can be quite unreliable. The different reduced models we might fit and compare against this one are not all equivalent because the variable SlopeAspect has a large number of missing values.

#check for missing values
apply(elfin,2,function(x) sum(is.na(x)))
Occupied_Unoccupied         TotalCanopy      WildIndigoSize   NearestWildIndigo
                  0                   0                   0                   0
DistanceNearestTree       DirectionTree               Slope         SlopeAspect
                  0                   0                   0                 126

The gam function will silently remove these missing values anytime SlopeAspect is a predictor in the model. Models that are fit to different sets of observations cannot be compared using significance testing or AIC. It makes sense to test SlopeAspect first because if it can be dropped then the remaining variables can be examined using the full data set. To make models with and without SlopeAspect comparable I need to explicitly remove the observations that have missing values for SlopeAspect before fitting the model without SlopeAspect.

#remove observations with missing values of SlopeAspect
elfin2 <- elfin[!is.na(elfin$SlopeAspect),]

As was explained in lecture 39, the default number of knots (10) is set too low for a number of these smooths and has the effect of constraining then. As a result, when SlopeAspect is removed from the model it partially releases this constraint causing the two models to use different smoothing parameters. These models are then not truly nested and the likelihood ratio test is invalid. We can fix this by increasing the number of knots and/or using the smoothing parameters from the bigger model in the smaller model.

gam1 <- gam(Occupied_Unoccupied~s(TotalCanopy,k=20) + s(WildIndigoSize,k=20) + s(NearestWildIndigo,k=20) +s(DistanceNearestTree,k=20)+ s(DirectionTree,k=8)+ s(Slope)+ s(SlopeAspect,k=8), data=elfin2, method="ML", family=binomial)
gam2 <- gam(Occupied_Unoccupied~s(TotalCanopy,k=20) + s(WildIndigoSize,k=20) + s(NearestWildIndigo,k=20)+ s(DistanceNearestTree,k=20) + s(DirectionTree,k=8)+ s(Slope), data=elfin2, method="ML", family=binomial, sp=gam1$sp[1:6])
anova(gam2, gam1)
Analysis of Deviance Table

Model 1: Occupied_Unoccupied ~ s(TotalCanopy, k = 20) + s(WildIndigoSize,
    k = 20) + s(NearestWildIndigo, k = 20) + s(DistanceNearestTree,
    k = 20) + s(DirectionTree, k = 8) + s(Slope)
Model 2: Occupied_Unoccupied ~ s(TotalCanopy, k = 20) + s(WildIndigoSize,
    k = 20) + s(NearestWildIndigo, k = 20) + s(DistanceNearestTree,
    k = 20) + s(DirectionTree, k = 8) + s(Slope) + s(SlopeAspect,
    k = 8)
  Resid. Df Resid. Dev     Df Deviance
1    317.41     213.32               
2    316.42     212.53 0.9898 
0.79907

We can drop SlopeAspect. With that taken care of I refit the model to the full data set.

gam2 <- gam(Occupied_Unoccupied~s(TotalCanopy,k=20) + s(WildIndigoSize,k=20) + s(NearestWildIndigo,k=20)+ s(DistanceNearestTree,k=20) + s(DirectionTree,k=8)+ s(Slope), data=elfin, method="ML", family=binomial)
summary(gam2)

Family: binomial
Link function: logit

Formula:
Occupied_Unoccupied ~ s(TotalCanopy, k = 20) + s(WildIndigoSize,
    k = 20) + s(NearestWildIndigo, k = 20) + s(DistanceNearestTree,
    k = 20) + s(DirectionTree, k = 8) + s(Slope)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.7841     0.2947   -2.66   0.0078 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                         edf Ref.df Chi.sq  p-value   
s(TotalCanopy)         3.514  4.408 37.816 2.04e-07 ***
s(WildIndigoSize)      2.532  3.172 52.407 3.24e-11 ***
s(NearestWildIndigo)   1.000  1.000 13.037 0.000305 ***
s(DistanceNearestTree) 3.281  4.091  8.895 0.067595 . 
s(DirectionTree)       1.000  1.000  4.152 0.041576 * 
s(Slope)               1.000  1.000  0.179
0.672329   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.587   Deviance explained = 53.1%
ML score = 164.05  Scale est. = 1         n = 454

Slope is the next candidate to be dropped. I use the same smoothing parameters

gam3 <- gam(Occupied_Unoccupied~s(TotalCanopy,k=20) + s(WildIndigoSize,k=20) + s(NearestWildIndigo,k=20)+ s(DistanceNearestTree,k=20) + s(DirectionTree,k=8), data=elfin, method="ML", family=binomial, sp=gam2$sp[1:5])
anova(gam3, gam2, test='Chisq')
Analysis of Deviance Table

Model 1: Occupied_Unoccupied ~ s(TotalCanopy, k = 20) + s(WildIndigoSize,
    k = 20) + s(NearestWildIndigo, k = 20) + s(DistanceNearestTree,
    k = 20) + s(DirectionTree, k = 8)
Model 2: Occupied_Unoccupied ~ s(TotalCanopy, k = 20) + s(WildIndigoSize,
    k = 20) + s(NearestWildIndigo, k = 20) + s(DistanceNearestTree,
    k = 20) + s(DirectionTree, k = 8) + s(Slope)
  Resid. Df Resid. Dev      Df Deviance Pr(>Chi)
1    441.66     293.59                         
2    440.67     293.43 0.98928  0.15979   
0.685

Slope can be dropped. I refit the model using its own smoothing parameters.

gam3 <- gam(Occupied_Unoccupied~s(TotalCanopy,k=20) + s(WildIndigoSize,k=20) + s(NearestWildIndigo,k=20)+ s(DistanceNearestTree,k=20) + s(DirectionTree,k=8), data=elfin, method="ML", family=binomial)
summary(gam3)
Family: binomial
Link function: logit

Formula:
Occupied_Unoccupied ~ s(TotalCanopy, k = 20) + s(WildIndigoSize,
    k = 20) + s(NearestWildIndigo, k = 20) + s(DistanceNearestTree,
    k = 20) + s(DirectionTree, k = 8)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.7793     0.2893  -2.694  0.00707 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                         edf Ref.df Chi.sq  p-value   
s(TotalCanopy)         3.527  4.420 37.900 1.99e-07 ***
s(WildIndigoSize)      2.629  3.294 52.377 3.99e-11 ***
s(NearestWildIndigo)   1.000  1.000 13.167 0.000285 ***
s(DistanceNearestTree) 3.238  4.036  8.795
0.067999
s(DirectionTree)       1.000  1.000  4.132 0.042079 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.588   Deviance explained = 53.1%
ML score = 164.14  Scale est. = 1         n = 454

DistanceNearestTree is the next candidate to be dropped. I consider a likelihood ratio test.

gam3a <- gam(Occupied_Unoccupied~s(TotalCanopy,k=20) + s(WildIndigoSize,k=20) + s(NearestWildIndigo,k=20)+  s(DirectionTree,k=8), data=elfin, method="ML", family=binomial)
anova(gam3a, gam3, test='Chisq')
Analysis of Deviance Table

Model 1: Occupied_Unoccupied ~ s(TotalCanopy, k = 20) + s(WildIndigoSize,
    k = 20) + s(NearestWildIndigo, k = 20) + s(DirectionTree,
    k = 8)
Model 2: Occupied_Unoccupied ~ s(TotalCanopy, k = 20) + s(WildIndigoSize,
    k = 20) + s(NearestWildIndigo, k = 20) + s(DistanceNearestTree,
    k = 20) + s(DirectionTree, k = 8)
  Resid. Df Resid. Dev     Df Deviance  Pr(>Chi)   
1    443.64     315.02                             
2    441.61     293.41 2.0304   21.607
2.129e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The likelihood ratio test and Wald test are in strong disagreement. The likelihood ratio test argues we should retain DistanceNearestTree. So our final model has the smooths of five predictors of which three appear to be nonlinear (based on the reported edf) and two are linear.

Question 3

My basic strategy is to fit a logistic regression model using all of the variables and then gradually weed out those predictors that are flawed and/or have no relationship to the responses. I use the tree results as well as a GAM to suggest interactions and improve the structural forms for the various remaining predictors. I start the process by including all the variables individually and additively in a logistic regression model

mod1 <- glm(Occupied_Unoccupied~., data=elfin, family=binomial)
round(summary(mod1)$coefficients,4)
                    Estimate Std. Error z value Pr(>|z|)
(Intercept)          -0.0255     0.5314 -0.0481   0.9617
TotalCanopy          -0.0025     0.0075 -0.3344   0.7381
WildIndigoSize        0.0285     0.0045  6.2689   0.0000
NearestWildIndigo    -0.0096     0.0031 -3.1454   0.0017
DistanceNearestTree  -0.0047     0.0008 -5.6250   0.0000
DirectionTree        -0.0027     0.0015 -1.8231   0.0683
Slope                -0.0054     0.0529 -0.1024   0.9184
SlopeAspect           0.0001     0.0020  0.0651   0.9481

Dealing with the angle variables

The variable SlopeAspect is an obvious candidate for omission. Recall that SlopeAspect is a variable with a lot of missing values. Both Slope and SlopeAspect are angles and angles require some care when including them in regression models. I examine their values.

unique(elfin$Slope)
 [1]  1  0  2  3  6  8  7  9 10  4  5 14 13 11 12
unique(elfin$SlopeAspect)
[1]  45  NA  90 135 315 270 225   0 180

The values for Slope are all rather close so nothing more needs to be done with it, but SlopeAspect has a wide range. Angles are circular measures and consequently 0 and 360 are the same angle. This means on a circular scale that 315 is as close to 0 as 45 is to 0. Currently in the regression model 0 and 315 are treated as being the the furthest apart. How angle should be included in a regression model depends on how we think angle impacts the response. This is a substantive issue not a statistical one. Having said that it is almost certainly the case given the range of the data that using the raw angle measured in degrees makes very little sense. One possibility is to use the cosine or sine of the angles. I try each in turn and both together.

mod1a <- update(mod1, .~. - SlopeAspect + cos(SlopeAspect/360*2*pi))
mod1b <- update(mod1, .~. - SlopeAspect + sin(SlopeAspect/360*2*pi))
mod1c <- update(mod1, .~. - SlopeAspect + cos(SlopeAspect/360*2*pi) + sin(SlopeAspect/360*2*pi))
AIC(mod1, mod1a, mod1b, mod1c)
      df      AIC
mod1   8 278.9459
mod1a  8 277.5965
mod1b  8 278.8972
mod1c  9 279.4565

The model with lowest AIC uses cosine. If we examine the summary table we find that its coefficient is not significant. So, we can drop it.

round(summary(mod1a)$coefficients,4)
                              Estimate Std. Error z value Pr(>|z|)
(Intercept)                     0.0708     0.4887  0.1448   0.8849
TotalCanopy                    -0.0018     0.0076 -0.2378   0.8120
WildIndigoSize                  0.0288     0.0046  6.2798   0.0000
NearestWildIndigo              -0.0098     0.0031 -3.1678   0.0015
DistanceNearestTree            -0.0048     0.0009 -5.6024   0.0000
DirectionTree                  -0.0028     0.0015 -1.8989   0.0576
Slope                          -0.0100     0.0513 -0.1959   0.8447
cos(SlopeAspect/360 * 2 * pi)   0.3017     0.2609  1.1561   0.2476
mod1.5 <- update(mod1a, .~. - cos(SlopeAspect/360*2*pi))
round(summary(mod1.5)$coefficients,4)
                    Estimate Std. Error z value Pr(>|z|)
(Intercept)          -0.4783     0.3798 -1.2592   0.2080
TotalCanopy           0.0118     0.0064  1.8519   0.0640
WildIndigoSize        0.0251     0.0036  7.0083   0.0000
NearestWildIndigo    -0.0092     0.0024 -3.9142   0.0001
DistanceNearestTree  -0.0022     0.0004 -4.9592   0.0000
DirectionTree        -0.0023     0.0012 -1.8910   0.0586
Slope                -0.0513     0.0402 -1.2768   0.2017

Slope is not significant so I drop it next.

mod2 <- update(mod1.5, .~.-Slope)
round(summary(mod2)$coefficients,4)
                    Estimate Std. Error z value Pr(>|z|)
(Intercept)          -0.5813     0.3713 -1.5656   0.1175
TotalCanopy           0.0106     0.0063  1.6879   0.0914
WildIndigoSize        0.0247     0.0035  6.9579   0.0000
NearestWildIndigo    -0.0093     0.0024 -3.9494   0.0001
DistanceNearestTree  -0.0022     0.0004 -4.8832   0.0000
DirectionTree        -0.0023     0.0012 -1.9111   0.0560

Dealing with nonlinear variables

At this point all the terms remaining in the model were also in our final GAM model. I examine the graphs of their smooths.

plot(gam3, pages=1, scale=0)

fig. 3

Fig. 3  Plots of the estimated smooths from a generalized additive model (GAM)

NearestWildIndigo and DirectionTree look linear. DirectionTree has some other issues that we'll deal with later. The remaining variables look nonlinear. The smooth for TotalCanopy looks very much like a quadratic function. I try that next.

mod3 <- update(mod2, .~. + I(TotalCanopy^2))
round(summary(mod3)$coefficients,4)
                    Estimate Std. Error z value Pr(>|z|)
(Intercept)          -2.3772     0.4356 -5.4569   0.0000
TotalCanopy           0.2050     0.0257  7.9828   0.0000
WildIndigoSize        0.0238     0.0036  6.5535   0.0000
NearestWildIndigo    -0.0100     0.0026 -3.8014   0.0001
DistanceNearestTree  -0.0006     0.0003 -2.4510   0.0142
DirectionTree        -0.0031     0.0014 -2.2048   0.0275
I(TotalCanopy^2)     -0.0026     0.0003 -7.3942  
0.0000

The quadratic term is significant. The relationship with DistanceNearestTree also looks quadratic. I try that next.

mod4 <- update(mod3, .~. + I(DistanceNearestTree^2))
round(summary(mod4)$coefficients,4)
                         Estimate Std. Error z value Pr(>|z|)
(Intercept)               -1.6275     0.4988 -3.2626   0.0011
TotalCanopy                0.1667     0.0279  5.9746   0.0000
WildIndigoSize             0.0236     0.0037  6.4551   0.0000
NearestWildIndigo         -0.0097     0.0026 -3.7544   0.0002
DistanceNearestTree       -0.0023     0.0007 -3.2258   0.0013
DirectionTree             -0.0029     0.0014 -2.0231   0.0431
I(TotalCanopy^2)          -0.0022     0.0004 -6.0635   0.0000
I(DistanceNearestTree^2)   0.0000     0.0000  2.6629  
0.0077

The quadratic term is again significant. The partial response function for WidIndigoSize looks like a logarithm. I replace WildIndigoSize with log(WildIndigoSize) and then compare these two models using AIC.

mod5 <- update(mod4, .~. - WildIndigoSize + log(WildIndigoSize))
AIC(mod4, mod5)
     df      AIC
mod4  8 328.4710
mod5  8
324.3546

So the log model is preferred. The summary table of our model thus far is shown below.

round(summary(mod5)$coefficients,4)
                         Estimate Std. Error z value Pr(>|z|)
(Intercept)               -7.1632     1.0922 -6.5585   0.0000
TotalCanopy                0.1651     0.0283  5.8258   0.0000
NearestWildIndigo         -0.0097     0.0027 -3.6162   0.0003
DistanceNearestTree       -0.0022     0.0007 -3.3081   0.0009
DirectionTree             -0.0029     0.0014 -2.0492   0.0404
I(TotalCanopy^2)          -0.0021     0.0004 -5.8326   0.0000
I(DistanceNearestTree^2)   0.0000     0.0000  2.9107   0.0036
log(WildIndigoSize)        1.7786     0.2441  7.2865   0.0000

The DirectionTree variable

DirectionTree is a weird variable. Does it make sense to treat it as continuous? Currently the coefficient is negative and significant. So that means as the angle to the next tree gets bigger, we're less likely to find the butterfly. If we examine the variable we see that it has only a few unique values.

unique(elfin$DirectionTree)
[1] 315  45 270  90   0 180 135 225

I try taking the sine and cosine of the angle as well as treating it as a factor.

mod6a <- update(mod5, .~.-DirectionTree + sin(DirectionTree/360*2*pi))
mod6b <- update(mod5, .~.-DirectionTree + cos(DirectionTree/360*2*pi))
mod6c <- update(mod5, .~.-DirectionTree + factor(DirectionTree))
mod6d <- update(mod5, .~.-DirectionTree)
AIC(mod5, mod6a, mod6b, mod6c, mod6d)
      df      AIC
mod5   8
324.3546
mod6a  8 325.6400
mod6b  8 328.3542
mod6c 14 325.0915
mod6d  7 326.6492

So, the linear model ranks best although the factor model comes close. The likelihood ratio tests for for both are statistically significant.

anova(mod6d, mod6c, test='Chisq')
Analysis of Deviance Table

Model 1: Occupied_Unoccupied ~ TotalCanopy + NearestWildIndigo + DistanceNearestTree +
    I(TotalCanopy^2) + I(DistanceNearestTree^2) + log(WildIndigoSize)
Model 2: Occupied_Unoccupied ~ TotalCanopy + NearestWildIndigo + DistanceNearestTree +
    I(TotalCanopy^2) + I(DistanceNearestTree^2) + log(WildIndigoSize) +
    factor(DirectionTree)
  Resid. Df Resid. Dev Df Deviance Pr(>Chi) 
1       447     312.65                      
2       440     297.09  7   15.558 
0.02948 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

anova(mod6d, mod5, test='Chisq')
Analysis of Deviance Table

Model 1: Occupied_Unoccupied ~ TotalCanopy + NearestWildIndigo + DistanceNearestTree +
    I(TotalCanopy^2) + I(DistanceNearestTree^2) + log(WildIndigoSize)
Model 2: Occupied_Unoccupied ~ TotalCanopy + NearestWildIndigo + DistanceNearestTree +
    DirectionTree + I(TotalCanopy^2) + I(DistanceNearestTree^2) +
    log(WildIndigoSize)
  Resid. Df Resid. Dev Df Deviance Pr(>Chi) 
1       447     312.65                      
2       446     308.36  1   4.2947 
0.03823 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As a final comparison I plot the partial response functions for the factor model and the linear model for DirectionTree and superimpose them. To simplify obtaining the factor model estimates I refit the model without the intercept.

mod6c.1 <- update(mod5, .~. - DirectionTree + factor(DirectionTree)-1)
coef(mod6c.1)
             TotalCanopy        NearestWildIndigo      DistanceNearestTree
            1.636235e-01            -1.006312e-02            -2.325251e-03
        I(TotalCanopy^2) I(DistanceNearestTree^2)      log(WildIndigoSize)
           -2.134585e-03             2.642030e-07             1.819255e+00
  factor(DirectionTree)0  factor(DirectionTree)45  factor(DirectionTree)90
           -7.436173e+00            -7.552560e+00            -7.466269e+00
factor(DirectionTree)135 factor(DirectionTree)180 factor(DirectionTree)225
           -7.189403e+00            -7.617518e+00            -8.528675e+00
factor(DirectionTree)270 factor(DirectionTree)315
           -7.221354e+00            -9.034128e+00
coef(mod5)
             (Intercept)              TotalCanopy        NearestWildIndigo
           -7.163231e+00             1.651143e-01            -9.659265e-03
     DistanceNearestTree            DirectionTree         I(TotalCanopy^2)
           -2.172650e-03            -2.942651e-03            -2.128156e-03
I(DistanceNearestTree^2)      log(WildIndigoSize)
            2.503037e-07             1.778623e+00
plot(c(0,45,90,135,180,225,270,315), coef(mod6c.1)[7:14], type='b', ylab='logit', xlab="DirectionTree")
abline(coef(mod5)[1], coef(mod3)[6], lty=2)
legend('bottomleft', c('factor model', 'linear model'), lty=c(1,2), pch=c(1,NA), bty='n', cex=.9)

fig 3

Fig. 3  Fitted relationship between logit and DirectionTree variable: individual estimates for each angle from a factor model as well as the estimated linear relationship.

It's difficult to tell if DirectionTree is capturing anything real in the data. The means for individual angles show an extreme amount of variability about the regression line.

Including interactions based on the classification tree

Because TotalCanopy appears repeatedly on the right branch of the classification tree, it suggests that the response may have a nonlinear relationship with TotalCanopy. The GAM has already suggested a quadratic relationship so we don't learn anything more about nonlinearities from looking at the classification tree.

The variables TotalCanopy, WildIndigoSize, and DistanceNearestTree occur along a single branch of the classification tree. This may indicate that in an ordinary logistic regression model including interactions between TotalCanopy, WildIndigoSize, and DistanceNearestTree might be a good choice. On the other hand the right branch also includes TotalCanopy and WildIndigoSize with nearly the same split value for WildIndigoSize. So perhaps only one of the component two-factor interactions is required. I try adding an interaction between these three variables. Each of the variables has a modified form in the model, two of them are quadratic and one is logged, so choosing the proper form for the interaction is complicated. For instance we could choose to involve the quadratic terms.

mod6 <- update(mod5, .~. + TotalCanopy * log(WildIndigoSize) * DistanceNearestTree)
anova(mod5, mod6, test='Chisq')
Analysis of Deviance Table

Model 1: Occupied_Unoccupied ~ TotalCanopy + NearestWildIndigo + DistanceNearestTree +
    DirectionTree + I(TotalCanopy^2) + I(DistanceNearestTree^2) +
    log(WildIndigoSize)
Model 2: Occupied_Unoccupied ~ TotalCanopy + NearestWildIndigo + DistanceNearestTree +
    DirectionTree + I(TotalCanopy^2) + I(DistanceNearestTree^2) +
    log(WildIndigoSize) + TotalCanopy:log(WildIndigoSize) + TotalCanopy:DistanceNearestTree +
    DistanceNearestTree:log(WildIndigoSize) + TotalCanopy:DistanceNearestTree:log(WildIndigoSize)
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       446     308.36                    
2       442     301.11  4   7.2454  
0.1235

anova(mod6, test='Chisq')
Analysis of Deviance Table

Model: binomial, link: logit

Response: Occupied_Unoccupied

Terms added sequentially (first to last)

                                                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)   
NULL                                                                  453     625.85             
TotalCanopy                                          1   21.830       452     604.02 2.979e-06 ***
NearestWildIndigo                                    1   55.862       451     548.16 7.772e-14 ***
DistanceNearestTree                                  1   65.682       450     482.47 5.298e-16 ***
DirectionTree                                        1    3.590       449     478.88  0.058122 . 
I(TotalCanopy^2)                                     1   87.980       448     390.90 < 2.2e-16 ***
I(DistanceNearestTree^2)                             1   10.481       447     380.42  0.001206 **
log(WildIndigoSize)                                  1   72.069       446     308.35 < 2.2e-16 ***
TotalCanopy:log(WildIndigoSize)                      1    0.951       445     307.40  0.329508   
TotalCanopy:DistanceNearestTree                      1    1.100       444     306.30  0.294269   
DistanceNearestTree:log(WildIndigoSize)              1    0.157       443     306.15  0.691767   
TotalCanopy:DistanceNearestTree:log(WildIndigoSize)  1    5.037       442     301.11 
0.024805
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC(mod5,mod6)
     df      AIC
mod5  8
324.3546
mod6 12 325.1091

The three-factor interaction by itself is significant (p= 0.024). The principle of marginality requires that if we add a three-factor interaction to model then we must also add all the component two-factor interactions and main effects. If we carry out a likelihood ratio test of all the interactions simultaneously the test statistic is not significant (p = 0.125). In addition the AIC is higher for the three-factor interaction model (325.1 versus 324.3).

Question 4

Classification tree

The predict function applied to a classification tree yields the predicted probabilities of each response category separately for each observation.

predict(out2a)[1:4,]
            0         1
[1,] 0.833333 0.1666667
[2,] 0.929134 0.0708661
[3,] 0.929134 0.0708661
[4,] 0.078125 0.9218750

In each row of the output from predict the entry with the larger probability is the prediction for that observation. Because the values sum to 1 in each row, the prediction is the entry whose value exceeds 0.5.

table(as.numeric(predict(out2a)[,2]>=.5), elfin$Occupied_Unoccupied)
      0   1
  0 215  21
  1  32 186

We could also have obtained this by using the type='class' argument of predict.

predict(out2a, type='class')[1:4]
[1] 0 0 0 1
Levels: 0 1
table(predict(out2a, type='class'), elfin$Occupied_Unoccupied)
  
      0   1
  0 215  21
  1  32 186

I calculate the fraction of correct predictions.

sum(diag(table(as.numeric(predict(out2a)[,2]>=.5), elfin$Occupied_Unoccupied)))/ sum(table(as.numeric(predict(out2a)[,2]>=.5), elfin$Occupied_Unoccupied))
[1] 0.88326

88.3% of the observations are classified correctly as true positives and true negatives.

GAM 

For the GAM I use model gam3 with five smooths.

s(TotalCanopy) + s(WildIndigoSize) + s(DistanceNearestTree) + s(NearestWildIndigo) + s(DirectionTree)

table(as.numeric(fitted(gam3)>.5), elfin$Occupied_Unoccupied)
  
      0   1
  0 213  29
  1  34 178
sum(diag(table(as.numeric(fitted(gam3)>.5), elfin$Occupied_Unoccupied)))/sum(table(as.numeric(fitted(gam3)>.5), elfin$Occupied_Unoccupied))
[1] 0.8612335

86.1% of the observations are classified correctly as true positives and true negatives.

Logistic regression

For the logistic regression model I use model mod5.

Occupied_Unoccupied ~ TotalCanopy + NearestWildIndigo + DistanceNearestTree + DirectionTree + I(TotalCanopy^2) + I(DistanceNearestTree^2) + log(WildIndigoSize)

table(as.numeric(fitted(mod5)>.5), elfin$Occupied_Unoccupied)
  
      0   1
  0 212  30
  1  35 177
sum(diag(table(as.numeric(fitted(mod5)>.5), elfin$Occupied_Unoccupied)))/ sum(table(as.numeric(fitted(mod5)>.5), elfin$Occupied_Unoccupied))

[1] 0.8568282

85.7% of the observations are classified correctly as true positives and true negatives.

Table 2  Confusion matrices (using c = 0.5) for four different models
Classification tree Generalized additive model
confusion1
GAM
Logistic regression
Logistic regression (with interactions)
logistic
confusion

Both the classification tree and GAM models beat the logistic regression model in having both fewer false positive and false negatives. The classification tree has individually the fewest number of false positives and false negatives of the three models.

Question 5

I display the ROC curves for the logistic model, the GAM, and the classification tree. As we saw in Question 3, applying the predict function to a classification tree yields the probability of 0 (column 1) and the probability of 1 (column 2). For the ROC curve we need the probability of 1.

#get the classification tree probability of category 1
tree.pred <- predict(out2a)[,2]
library(ROCR)
#prediction objects
pred.logist1 <- prediction(fitted(mod5), elfin$Occupied_Unoccupied)
pred.gam <- prediction(fitted(gam3), elfin$Occupied_Unoccupied)
pred.tree <- prediction(tree.pred, elfin$Occupied_Unoccupied)
#performance objects
perf.logist1 <- performance(pred.logist1, 'tpr', 'fpr')
perf.gam <- performance(pred.gam, 'tpr', 'fpr')
perf.tree <- performance(pred.tree, 'tpr', 'fpr')
#create the graph
plot(perf.logist1@x.values[[1]], perf.logist1@y.values[[1]], type='s', ylab='True positive rate (TPR)', xlab='False positive rate (FPR)', col=2, lwd=2)
lines(perf.tree@x.values[[1]], perf.tree@y.values[[1]], type='s', col='grey60', lwd=2)
lines(perf.gam@x.values[[1]], perf.gam@y.values[[1]], type='s', col=4, lty=1)
legend('bottomright', c('logistic', 'GAM', 'classification tree'), col=c(2,4,'grey60'),lty=c(1,1,1), lwd=c(2,1,2), cex=.9, bty='n', title=expression(bold('Model')))

ROC

Fig. 5  ROC curves for a logistic regression model, a five-variable GAM, and a classification tree.

Question 6

auc.vals <- sapply(list(pred.logist1, pred.gam, pred.tree), function(x) performance(x, 'auc')@y.values[[1]][1])
names(auc.vals) <- c('logistic', 'GAM', 'tree')
auc.vals
 logistic       GAM      tree
0.9284359 0.9338536 0.9136302

All of these values are exceedingly high.

Question 7

Because the classification tree has only six splits, its ROC curve has six jumps. Because the logistic regression model and the GAM treat the variables as continuous, their ROC curves are much smoother. As a result the AUC for these models exceed that of the classification tree. Thus for certain cut-points (0.5 for instance) the classification tree beats both logistic regression models (although rarely the GAM) but on average it does worse. So surprisingly the parametric and semi-parametric models beat the algorithmic approach here. The GAM is the clear winner but does so perhaps by sacrificing interpretability and overfitting the data. In truth all four models do a spectacular job of classification here.


HW scores

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--May 2, 2012
URL: https://sakai.unc.edu/access/content/group/2842013b-58f5-4453-aa8d-3e01bacbfc3d/public/Ecol562_Spring2012/docs/solutions/finalpart2.htm