Lecture 37 (lab 12)—Friday, April 13, 2012

Topics

Overview of the data
Density as a response variable
Regression trees
Classification trees
Random forests

R functions and commands demonstrated

as.party (from the partykit package) converts an rpart object to a party object.
dotchart produces a dot chart of a variable.
mosaicplot produces a mosaic plot, a graphical representation of a contingency table of cross-classified data.
plot when applied to an rpart object generates a binary tree representation of the results.
plotcp (from the rpart package) plots the cross-validation error (xerror) versus the cost complexity parameter (c_p) for trees of various sizes. It also displays the one standard deviation upper limit calculated from the model with the smallest cross-validation error.
printcp (from the rpart package) prints the c_p table for a sequence of best trees.
prune (from the rpart package) is used to prune an existing tree based on values of the cost complexity parameter c_p.
randomForest (from the randomForest package) is used to fit random forests.
rpart (from the rpart package) is used to carry out binary recursive partitioning resulting in either a regression or classification tree.
rpart.control (from the rpart package) is used in the control argument of rpart to set control parameters that affect the rpart algorithm and change the manner in which the tree is constructed.
text when applied to an rpart object adds text to an already existing graph of a regression or classification tree.

R function options

branch= argument to plot when plotting an rpart object. It determines the shape of the branches displayed in the tree. Values can range from branch = 0 which produces trees with shouldered, rectangular branches to branch=1 for trees with V-shaped branches. Intermediate values yield intermediate branching patterns.
control= argument to rpart to set control parameters that affect the manner in which the tree is constructed.
cp= argument to prune and rpart.control for selecting a value of the cost complexity parameter to control the maximum size of the tree.
importance= argument to randomForest specifies whether importance statistics should be calculated for each variable.
margin= argument to plot when plotting an rpart object. It is used to expand the amount of white space around the tree to prevent truncation of any text that is added later with the text function.
method= argument to rpart specifies whether a regression tree, method='anova', or a classification tree, method='class', is desired. By default rpart uses the class of the response variable to make this determination.
minsplit= argument to rpart.control for setting the minimum number of observations that must exist at a node in order for a split to be attempted.
parms= argument to rpart to set tree parameters, for instance, the kind of impurity measure to use in building a classification tree.
pretty= argument to text controls how the categories of factor variables are displayed on a plotted tree.
split= is for setting the split criterion in classification trees. It is entered as an argument of the parms parameter of rpart.
tp_args= (from the partykit package) is used to pass arguments to the plot function to control the graphical display of the terminal nodes when plotting party objects.
uniform= argument to plot when plotting an rpart object. Setting uniform=T results in equal length branches rather than branch lengths that measure the relative decrease in impurity (the default).
use.n= argument to text for adding text to an rpart object. Setting use.n=T causes the sample sizes at terminal nodes to be added to regression trees and the actual numbers of observations in the individual categories at each leaf to be added to classification trees.

R packages used

rpart for the various functions needed to fit and display classification and regression trees.
partykit for enhanced graphics.
randomForest for random forests.

Overview of the data

The data are from Zuur et al. (2007) who use it to introduce classification and regression trees in their Chapter 9. The data set used to illustrate regression trees is a Bahamas fisheries data set consisting of unpublished data from The Bahamas National Trust / Greenforce Andros Island Marine Study. Zuur et al. (2007) describe the data as follows (p. 147):

Parrotfish densities are used as the response variable, with the explanatory variables: algae and coral cover, location, time (month), and fish survey method (method 1: point counts, method 2: transects). There were 402 observations measured at 10 sites.

They choose not to use CoralRichness in the analysis and we will follow suit. Because the response variable Parrot is parrotfish density, a continuous variable, we will use these data to construct a regression tree.

Bahama <- read.table( 'ecol 562/Bahama.txt', header=T)

names(Bahama)

[1] "Sample" "Parrot" "CoralRichness" "CoralTotal" "Month" "Station"
[7] "Method"

dim(Bahama)

[1] 402 7

Bahama[1:4,]

  Sample Parrot CoralRichness CoralTotal Month Station Method
1      1 25.46            11       5.73     5       1      1
2      2 16.55            11       5.73     5       1      1
3      3 15.28            11       5.73     5       1      1
4      4 10.19            11       5.73     5       1      1

Zuur et al. (2007) use a second data set to illustrate classification trees. They describe the data and their objectives as follows (p. 152–153).

These data are from an annual ditch monitoring programme at a waste management and soil recycling facility in the UK. The soil recycling facility has been storing and blending soils on this site for over many years, but it has become more intensive during the last 8 to 10. Soils are bought into the stockpile area from a range of different sites locations, for example from derelict building sites, and are often saturated when they arrive. They are transferred from the stockpile area to nearby fields and spread out in layers approximately 300 mm deep. As the soils dry, stones are removed, and they are fertilised with farm manure and seeded with agricultural grasses. These processes recreate the original soil structure, and after about 18 months, the soil is stripped and stockpiled before being taken off-site to be sold as topsoil.

The main objective of the monitoring was to maintain a long-term surveillance of the surrounding ditch water quality and identify any changes in water quality that may be associated with the works and require remedial action. The ecological interest of the site relates mainly to the ditches: in particular the large diversity of aquatic invertebrates and plants, several of which are either nationally rare or scarce. Water quality data were collected four times a year in Spring, Summer, Autumn and Winter, and was analysed for the following parameters: pH, electrical conductivity (μS/cm), biochemical oxygen demand (mgl^–1), ammoniacal nitrogen (mgl^–1), total oxidised nitrogen, nitrate, nitrite (mgl^–1), total suspended solids (mgl^–1), chloride (mgl^–1), sulphate (mgl^–1), total calcium (mgl^–1), total zinc (μgl^–1), total cadmium (μgl^–1), dissolved lead (μgl^–1), dissolved nickel (μgl^–1), orthophosphate (mgl1) and total petroleum hydrocarbons (μgl^–1). In addition to water quality observations, ditch depth was measured during every visit. The data analysed here was collected from five sampling stations between 1997 and 2001. Vegetation and invertebrate data were also collected, but not used in this example.

The underlying question now is whether we can make a distinction between observations from different ditches based on the measured variables.

The response variable, Site, is a categorical variable with five categories and hence will be used to construct a classification tree.

Ditch <- read.table('ecol 562/Ditch.txt', header=T)

names(Ditch)

[1] "Site"                    "Year"             "Month"         "Depth"
[5] "pH"                      "Conductivity"     "BOD"           "Ammoniacal_Nitrogen"
[9] "Total_Oxidised_Nitrogen" "Suspended_Solids" "Chloride"      "Sulphate"
[13] "Total_Calcium"           "Total_Zinc"       "Total_Cadmium" "Total_Lead"
[17] "Total_Nickel"            "Total_Phosphate"

dim(Ditch)

[1] 48 18

Ditch[1:4,]

Site Year Month Depth   pH Conductivity BOD Ammoniacal_Nitrogen Total_Oxidised_Nitrogen
1    1 1997    11    NA 7.62          511 7.1                 0.1                    1.42
2    1 1999     1 0.40 8.51          778 7.9                 0.2                    1.43
3    1 1999     4 0.25 7.71          517 3.6                 0.1                    3.88
4    1 1999     7 0.40 7.77         1892 3.4                 0.1                    0.05
Suspended_Solids Chloride Sulphate Total_Calcium Total_Zinc Total_Cadmium Total_Lead
1              441       43       63         104.0       0.18            NA       0.07
2              759       78      196         141.0       0.24         0.005       0.14
3               30       65      123          77.6       0.04         0.005       0.05
4               51      266      405         158.0       0.05         0.005       0.05
Total_Nickel Total_Phosphate
1         0.03              NA
2         0.03            0.05
3         0.03            0.30
4         0.03            0.17

The Ditch data set contains only 48 observations. This makes it less than ideal for building a tree model.

Density as a response variable

It's worth considering how we might analyze the Bahama data using a parametric regression model. Regression trees are indifferent to nature of the response variable but this does matter when choosing a probability distribution for a parametric model. The variable of interest, parrotfish density, presumably was obtained by counting parrotfish and dividing the total by the area surveyed. While this may seem like a perfectly natural variable, ratios can be exceedingly difficult to analyze statistically. We can get a sense of the difficulty when we look at the distribution of the parrotfish density variable (Fig. 1).

Fig. 1 Distribution of parrotfish densities

Fig. 1 is a bar chart of the different parrotfish densities in the data set. Keep in mind that the overall distribution of the response is largely irrelevant in regression. Instead it is the distribution of the response at various predictor combinations that matters. Because density is a non-negative quantity and we see from the graph that much of the data is close to and even achieves its lower boundary of zero, a distribution that permits negative values such as a normal distribution is not likely to be a good choice here. If we turn to positive distributions such as the lognormal and gamma we find there is another problem. These distributions don't allow zero values.

library(MASS)

fitdistr(Bahama$Parrot, 'lognormal')

Error in fitdistr(Bahama$Parrot, "lognormal") :
need positive values to fit a log-Normal

glm(Bahama$Parrot~1, data=Bahama, family=Gamma)

Error in eval(expr, envir, enclos) :
non-positive values not allowed for the gamma family

Yet 7% (n = 29) of the observed parrotfish densities are zero. One solution is to assume the zeros aren't really zero, but are small positive numbers that fall below our detection threshold. Alternatively we could analyze the zeros separately by fitting a binomial model to the presence-absence of parrotfish. Then conditional on parrotfish being present we could fit a lognormal model to the nonzero densities. Splitting up the data like this diminishes our ability to detect effects of interest. It is quite likely that the factors that cause density to be low also cause density to be zero.

All of these issues have arisen because we've chosen to record the response variable as density. The actual variable in nature is a count, the number of parrotfish observed on a transect, which we then chose to divide by the area surveyed. If the transects vary in length then the number of fish observed will vary also, so we need to control for area. Rather than forming a ratio we can control for the effect of area by including it in the regression model as a covariate.

parrotfish.count ~ b0 + b1*area + b2*x

The advantage of doing it this way is that the response variable is still a count and we can make use of probability models for counts (Poisson, negative binomial, etc.) all of which permit a count to be equal to zero.

If we insist that the results be interpretable as effects on density then there is yet another option: use parrotfish counts as the response variable but use log(area) as an offset rather than as a predictor. An offset is a term in a regression model whose coefficient is constrained to be 1. In R we would fit a model with an offset as follows.

glm(parrotfish.count ~ offset(log(area)) + x, family=poisson)

Because the default link function for Poisson regression is a log link, the above syntax fits the following regression model.

offset

where the last step follows from a property of logarithms. So if we assume parrotfish counts have a Poisson distribution (which allows zeros), then by including log area as an offset we end up with a model for density. The same result holds if the parrotfish counts are assumed to have a negative binomial distribution. The bottom line is that fitting models to densities is a messy statistical problem, while fitting models to counts is easy. Avoid working with densities unless you have no other choice.

Regression trees

Regression trees are fit with the rpart function in the the rpart package. The default syntax is identical to that which we've used with most of the regression functions of R. Because Month, Station, and Method are recorded with numerical values I declare them as factors explicitly in the model formula. Because Parrot is a continuous variable, rpart fits a regression tree. To explicitly request a regression tree we can add the argument method="anova".

library(rpart)

parrot_tree <- rpart(Parrot ~ CoralTotal + factor(Month) + factor(Station) + factor(Method), data = Bahama)

The default print method for rpart is to print a text representation of the tree. Generally you shouldn't use the summary function on an rpart object unless you want to see all the details of the tree construction.

parrot_tree

n= 402

node), split, n, deviance, yval
* denotes terminal node

1) root 402 50190.0 10.780
   2) as.factor(Method)=2 244 7044.0 6.449
     4) CoralTotal< 4.955 87 1401.0 3.526 *
     5) CoralTotal>=4.955 157 4488.0 8.069
      10) as.factor(Station)=2,3,6,7,10 122 2593.0 7.075 *
      11) as.factor(Station)=1,5,9 35 1355.0 11.530 *
   3) as.factor(Method)=1 158 31490.0 17.470
     6) as.factor(Station)=3,4 25 781.7 3.157 *
     7) as.factor(Station)=1,2,5,6,7,8,9,10 133 24630.0 20.160
      14) as.factor(Month)=5,7,8,10 94 11590.0 17.090
        28) as.factor(Station)=1,5,6,7,8 66 6717.0 15.530 *
        29) as.factor(Station)=2,9,10 28 4328.0 20.780 *
      15) as.factor(Month)=11 39 10020.0 27.560
        30) CoralTotal>=4.395 27 7647.0 24.810 *
        31) CoralTotal< 4.395 12 1713.0 33.740 *

Each line of the text tree contains the following information.

The nodes are numbered in the order they are created.
The value following the node number is the split condition: the variable and the value of the variable used to determine the split.
The third value is the deviance, which for a regression tree is equal to the residual sum of squares.
The last value is mean of the response variable for all observations assigned to this node. If this value is followed by an asterisk, *, then the node is also a terminal node (leaf).

Each split leads to a reduction in total deviance. At the root (null) node the total deviance is 50,190.0. After the first split the total deviance is reported to be

7044.0+31490.0

[1] 38534

so the deviance has been reduced by splitting on the variable Method. At each stage of the tree we can sum the deviances at all the terminal nodes and compare it to the deviance at the root node to obtain an R², the fraction of the original total deviance that has been explained by the tree.

(50190.0-(7044.0+31490.0))/50190.0

[1] 0.2322

Thus splitting the data set into groups based on the variable Method and using the mean response in each group as the predicted value has explained 23% of the original variability in parrotfish density. We can obtain a graphical summary of the regression tree by plotting the rpart object and then using the text function to add labels to the nodes.

plot(parrot_tree, margin=.1)

text(parrot_tree, cex=.85, use.n=T)

Fig. 2 Graphical display of the tree

The margin= argument in plot expands the white space around the tree to prevent truncation of the text that comes next. The use.n= argument of text adds the information about the number of observations occurring at each of the terminal nodes. Notice that rpart has relabeled the categories of each factor variable as a, b, c, etc. This is to avoid overlap with long category names. We can get the actual category levels to display by including the pretty argument of text. Setting pretty=1 causes the numeric category levels to be printed.

plot(parrot_tree, margin=.1)

text(parrot_tree, cex=.85, use.n=T, pretty=1)

Fig. 3 Tree with argument pretty=1

By default the tree is drawn so that the length of the branches indicate the relative magnitudes of the drops in deviance at each split point. This typically means that the information in the terminal leaves gets crowded together because the changes in deviance there are relatively small. We can correct this by adding the uniform=T argument to plot to obtain trees with equal branches.

plot(parrot_tree, margin=.1, uniform=T)

text(parrot_tree, cex=.8, use.n=T, pretty=1)

Fig. 4 Tree with uniform branching

We can change the appearance of the branches with the branch argument. The default choice is branch=1 which yields the rectangular shoulders shown. Choosing branch=0 yields branches with no shoulders. Intermediate values for branch yield trees with shorter shoulders. I leave out the pretty argument to obtain a more compact display.

plot(parrot_tree, margin=.1, uniform=T, branch=0)

text(parrot_tree, cex=.8, use.n=T)

plot(parrot_tree, margin=.1, uniform=T, branch=0.5)

text(parrot_tree, cex=.8, use.n=T)

(a)	(b)
Fig. 5 The effect of varying the branch argument on the displayed tree. (a) branch=0; (b) branch=0.5. Figs. 2, 3, and 4 show branch=1.

We can obtain the value of the cost complexity parameter for trees of different sizes with the printcp function.

printcp(parrot_tree)

Regression tree:
rpart(formula = Parrot ~ CoralTotal + as.factor(Month) + as.factor(Station) +
as.factor(Method), data = Bahama)

Variables actually used in tree construction:
[1] as.factor(Method) as.factor(Month) as.factor(Station) CoralTotal

Root node error: 50188/402 = 125

n= 402

     CP nsplit rel error xerror xstd
1 0.232      0      1.00   1.00 0.116
2 0.121      1      0.77   0.77 0.083
3 0.060      2      0.65   0.71 0.075
4 0.023      3      0.59   0.67 0.071
5 0.013      4      0.56   0.64 0.072
6 0.011      5      0.55   0.65 0.073
7 0.011      6      0.54   0.64 0.071
8 0.010      7      0.53   0.63 0.069

The column labeled CP is the complexity parameter we discussed in lecture 36. It serves as a penalty term to control tree size and is always monotonic with the number of splits (nsplit). The smaller the value of CP, the more complex will be the tree (the greater the number of splits).
For a regression tree, the relative error (rel error) is the average deviance of the current tree divided by the average deviance of the null tree.
The cross-validation error (xerror) is based on a 10-fold cross-validation and is again measured relative to the deviance of the null model. As expected the cross-validation error is greater than the relative error. Using the same data to both fit and test a model results in over-optimistic fit diagnostics.

While relative error is guaranteed to decrease as the tree gets more complex, this will not normally be the case for cross-validation error. Because the cross-validation error is still decreasing in the output shown above, the default tree size is probably too small. We need to refit the model and force rpart to carry out additional splits. This can be accomplished with the control argument of rpart. I use the rpart.control function to specify a value for cp= that is smaller than the default value of 0.01. Because the cross-validation error results are random, I use the set.seed function first to set the seed for the random number stream so that the results obtained are reproducible.

set.seed(20)

parrot_tree2 <- rpart(Parrot ~ CoralTotal + as.factor(Month) + as.factor(Station) + as.factor(Method), data = Bahama, control=rpart.control(cp=.001))

The CP table exists as a component of the rpart object.

names(parrot_tree2)

[1] "frame" "where" "call" "terms" "cptable" "splits" "method" "parms"
[9] "control" "functions" "csplit" "y" "ordered"

parrot_tree2$cptable

         CP nsplit rel error xerror    xstd
1 0.232124      0    1.0000 1.0094 0.11738
2 0.121241      1    0.7679 0.7880 0.08532
3 0.060104      2    0.6466 0.7248 0.08104
4 0.023014      3    0.5865 0.6288 0.06845
5 0.013218      4    0.5635 0.6327 0.07226
6 0.010802      5    0.5503 0.6329 0.07345
7 0.010760      6    0.5395 0.6509 0.07497
8 0.007509      7    0.5287 0.6416 0.07510
9 0.005288      8    0.5212 0.6474 0.07387
10 0.004993     10    0.5107 0.6464 0.07350
11 0.003426     11    0.5057 0.6424 0.07367
12 0.001722     12    0.5022 0.6401 0.07318
13 0.001369     13    0.5005 0.6399 0.07202
14 0.001124     14    0.4991 0.6439 0.07249
15 0.001000     15    0.4980 0.6440 0.07249

From the output it appears that xerror has achieved an interior minimum. We can obtain an informative graphical display of the c_p table with the plotcp function.

plotcp(parrot_tree2)

Fig. 6 Plot of CP table. The dotted line denotes the upper limit of the one standard deviation rule.

From the graph it would appear that the minimum cross-validation error occurred for the fourth tree listed in the CP table that had a total of 4 leaves. We can verify this by just examining the table by eye or more formally as follows.

#minimum cross-validation error

min(parrot_tree2$cptable[,"xerror"])

[1] 0.6288

#location of minimum in CP table

which.min(parrot_tree2$cptable[,"xerror"])

4
4

# the tree with the minimum cross-validation error

parrot_tree2$cptable[4,]

CP nsplit rel error xerror xstd
0.02301 3.00000 0.58653 0.62884 0.06845

So, if we were to use the tree that minimized the cross-validation error we would choose a tree with 3 splits. As was noted in lecture 36, the problem with such a choice is that the cross-validation error is a random quantity. There is no guarantee that if we were to fit the sequence of trees again using a different random seed that the same tree would minimize the cross-validation error.

A more robust alternative to minimum cross-validation error is to use the one standard deviation rule: choose the smallest tree whose cross-validation error is within one standard error of the minimum. Using this rule we would still end up with the same tree. The first tree whose point estimate of the cross-validation error falls within the ± 1 xstd of the minimum, is the minimum tree itself of size 4. I extract the CP value of this tree and refit the tree using the prune function on the current tree object.

cp.choice <- parrot_tree2$cptable[4,"CP"]

cp.choice

[1] 0.02301

pruned.tree <- prune(parrot_tree2, cp=cp.choice)

pruned.tree

n= 402

node), split, n, deviance, yval
* denotes terminal node

1) root 402 50190.0 10.780
   2) as.factor(Method)=2 244 7044.0 6.449 *
   3) as.factor(Method)=1 158 31490.0 17.470
     6) as.factor(Station)=3,4 25 781.7 3.157 *
     7) as.factor(Station)=1,2,5,6,7,8,9,10 133 24630.0 20.160
      14) as.factor(Month)=5,7,8,10 94 11590.0 17.090 *
      15) as.factor(Month)=11 39 10020.0 27.560 *

plot(pruned.tree, margin=0.1)

text(pruned.tree ,cex=.9, use.n=T, pretty=1)

Fig. 7 The pruned tree

This result is probably the researcher's worst nightmare. None of the important predictors turn out to be biological variables. Instead we see that the survey method (point counts versus transect) is the most important predictor of parrotfish density. The right branch consisting of only point counts then divides out best by the location at which the sample was taken (station) and six of these stations are then split by the month of the sample.

The predict function works with tree objects. For a regression tree it returns the mean at the leaf where the observation is assigned.

predict(pruned.tree)[1:20]

[1] 17.09394 17.09394 17.09394 17.09394 17.09394 17.09394 17.09394 17.09394
[9] 17.09394 17.09394 17.09394 17.09394 17.09394 17.09394 17.09394 17.09394
[17] 27.55513 27.55513 27.55513 27.55513

Just like the rpart object, the prune object contains quite a bit of information.

names(pruned.tree)

[1] "frame" "where" "call" "terms" "cptable" "splits" "method" "parms"
[9] "control" "functions" "csplit" "y" "ordered"

The $where component indicates to which leaf the different observations have been assigned.

pruned.tree$where

[1] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[39] 2 2 2 2 2 2 2 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
[77] 7 7 7 7 7 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 2 2
[115] 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2
[153] 2 2 2 2 2 2 2 2 6 6 6 6 6 6 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 6
[191] 6 6 6 6 6 6 6 6 6 6 6 6 6 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[229] 2 2 2 2 2 2 2 2 2 2 2 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
[267] 6 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[305] 2 2 2 2 2 2 2 2 2 2 2 2 6 6 6 6 2 2 2 2 2 2 2 6 6 6 6 6 6 6 7 7 7 7 7 7 2 2
[343] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 6 6 6 6 6 6 6 7 7 7 7 7 7 7 2 2 2
[381] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

unique(pruned.tree$where)

[1] 6 7 2 4

The numbers themselves correspond to row numbers in the $frame component that defines the final tree.

pruned.tree$frame

                  var   n wt     dev   yval complexity ncompete nsurrogate
1   as.factor(Method) 402 402 50188.3 10.781   0.232124        3          3
2              <leaf> 244 244 7044.2 6.449   0.023014        0          0
3 as.factor(Station) 158 158 31494.2 17.471   0.121241        2          1
6              <leaf> 25 25   781.7 3.157   0.001722        0          0
7    as.factor(Month) 133 133 24627.6 20.162   0.060104        2          2
14             <leaf> 94 94 11587.6 17.094   0.010802        0          0
15             <leaf> 39 39 10023.5 27.555   0.013218        0          0

Notice that rows 2, 4, 6, and 7 are labeled as leaves but have row names 2, 6, 14, and 15. The row names in the frame are the actual left-to-right positions of the leaves in the tree. The missing numbers correspond to splits and leaf locations that don't appear in the pruned tree. We can use all this information to display box plots in the same order as the left-to-right positions of the leaves of the pruned tree. I adjust the margins of each graph with the mar argument of par to try to position the boxplots right below the corresponding leaf of the tree. I use the contents of the $where component to group the observations by their final leaf assignment.

#create a factor for leaf positions

where <- factor(pruned.tree$where)

#calculate sample sizes as a check

n <- tapply(Bahama$Parrot, where, length)

#place boxplot directly below tree and remove some white space

par(mfrow=c(2,1))

#save old margin settings

oldmar <- par("mar")

#margins for tree

par(mar=c(0,4.1,1.1,1.1))

plot(pruned.tree, margin=.2)

text(pruned.tree, cex=.8, use.n=T, pretty=1)

#margins for boxplot

par(mar=c(2.1,5.1,0,1.8))

boxplot(Bahama$Parrot~where, varwidth=T, pars=list(axes=F))

axis(2)

mtext(side=1,line=.5, text=paste("n=",n), at=1:4, cex=.9)

#restore original margins

par(mar = oldmar)

Fig. 8 Pruned tree with box plots of the distributions of the response at the terminal nodes

Based on the box plots it appears that the first two splits of the tree have managed to pull out most of surveys that yielded the lowest densities of parrotfish.

The partykit package of R produces enhanced displays of regression and classification trees. It does its own version of Fig. 8. To use it we need to convert the tree created by rpart into a party object with the as.party function. I pass an additional argument, id=FALSE, to the plot function to suppress displaying the node number of each tree.

library("partykit")

plot(as.party(pruned.tree), tp_args=list(id=FALSE))

Fig. 9 Regression tree diagram produced by the partykit package

Classification trees

Classification trees are constructed the same way as regression trees, so I skip over most of the preliminaries. The data in the Ditch data set are clearly heterogeneous but perhaps we can ignore this because we're treating the annual measurements at each site as replicates of that site. I start with a small value for cp so that the initial tree is as large as possible. Because we don't have much data (only 48 observations) I also set a lower value for the minsplit criterion. By default it's set at 20 so that any group with fewer than 20 observations in it would not be split further. (Note: by default minbucket, the minimum number of observations allowed to form a terminal node, is set to minbucket = minsplit/3 so reducing minsplit also decreases minbucket.) The default split criterion is the Gini index. This can be changed to the entropy measure. To carry this out we would need to add a parms argument to the rpart call: parms=list(split="information").

To avoid typing all of the variable names in the model formula I use the ~. notation and remove those variables that should not be in the model by not including them in the data argument. To obtain a classification tree the response variable needs to be a factor. Alternatively one can add the argument method='class' to obtain a classification tree.

set.seed(50)

ditch_tree <- rpart(factor(Site)~., data=Ditch[,c(1,4:18)], control=rpart.control(cp=0.001, minsplit=15))

ditch_tree

n= 48

node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 48 38 1 (0.21 0.19 0.21 0.21 0.19)
   2) Total_Calcium>=118 25 16 5 (0.32 0.28 0 0.04 0.36)
     4) Conductivity< 1505 11 6 1 (0.45 0.45 0 0.091 0) *
     5) Conductivity>=1505 14 5 5 (0.21 0.14 0 0 0.64) *
   3) Total_Calcium< 118 23 13 3 (0.087 0.087 0.43 0.39 0)
     6) Depth>=0.505 8 0 3 (0 0 1 0 0) *
     7) Depth< 0.505 15 6 4 (0.13 0.13 0.13 0.6 0)
      14) Conductivity< 622.5 5 3 1 (0.4 0 0.4 0.2 0) *
      15) Conductivity>=622.5 10 2 4 (0 0.2 0 0.8 0) *

The output resembles that of the regression tree with the following differences.

Because the response variable consists of multiple categories what's reported are the proportions of observations that occur in each of the categories (yprob).
The value labeled yval is the category with the highest predicted probability.
The deviance column is replaced by something called "loss". The loss is the number of misclassified observations. It's the number of observations assigned to that leaf that are not equal to yval.

I next display a graph of the tree.

plot(ditch_tree, mar=0.1)

text(ditch_tree, use.n = T, cex = .9)

Fig. 10 A classification tree

At each leaf we have the predicted category for the members of that leaf using a simple majority rule. The use.n=T argument caused the actual number of observations in the five categories to be displayed below the prediction.

I next generate a plot of the cross-validation error for different values of cost complexity parameter.

plotcp(ditch_tree)

Fig. 11 The cross-validation error for a sequence of classification trees

The result are fairly unambiguous. A tree with four leaves is best. I examine the CP table.

ditch_tree$cptable

       CP nsplit rel error xerror    xstd
1 0.23684      0    1.0000 1.1842 0.04413
2 0.18421      1    0.7632 1.1842 0.04413
3 0.13158      2    0.5789 1.0526 0.06795
4 0.02632      3    0.4474 0.7368 0.08989
5 0.00100      4    0.4211 0.7368 0.08989

I prune the tree using the optimal value of CP and display the results.

myprune <- prune(ditch_tree, cp=ditch_tree$cptable[4,"CP"])

plot(myprune, margin=.1)

text(myprune, use.n = T, cex = .9)

Fig. 12 The pruned tree

The predict function applied to classification trees reports the category probabiities for the leaf that an observation is assigned.

predict(myprune)[1:10,]

1 2 3 4 5
[1,] 0.1333333 0.1333333 0.1333333 0.60000000 0.0000000
[2,] 0.4545455 0.4545455 0.0000000 0.09090909 0.0000000
[3,] 0.1333333 0.1333333 0.1333333 0.60000000 0.0000000
[4,] 0.2142857 0.1428571 0.0000000 0.00000000 0.6428571
[5,] 0.2142857 0.1428571 0.0000000 0.00000000 0.6428571
[6,] 0.4545455 0.4545455 0.0000000 0.09090909 0.0000000
[7,] 0.4545455 0.4545455 0.0000000 0.09090909 0.0000000
[8,] 0.4545455 0.4545455 0.0000000 0.09090909 0.0000000
[9,] 0.4545455 0.4545455 0.0000000 0.09090909 0.0000000
[10,] 0.2142857 0.1428571 0.0000000 0.00000000 0.6428571

The default partykit display for classification trees is a bar chart for response variables with more than two categories. Based on the displayed frequency counts in the leaves it's clear that we've classified one group perfectly and two others fairly well.

library("partykit")

plot(as.party(myprune), tp_args=list(id=FALSE))

Fig. 13 Classification tree diagram produced by the partykit package

Another way we can summarize the classification results graphically is with a mosaic plot, which replaces the box plots for a regression tree and the bar charts produced by the partykit package.

par(mfrow=c(2,1))

par(mar=c(0.5,2.1,0.5,0))

plot(myprune, margin=.3)

text(myprune, use.n = T, cex = .9)

par(xpd=T)

par(mar=c(2.1,3.1,0.1,1))

out.t <- table(myprune$where, Ditch$Site)

#remove labels for where categories

rownames(out.t) <- rep('',4)

mosaicplot(out.t, main='', xlab='', las=1)

par(xpd=F)

par(mar=c(4.1,5.1,2.1,1.1))

par(mfrow=c(1,1))

Fig. 14 Classification tree and corresponding mosaic plot

In a mosaic plot the categories that don't appear at a leaf are shown with dashed lines in the appropriate positions. The width of the boxes reflect the number of observations assigned to the four leaves of the tree.

The first split of the classification tree was based on calcium levels. We can graphically assess how well this split discriminated the sites with a dot chart in which a line is superimposed at the calcium split value. Each site type is displayed with a different symbol type.

dotchart(Ditch$Total_Calcium, pch = Ditch$Site, xlab = "Range", ylab = "Sample", main = "Total Calcium")

abline(v=118,lty=2,col=2)

Fig. 15 Dot plot of the primary split variable, Total_Calcium

From the graph it's pretty clear that total calcium level does a very good job at discriminating two groups of sites. Sites 1, 2, and 5 look very different from sites 3 and 4. For most of the sites only a couple of years end up being classified into the wrong group, while site 3 has been classified perfectly.

So far we've been ignoring the summary results from an rpart run because the output is so copious. It contains an entire summary of the tree fitting process. One useful portion of the summary output is the description of surrogate splits. Below I show this information for the first split at Total_Calcium.

summary(myprune)

Call:
rpart(formula = factor(Site) ~ ., data = Ditch[, c(1, 4:18)],
control = rpart.control(cp = 0.001, minsplit = 15))
n= 48

       CP nsplit rel error xerror    xstd
1 0.23684      0    1.0000 1.1842 0.04413
2 0.18421      1    0.7632 1.1842 0.04413
3 0.13158      2    0.5789 1.0526 0.06795
4 0.02632      3    0.4474 0.7368 0.08989

Node number 1: 48 observations, complexity param=0.2368
  predicted class=1 expected loss=0.7917
    class counts:    10     9    10    10     9
probabilities:   0.208 0.188 0.208 0.208 0.188
left son=2 (25 obs) right son=3 (23 obs)
Primary splits:
    Total_Calcium < 118   to the right, improve=6.392, (0 missing)
    Chloride      < 71    to the right, improve=6.142, (0 missing)
    Sulphate      < 205   to the right, improve=5.792, (0 missing)
    Conductivity < 703   to the right, improve=5.125, (0 missing)
    Depth         < 0.505 to the right, improve=4.807, (4 missing)
Surrogate splits:
    Chloride            < 71    to the right, agree=0.917, adj=0.826, (0 split)
    Conductivity        < 722.5 to the right, agree=0.896, adj=0.783, (0 split)
    Sulphate            < 261   to the right, agree=0.875, adj=0.739, (0 split)
    Ammoniacal_Nitrogen < 0.25 to the right, agree=0.812, adj=0.609, (0 split)
    Total_Zinc          < 0.035 to the right, agree=0.708, adj=0.391, (0 split)

The section labeled "Primary splits" lists the top five options that were considered for forming the first split. From this we can see that Total_Calcium was not a clear winner. Chloride, the second choice, does nearly as well in decreasing impurity. If the goal in building a classification tree is to screen out variables for a subsequent parametric regression, then one might consider including some of competitive primary split variables in the parametric regression too.

The section labeled Surrogate splits considers other splits that most closely agree with the group allocations obtained with Total_Calcium < 118. For instance we see that choosing to split on Chloride < 71 instead of Total_Calcium < 118 would have yielded a classification of observations at this split that is 91.7% in agreement. This can be useful in predicting new observations that have missing values for the primary split variable. So if a new variable did not have a value for Total_Calcium we could use Chloride instead and most likely achieve a similar classification to what we would have if we had a value for Total_Calcium.

Random forests

Two R packages that implement random forests are randomForest and party. The randomForest package is an R port of the Fortran code of the original implementation of random forests by Leo Breiman. The party package uses conditional inference trees that supposedly minimize the bias that random forests can exhibit with correlated variables. The function in the randomForest package for fitting random forests is randomForest. It uses a formula interface just like rpart.

library(randomForest)

rf <- randomForest(factor(Site)~., data=Ditch[,c(1,4:18)],)

Error in na.fail.default(list(`factor(Site)` = c(1L, 1L, 1L, 1L, 1L, 1L, :
missing values in object

Unlike rpart we need to tell the randomForest function how to deal with missing values because randomForest needs to construct subsets of the variables when building individual trees. A standard choice is na.action=na.roughfix. This imputes missing values using either the median of the non-missing observations (regression trees) or the most frequent category (classification trees). I add this argument as well as importance = T in order to obtain importance statistics.

rf <- randomForest(factor(Site)~., data=Ditch[,c(1,4:18)], na.action=na.roughfix, importance=T)

names(rf)

[1] "call"            "type"            "predicted"       "err.rate"
[5] "confusion"       "votes"           "oob.times"       "classes"
[9] "importance"      "importanceSD"    "localImportance" "proximity"
[13] "ntree"           "mtry"            "forest"          "y"
[17] "test"            "inbag"           "terms"

If we print the object we get details about the fit.

Call:
randomForest(formula = factor(Site) ~ ., data = Ditch[, c(1,      4:18)], importance = T, na.action = na.roughfix)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of error rate: 47.92%
Confusion matrix:
1 2 3 4 5 class.error
1 1 4 2 0 3   0.9000000
2 5 3 0 0 1   0.6666667
3 0 0 7 2 1   0.3000000
4 1 0 2 7 0   0.3000000
5 1 0 0 1 7   0.2222222

From the printout we see that by default the forest was built from 500 trees each of which is a bootstrap sample from the actual data. Observations that are left out of the bootstrap sample form the "out of bag" (OOB) set. The OOB observations from each tree were used to assess fit. The printout reports that at each split point a random sample of three predictors (= the floor of the square root of the total number of predictors) was obtained and used to determine the best split.

The error rate of 47.92% can be calculated from the confusion matrix shown at the bottom of the output. The rows of the confusion matrix correspond to the actual categories while the columns are the predicted categories. The predictions of the OOB observations are based on the majority opinion of the 500 trees. Entries on the diagonal are the number of times the random forest made the correct prediction. Thus we see that in site 1 only one of the observations was correctly predicted by the forest; the other nine were incorrectly assigned to other categories. The best prediction rate is for site 5 where 7 of the 9 observations are correctly classified by the random forest.

By specifying importance=T in the function call we obtain measures of each variable's importance to the classification.

rf$importance

                                    1            2            3            4
Depth                   -0.0028000000 0.025466667 0.098857143 0.044647619
pH                       0.0102142857 -0.013457143 -0.007304762 -0.002835714
Conductivity            -0.0202833333 0.005333333 0.079966667 0.031947619
BOD                      0.0029476190 -0.003266667 -0.001033333 -0.004314286
Ammoniacal_Nitrogen     -0.0077190476 -0.008735714 -0.001833333 -0.003407143
Total_Oxidised_Nitrogen -0.0089833333 0.008333333 -0.001319048 0.004400000
Suspended_Solids        -0.0202666667 0.018619048 0.020009524 0.016200000
Chloride                -0.0059333333 0.015628571 0.085352381 0.045450000
Sulphate                 0.0013380952 0.007666667 0.041842857 0.062373810
Total_Calcium           -0.0134666667 -0.013161905 0.098500000 0.064019048
Total_Zinc              -0.0121166667 0.032247619 0.041219048 0.006061905
Total_Cadmium            0.0016000000 -0.000200000 -0.005266667 -0.001685714
Total_Lead              -0.0046333333 0.028266667 0.005366667 0.009885714
Total_Nickel            -0.0006333333 -0.001833333 -0.001185714 -0.003500000
Total_Phosphate          0.0057190476 -0.007200000 0.027109524 0.161309524
                                    5 MeanDecreaseAccuracy MeanDecreaseGini
Depth                    0.0020190476         0.0306509511        3.4146216
pH                      -0.0113476190        -0.0066050167        1.7914344
Conductivity             0.1069190476         0.0359485513        4.1764033
BOD                     -0.0033000000        -0.0014124502        1.5922467
Ammoniacal_Nitrogen      0.0988000000         0.0138814218        2.3691919
Total_Oxidised_Nitrogen -0.0012619048         0.0006801351        1.9740028
Suspended_Solids         0.0006666667         0.0058266886        2.7005667
Chloride                 0.0797619048         0.0406627907        4.3421930
Sulphate                 0.0101000000         0.0230129595        3.4430097
Total_Calcium            0.0588571429         0.0366480276        3.7074724
Total_Zinc               0.0099857143         0.0146729988        2.1329549
Total_Cadmium           -0.0014142857        -0.0016455866        0.4718458
Total_Lead               0.0136666667         0.0101774810        1.3720880
Total_Nickel            -0.0009333333        -0.0019895049        0.6116379
Total_Phosphate          0.0130333333         0.0365652276        3.2338879

The two columns labeled MeanDecreaseAccuracy and MeanDecreaseGini contain the two importance measures discussed in lecture 36. MeanDecreaseAccuracy measures how much the prediction is affected if the values of a predictor are randomly permuted among the other observations in the OOB set. A good way to display these results is with a dot chart in which the variables are sorted by their importance.

dotchart(sort(rf$importance[,6]))

Fig. 16 Variable importance as indicated by the mean decrease in accuracy

The variables appear to fall roughly into groups with the first five or six variables being clearly more important than the rest. It's useful to compare this picture to the one we get using the importance measure based on the Gini index.

dotchart(sort(rf$importance[,7]))

Fig. 17 Variable importance as determined by the mean decrease in Gini index

While there has been a change in the order of the variables, the first six ranked variables using either importance measure are the same.

The predict function works with randomForest objects. By default, the predict function returns the majority vote of the 500 trees for the times when that the observation was OOB.

predict(rf)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
3 2 3 5 5 2 1 5 2 2 1 2 1 2 1 1 2 1 5 3 3 3 5 3 3 3 4
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
3 4 4 4 4 4 4 4 3 3 4 1 1 5 4 5 5 5 5 5 5
Levels: 1 2 3 4 5

If type='prob' is added as an argument we get instead the proportion of times that a given observation was assigned to the different categories while OOB.

predict(rf,type='prob')[1:10,]

1 2 3 4 5
1 0.13513514 0.1783784 0.43243243 0.24864865 0.005405405
2 0.17277487 0.7172775 0.01570681 0.07853403 0.015706806
3 0.17112299 0.1016043 0.39037433 0.32085561 0.016042781
4 0.22784810 0.1075949 0.01265823 0.03797468 0.613924051
5 0.23936170 0.2074468 0.01063830 0.02127660 0.521276596
6 0.25714286 0.4914286 0.06285714 0.07428571 0.114285714
7 0.33701657 0.2983425 0.19889503 0.03867403 0.127071823
8 0.29729730 0.1783784 0.02162162 0.02702703 0.475675676
9 0.11728395 0.3580247 0.18518519 0.22839506 0.111111111
10 0.03763441 0.7688172 0.00000000 0.01075269 0.182795699

Cited References

Zuur, Alain F., Elena N. Ieno, and Graham M. Smith. 2007. Analysing Ecological Data. Springer, New York. UNC e-book

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--April 17, 2012
URL: https://sakai.unc.edu/access/content/group/2842013b-58f5-4453-aa8d-3e01bacbfc3d/public/Ecol562_Spring2012/docs/lectures/lecture37.htm