Lecture 36—Wednesday, April 11, 2012

Outline of lecture

Cost complexity criterion
Cross-validation
General comments about trees
Extensions of regression trees—random forests
References

The cost complexity criterion

The basic method used in constructing a decision tree is an example of a greedy algorithm. In building the tree we choose each partition to obtain the best performance from the current tree without regard to the performance of future trees. The implication of this is that the obvious stopping rule, keep going until the reduction in impurity fails to exceed a pre-determined minimum threshold value ε, is a bad stopping rule. Because a greedy algorithm fails to look ahead, a worthless split at one stage can lead to a very worthwhile split at a later stage. In order to permit such worthwhile trees to arise a better strategy is to grow a very large tree and then prune it back. The standard criterion used in pruning trees is the cost complexity criterion c_p.

Suppose we grow a very large tree that has a total of n leaves. Assume that the tree is constructed as was described last time so that the best tree is found at each stage. If we reverse things, the best tree at the previous stage is the one that can be obtained by combining adjacent nodes in such a way that impurity increases the least. Thus pruning a tree retraces our steps and finds the best tree at each smaller size. Let T denote the current tree and let denote the number of splits that were used to create this tree. The cost complexity (CC) function for this tree is defined as follows.

Here D_i is the impurity at terminal node i: RSS for a continuous response or one of the categorical impurity measures defined last time. The parameter λ is a penalty term. By varying λ we can select different-sized trees from our sequence of best trees. The principle behind the cost complexity function is comparable to the rationale for the formula for smoothing splines that is used in generalized additive models:

smoothing spline

Here λ plays a similar role as it does in the cost complexity function.

The rpart function of the rpart package of R does not report λ from the complexity function, but rather a scaled version of λ that it denotes as c_p, the cost complexity criterion.

A null tree is one with no splits. rpart displays the sequence of best trees obtained as the value of c_p is varied.

Cross-validation

So the problem of when to stop building a tree has been converted into a problem of when to stop pruning a tree. To answer this we turn to cross-validation. For each best tree of a fixed size (or alternatively for each value of c_p yielding a different tree) we carry out a k-fold cross-validation. To do this we randomly divide the data D into k subsets such that

Typically k is chosen to be 10. We leave out each one of the k subsets in turn and fit the tree to a data set consisting of the remaining k – 1 subsets combined. Fitting the tree means running the observations through the previously constructed tree branches to obtain the predictions at the leaves, or depending on whether it is a regression tree or a classification tree. The data that were left out are then run through the tree and are used to calculate the impurity of the tree at each node. This process is repeated k times, once for each subset, and the relative average cross-validation error (xerror) is calculated as follows.

xerror

Stopping rule #1: Choose the value c_p (or equivalently the number of splits) that produces a tree that minimizes cross-validation relative error xerror. The problem with this stopping rule is that xerror is random because it depends on the k-fold partition that was actually obtained. Thus a better rule is one that takes into account the variability of xerror.

Stopping rule #2: Choose the first (largest) value of c_p such that xerror < min(xerror) + xstd. Here xstd is the estimate of the standard deviation of xerror that is calculated from the tree with the minimum value of xerror.

General comments about trees

Monotone transformations of the predictors have no effect on the tree one gets. This is not true for transformations of the response because a transformation will change the residual sum of squares.
Trees are not useful with small data sets. To achieve any progress with small data sets one is typically forced to make very strong assumptions. Trees on the other hand make almost no assumptions at all.
Trees are especially useful with large data sets in which there are many predictors to choose from. In this scenario the weak assumptions made by trees are not limiting and good results can be obtained. Trees have become especially popular for data mining projects.
Trees can find complex structure in data sets, structure that cannot be detected with ordinary regression models.
Because of the way trees are formulated, the overall tree can be far from optimal even though the tree obtained at each local step is optimal.
Trees are not able to account for correlated data. Put another way trees built using correlated data only represent other similarly correlated data sets from that population.
Trees deal with continuous variables by essentially converting them into categorical variables.

How can trees be used in conjunction with ordinary parametric models?

Trees can be used to suggest possible interactions. When two different variables are used to split a tree at different points along the same branch, one interpretation is that there is an interaction between those two variables.
If the same variable appears repeatedly along the same branch of a tree, it may mean there is a nonlinear relationship between that variable and the response.
Having estimated a parametric model, one can run the residuals from the model through a tree to see if there is any additional structure that has been overlooked.
One can take the fitted values from a parametric regression model and use them to depict the regression model as a tree with a series of decision rules.

Extensions of regression trees—random forests

Regression trees have come a long way in the 25 years since their inception. Recent attention has focused on what are called ensemble methods. These are associated with colorful names such as boosting, bagging, and random forests. The basic idea behind all of these methods is to grow trees on perturbed instances of the data, each time obtaining a set of predictions and then averaging the predictions over repeated samples. Ensemble methods try to address two problems that can arise with individual decision trees.

Decision trees tend to not be very robust. Removing a fraction of the data can change an individual tree dramatically.
Even with pruning individual trees tend to overfit to particular data sets.

Random forests get around both problems by interjecting randomness into the process. This is done by including random sampling at two stages of the tree-building process.

Sampling the data set

In the random forest implementation that is available for R, bootstrap samples (500 by default) are drawn from the raw data and are used to grow trees. So each of the 500 trees is built from a different set of observations. A bootstrap sample is a random sample with replacement so the same observations can appear in a sample multiple times. Typically with a large number of observations 2/3 of them will appear in a given bootstrap sample and 1/3 will be left out. The machine learning terminology for this is "bagging", an acronym for "bootstrap aggregation". Those observations selected for the sample are said to be "in the bag" and those left out are said to be "out of bag". The in the bag observations serve as the training data set for building the tree. The out of bag observations are used as the test data set for determining how well the tree predicts new observations.

Sampling the variables

Instead of using all the available variables to define a split at a decision node, a small random sample of predictors is selected and evaluated for split points. Thus within a given tree the variables that are available for use at each split point will vary depending on the random sample obtained. In R if there are p predictors available in the data set, a random sample of size of these predictors is selected at each split point for a classification tree. For regression trees the default is 1/3 of the predictors. In addition to introducing variability into the trees, working with a smaller predictor set at each decision node reduces the computation load.

Overfitting the tree

Given the different data sets and different predictor sets we expect the performance of the different trees to vary. Each tree is built to its maximum depth with no pruning so that the tree performs well (has minimum bias) on the data set that was used to build it. Although each tree overfits to its particular data set, over the entire ensemble of trees each tree is overfitting differently because the variables are different (as a result of the random selection of variables at nodes) and the data sets are different (as a result of bagging). The individual trees are analogous to a set of experts each of which has a different and very specific expertise but collectively they encompass a wide range of knowledge. As a result an ensemble of trees tends to perform well with new data.

Summarizing the performance of a random forest

Random forests use ensemble scoring. Predictions are calculated as the average of the predictions of the individual trees. Because so many trees are involved describing the fit of a random forest is more complicated than it is with a single decision tree.

One statistic reported by a random forest is the OOB (out of bag) estimate of the error rate. This is calculated using the out of bag observations. Because bootstrap samples were used to build each tree, each observation will on average be OOB in about 1/3 of the trees. For each tree in which the observation is OOB we can calculate the prediction error. This error then gets averaged across all trees for which that observation is OOB. For classification trees the prediction error is the proportion of trees for which the OOB observation is misclassified. Finally we average this over all observations to obtain a single summary statistic. The OOB estimate of the error rate tells us how well the forest will generalize to new data.

A random forest also reports the importance of each variable in the forest. This is calculated in a number of different ways. The Gini importance of a variable is the average decrease in the Gini index whenever that variable is used to make a decision at a split point. Another importance measure is the scaled average of prediction accuracy. For this each variable has its values randomly permuted among the other OOB observations. The predictions for these observations after permutation are obtained and compared against the the predictions before permutation. The average decrease in accuracy is then calculated. The larger this is the more important is the variable. The importance values of different variables are typically displayed in a dot plot in which the variables are ranked by their importance.

In classification trees we may wish to treat false positive errors differently from false negative errors. There are tuning parameters (the sampsize argument in the R implementation of randomForest for example) that can be used to force a tree to take one of the errors more seriously than the other. This is analogous to adjusting the probability cut-off in a decision rule in order to change the sensitivity and specificity of logistic regression models.

The utility of random forests

Random forests are ensembles of hundreds of unpruned decision trees. They are typically used with large data sets that have hundreds or even thousands of input variables. It is even possible to use random forests when there are more variables than there are observations. Unlike single trees, random forests are not sensitive to noise. Because the individual trees making up the forest are built to maximum depth, they tend to have low bias. Classification trees are especially good at handling under-represented classes.

The process of bagging and the use of small random samples of predictors at each node protects random forests against overfitting and guards against the undue influence of outliers. This built-in randomness also makes each tree in the ensemble a more or less independent model. For a recent introduction to random forests for ecologists see Cutler et al. (2007).

References

Cutler, D. R., T. C. Edwards, K. H. Beard, A. Cutler, K. T. Hess, J. Gibson, and J. J. Lawler. 2007. Random forests for classification in ecology. Ecology 88: 2783–2792.

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--April 12, 2012
URL: https://sakai.unc.edu/access/content/group/2842013b-58f5-4453-aa8d-3e01bacbfc3d/public/Ecol562_Spring2012/docs/lectures/lecture36.htm