Final Exam—Part 2

Due Date

Wednesday, April 25, 2012

Instructions

Same rules and policies apply to this "exam" as apply to ordinary assignments. Part 3 of the final will be assigned on the last day of class and will be due on the day of the scheduled final exam.

Data Source

The file elfin.csv contains the data for this exercise. This is a comma-delimited text file in which the variable names appear in the first row. The data are analyzed in Albanese et al. (2008).

Background

According to Albanese et al. (2008), p. 603, "the frosted elfin (Callophrys irus) is a localized and declining butterfly found in xeric open habitats maintained by disturbance." They examined four study sites in southeastern Massachusetts to assess whether females preferentially deposited eggs on host plants within specific microhabitats. They recorded the locations of wild indigo plants on which female frosted elfins were observed depositing an egg. A random sample of unoccupied wild indigo plants was obtained to compare to the larvae-occupied group. They then measured seven vegetative and other environmental variables at both groups of wild indigo plants. The recorded variables are described in the table below.

Variable Description
TotalCanopy The woody plant canopy cover over the center of the wild indigo plant. The average of four spherical densitometer readings measured at breast height (1.5 m) in each of the four cardinal directions
WildIndigoSize The maximum foliar width (cm) of the wild indigo plant multiplied by the maximum height (cm) of the wild indigo plant
NearestWildIndigo The linear distance (cm) from the edge of the wild indigo plant to the edge of the nearest neighboring wild indigo plant
DistanceNearestTree The linear distance (cm) from the nearest edge of the wild indigo plant to the nearest edge of a woody plant >2.5 m in height
DirectionTree The direction measured in degrees with a compass of the main stem of the nearest woody plant >2.5 m in height
Slope The steepest slope angle measured in degrees with an optical clinometer from the center of the wild indigo plant
SlopeAspect The direction of the steepest slope angle measured in degrees with a compass from the center of the wild indigo plant. This variable is missing for about 25% of the observations.

The variable Occupied_Unoccupied in the data set is a binary variable that indicates whether an indigo plant harbored an egg (Occupied_Unoccupied = 1) or did not (Occupied_Unoccupied = 0).

Questions

  1. Fit a classification tree to the above occupancy data. Find the optimal tree size and display your final tree graphically with proper labeling. Be sure to use the set.seed function to set the random seed manually before you make the final run of rpart to carry out the cross-validation for determining how best to prune the tree. Report the value of the seed you used so that I can replicate your results if needed.
  2. Fit a GAM to these data with the binary response Occupied_Unoccupied. Simplify the model using appropriate tests and/or criteria.
  3. Fit a habitat suitability model to these same data but this time using ordinary logistic regression. Report your final logistic regression model and explain why you decided to include the terms you did (both why you retained terms and how you came up with the terms in the first place).
    1. Three of the predictors in this data set are angles. Explain why angles are unusual variables and suggest possible ways of including angles in the regression model. Are all of these choices reasonable for each of the angle variables present here?
    2. Use the generalized additive logistic regression model (GAM) from Question 2 to help decide on the proper functional forms for the predictors in the model.
    3. Use the classification tree you obtained in Question 1 to suggest possible interactions and nonlinearities in the variables in your logistic regression model and try to add them to the model.
    4. Are the nonlinearities suggested by the classification tree given further support by the GAM?
  4. Using a cut-off of 0.5 obtain confusion matrices (classification tables) for the classification tree, the GAM, and the ordinary logistic regression model. Note: The predict function can be used with classification trees to obtain predicted probabilities for the 0 and 1 categories.
  5. Obtain ROC curves for the classification tree, GAM, and ordinary logistic regression model. Display all three ROC curves on the same graph.
  6. Obtain AUC for all three models.
  7. Based on the above results, which appears to be the better model?

Hints

Question 2

The number of knots used in the smoothing spline cannot exceed the number of unique values of a variable. The default value for knots is k = 10. You will need to reduce this for a couple of the variables in order to be able to estimate a smooth for that variable.

Reference

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum in Ecology and the Environment, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--April 15, 2012
URL: https://sakai.unc.edu/access/content/group/2842013b-58f5-4453-aa8d-3e01bacbfc3d/public/Ecol562_Spring2012/docs/assignments/finalpart2.htm