Assignment 9
Due Date
Friday, March 23, 2012
Data Source
The data set for this assignment is ozone.txt, a tab-delimited text file.
Overview
The data set comes from a study of the relationship between atmospheric ozone concentration and meteorology in the Los Angeles basin. It was first presented by Breiman and Friedman (1985) and was further analyzed by Faraway (2006). The data consist of daily measurements of ozone concentration (maximum one hour average) and eight meteorological quantities for 330 days of 1976. The variables are listed below.
Dependent Variable
03: Upland ozone concentration (ppm)
Predictors
- temp: Sandburg Air Force Base temperature (°C)
- ibh: inversion base height (ft.)
- dpg: Daggett pressure gradient (mm Hg)
- vis: visibility (miles)
- vh: Vandenburg 500 millibar height (in)
- humidity: humidity (percent)
- ibt: inversion base temperature (°F)
- wind: wind speed (mph)
- day: day of the year to account for possible seasonal effects not captured by the meteorological variables
Questions
- Fit an additive model that includes O3 as the response along with separate smooths of each of the nine predictor variables.
- According to the Wald tests in the summary table two of the smooths may not be statistically significant. The reported Wald tests are somewhat unreliable though. Better statistical tests can be obtained with method="ML". Refit the model using method="ML" and try dropping the two variables in question one at a time. Use the anova function with test='F' to compare two nested models at a time. Remove those variables whose smooths are not significant as reported by the anova function.
- Plot the smooths and determine which of the smooths are roughly linear. Replace the smooths for those variables with parametric linear terms.
- Plots of the four remaining smooths suggest that a piecewise linear curve with two pieces might approximate the pattern displayed by the smooth. Separately try replacing each of the four smooths with an appropriate piecewise linear curve. Estimate the location of the breakpoint (knot) by fitting a large number of models and selecting the model that provides the best fit.
- Argue that it is statistically defensible to replace only one of the smooths with a piecewise linear curve. (Remember that the estimated breakpoint location should count as one of the estimated parameters.)
- For this variable, plot the smooth along with the replacement breakpoint model on the same graph so that the two functions can be readily compared.
Hints
- Question 4: To fit a breakpoint (BP) model for the variable x at breakpoint c, include the following two terms in the regression model: x + I((x-c)*(x>c)). These two terms together should replace s(x). For each of the four smooths you should take your full GAM from Question 3 and replace only one smooth in that model with the corresponding BP terms. Loop through possible values of the breakpoint and choose the best one. Repeat this separately for each smooth keeping the rest of the model as it was in Question 3. So, in each case the model you are fitting should contain 3 smooths, 4 linear terms, and 1 BP term. The difference between the models is that a different smooth gets swapped with the BP terms, but the rest of the smooths are still there. Finally you should compare each BP model separately to the GAM from Question 3.
- Question 6: For the smooth just plot the GAM from Question 3 with the select argument to select the smooth you want. Overlay the breakpoint term on top of this graph. You will need to choose an intercept of the breakpoint regression model so that the smoother and the breakpoint regression model have the same x- and y-coordinates at the left edge of the graph. It will take trial and error to determine the y-coordinate of the smooth at the left-hand endpoint. Once you obtain this you can then figure out how much you need to add to the BP terms to cause the BP model to start at the same point. There is an example of doing this in the lecture 26 notes where I overlay the regression term for log(distance) on the smooth of distance.
Cited references
- Breiman, L. and J. H. Friedman. 1985. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association 80: 580–598.
- Faraway, Julian J. 2006. Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. Chapman & Hall/CRC Press: Boca Raton, FL.
Course Home Page