Assignment 6

Due Date

Friday, February 24, 2012

Data Source

The data for this assignment are described at the NOAA web site as follows: "The carbon dioxide data, measured as the mole fraction in dry air, on Mauna Loa constitute the longest record of direct measurements of CO₂ in the atmosphere. … Data are reported as a dry mole fraction defined as the number of molecules of carbon dioxide divided by the number of molecules of dry air multiplied by one million (ppm)."

Preliminary Steps

The goal of this exercise is to develop a regression model for the monthly mean atmospheric CO₂ levels in Hawaii from 1958 to 2012. The model will be purely descriptive using time as the only predictor. The two variables of interest are

average: the monthly average atmospheric CO₂ level. This variable is missing for some months.
decimal date: the year and month converted to a decimal value.

Obtain the data and read it into R. According to the description of the data missing values are recorded as -99.99. Use the na.strings='-99.99' argument of read.table to identify those values as missing. Note: the missing code needs to be given in quotes.

Create a variable, call it ID, that starts at 1 for the earliest observation and has as its last value the number of observations in the data set. We'll use this variable to model periodicities in the CO₂ levels over time.

Questions

Plot monthly average atmospheric CO₂ level against decimal date (with points connected by line segments). Superimpose a linear trend on this scatter plot. Does the trend in the data look linear?
Fit a linear trend model and a quadratic model to the data and test one against the other. Which appears to be the better model?
Plot the ACF of the residuals from the best model of Question 2 using the default setting for lag.max. The displayed lags represent months. What obvious periodicity is displayed?
To account for the observed annual cycle add appropriate trigonometric terms that are functions of the ID variable to the best model from Question 2. Verify that the new terms are statistically significant.
Plot the ACF of the residuals of the model of Question 4 using lag.max=500. What do you see?
Add trigonometric terms that are functions of the ID variable to account for the long-range periodic behavior seen in Question 5. Fit a sequence of models with different integer choices for the period and select the period that yields a model with the lowest AIC. Verify that all the terms in the model are statistically significant.
Plot the ACF of the residuals of the best model of Question 6 using the default settings for lag.max. You should see evidence of a short-range period. Guess the period or fit a sequence of models as in Question 6 to identify the period precisely. Verify that all terms of your new model are statistically significant.
Plot the ACF of the residuals of the best model of Question 7. There still may be evidence of long range periodicities, but the short range ACF should be well-behaved. Plot the PACF of the residuals of this model. Based on what you see in these two plots what correlation structures might you consider?
Add a number of appropriate correlation structures to the best model of Question 7 and select the best of these. Warning: depending on the speed of your computer some of these models may take a long time to fit!
Plot the ACF of the residuals of your best model from Question 9 using type='normalized' to see if the remaining correlation has been accounted for. You should see that except for a significant lag 12 correlation (and its multiples), the remaining residual correlation has been removed.
To remove this last significant correlation try fitting an AR(12) correlation model instead for the residuals. This AR(12) model will not have the lowest AIC of the correlation models considered because the gls function forces us to include all of the intermediate autoregressive parameters, most of which are not needed here.
Plot the predicted mean of your final model superimposed on a plot of the raw data over time.

Hints

Preliminary steps

Because the lines of text at the top of the file begin with a comment symbol, #, it is possible to read the web page directly into R from the web site as is. Specify the URL as the file location and don't include the header= argument of read.table, or specify it as header=F. Without a header R will create generic names for the columns so you will have to supply your own names with the names function after you read in the file.
If instead you would rather read in the file using the names that appear in the file you will need to first edit the file as follows.
- Remove the preliminary text and modify the header so that the variable names occupy only a single row.
- Remove the comment symbol, #, that begins the row of variable names.
- Add names for the variables that occupy the first two columns of the file that are currently unnamed.
- The variable in the last column is named "#days". Because # is the comment symbol in R you must remove the # character from the name before you try to read the data into R.

Question 3 and others involving an ACF

The vector of residuals returned from the models you fit will have the missing observations removed. So it will be shorter than your original data set. This messes up the temporal spacing of the observations. In order to obtain an honest ACF we need to re-introduce those missing values back into the residual vector. The following three-step protocol will do this.

Create a data frame that contains the residuals of the model and the ID variable. You will need to remove those ID values corresponding to missing values of the response before constructing the data frame. This data frame will have dimension 640 × 2.
Create a second data frame that contains the full ID variable. This data frame will have dimension 647 × 1. Make sure the ID variable in both data frames has the same name.
Use the merge function to merge the two data frames. The merge function is used to combine the variables from two data frames using common named fields, in this case ID, to line up the observations properly. We also need to include the argument all=T of merge so that unmatched cases are included.

The data frame this produces will have a column of residuals in which the missing values have been inserted. As an example, suppose the original input data frame is called co2 and the best model from Question 2 is called out2. The three steps listed above would be carried out as follows.

data1 <- data.frame(res=residuals(out2), ID=co2$ID[!is.na(co2$average)])

data2 <- data.frame(ID=co2$ID)

data3 <- merge(data1, data2, all=T)

data3[1:9,]

ID        res
1 1 1.5132024
2 2 3.1844621
3 3 3.1663749
4 4         NA
5 5 1.3888820
6 6 0.3902997
7 7 -1.4084469
8 8         NA
9 9 -1.4172666

Before examining the ACF of the residuals you will need to carry out these three steps for each of the models you fit.

Question 10

For residuals produced by the gls function there is a type='normalized' argument,

residuals(gls.model, type='normalized'),

that extracts normalized residuals in which the standardized residuals are pre-multiplied by the inverse square root factor of the estimated error correlation matrix. If the chosen correlation structure is correct the normalized residuals should no longer display the significant ACF pattern you were trying to remove.

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--February 20, 2012
URL: https://sakai.unc.edu/access/content/group/2842013b-58f5-4453-aa8d-3e01bacbfc3d/public/Ecol562_Spring2012/docs/assignments/assign6.htm