Assignment 6—Solution

co2 <- read.table('ftp://ftp.cmdl.noaa.gov/ccg/co2/trends/co2_mm_mlo.txt', na.strings='-99.99')
names(co2) <- c('year', 'month', 'decimal.date', 'average', 'interpolated', 'trend', 'days')
co2$ID <- 1:nrow(co2)
co2[1:4,]
  year month decimal.date average interpolated  trend days ID
1 1958     3     1958.208  315.71       315.71 314.61   -1  1
2 1958     4     1958.292  317.45       317.45 315.29   -1  2
3 1958     5     1958.375  317.50       317.50 314.71   -1  3
4 1958     6     1958.458      NA       317.10 314.85   -1  4

Question 1

plot(average~decimal.date, data=co2, ylab=expression(paste('CO'[2],' level')), xlab='Year', type='l')
abline(lm(average~decimal.date, data=co2), col=2)

fig. 1

Fig. 1  Raw data plus linear trend model

The plot shows a marked deviation from linearity.

Question 2

model1 <- lm(average ~ decimal.date, data=co2)
model2 <- lm(average ~ decimal.date + I(decimal.date^2), data=co2)
anova(model1, model2)
Analysis of Variance Table

Model 1: average ~ decimal.date
Model 2: average ~ decimal.date + I(decimal.date^2)
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)   
1    638 7284.0                                 
2    637 3051.4  1    4232.5 883.57
< 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The quadratic term is highly significant.

Question 3

data1 <- data.frame(res=residuals(model2), ID=co2$ID[!is.na(co2$average)])
data2 <- data.frame(ID=co2$ID)
data3 <- merge(data1, data2, all=T)
acf(data3$res, na.action=na.pass, main='')
mtext('Model 2 residuals', side=3, line=.5, cex=1)

fig. 2

Fig. 2  ACF for the model 2 residuals

The plot shows an obvious cycle at 12 months (the point at which the graph begins to repeat itself).

Question 4

model3 <- update(model2, .~. + sin(2*pi/12*ID) + cos(2*pi/12*ID))
anova(model2, model3)
Analysis of Variance Table

Model 1: average ~ decimal.date + I(decimal.date^2)
Model 2: average ~ decimal.date + I(decimal.date^2) + sin(2 * pi/12 *
    ID) + cos(2 * pi/12 * ID)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)   
1    637 3051.41                                 
2    635  532.88  2    2518.5 1500.6
< 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The addition of the trigonometric terms of period 12 is highly significant.

Question 5

data1 <- data.frame(res=residuals(model3), ID=co2$ID[!is.na(co2$average)])
data2 <- data.frame(ID=co2$ID)
data3 <- merge(data1, data2, all=T)
acf(data3$res, na.action=na.pass, main='', lag.max=500)
mtext('Model 3 residuals', side=3, line=.5, cex=1)

fig. 3

Fig. 3  ACF for the model 3 residuals

Although things are damping down with time, it's pretty clear we have a cycle somewhere in excess of 300 months.

Question 6

I try fitting models with a period of 250 months up to a period of 400 months and save the AIC from each model.

my.aic<-matrix(NA, ncol=2, nrow=151)
for(k in 250:400) {
temp.model4 <- update(model3, .~. + cos(2*pi/k*ID) + sin(2*pi/k*ID))
my.aic[k-249,] <- c(k, AIC(temp.model4))
}
which.min(my.aic[,2])
[1] 76
my.aic[76,]
[1]  325.000 1484.963
plot(my.aic[,1], my.aic[,2], type='l', xlab='Period', ylab='AIC')

fig. 4

Fig. 4   AIC values for different periodic models, period 250 to 400

Although a period of 325 minimizes AIC, it's pretty clear from the graph that a number of other choices for the period would do nearly as well. This is not surprising given we only have enough data to observe two full cycles.

model4 <- update(model3, .~. + sin(2*pi/325*ID) + cos(2*pi/325*ID))
anova(model3, model4)
Analysis of Variance Table

Model 1: average ~ decimal.date + I(decimal.date^2) + sin(2 * pi/12 *
    ID) + cos(2 * pi/12 * ID)
Model 2: average ~ decimal.date + I(decimal.date^2) + sin(2 * pi/12 *
    ID) + cos(2 * pi/12 * ID) + sin(2 * pi/325 * ID) + cos(2 *
    pi/325 * ID)
  Res.Df    RSS Df Sum of Sq     F    Pr(>F)   
1    635 532.88                                
2    633 371.98  2     160.9 136.9
< 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The addition of the trigonometric terms of period 325 is highly significant.

Question 7

data1 <- data.frame(res=residuals(model4), ID=co2$ID[!is.na(co2$average)])
data2 <- data.frame(ID=co2$ID)
data3 <- merge(data1, data2, all=T)
acf(data3$res, na.action=na.pass, main='')
mtext('Model 4 residuals', side=3, line=.5, cex=1)

fig. 5

Fig. 5   ACF of the model 4 residuals

The graph shows an obvious cycle at six months.

model5 <- update(model4, .~. + sin(2*pi/6*ID) + cos(2*pi/6*ID))
anova(model4, model5)
Analysis of Variance Table

Model 1: average ~ decimal.date + I(decimal.date^2) + sin(2 * pi/12 *
    ID) + cos(2 * pi/12 * ID) + sin(2 * pi/325 * ID) + cos(2 *
    pi/325 * ID)
Model 2: average ~ decimal.date + I(decimal.date^2) + sin(2 * pi/12 *
    ID) + cos(2 * pi/12 * ID) + sin(2 * pi/325 * ID) + cos(2 *
    pi/325 * ID) + sin(2 * pi/6 * ID) + cos(2 * pi/6 * ID)
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)   
1    633 371.98                                 
2    631 184.47  2    187.51 320.69
< 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The addition of the trigonometric terms of period 6 is highly significant.

Question 8

data1 <- data.frame(res=residuals(model5), ID=co2$ID[!is.na(co2$average)])
data2 <- data.frame(ID=co2$ID)
data3 <- merge(data1, data2, all=T)
acf(data3$res, na.action=na.pass, main='')
mtext('Model 5 residuals', side=3, line=.5, cex=1)
pacf(data3$res, na.action=na.pass, main='')
mtext('Model 5 residuals', side=3, line=.5, cex=1)

fig 6

Fig. 6   ACF of the model 5 residuals

The decaying pattern of the ACF is highly suggestive of an autoregressive process.

fig. 7

Fig. 7   PACF of the model 5 residuals

The sharp spike at lag 1 in the PACF suggests that the autoregressive process may be AR(1). The fact that there are two much smaller significant spikes at lags 2 and 3 suggests we should also consider and AR(2) or AR(3). Because the drop from the first spike is sudden while the decay of spikes 2 and 3 is gradual, we might also consider adding some moving average terms.

Question 9

I start by fitting the last model using the gls function of the nlme package.

library(nlme)
out.g <- gls(average ~ decimal.date + I(decimal.date^2) + sin(2*pi/12*ID) + cos(2*pi/12*ID) + sin(2*pi/325*ID) + cos(2*pi/325*ID) + sin(2*pi/6*ID) + cos(2*pi/6*ID), data=co2, method='ML', na.action=na.omit)

From the ACF and PACF I consider ARMA processes up to order 3. Because the PACF is a bit ambiguous, I also include moving average processes up to order 3.

cor.results <- NULL
for(p in 0:3) {
for(q in 0:3) {
if(p>0 | q>0) {
cor.temp <- update(out.g, correlation=corARMA(p=p, q=q, form = ~ ID))
cor.results <- rbind(cor.results, c(p, q, logLik(cor.temp), AIC(cor.temp))) }
}
}
cor.results2 <- rbind(c(0,0, logLik(model5), AIC(model5)), cor.results)
colnames(cor.results2) <- c('p','q','LL','AIC')
cor.results2
      p q        LL       AIC
 [1,] 0 0 -510.0550 1040.1099
 [2,] 0 1 -324.1492  670.2985
 [3,] 0 2 -260.6680  545.3360
 [4,] 0 3 -223.3909  472.7818
 [5,] 1 0 -169.3214  360.6429
 [6,] 1 1 -156.9674  337.9348
 [7,] 1 2 -155.8479  337.6958
 [8,] 1 3 -155.8348  339.6697
 [9,] 2 0 -159.4348  342.8696
[10,] 2 1 -155.8277  337.6554
[11,] 2 2 -155.8166  339.6332
[12,] 2 3 -155.8134  341.6267
[13,] 3 0 -156.0176  338.0353
[14,] 3 1 -155.8686  339.7372
[15,] 3 2 -155.7893  341.5785
[16,] 3 3 -144.5356 
321.0711

We see a large drop in AIC from model5 (p=0, q=0) to an AR(1) model and then to an ARMA(1,1) model. There is no additional improvement until the last model examined, the ARMA(3,3) model, which ranks best. I refit that model as model6.

model6 <- update(out.g, correlation=corARMA(p=3, q=3, form=~ID))

Question 10

data1 <- data.frame(res=residuals(model6, type='normalized'), ID=co2$ID[!is.na(co2$average)])
data2 <- data.frame(ID=co2$ID)
data3 <- merge(data1, data2, all=T)
acf(data3$res, na.action=na.pass, main='')
mtext('Model 6 residuals', side=3, line=.5, cex=1)

Fig. 8a shows the ACF of the normalized residuals. There are still a number of significant spikes at various lags. We haven't been correcting for the large number of tests that we've been doing. Because we've included a period 325 term determined on the basis of examining an ACF, I feel justified in including a Bonferroni correction for 325 tests (Fig. 8b).

(a) fig 8a
(b)fig. 8b

Fig. 8  ACF for the ARMA(3,3) quadratic model with three periodic terms using (a) alpha = .05 and (b) alpha = .05/325, a Bonferroni correction to account for testing at 325 different lags.

So, even with such a large Bonferroni correction there remains a significant spike at lag 12. Apparently the trig terms have not adequately accounted for the annual cycle.

Question 11

I replace the ARMA(3,3) model with an AR(12) model.

model7 <- update(out.g, correlation=corARMA(p=12, q=0, form=~ID))
AIC(model7)
[1] 331.361
data1 <- data.frame(res=residuals(model7, type='normalized'), ID=co2$ID[!is.na(co2$average)])
data2 <- data.frame(ID=co2$ID)
data3 <- merge(data1, data2, all=T)
acf(data3$res, na.action=na.pass, main='')
mtext('Model 7 residuals', side=3, line=.5, cex=1)
acf(data3$res, na.action=na.pass, main='', ci=1-.05/30)
mtext('Model 7 residuals (Bonferroni correction for 30 tests)', side=3, line=.5, cex=.9)

Now a fairly modest Bonferroni correction removes the significance of the spike at lag 12.

(a) fig 9a
(b)fig 9b

Fig. 9  ACF for the AR(12) quadratic model with three periodic terms using (a) alpha = .05 and (b) alpha = .05/30, a Bonferroni correction to account for testing at 30 different lags.

Question 12

plot(average~decimal.date, data=co2, ylab=expression(paste('CO'[2],' level')), xlab='Year', type='l', lwd=2, col='grey70')
#reintroduce missing values into the predicted values time series
pred1 <- data.frame(pred=predict(model7), ID=co2$ID[!is.na(co2$average)])
pred3 <- merge(pred1, data2, all=T)
pred3[1:6,]
  ID     pred
1  1 316.0438
2  2 317.0623
3  3 317.6913
4  4       NA
5  5 315.5368
6  6 313.4073
lines(co2$decimal.date, pred3$pred, col=2)
legend('topleft', c('raw data', 'model'), col=c('grey70',2), lty=c(1,1), lwd=c(2,1), cex=.9, bty='n')

fig. 10

Fig. 10   Plot of model predictions and the raw data

I zoom in on a the more recent portion of the plot to get a better sense of the fit.

plot(average~decimal.date, data=co2[co2$decimal.date>=2000,], ylab=expression(paste('CO'[2],' level')), xlab='Year', type='l',lwd=2,col='grey70')
lines(co2$decimal.date[co2$decimal.date>=2000], pred3$pred[co2$decimal.date>=2000], col=2)
legend('topleft', c('raw data', 'model'), col=c('grey70',2), lty=c(1,1), lwd=c(2,1), cex=.9, bty='n')

fig. 11

Fig. 11   Plot of model predictions and the raw data for 2000 – 2012

It appears that there is a degree of irregularity to the annual cycle that we fail to match. The amplitude varies slightly over time. If we had some useful time-dependent predictors we might be able to explain this deviation from a purely sinusoidal pattern.

The standard tool for detecting signals in time series is the periodogram (spectrum). If we examine the spectrum of the residuals of the quadratic model, the three cycles we found are seen as being the dominant cycles. (I refit the model to the interpolated data because the spectrum function does not permit missing data.) This is further justification for the model we've fit.

out2 <- lm(interpolated~decimal.date + I(decimal.date^2), data=co2)
out.sp <- spectrum(residuals(out2), spans=3, log='no')
1/out.sp$freq[out.sp$spec>30]
[1] 324.00000 12.22642 12.00000 11.78182 6.00000
plot(1/out.sp$freq, out.sp$spec, type='h', ylab='spectrum', xlab='period', col=2, lwd=2)

fig. 12

Fig. 12   Spectrum of the residuals from the quadratic model


hw scores

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--March 1, 2012
URL: https://sakai.unc.edu/access/content/group/2842013b-58f5-4453-aa8d-3e01bacbfc3d/public/Ecol562_Spring2012/docs/solutions/assign6.htm