Lecture 33—Monday, April 2, 2012

Topics

R functions and commands demonstrated

R packages used

Data sets with both spatial and temporal correlation

Our approach has been to treat spatial and temporal correlation in data as a nuisance, problems to be addressed after we've fit the best model we can using the predictors at our disposal. Beginning in lecture 15 we examined data in which measurements were obtained over time on the same subject as well as on different subjects yielding a hierarchical data structure with the potential for residual temporal correlation. Beginning in lecture 30 we turned to spatially referenced data where even after fitting a regression model with spatially referenced predictors the spatial correlation persisted. Today we examine a data set in which the residuals of our best regression model exhibit both temporal and spatial correlation.

To illustrate this we return to a data set that was analyzed in Assignment 8. The data consist of repeated observations over a four year period on 691 different parcels of land in eastern North Carolina. The goal was to determine factors that affect the decision to use burning as a land management tool. We used GEE to fit a logistic regression model to the binary response. GEE provides marginal interpretations of the parameters rather than the subject-specific interpretations provided by mixed effects models, and was needed to deal with the unusual temporal correlation structure exhibited by the data. Observations separated by three years were positively correlated while observations separated by one or two years were negatively correlated. This correlation structure is consistent with the rule-of-thumb that recommends burning every three years.

I load the data and fit a GEE model with an unstructured correlation matrix. For completeness I fit the model using both the gee and geepack packages.

burn <- read.csv('ecol 562/burn.csv')
library(gee)
gee.un <- gee(burn50 ~ Hectares + factor(rx_cat3) + zmin_drd + prop_timb + Prop_no_b + zmin_dr + zmin_ddv + zmin_ddv:factor(rx_cat3), family = binomial, data = burn, id=PolyFID, corstr="unstructured", scale.fix=T)
library(geepack)
geeglm.un <- geeglm(burn50 ~ Hectares + factor(rx_cat3) + zmin_drd + prop_timb + Prop_no_b + zmin_dr + zmin_ddv + zmin_ddv:factor(rx_cat3), family = binomial, data = burn, id=PolyFID, corstr="unstructured", scale.fix=T)

The data consist of repeated observations on parcels of land. Because the parcels are spatially referenced and have spatial extent there is the potential for spatial correlation. For instance, it's not unreasonable for nearby parcels to have been burned at the same. As a first step toward understanding this we plot the physical locations of the parcels and examine their distance relationships.

Using shapefiles in R

To help display the spatial relationships between parcels I obtain a state and county boundary shapefile from the North Carolina Geodetic Survey web site. Shapefiles can be read into R with functions from the maptools package. Below I use the readShapePoly function for this purpose.

#read in NC shape file
library(maptools)
nc1 <- readShapePoly( "ecol 562/NC_State_County_Boundary_NAD83HARN/NC_State_County_Boundary_NAD83HARN.shp")
plot(nc1)

Fig. 1

Fig. 1  Plot of North Carolina shapefile

Because the coordinates of each site appear multiple times in the data set (for each of the four years of measurement), I use the tapply function to extract one set of coordinates from each site.

#select one value from each PolyFID
x.un <- tapply(burn$X, burn$PolyFID, function(x) x[1])
y.un <- tapply(burn$Y, burn$PolyFID, function(x) x[1])

To zoom in on the portion of North Carolina where the data are, I use the ranges of the plot coordinates to set the x-limits and y-limits of the North Carolina shapefile map. We're told that the map is scaled in feet, while the coordinates in the burn data set are scaled in meters, so we need to use a meters-to-feet conversion factor when plotting the sites. I also subtract and add a small amount to the lower and upper limits of the range when specifying the limits in the plot function.

range(x.un)
[1] 707957.92 858158.03
range(y.un)
[1]  70192.845 174379.181
#conversion factor: meters to feet
mult <- 3.28083989501312
plot(nc1, xlim=range(x.un)*mult + c(-100,100), ylim=range(y.un)*mult + c(-100, 100))
#add PolyFID locations
points(x.un*mult, y.un*mult, col=2, cex=.5)

I also add a key showing the distance scale used on the map.

#add distance scale to graph
segments(2500000, 234000, 2610000, 234000)
text(2500000+seq(0,33000,2500)*mult, 234000, '|', cex=.5, col='grey50')
text(2500000+seq(0,30000,10000)*mult, 234000,'|')
text(2500000+seq(0,30000,10000)*mult, 234000, c(0,'10','20','30'), cex=.7, pos=1)
text(2610000, 234000, 'km', pos=4, cex=.7)

Parcel locations
Fig. 2  Spatial distribution of parcels (PolyFIDs) from the burn data set in North Carolina  

The map reveals a substantial amount of clustering of the parcels. Unlike the wheat yield data we examined in lecture 32 where the observations were equally spaced, the spatial distribution of the burn sites is quite patchy consisting of both dense and loose clusters of points some with large gaps between them. To get a better sense of the distances involved I calculate all the pairwise distances between sites and examine their distribution.

#calculate geographic distances
out.d <- dist(cbind(x.un, y.un))
range(out.d)
[1]     21.786218 164640.824375

So sites are as close as 22 meters and as far as part as 165 kilometers! I display a histogram of the distances and superimpose a kernel density estimate using the density function. Because the boundary value of zero will distort the kernel density estimate, I append to the distances their negative values and fit the kernel density to both. I then discard the negative values and double the remaining y-coordinates so that the area under the kernel density still integrates to 1. I begin by converting the "dist" object created by the dist function to a matrix and extract the unique distances that lie in the lower triangle of the matrix. The distribution of distances turns out to be distinctly bimodal with a long tail.

dist.mat <- as.matrix(out.d)
unique.d <- as.vector(dist.mat[lower.tri(dist.mat)])
hist(unique.d, probability=T, xlab='distance', main='')
#add kernel density estimate by reflecting data across y-axis
out.ker <- density(c(-unique.d, unique.d))
names(out.ker)
[1] "x"         "y"         "bw"        "n"         "call"      "data.name"
[7] "has.na"
#throw away negative values and rescale density
new.x <- out.ker$x[out.ker$x>0]
new.y <- 2*out.ker$y[out.ker$x>0]
lines(new.x, new.y, col=2)

fig. 3
Fig. 3  Histogram and kernel density estimate of the distribution of inter-parcel distances  

We next turn to assessing whether there is lingering spatial correlation in the residuals from the GEE model. Because the Mantel and semivariogram functions require that all the observations have distinct coordinates so that none of the pairwise distances are zero, I restrict the analysis to just the residuals from 2004. I extract the spatial coordinates and Pearson residuals from the geeglm model and subset them by year.

#restrict data and results to 2004
burn04 <- burn[burn$year==2004,]
#residuals from 2004
res04 <- residuals(geeglm.un, type='pearson')[burn$year==2004]
newdat <- data.frame(x=burn04$X, y=burn04$Y, r=res04)

I use the mantel function from vegan to obtain a randomization test of the Mantel correlation of the residual distances in 2004 with the geographic distances.

#Mantel test
dist.g <- dist(newdat[,1:2])
dist.r <- dist(newdat[,3])
library(vegan)
mantel(dist.g, dist.r)
Mantel statistic based on Pearson's product-moment correlation

Call:
mantel(xdis = dist.g, ydis = dist.r)

Mantel statistic r: 0.05939
      Significance:
0.001

Empirical upper confidence limits of r:
   90%    95%  97.5%    99%
0.0171 0.0230 0.0307 0.0397

Based on 999 permutations

The correlation although small is statistically significant. As we'll see the small correlation is probably due to the very local nature of the spatial correlation coupled with the very large spatial extent of the data. Only nearby observations are correlated; the vast majority of the observations are uncorrelated.

Accounting for spatio-temporal correlation

So how do we account for the spatial correlation in a way that will also allows us to simultaneously account for the temporal correlation? In lecture 32 we had success by including the spatial coordinates of the observations as predictors in the model, either in the form of a response surface or as nonparametric smoothers in a GAM. Using this approach here we would just need to add appropriate geographic terms to the regression equation in the GEE model for temporal correlation. Unfortunately this approach doesn't work here. Adding geographic terms to the mean process model does not substantially reduce the spatial correlation. This shouldn't be surprising because it's unlikely that the large-scale geographic location of a site has much to do with the decision to burn or not. Such decisions are likely to be very local affecting only whether adjacent sites are burned together or not, a decision that does not vary with latitude and longitude. Contrast this with the effect of geography on wheat yield in the experiment of lecture 32. There unmeasured soil conditions that varied geographically could easily have led to the spatial correlation in yield that was detected.

Because the parcels have areal extent we could view these as lattice data and define neighborhoods for each parcel consisting of those parcels that share boundaries with a given parcel. A frequently used spatial correlation model for lattice data is the ICAR model, an acronym for the intrinsic conditional autoregressive model. In a CAR model each parcel is assigned its own unique random effect but also receives the weighted inputs from the random effects of its neighbors. This is a kind of mixed effects model but because of its complicated nature it is most easily implemented using Bayesian methods. Because we need to couple the spatial correlation with a temporal correlation that we saw in Assignment 8 was not correctly captured by including random effects, we can reject a CAR model for these data.

Given that GEE was able to effectively handle the temporal correlation it would be nice if we could add the spatial correlation to the GEE model. When the data are sorted by parcel and there is only temporal correlation among observations from the same parcel, the correlation matrix of the full data set is a block diagonal matrix in which the temporal correlation matrix occurs repeatedly along the diagonal, once for each parcel, and there are zeros everywhere else. A straightforward way to extend a temporal correlation matrix to a spatio-temporal correlation matrix is to make the additional assumption that the space-time covariance is separable, meaning that it can be written as the product of a spatial covariance and a temporal covariance. See, e.g., Gumpertz et al. (2000) and Lin (2010).

With the assumption of separability the spatio-temporal correlation matrix is obtained as the Kronecker product of the spatial correlation matrix estimated from a variogram, say, and the temporal correlation matrix estimated from GEE or directly from the model residuals. In the Kronecker product the temporal correlations appear repeatedly as identical 4 × 4 matrices along the block diagonal. These same blocks also appear in the off-diagonal positions but multiplied by an appropriate spatial decay constant that decreases as a function of the distance separating the plots. These off-diagonal elements describe how the temporal correlation pattern for observations from the same site is diminished by spatial extent when observations from different sites are considered. The main attraction of separability apart from its simplicity is that if the temporal matrix and the spatial matrix are both legitimate correlation matrices (i.e., they are both invertible and positive definite), then the Kronecker product of these two matrices will also be a legitimate correlation matrix. There is no guarantee of this happening when the matrix is assembled in a piecemeal fashion.

As an illustration suppose A is a 2 × 2 spatial correlation matrix and B is a 3 × 3 temporal correlation matrix.

a matrix      B matrix

Then the Kronecker product, kronecker is defined to be the following 6 × 6 matrix.

kronecker

For the burn data set we would choose A to be a 691 × 691 matrix of spatial correlations between the 691 parcels and B to be the 4 × 4 matrix of temporal correlations among the four years 2004 through 2007. The final matrix kronecker is 2764 × 2764, the dimensionality of the data set.

Methods for obtaining the spatio-temporal correlation matrix

We can construct the spatial correlation matrix A and/or the complete spatio-temporal correlation matrix C in a number of ways. I list three.

  1. Choose only one time period to estimate the spatial correlation, e.g. year 2004, and fit a parametric semivariogram model to the empirical semivariogram and use it to estimate a spatial correlation function. Evaluate the correlation function on the matrix of pairwise distances to obtain the spatial correlation matrix between parcels. Construct the spatio-temporal correlation matrix as the Kronecker product of the distance matrix and the temporal correlation matrix (estimated from GEE or as the correlation of the model residuals by year). Fit a GEE model using this matrix as the correlation matrix with an id variable in which all observations are assigned to the same group. For example, to use a user-specified correlation matrix in the geeglm function set corstr="fixed" option and use the zcor argument to list the entire lower triangle of the spatio-temporal correlation matrix as was done in assignment 8. The process could then be repeated for a different year to hopefully demonstrate that estimating the spatial correlation using a different year does not substantively change the conclusions drawn from the model.
  2. It is possible to analyze multiple dependent variables jointly in a multivariable semivariogram using the gstat package of R. The method for doing so is described beginning on p. 206 of Applied Spatial Data Analysis with R by Bivand et al. In a multivariable analysis the residuals from different years are treated as different variables measured from the same observational unit, the plot. The fit.lmc function in the gstat package calculates direct and cross semivariograms for variables individually and in pairs. What makes this approach especially attractive is that the fit.lmc function will constrain the matrix of all ten unique partial sill estimates to be positive definite, a property that is fundamental when using the cross-semivariograms to construct a correlation matrix. Using this approach there is no need to calculate a separate temporal correlation matrix because this method will return the spatio-temporal correlation directly as a function of distance. It also relaxes the separability assumption and allows for the simultaneous estimation of both the temporal and spatial correlation. Unfortunately given that the same initial parametric settings are used for all the semivariograms, the ten individual semivariogram models can be extremely difficult to fit in this fashion.
  3. A third approach is a hybrid of these two. In it we estimate separate empirical semivariograms for each year (direct semivariograms) and for each combination of years (cross-semivariograms). The next step is to fit a theoretical semivariogram model to the set of empirical semivariances using a nonlinear mixed effects model with the source of the empirical semivariogram as the grouping variable and in which one or more of the semivariogram parameters are allowed to vary by group. This approach uses all the data from all four years simultaneously thus addressing one possible criticism of method 1. Furthermore the mixed effects model will yield more robust estimates of the correlation than method 2 and avoid convergence problems. It turns out for these data that the semivariograms and cross-semivariograms from method 2 fall into two natural groups (of five semivariograms each) so that a good way to proceed is to obtain a separate set of correlation parameters for each group using two separate mixed effects models. As in method 1 the spatial extent of the spatio-temporal correlation matrix is populated by allowing the elements of the temporal correlation matrix to decay with distance. Thus this is still a separable model but one in which the elements of the spatial matrix are obtained in a more general fashion.

In all cases it is necessary to check the entries of the constructed spatio-temporal correlation matrix for feasibility. The Prentice bounds for binary data must still hold and the functions presented in lecture 23 can be used to check this. Given the way in which the correlation matrix was constructed, feasibility violations in the spatial part of the matrix should be corrected by adjusting the parameters of the overall semivariogram model that was used to generate the matrix, rather than by tweaking only the individual matrix entries that violate the bounds.

Using a single year to estimate the spatial correlation matrix

I use the 2004 residuals to estimate a semivariogram model. I start by generating the empirical semivariogram. I try bins with centers that are a distance 100 m apart and extend the calculations out to 5000 meters.

#obtain empirical semivariogram of residuals
library(geoR)
geodat <- as.geodata(newdat)
geodat.v1 <- variog(geodat, uvec=seq(50,4950,100), max.dist=5000, option='bin')
geodat.v1$n
 [1]   8  64 142 153 167 213 231 288 291 310 335 373 349 368 366 368 387 384 378 396
[21] 408 367 378 426 428 386 433 428 406 447 438 448 458 478 469 458 472 442 474 473
[41] 488 474 510 519 511 496 502 517 523 552

The first bin clearly has too few observations so I try moving the center of the first bin to a larger value.

#move center of first bin so it has more data
geodat.v1 <- variog(geodat, uvec=seq(75,4950,100), max.dist=3500, option='bin')
geodat.v1$n
 [1]  17  82 151 156 175 229 227 292 316 323 343 351 351 368 371 390 352 402 376 389
[21] 404 388 374 440 418 391 433 407 424 446 453 439 457 480 472 484 440 442 485 466
[41] 469 527 499 505 493 530 493 508 525 427
plot(geodat.v1)

fig. 4
Fig. 4  Semivariogram of 2004 residuals from a GEE model  

The graph seems to be turning back on itself after reaching what appears to be a sill of around 1.3. I try extending the graph out even further to see if this pattern persists.

#extend the distances out further
geodat.v2 <- variog(geodat, uvec=seq(75,14950,100), max.dist=10000, option='bin')
plot(geodat.v2)

fig. 5
Fig. 5  Semivariogram of 2004 residuals from a GEE model carried out to 10,000 meters.  

Now we see that at around 3000 m the graph becomes quite noisy and begins to oscillate around the putative sill of 1.3. Considering the arrangement of parcels shown in Fig. 2 the patchiness of the data set may be causing this. Once a certain distance is reached the semivariances are being calculated between observations from clusters that are geographically disjoint. The mean process model probably doesn't adequately account for all the spatial heterogeneity in the data set and this is manifesting itself in the noisy semivariogram at greater distances. In any case it's pretty clear that the local correlation that we're trying to detect dissipates at around 3500 m, so we should stop calculating the semivariances at that point. We'll fit theoretical models to this semivariogram in the next lecture.

#cut calculations off at 3500
geodat.v1 <- variog(geodat, uvec=seq(75,4950,100), max.dist=3500, option='bin')
plot(geodat.v1)

fig 6
Fig. 6  Semivariogram of 2004 residuals from a GEE model  

Cited Reference

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--April 3, 2012
URL: https://sakai.unc.edu/access/content/group/2842013b-58f5-4453-aa8d-3e01bacbfc3d/public/Ecol562_Spring2012/docs/lectures/lecture33.htm