Lecture 42—Wednesday, April 25, 2012

Outline of lecture

Comments on stratified random samples
Cluster samples
Contrasting stratified and cluster samples
Systematic random samples
The survey package of R
Using survey sampling methods for statistical modeling
Cited references
R code used to produce figures

Comments on stratified random samples

Stratified random sampling is appropriate when the population of interest consists of coherent subpopulations. In stratified random sampling we treat the subpopulations as separate populations in their own right for sampling purposes and we call the subpopulations strata. The point estimates and their precisions for the individual strata are then reassembled to form a stratified estimate of the mean and variance.

For stratified random sampling to be effective, the strata should be individually more homogeneous than the population as a whole. We want most of the variability to lie between strata rather than within strata. In ecology strata are often chosen to be habitat types. If an area exhibits an environmental gradient that is likely to have an effect on the variable we're measuring then we should construct strata boundaries that lie perpendicular to that gradient. In this case the strata are constructs that we've invented that may or may not correspond to obvious subpopulations in the field. The primary advantage of stratified random sampling is that the standard errors of our estimates are decreased because the variability that exists between the strata has been eliminated from the analysis.

Within the strata any other possible sampling scheme is legitimate. Thus we can have stratified simple random samples, stratified systematic random samples, or stratified cluster random samples. The one practical issue that needs to be decided is how to allocate resources when sampling the different strata. Proportional allocation is a logical and relatively simple approach. Stratified random sampling nearly always results in a gain in precision unless the researcher is totally clueless about his/her study subjects.

Cluster samples

One-stage cluster samples

Suppose the tract occupied by our schematic population is crisscrossed by roads that provide easy access to groups of units (Fig. 1). Logistically it makes sense to sample these units as a group. Groups of sampling units that exist for reasons unrelated to the variable of interest are called clusters. If we take a random sample of clusters and then take all members of the selected clusters for our sample we've created what's called a one-stage cluster sample. Because each cluster contains four elements, to obtain a one-stage cluster sample of size 12 we need only select three clusters. Because there are nine clusters total it follows that there are possible one-stage cluster samples of size 3.

choose(9,3)
[1] 84

Using the sample function in R we can obtain a list of the clusters for our one-stage cluster sample as follows.

sample(1:9, 3, replace=FALSE)

An example of a one-stage cluster sample taken from our schematic population is shown in Fig. 1.


Fig. 1 One-stage cluster sample	Fig. 2 Two-stage cluster sample

Two-stage cluster samples

A two-stage cluster sample is an extension of a one-stage cluster sample in that it adds a second sampling stage. After selecting the clusters (primary sample units) we then choose a subsample of units from each cluster (secondary sample units), generally using simple random sampling. To obtain a sample of size 12 from our schematic population using a two-stage cluster sample, we could select six clusters at the first stage and then two units from each of those clusters at the second stage. (Other possibilities are to select four primary units followed by three secondary units from each, etc.) There are different cluster samples of this type.

choose(9,6)*(choose(4,2))^6
[1] 3919104

Using R we obtain the six primary sampling units as follows.

sample(1:9, 6, replace = FALSE)

Then if we think of the units in each cluster as being numbered 1, 2, 3, and 4 in some order, we can obtain the secondary units as follows.

sapply(1:6, function(x) sample(1:4, 2, replace=F))
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    3    4    3    2    2
[2,]    4    1    3    2    3    3

An example of one such two-stage cluster sample is shown in Fig. 2.

General comments about cluster samples

Cluster samples are often not chosen by design but are a product of circumstances. If the primary sampling unit is a quadrat or a sampling device (like a dip net or a bucket), but the object of interest is the individual of which many are found in each primary sampling unit, then a cluster sample has been obtained. Cluster samples are identified by the number of stages that are present.

In a single stage (or simple) cluster sample, clusters are selected and then every unit occurring in that cluster is also taken.
In multistage cluster samples the clusters are selected and then subsampled one or more times. The number of nested subsamples that are present defines the number of stages.

Cluster sampling tends to work best when the individual clusters are as diverse as possible, at least as diverse as the population as a whole. Typically this will not be the case because cluster elements tend to be positively correlated and hence less diverse than the population as a whole. As a result parameter estimates obtained using cluster samples tend to be less precise (more variable) than estimates obtained from simple random sampling.

An advantage of cluster sampling is that a sample frame is not needed for the whole population, only for the clusters that happen to be included in the sample, and if we're doing only a one-stage cluster sample, then not even that. For a one-stage cluster sample our sample frame is just a list of the clusters. In environmental science clusters tend to be spatial or temporal units.

Contrasting stratified and cluster samples

Stratified random samples and cluster samples are easy to distinguish after the fact, but perhaps less so before sampling begins. It's conceivable that one person's strata might be another person's clusters.

Stratified random sample	Cluster sample
In a stratified random sample all strata are included in the sample.	In a cluster sample only a random subset of clusters is included.
Stratified sampling is done to increase precision.	Cluster sampling is done to decrease cost.
A "good" stratum will look very different from the overall population and in particular be different from the other strata comprising that population.	A "good" cluster should look just like the population as a whole and should resemble all other clusters in the population.
Strata should be internally homogeneous.	Clusters should be internally heterogeneous.

Systematic random samples

Fig. 3 Systematic sample

Systematic samples are very popular because they are very easy to carry out and are often logistically attractive. Unfortunately they have a number of problems statistically. After a random starting point in the sample frame, a systematic sample selects additional sample units at regular intervals. Thus for a systematic sample to be feasible it must be possible to order the sampling units in some way.

To obtain a systematic sample of size 12 from our schematic population of size 36, we randomly choose a value from 1, 2, or 3 as a starting place and then select every third value from that starting point until we cycle through the population. Observe that in this instance there are only three possible starting points and hence only three possible systematic samples of size 12 from this population.

Using R we obtain our systematic sample as follows.

seq(sample(1:3,1),36,3)
[1] 2 5 8 11 14 17 20 23 26 29 32 35

Fig. 3 illustrates one such sample. While the regularity of the sample may be a bit disconcerting, such a sample can be either very good or very bad.

It's very bad if it turns out there is a gradient running east-west, because then we've managed to miss much of the gradient having obtained samples from only two points in the gradient.
It's very good if it turns out there is a gradient running north-south because then we've managed to obtain samples from all possible points in the gradient.

Besides its goodness or badness, the real difficulty with systematic samples is in trying to come up with an honest measure of sampling variability. To accomplish this often requires making some rather extreme assumptions or carrying out some ad hoc statistical gymnastics.

A systematic sample is an ordered sample. The implicit assumption in systematic sampling is that it is possible to order the target area in some way and to use that order to pick a sample. Means obtained with a systematic sample of size n will be unbiased for the population mean (equal to the population mean on average) if the sample is spread out over the population in equal increments in such a way that is an integer, where N is the population size.

Generally speaking, if the target area is large and the sampling area is small, there are no particular statistical problems inherent in using a systematic sample instead of a simple random sample. An example of such a situation would be taking a transect along an ocean bottom stopping every kilometer or so to take a sample.

How should one obtain the variance of systematic sample estimates?

If the population is fairly homogeneous it is probably OK to use the variance formulas for simple random sampling.
If the population is sparse or otherwise heterogeneous, it is better to replicate the systematic sample and use the replicates to obtain a variance estimate. The replication may be done by simply dividing a single systematic sample into sections.

If a population exhibits a trend then a judiciously selected systematic sample can be more precise than a simple random sample of the same size.

The survey package of R

There exist formulas for calculating simple statistics and their standard errors for most of the standard sampling designs. But when sampling protocols are complicated it becomes necessary to approximate things numerically either using available software or by writing your own routines. The survey package in R can provide sample estimates along with their standard errors for all of the standard sample designs plus many additional ones. The survey package is not part of the standard R installation. So the first time you plan to use it you first need to obtain this package from the R web site.

library(survey)

The survey package was written by Thomas Lumley of the University of Washington, Lumley (2007), and he has provided a number of useful references for his package.

The package has its own web site, http://faculty.washington.edu/tlumley/survey/ where you can find many helpful documents on using the package. In particular you can find worked examples from a number of standard textbooks on survey sampling.
In addition manuals and vignettes for the survey package can be found in the doc folder of the library folder of the folder that contains R. A vignette is an expository document that illustrates a particular feature of a package.
You can also find the standard documentation for all the survey package functions at the R web site. The URL for the pdf file that documents the syntax and use of all the functions in the package as well as vignettes is http://cran.r-project.org/web/packages/survey/index.html.

To illustrate its capabilities I use it to analyze data collected with a one-stage cluster sample. The data file is Surinam.csv, a comma-delimited file with the variable names in the first row of the file. The data are a complete inventory of all the trees in a 60 ha section of forest in Surinam. It is taken from Schreuder et al. (2004) who use it to illustrate applications of survey sampling theory to forestry.

trees <- read.csv('ecol 562/Surinam.csv')

names(trees)

[1] "Diameter_cm" "Longitude"   "Latitude" "Height_m" "Volume_cum"
[6] "SubPlot"     "x2"          "x3"       "DBHClass" "Diameter_in"
[11] "Height_ft"   "Volume_cuft" "CC"

The variable SubPlot records the subplot in which a tree occurred. There are 60 subplots identified by a combination of letter and number.

unique(trees$SubPlot)

[1] D1 A1 B1 C1 D2 A2 C2 B2 C3 D3 A3 B3 C4 A4 B4 D4 D5 A5 B5
[20] C5 D6 B6 C6 A6 D7 C7 A7 B7 C8 D8 B8 A8 D9 A9 B9 C9 A10 D10
[39] B10 C10 C11 A11 D11 B11 B12 C12 D12 A12 B13 C13 A13 D13 D14 C14 B14 A14 D15
[58] C15 A15 B15
60 Levels: A1 A10 A11 A12 A13 A14 A15 A2 A3 A4 A5 A6 A7 A8 A9 B1 B10 ... D9

SubPlot is a factor variable in R, as indicated by the comment at the end of the unique output above "60 levels: A1 A10" etc. A factor level has both a name (the name of the subplot in this case) and a number (a number from 1 to 60). We can take advantage of the numeric values to color the subplots geographically. In the plot command below I use the numeric level of the factor levels as the value of the col= argument. Because there are only 8 numeric color levels in R, the color levels are recycled and repeated for the 60 subplot values. As Fig. 4 shows the subplots are quadrats that are geographically distinct.

plot(Latitude~Longitude, data=trees, cex=.5, col=as.numeric(SubPlot))

Fig. 4 The geographic distribution of SubPlots in the population

In a one-stage cluster sample we obtain a random sample of clusters and then select all the elements in that cluster. I begin by obtaining a sorted list of the unique cluster names.

plot.list <- sort(unique(trees$SubPlot))

num.plots <- length(plot.list)

num.plots

[1] 60

Because comparisons between cluster means form the basis for the variance calculation in a one-stage cluster sample, the smallest sample size we can take that is still useful is a cluster sample with two clusters. I use the sample function to select three clusters from the list of unique clusters.

set.seed(20)

my.plot <- plot.list[sample(1:num.plots,3)]

my.plot

[1] D2 D1 B10
60 Levels: A1 A10 A11 A12 A13 A14 A15 A2 A3 A4 A5 A6 A7 A8 A9 B1 B10 B11 B12 B13 B14 B15 ... D9

Next we need to select the elements from the population of trees that belong to clusters D2, D1, and B10. This is easily done with the Boolean intersection operator, %in%. I first illustrate its use.

AB <- LETTERS[1:2]

[1] "A" "B"

test <- c("A","C","a","B","D")

test %in% AB

[1] TRUE FALSE FALSE TRUE FALSE

So x%in%y tests whether each element in x is among the elements in y. I next obtain the one-stage cluster sample.

cluster.sample <- trees[trees$SubPlot %in% my.plot,]

plot(Latitude~Longitude, data=trees, col=c('grey70',2)[as.numeric(SubPlot %in% my.plot)+1])

Fig. 5 Location of the cluster sample

To use the survey package we first have to define the nature of the sample using the svydesign function. For the simple sample schemes we've discussed the following four arguments are all that we need.

id: This identifies the primary sampling unit (and the secondary units if they exist).
strata: Use this option if there are strata you wish to include as part of the analysis.
fpc: This specifies the finite population correction factor, the FPC. You can either enter the FPC yourself or enter the population size on which the FPC should be based. Note: The arguments for fpc should refer to variable(s) in a data frame. If the variables do not exist you can create them on the fly when specifying the argument. This argument is optional.
data: This specifies the data frame.

For id, strata, and fpc if what comes next is a variable occurring in the specified data frame then the first character in the argument must be a ~. For more complicated designs there are additional arguments that are useful. The probs argument can be used to specify the individual probabilities of selection for each element of the sample. Thus we can use the survey package to analyze any kind of sample as long as the probabilities of selection are known. The weights argument can be used instead of the probs argument (the weights are the reciprocal of the selection probabilities). In addition the weights argument can be used to handle missing data, nonresponses, and many of the other biases that can creep into sample designs.

To construct the survey object for a one-stage cluster sample we need to specify the id, fpc, and data arguments.

id=~SubPlot: The primary sampling units are the clusters, not the individual trees. Thus the id argument is used to reflect this.
fpc=rep(60, nrow(cluster.sample): The FPC is now calculated in terms of the sampling fraction of clusters, i.e., how many clusters were sampled relative to the number of clusters in the population. Since there are 60 clusters in the population, I enter this value for fpc and let R calculate the FPC from it.

I create the svydesign object and calculate the sample mean and standard error of the Diameter_cm variable. I then construct a 95% confidence interval for the mean using the confint function.

cluster.sample$fpc <- rep(60, nrow(cluster.sample))

cluster.obj <- svydesign(id=~SubPlot, fpc=~fpc, data=cluster.sample)

#get means and confidence intervals

cluster.out <- svymean(~Diameter_cm, cluster.obj)

cluster.out

mean SE
Diameter_cm 41.043 1.5667

confint(cluster.out)

2.5 % 97.5 %
Diameter_cm 37.97257 44.11389

We can also fit regression models that account for the manner in which the data are collected using the svyglm function of the survey package. Essentially svyglm uses glm to fit the model and then adjusts the standard errors of the parameter estimates based on the sample design that was used. When the sample design is a cluster sample, the svyglm function is an alternative to fitting a mixed effects model or to specifying a correlation structure using generalized least squares or generalized estimating equations.

out.svy <- svyglm(Volume_cum~Diameter_cm, design=cluster.obj)

summary(out.svy)

Call:
svyglm(Volume_cum ~ Diameter_cm, design = cluster.obj)

Survey design:
svydesign(id = ~SubPlot, fpc = ~fpc, data = cluster.sample)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.50414    1.15804 -3.889     0.16
Diameter_cm 0.15329    0.02924   5.243     0.12

(Dispersion parameter for gaussian family taken to be 2.470582)

Number of Fisher Scoring iterations: 2

We can compare the above results to what we get by fitting a random intercepts model. With only three clusters, the mixed effects model will not be very accurate here.

library(nlme)

out.lme <- lme(Volume_cum~Diameter_cm, random=~1|SubPlot, data=cluster.sample, method="ML")

summary(out.lme)

Linear mixed-effects model fit by maximum likelihood
Data: cluster.sample
AIC BIC logLik
1305.587 1320.985 -648.7936

Random effects:
Formula: ~1 | SubPlot
(Intercept) Residual
StdDev: 4.665527e-05 1.569542

Fixed effects: Volume_cum ~ Diameter_cm
                Value Std.Error DF   t-value p-value
(Intercept) -4.504145 0.21536764 343 -20.91375       0
Diameter_cm 0.153294 0.00482657 343 31.76040       0
Correlation:
            (Intr)
Diameter_cm -0.92

Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.03580046 -0.44650216 0.01902902 0.43778939 10.52751911

Number of Observations: 347
Number of Groups: 3

The point estimates are the same but the standard errors are six times larger using the survey package versus using the mixed effects model. Given that we know that the variance components will be difficult to estimate with so few clusters, the more conservative results from the cluster sample analysis are probably a better choice.

Using survey sampling methods for statistical modeling

There is a direct relationship between basic survey sample sampling designs and some of the standard statistical designs we've considered in this course. For instance,

When strata are used to control for heterogeneity they are serving the same role as blocking in experimental designs.
Cluster samples are examples of hierarchical data. Thus survey sample methods provide yet another way to deal with correlated (clustered) data.

In a survey sample the focus is on the selection probabilities of the individual sample units. In hierarchical designs the probability of selecting any given unit from the underlying population can at least in principle be specified. Using these probabilities expectations and variances can be fully specified, although in practice the variances may need to be approximated. Survey sampling software, such as the survey package of R, can be used to carry out regressions in which the sample design is taken into account when calculating the standard errors of regression parameters.

We have covered four different methods for dealing with correlated and clustered (hierarchical) data in this course.

Survey sample methods in which the focus is on the selection probabilities of individual elements.
Mixed effects models in which observational heterogeneity is modeled through the use of random effects.
Generalized least squares for Gaussian data in which variance heterogeneity and correlation are accounted for by specifying a variance-covariance matrix for the residuals.
Marginal models based on generalized estimating equations in which variance heterogeneity and correlation is incorporated directly into the estimating equation derived from the model's log-likelihood or quasi-likelihood.

Cited references

Lumley, Thomas. 2007. Survey Analysis in R. http://faculty.washington.edu/tlumley/survey/
Schreuder, Hans T., Ernst, Richard, and Ramirez-Maldonado, Hugo. 2004. Statistical techniques for sampling and monitoring natural resources. Gen. Tech. Rep. RMRS-GTR-126. Fort Collins, CO: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station. 111 p. Publication: http://www.fs.fed.us/rm/pubs/rmrs_gtr126.pdf
Accompanying data sets and R code: http://www.fs.fed.us/rm/pubs/rmrs_gtr126_data_sets.zip

R code used to produce figures

#Fig. 1

par(mar=c(2.1,4.1,2,1))

mydat <- data.frame(rep(1:6,6), rep(6:1,rep(6,6)))

colnames(mydat) <- c('x','y')

clusts <- c(rep(rep(1:3, rep(2,3)),2), rep(rep(4:6,rep(2,3)),2), rep(rep(7:9, rep(2,3)),2))

clusts2 <- rep(c(rep(1:2,3), rep(3:4,3)),3)

mysub <- mydat[clusts%in%sample(1:9,3, replace=FALSE),]

plot(mysub$y~mysub$x, pch=16, cex=2, ylim=c(0.5,6.5), xlim=c(0.5,6.5), col='pink', axes=FALSE, xlab='',ylab='')

axis(1, at=1:6, label=rep(' ',6))

axis(2, at=6:1, label=seq(1,31,6), las=1, cex.axis=.8)

box()

points(mysub$x, mysub$y, pch=1, cex=2, col=2)

points(mydat$x, mydat$y, pch=4, cex=.8)

abline(h=c(2.5,4.5), lty=3, lwd=2)

abline(v=c(2.5,4.5), lty=3, lwd=2)

mtext(side=3, line=.5, 'One-Stage Cluster Sample')

#Fig. 2

level2 <- sample(1:9,6)

level1 <- sapply(1:6, function(x) sample(1:4,2))

plot(mydat$y~mydat$x, pch=16, cex=2, ylim=c(0.5,6.5), xlim=c(0.5,6.5), col='pink', axes=FALSE, xlab='', ylab='', type='n')

mydat1 <- data.frame(mydat,clusts,clusts2)

for(i in 1:6) {

mysub <- mydat1[mydat1$clusts==level2[i],]

mysub2 <- mysub[mysub$clusts2%in%level1[,i],]

points(mysub2$x, mysub2$y, col='pink', cex=2 ,pch=16)

points(mysub2$x, mysub2$y, col=2, cex=2, pch=1)

}

axis(1, at=1:6, label=rep(' ',6))

axis(2, at=6:1, label=seq(1,31,6), las=1, cex.axis=.8)

box()

points(mydat$x, mydat$y, pch=4, cex=.8)

abline(h=c(2.5,4.5), lty=3, lwd=2)

abline(v=c(2.5,4.5), lty=3, lwd=2)

mtext(side=3, line=.5, 'Two-Stage Cluster Sample')

#Fig. 3

par(mar=c(2.1,4.1,2,1))

mydat <- data.frame(rep(1:6,6), rep(6:1,rep(6,6)))

colnames(mydat) <- c('x','y')

mysub <- mydat[seq(sample(1:3,1),36,3),]

plot(mysub$y~mysub$x, pch=16, cex=2, ylim=c(0.5,6.5), xlim=c(0.5,6.5), col='pink', axes=FALSE, xlab='', ylab='')

axis(1, at=1:6, label=rep(' ',6))

axis(2, at=6:1, label=seq(1,31,6), las=1, cex.axis=.8)

box()

points(mysub$x, mysub$y, pch=1, cex=2, col=2)

points(mydat$x, mydat$y, pch=4, cex=.8)

mtext(side=3, line=.5, 'Systematic Random Sample')

Course Home Page

Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--April 27, 2012
URL: https://sakai.unc.edu/access/content/group/2842013b-58f5-4453-aa8d-3e01bacbfc3d/public/Ecol562_Spring2012/docs/lectures/lecture42.htm