Lecture 41—Monday, April 23, 2012

Outline of lecture

Overview of sampling

In standard statistical formulas it is assumed that observations were selected with equal probabilities. Furthermore this assumption of equal probabilities extends to all pairs of observations, all triplets of observations, etc. In short it is assumed that observations were obtained via simple random sampling with replacement (SRSWR). If target populations are large then simple random sampling without replacement (SRSWOR), the actual approach, is a reasonable approximation to SRSWR.

In environmental studies the assumption of equal selection probabilities is almost never realized. Observations typically come in groups.

  1. In surveys of popular opinion the survey instrument may be sent to households and then all adults in the household are asked to fill out the survey. Thus households are the primary sampling unit but the subject of interest is the opinion of the individual (not the household). When a household is selected we automatically get all individuals in that household. Thus certain sample combinations are immediately ruled out and therefore not all pairs of observations have an equal probability of selection. For instance if there are M households in the sample frame and we sample two of them, then the probability that two particular individuals from two different households are selected is prob, while the probability that two particular individuals from the same household are selected is prob.

Fig. 1  A lake monitoring study with
time nested in lakes

  1. In a monitoring study lakes are selected and then these same lakes are measured repeatedly over time. Thus we may have a random sample of lakes but we don't have a random sample of lake-times. If a lake is selected to be monitored then we typically obtain measurements for that lake at all times. In Fig. 1 lakes 3 and 4 were randomly chosen. Notice it's not possible (except by error) to obtain a sample of size 6 in which we obtain a single time from lake 1, two times from lake 2, and three times from lake 2, the sort of situation that would be possible if lake-times were randomly chosen. The "nesting" of data as seen in Fig. 1 is characteristic of sample designs with unequal selection probabilities.
  2. In environmental sampling quadrats are often used to obtain samples distributed randomly over an area of interest. If the purpose is not to characterize the quadrats themselves but instead to characterize the individuals in a quadrat (individual plants, animals, species, etc.) then once again we don't have equal selection probabilities of individuals. The probability that two selected individuals came from different quadrats is different from the probability that two selected individuals came the same quadrat.

In sampling theory a distinction is made between the population and the sample. The population is the entity P to which you want your conclusions to apply. The sample is a subset S of P. It is the set of units for which you have data.

Statistical analysis is only appropriate when we have a probability sample. A probability sample is one in which each element of the population has a known probability of being included in the sample. Typically we start with a list of all the elements in the population. This list is called the sampling frame. In a simple random sample (SRS) each element in the sampling frame has an equal probability of being included in the sample. It is a specific kind of probability sample.

A general probability sample can differ from a simple random sample in two important ways.

  1. The inclusion probabilities of different elements from the population may be unequal. These unequal probabilities can be for single elements, pairs of elements, triplets of elements, etc.
  2. The sampling unit can be different from the population unit of interest. In some of the examples given above the interest was in the individual but the primary sampling unit was a higher-level entity that contained one or more individuals.

Simple random sampling is sometimes called unrestricted random sampling. All other samples are referred to as restricted random samples. Deviations from simple random sampling are often unavoidable.

  1. SRS may be impossible. Consider trying to do a simple random sample of protozoa in a pond. Because of scale issues the best we can do is to take a SRS of aliquots of pond water, even though the unit of interest is the individual protozoan.
  2. SRS may be impractical. Taking a SRS of trees in a large forest could be a logistical nightmare if selected trees are far apart in an area without roads.
  3. SRS may offer few benefits. Certain restricted sampling designs can do nearly as well as SRS at far less cost.

Sampling, the frequentist perspective

Fig. 2  Sampling from two populations

In the frequentist approach to statistics all focus is on the sample. Suppose we draw a sample from population 1 and calculate the sample mean which we take as an estimate of the population mean μ1.

Similarly suppose we draw a sample from population 2 and calculate the sample mean which we take as an estimate of the population mean μ2. We observe for our two samples that . Are we therefore justified in concluding that ?

Because we only have the sample at hand on which to base our conclusions, we should hesitate in making this logical leap. If we were to go back and obtain a second sample from each population we would not expect to obtain exactly the same sample means again. Even the direction of the inequality might change the second time around. The way to address this is to quantify the expected variability of means of samples drawn repeatedly from a population. If we knew how much sample means are likely to vary in repeated sampling we can then assess whether it is likely that the order of the inequality might change in a subsequent sample.

The point of this is to just to remind you that sample means are random variables. As such, they have a distribution, called a sampling distribution. In the frequentist perspective, the precision of a sample estimate depends not on the one sample you have but on all the possible samples you might have obtained. When you engage in restricted random sampling the distribution of possible samples you might obtain is different from what it would is for unrestricted random sampling. As a result the estimate of the precision of the sample estimate changes under these different sampling regimes. Thus we need to account for the sampling design in order to draw correct inferences from samples.

Sampling from finite populations

In elementary statistics you learned that the variance of the sample mean is given by and is estimated by . Here σ2 is the population variance, s2 is the variance of the sample, and n is the sample size. This is a very cool formula. It tells us that in order to understand how the sample mean varies under repeated sampling we need only look at quantities that we can measure from a single sample. Coupling this result with the central limit theorem allows us to make probabilistic statements about whether μ1 is likely to be greater than μ2.

If our sample sizes are large (or they're small but the underlying populations we're sampling from are normally distributed with known variances), then a confidence interval can be constructed for the mean difference using the standard normal distribution and the variances of the means. If this confidence interval does not contain zero and is strictly positive we can have some confidence in concluding that .

If our sample sizes are small and the underlying population is not normally distributed, but the variance is known, or the sample sizes are small and we use the sample variance as an estimate of the population variance of an underlying normal population, then we should use a t-distribution instead of a normal distribution to construct the confidence interval.

In truth the formula for the variance of the sample mean was derived assuming simple random sampling with replacement (SRSWR). When a sample is made with replacement then the element selected on the first draw has no influence on which element is selected on the second draw. The probabilities do not change. If our sample size is n and the population size is N, then the probability of drawing any single element is and the probability of drawing any two elements in succession is .

Typically though we don't sample with replacement. Because populations are finite the probabilities change on each draw as the pool of available elements changes. When sampling without replacement the probability of drawing any two elements in succession is , accounting for order. When we sample without replacement from a finite population a correlation is induced between the elements in our sample and this in turn requires a modification of the formula for the variance of the sample mean.

For a simple random sample without replacement (SRSWOR), the variance of the sample mean is given by

and is estimated with

The additional factor that appears in each of these expressions is called the finite population correction factor (FPC). Since we're usually dealing with a sample estimate we'll generally use FPC to refer to the second usage.

The ratio of the sample size to the population size that appears in the last expression is called the sampling fraction and is usually denoted by .

Observe that if N, the population size, is very large, then the FPC is approximately equal to 1 and can be ignored. Consequently one can think of sampling with replacement as being roughly equivalent to sampling from an infinite or, more realistically, from a very large population. In words, the FPC tells us how much extra precision we have achieved when the sample size comes close to the population size. It is to our advantage to include the FPC in our calculations because its presence decreases the variance of our estimates (making them appear to be more precise).

Examples of different sampling schemes

Fig. 3 is a schematic diagram of a population consisting of 36 units. We'll use this population to illustrate various sampling schemes. If it helps you can think of the diagram as representing the spatial locations of individuals in a population. The numbers on the left margin will be used to identify the individual elements in the population. We start numbering the elements from 1 in the top left corner to 36 in the bottom right corner. These numbers comprise the sampling frame for the population.

In all of our examples we will draw a sample of size 12 from this population. Thus in terms of the notation developed previously:

fig 3
Fig. 4
Fig. 3  The population of potential sample units
Fig. 4  A simple random sample of size 12

Simple random sample (without replacement)

There are a total of possible simple random samples of size 12 that can be drawn from this population.

choose(36,12)
[1] 1251677700

To produce one of these samples in R we can use the sample function. The sample function has two required arguments and a third optional argument. (There is a fourth optional argument that is of no interest to us at the moment.) In default order the arguments are the following.

  1. x: a list of elements from which to take the sample.
  2. size: the size of the desired sample
  3. replace: whether to sample with (TRUE) or without (FALSE) replacement. The default is replace=FALSE which is the option we will use.

By default the sample function uses the internal clock value to initialize the random number stream that it uses to obtain a random sample from x. The initialization can be manually set with the set.seed function.

To obtain a random sample without replacement of size 12 from the numbers 1 through 36, use the following command.

sample(1:36, 12, replace=FALSE)

Fig. 4 illustrates the sample that was obtained.

Stratified random sample

Fig. 5   A stratified random sample using
proportional allocation

The first example of a restricted random sample that we will consider is a stratified random sample. In a stratified random sample the population is assumed to consist of multiple subpopulations, called strata, and we take random samples separately from each. Assume for our schematic population that elements 1–24 make up stratum 1 and elements 25–36 make up stratum 2. We will take our random sample of size 12 in a way that guarantees that a portion of the sample comes from each stratum.

There are a number of methods for deciding how many elements to select from the different strata.

  1. Equal allocation: we take equal numbers of sample elements from each stratum.
  2. Proportional allocation: we choose elements in such a way that the proportion of elements that make up the various strata in our sample is identical to the proportion of elements that make up the various strata in the population.
  3. Neyman allocation: choose elements based on the sizes of the population strata and their variances.
  4. Optimal allocation: choose elements based on the sizes of the population strata, the variances of the population strata, and the costs of obtaining observations from each stratum

In truth any allocation scheme is OK if it can be objectively justified and the selection schemes are recorded.

I illustrate proportional allocation using our schematic population. Since the strata are in the proportions two thirds (stratum 1) to one third (stratum 2) in the population, our sample of size 12 should exhibit the same proportions. Thus we should choose 8 elements from stratum 1 and 4 elements from stratum 2. There are different proportional allocation samples that can be drawn from this population.

choose(24,8)*choose(12,4)
[1] 364058145

We can obtain a list of elements for our stratified sample by using the R sample function twice.

  1. Stratum 1:  sample(1:24, 8, replace=FALSE)
  2. Stratum 2:  sample(25:36, 4, replace=FALSE)

One such stratified random sample is shown in Fig. 5.

R code used to produce figures

#Fig. 1
par(mar=c(5.1,4.1,2,1))
mydat <- data.frame(rep(1:5,3), rep(3:1, rep(5,3)))
colnames(mydat) <- c('y','x')
plot(mydat$y~mydat$x, type='n', ylim=c(0.5,5.5), xlim=c(0.5,3.5), axes=FALSE, xlab='Time' , ylab='Lake')
axis(1, cex.axis=.9, at=1:3)
axis(2, las=1, cex.axis=.9)
box()
points(mydat$x, mydat$y, pch=22, cex=2)
points(mydat$x[mydat$y%in%c(3,4)], mydat$y[mydat$y%in%c(3,4)], pch=4, cex=1.7, col=2)
#Fig. 3
par(mar=c(2.1,4.1,2,1))
mydat <- data.frame(rep(1:6,6), rep(6:1,rep(6,6)))
colnames(mydat) <- c('x','y')
plot(mydat$y~mydat$x, type='n', ylim=c(0.5,6.5), xlim=c(0.5,6.5), axes=FALSE, xlab='', ylab='')
axis(1, at=1:6, label=rep(' ',6))
axis(2, at=6:1, label=seq(1,31,6), las=1, cex.axis=.8)
box()
points(mydat$x, mydat$y, pch=4, cex=.8)
mtext(side=3, line=.5, 'Population')
#Fig. 4
par(mar=c(2.1,4.1,2,1))
mydat <- data.frame(rep(1:6,6),rep(6:1,rep(6,6)))
colnames(mydat) <- c('x','y')
mysub <- mydat[sample(1:36, 12, replace=FALSE),]
plot(mysub$y~mysub$x, pch=16, cex=2, ylim=c(0.5,6.5), xlim=c(0.5,6.5), col='pink', axes=FALSE, xlab='', ylab='')
axis(1, at=1:6, label=rep(' ',6))
axis(2, at=6:1, label=seq(1,31,6), las=1, cex.axis=.8)
box()
points(mysub$x, mysub$y, pch=1, cex=2, col=2)
points(mydat$x, mydat$y, pch=4, cex=.8)
mtext(side=3, line=.5, 'Simple Random Sample')
#Fig. 5
par(mar=c(2.1,4.1,2,4.5))
mydat <- data.frame(rep(1:6,6),rep(6:1,rep(6,6)))
colnames(mydat) <- c('x','y')
mysub <- mydat[c(sample(1:24, 8, replace=FALSE), sample(25:36, 4, replace=FALSE)),]
plot(mysub$y~mysub$x, pch=16, cex=2, ylim=c(0.5,6.5), xlim=c(0.5,6.5), col='pink', axes=FALSE, xlab=' ', ylab=' ')
axis(1, at=1:6, label=rep(' ',6))
axis(2, at=6:1, label=seq(1,31,6), las=1, cex.axis=.8)
box()
abline(h=2.5, lty=3, lwd=2)
points(mysub$x, mysub$y, pch=1, cex=2, col=2)
points(mydat$x, mydat$y, pch=4, cex=.8)
mtext(side=4, line=.5, las=1, at=c(1.4,4.5), text=c("Stratum\n     2", "Stratum\n     1"), cex=.85, font=2)
mtext(side=3, line=.5, 'Stratified Random Sample')

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--April 27, 2012
URL: https://sakai.unc.edu/access/content/group/2842013b-58f5-4453-aa8d-3e01bacbfc3d/public/Ecol562_Spring2012/docs/lectures/lecture41.htm