Lecture 23—Monday, November 12, 2012

Topics

R functions and commands demonstrated

R function options

R packages used

A brief history of statistics

Bayesian inference is as old as statistics itself. The frequentist approach on the other hand is a more recent innovation and derives from the uncomfortable marriage of methodologies developed by the Neyman-Pearson-Wald school (hypothesis testing) and the Fisherian school (everything else). While there are fundamental differences in the Bayesian and frequentist approaches that are difficult to reconcile, most statisticians treat them as just two options on a menu and are willing to use the tools they offer without necessarily embracing the accompanying baggage.

The history of Bayesian statistics began in 1763 with a paper written by Thomas Bayes in which he outlined a formula that has come to be called Bayes rule. In the early 1900s the frequentists (Fisher, Neyman, and Pearson) rejected the Bayesian approach, after which it temporarily slipped into obscurity and all focus shifted to frequentist methods. Bayesian methods experienced a revival in the early 1990s so much so that Bayesian approaches are now the hottest area of modern statistics.

While there has been considerable rancor in the literature (particularly among scientists) with regard to these two approaches, I think it is fair to say that Bayesian inference has always been recognized as being the more legitimate approach to doing statistics. Because Bayesian inference was difficult to carry out except in a few fairly artificial cases, the frequentist approach has held sway by default. Except for a few universities that have built their reputation on being Bayesian hotbeds, (Duke, for example), most statistics departments have tended to include only token representatives of the Bayesian viewpoint among their faculty.

This situation dramatically changed about 20 years ago. Using methodologies originally developed by physicists, Bayesian inference is now easy to do. Furthermore Bayesian methods are able to solve problems that are currently impossible for the frequentist approach to solve. This has generated a great deal of interest in Bayesian statistics among scientists in general. I introduce Bayesian statistics here not as an alternative to frequentist statistics but as a collection of tools that can be useful in many instances.

Bayes rule

On first pass, Bayes rule would seem to be just a trivial extension of the concept of conditional probability. I begin by reviewing the definition of conditional probability. Let Fig. 1 represent a sample space S consisting of 12 equally likely sample points. The events A and B are the subsets of sample points shown in the figure. Under the equally likely assumption we have the following.

Fig. 1  The sample space S with events
A and B (R code)

Conditional probabilities effectively change the sample space. For instance, P(A|B) restricts the sample space to B and is calculated under the equally likely scenario by counting up the number of sample points that are both in A and B and dividing by the number of sample points in B. Formally we have

prob A given B

With this background, Bayes rule can be derived as follows. Let A and B be any two events. Using the definition of conditional probability we can write the following.

Having solved for the joint probability in each expression we can set the results equal to each other.

This latter formula is essentially Bayes rule but not as it is usually written. Imagine a set of events that are pairwise disjoint and partition the sample space S such that B union, then we can write P(A), the denominator in the above expression, as follows.

Prob A

Substituting this expression for P(A) in the denominator of the formula for P(B|A) given above yields Bayes rule.

Bayes rule

If B is continuous, the sum is replaced by an integral.

A simple application of Bayes rule (not covered in lecture)

The following Bayes rule problem is typical of the kind that appears in textbooks on elementary probability theory and biostatistics.

Suppose a certain drug test is 99% sensitive and 99% specific, that is, the test will correctly identify a drug user as testing positive 99% of the time, and will correctly identify a non-user as testing negative 99% of the time. This would seem to be a relatively accurate test, but Bayes' theorem can be used to demonstrate the relatively high probability of misclassifying non-users as users. Let's assume a corporation decides to test its employees for drug use, and that only 0.5% of the employees actually use the drug. What is the probability that, given a positive drug test, an employee is actually a drug user?

The assertions made in the problem can be formulated as conditional probabilities. Let's make the following identifications.

D = event that an employee uses drugs
Dc = event that an employee does not use drugs

Notice that these two events form an exhaustive partition of our sample space, i.e., sample space. Either an employee uses drugs or he/she does not. Continuing with our identifications, let

T = event that an employee tests positive on a drug test
Tc = event that an employee tests negative on a drug test

These two events form a second partition of our sample space, i.e., sample space 2. Sensitivity and specificity are formulated as conditional probabilities.

sensitivity: sensitivity

specificity: specificity

Conditional probabilities satisfy the laws of probability so given drug use either an individual tests positive or not. Thus we also know

oneminus sensitivity and similarly one minus specificity. We're also told prob drug, from which it follows prob no drug because these two probabilities have to add to 1. We're asked to find drug posterior. From the definition of conditional probability we have

bayes rule example

where in the last step I use Bayes rule in which our sample space of people with positive tests is partitioned as sample space. Plugging in the numbers we obtain the following.

bayes example 2

(.99*.005)/(.99*.005+.01*.995)
[1] 0.3322148

This low accuracy seems anomalous given the apparently high precision of the tests until we realize that the activity being tested for is quite rare. While I present this largely as an illustration of Bayes rule in action there is another message to take home from this example. The order of the events in the conditional probability truly matters. In the example drug posterior and sensitivityare not only different conceptually but also very different numerically.

The Bayesian approach to statistical inference

In the Bayesian approach to statistics Bayes rule becomes both a method of estimation and a way to make inferences. Here's Bayes rule again in its continuous form.

Bayes rule

Now let B = θ, a parameter, and A = data, then Bayes rule can be written as follows.

With these definitions we can express Bayes rule as

Bayes rule

where the constant of proportionality is give by the reciprocal of the integral shown above. Thus with Bayes rule we update the prior probability of θ via the likelihood to obtain the posterior probability of θ.

The posterior probability is a statement of our belief about the state of nature. It takes into consideration our prior beliefs as well as what we've learned from our data. If we have no substantive prior beliefs then we can use what are called "vague" or "uninformative" priors. In these cases we essentially derive all of our inference from the likelihood, just like in the frequentist approach to statistics. A subtle difference is that because we used Bayes rule to set up our problem the conclusions we draw about the parameters can be formulated as probability statements.

A contrast of the frequentist and Bayesian approaches to parameter estimation

To illustrate the differences between the frequentist and Bayesian approaches to parameter estimation, I apply the two methodologies to a common problem. The problem is simple enough that classical Bayesian methods can be used. An experiment is carried out to determine the seed germination success of a particular genetic strain. Eighty seeds were sewn separately in a single tray placed in a greenhouse. The total number of seeds that successfully germinated was recorded. Suppose we observe that 42 seeds germinated. We make the assumptions that the seed are independent, so that the individual seeds in a given tray do not interfere with each other, and that the seeds have an identical germination success rate p.

Regardless of whether we are taking a frequentist or Bayesian approach, we begin by writing down the likelihood for this experiment, the probability of observing the data that we observed. Let Xi = outcome for seed i. This outcome is 1 if the seed germinated, and 0 otherwise. If we list the outcome for each seed in order, then the probability of our data is the following.

example

This is an example of what's called a product Bernoulli distribution, which is closely related to the more familiar binomial distribution.

Frequentist approach to this problem

Frequentists work exclusively with the log-likelihood.

example logL

To find the maximum likelihood estimate of p using calculus we can differentiate the log-likelihood, set the result equal to zero, and solve for p. For this problem these steps are easy.

frequentist calculation

Observe that the estimate is just the total number of successes (germinations) divided by the total number of trials, i.e., the fraction of successes. As is often the case the maximum likelihood estimate is a fairly natural estimate. To obtain this result numerically we can write a function that calculates the negative log-likelihood as a function of p and then use the nlm function to find its minimum.

binomial.negLL <- function(p) -(42*log(p)+38*log(1-p))
out1 <- nlm(binomial.negLL, .5, hessian=T)
out1$estimate
[1] 0.5249995

This is the total number of successes divided by the total number of trials.

42/80
[1] 0.525

Using the Hessian we can calculate a 95% Wald confidence interval for the proportion.

#Wald confidence interval
out1$estimate+c(qnorm(.025), qnorm(.975))*sqrt(1/out1$hessian)
[1] 0.4155734 0.6344256

Bayesian approach to this problem

Bayesians also begin with the likelihood.

exampBayes

In addition to the likelihood a Bayesian requires a prior probability for p. The prior probability is a probability distribution that summarizes our knowledge about p before we collected the current data. In a classical Bayesian analysis we would like a probability distribution for the prior that readily combines with the likelihood so that the posterior distribution can be obtained by inspection. Furthermore a prior for p should be a probability model that assigns non-zero probabilities to values in the interval (0, 1) and zero values outside this interval. The beta distribution is an example of such a probability distribution. The probability density function of a beta distribution with parameters a and b is the following.

beta

where B(a, b) is a normalizing constant called the beta function. The mean of a beta distribution in terms of its parameters isbeta mean. Using a generic beta prior for p, the posterior probability of p can be written as follows.

beta posterior

If we ignore the normalizing constants in the denominator, we see that the numerator has the form of a beta distribution with parameters 42 + a and 38 + b. Thus it must be the case that the normalizing constant must reduce to B(42 + a, 38 + b) and hence the posterior distribution is also a beta distribution.

posterior beta

The mean of this distribution is

beta mean

In order to agree with the frequentist estimate, we need to choose an uninformative prior for p. For a proportion bounded on the interval [0, 1] a natural choice is the uniform distribution on this interval. A uniform distribution on (0, 1) has the constant density f(p) = 1 and is a special case of a beta distribution in which a = b = 1. For a Bayesian point estimate of p we can use the mean of the posterior distribution that is given by

beta mean = 0.5243902

Notice that the Bayesian estimate of p is similar but not identical to the frequentist estimate. The difference between the two estimates would be smaller with a larger sample size because the influence of the prior would be further reduced.

A modern approach to Bayesian estimation

The beta distribution is referred to as a conjugate distribution of the product Bernoulli distribution (more generally, it is a conjugate distribution of the binomial distribution). They are conjugates because the product of the two distributions yields a new distribution that is of the same form, in this case a beta distribution. Historically this was how Bayesian estimation proceeded. We construct a likelihood and then try to find a suitable prior such that the prior is a conjugate to the likelihood, or is such that the normalizing integral in the denominator of Bayes rule that results can be evaluated. The problem with the classical approach is that it restricts the use of Bayesian methods to only very simple problems. When we turn to regression problems in which there are multiple parameters of interest each with their own priors there is no simple way to select the priors so that the corresponding posterior distribution can be determined.

The manner in which statisticians carry out Bayesian estimation changed dramatically in the early 1990s. Today Bayesian analysis is carried out using a set of techniques referred to as Markov Chain Monte Carlo methods that make it possible to sample from the posterior distribution, for a given prior and likelihood, without the need to obtain an explicit formula. Markov chain Monte Carlo (MCMC for short) has two components to its name: Monte Carlo and Markov chain.

  1. The "Monte Carlo" in MCMC refers to Monte Carlo methods, the name being derived from the casino. Technically a Monte Carlo method is a stochastic simulation in which values from a probability distribution are obtained by sampling. Monte Carlo methods include a large class of statistical methods that depend on random number generation and sampling from computer-generated values to obtain approximate answers to statistical questions. In MCMC, the Monte Carlo methods are used for obtaining random samples from the desired posterior distribution, posterior.
  2. "Markov chain" refers to the algorithm used for generating sequential random values that lead us to the putative posterior distribution. If a Markov chain is formulated correctly, theory guarantees that eventually the values it generates will be samples from the desired posterior distribution. Formally, we need what's called an ergodic Markov chain because such chains possess a stationary distribution. The methods commonly used in MCMC are guaranteed to produce ergodic Markov chains.

So a Markov chain is how we get to the posterior distribution and Monte Carlo is how we obtain samples of that posterior distribution once we get there. Markov chain Monte Carlo doesn't return a formula for the posterior distribution, just a set of values. But with a large enough sample we can characterize most features of interest about the posterior distribution—quantiles, means, medians, etc. The trick in Markov chain Monte Carlo is to find a Markov chain whose stationary distribution is . With such a Markov chain we just let it run for a while (the so-called burn-in period) to give the chain time to reach its stationary distribution. Once there the chain will roughly visit values in the stationary distribution in proportion to their probabilities. Thus values in more probable regions of the distribution will be visited more often than values in the tail of the distribution.

A full discussion of MCMC methods and why MCMC works is beyond the scope of this course. Some of the references cited at the end of this document provide more information. The programs BUGS and JAGS are adaptive in their choice of method. They will use Gibbs sampling, adaptive rejection sampling, the Metropolis algorithm, or various other MCMC methodologies depending on the nature of the problem you submit. To give a hint at the flavor of what's involved I briefly describe the logic behind two of these MCMC methods: the Metropolis algorithm and Gibbs sampling.

Metropolis algorithm

The Metropolis algorithm originates from work in statistical physics dating back to the 1950s and was adapted for Bayesian estimation only in the last 25 years. The method uses the fact that knowing the full formula for the posterior distribution, including the normalizing constant, is unnecessary if one only needs to deal with the ratios of posteriors. Ignoring the details of the normalizing constant Bayes rule can be written as follows.

ratio1

where k is the normalizing constant. So, when we look at the ratios of the posterior densities at two distinct values of the parameter, θ1 and θ2, the normalizing constant cancels. So, we only need to work with the likelihood and the prior, the two quantities for which we have formulas.

ratio2

The Metropolis algorithm uses what's called a proposal distribution (also a jumping or instrumental distribution) to generate candidate values of θ. The proposal distribution is denoted

proposal

In the Metropolis algorithm the proposal distribution must be one that is symmetric, i.e., symmetry. In another variation, the Metropolis-Hastings algorithm, the symmetry requirement is replaced with a more general condition.

At step t – 1 of the Metropolis algorithm, a new candidate value is drawn from the proposal distribution. A form of rejection sampling is then used to decide if we should keep this new value. We calculate

Metropolis

and the new parameter value is accepted with probability α. In the Metropolis-Hastings algorithm the formula that is used for α is slightly more complicated in that it also involves a ratio of the proposal distributions. In practice if α < 1 then we draw a random value u from the uniform distribution on (0, 1) and keep if u ≥ α.

It turns out that the series of θ values generated by this algorithm forms a Markov chain whose stationary distribution is the desired posterior distribution . The Markov chain generated by the Metropolis-Hastings algorithm converges to regardless of the choice of proposal distribution q, but the choice of q can affect how fast the convergence takes place.

Here's a trivial illustration of how the Metropolis algorithm might work in practice. Suppose there are only seven possible values of θ, the numbers 1 through 7, and that we can evaluate the prior, the likelihood, and their product at each of these seven values. The product of the prior and the likelihood is the numerator of Bayes rule and is referred to as the target distribution, target. It's the posterior density without the appropriate normalizing constant. For our example with seven distinct values of θ, suppose the target distribution takes the following form, where for convenience I've chosen f(θ) = θ (Fig. 2).

target   
Fig. 2   Hypothetical target distribution.

We begin by choosing one of the seven possible values of θ at random. This is our starting value. Next we need a proposal distribution, a way of generating new values of θ given the current value of θ. A simple choice is to either select the observation immediately to the left of θ with probability 1/2 or to select the observation immediately to the right of θ with probability 1/2. Thus if the current value of θ is θi, we next choose either θi – 1 or θi + 1 with equal probabilities. (Note: we would need to treat the endpoints θ = 1 and θ = 7 as special cases.) We can implement this by choosing a random number between 0 and 1. If the random number is less than 0.5, we choose θi – 1; if this number is greater than 0.5 we choose θi + 1.

Having selected a candidate value we have to decide whether to keep it (move there). For that we evaluate the target distribution function at the candidate value and compare it to the value of the target distribution function at the current value. If the target distribution is larger at the candidate value, so that the ratio of the target distributions is greater than 1, we move there. If the ratio is less than one then we move to the candidate value with probability equal to the ratio. For the target distribution in Fig. 2 our decision rule is the following.

  1. If we draw θi + 1 we move there because rule 1.
  2. If we draw θi – 1 we move there with probability rule 2, otherwise we stay where we are.

This leads to the formal decision rule given previously. To see how this would work in practice suppose the random starting point is θ = 4. I generate six random numbers between 0 and 1.

set.seed(10)
probs <- runif(6)
probs
[1] 0.50747820 0.30676851 0.42690767 0.69310208 0.08513597 0.22543662
  1. We start at θ = 4. We need to choose between the adjacent values θ = 3 and θ = 5.
  2. Because probs[1] > 0.5 we choose θ = 5. Then because f(5) = 5 > 4 = f(4) we move to θ = 5.
  3. Because probs[2] < 0.5 we next choose θ = 4. Then because f(4)/f(5) = 0.8 > probs[3], we move to θ = 4.
  4. Because probs[4] > 0.5 we next choose θ = 5. Then because f(5) = 5 > 4 = f(4) we move to θ = 5.
  5. Because probs[5] < 0.5 we next choose θ = 4. Then becausef(4)/f(5) = 0.8 > probs[6], we move to θ = 4.

So after five steps our Markov chain consists of the values {4, 5, 4, 5, 4}. The claim is that if we carry out this protocol long enough the relative frequencies of the elements of our sample will correspond to the probabilities of the posterior distribution (which is a multiple of the target distribution). The rationale for this assertion is that our choices in moving from one point to another are entirely determined by the relative magnitudes of the posterior probabilities.

For the case in which there is a continuum of possible values for θ we need a more elaborate proposal distribution. A typical choice might be a normal distribution centered at the current value with a standard deviation that is treated as a tuning parameter. Because it generally takes a while to explore the parameter space, particularly if we start in an unrepresentative portion of it, the Metropolis algorithm requires a burn-in period. The observations obtained during the burn-in period are discarded. Because also it will take a while for the relative frequencies in our sample to stabilize around those of the posterior distribution, the Metropolis algorithm needs to be run for a long time after the initial burn-in period. (There are a number of diagnostics that have been developed for determining how long to run the algorithm and when to terminate the burn-in period.) Once we're satisfied that we're sampling from the stationary (posterior) distribution, we let the Markov chain run some more this time saving the values it generates. The values we obtain are then used to obtain Monte Carlo estimates of whatever characteristics of the posterior distribution we desire.

The Gibbs sampler

The Gibbs sampler is another of the Markov chain Monte Carlo methods implemented in the software BUGS and JAGS. The Gibbs sampler yields a Markov chain whose stationary distribution is the posterior distribution we want. Thus after a finite amount of time we're guaranteed that the sequential values of the Markov chain are correlated samples from the desired posterior distribution. The relative frequency of observations from different intervals corresponds approximately to the probabilities defined by the posterior distribution over those intervals.

Suppose target depends on k parameters theta. Then the Gibbs sampler requires k conditional probability statements of the form

sampler

These probabilities are often obtainable from hierarchical models when the models are formulated from a Bayesian perspective. Typically the basic regression model is itself a conditional statement, the random effects are conditional on the values of parameters of a normal distribution, and all the remaining parameters are conditional on their priors.

With the conditional probabilities formulated we choose starting values initial and then use the probability statements one by one to update these values by drawing new values from the full conditional probability distributions. At stage j of the iteration the update equations will look as follows.

update

Notice that once a new value is generated it is immediately used as the conditioning value for the remaining parameters in all subsequent conditional probability distributions at stage j. This is clearly a Markov chain. It also turns out to be an ergodic Markov chain.

ergodic,

i.e., in the limit the probability of going from thetai to thetaj in n steps depends only on the final state. When the Gibbs sampler is formulated from the likelihood and priors of a Bayesian model, the limiting probability limiting prob is in fact the desired posterior probability.

So, we set up the Gibbs sampler and we let it run for a while (the burn-in period) to give the chain time to "reach" its limiting distribution. Once there the chain visits values roughly in proportion to their probabilities. With a large enough sample of chain values we can obtain estimates of these probabilities and use them to estimate other parameters of interest. Bayesian problems are well-suited for the Gibbs sampler because they are already based on a series of conditional probability statements. The good news is that we don’t have to set up the equations for the Gibbs sampler ourselves, WinBUGS and JAGS do it for us based on the model definition file we give it.

When conjugate priors are chosen so that the posterior distribution has a known form WinBUGS can bypass all this and sample directly from the posterior distribution using standard algorithms. One such standard algorithm is the inversion method. When this is not the case WinBUGS can use a method called adaptive rejection sampling. Rejection sampling requires that we can find an envelope function, g(x), that when multiplied by some constant, m, is greater than the true posterior density, f(x), at all values of x. The envelope function needs to be one from which sampling is easy. Having drawn a value z from the envelope function, we examine the ratio R,

adaptive

We then draw a number u from the uniform distribution on the interval (0, 1). If R > u, we accept the value of z as the new value of our parameter. Otherwise we keep the old value. The envelope function used by BUGS is constructed from a set of tangents to the log of the conditional density.

The frequentist philosophy of statistics and the Bayesian critique

For the frequentist everything hinges on the notion of repeated sampling. In the informal definition of probability the probability of an event represents the long range relative frequency of the occurrence of that event. To say that the probability of heads is one half when a fair coin is tossed once means that if we were to flip a fair coin repeatedly the long run relative frequency of heads is one half. (Note: another way to interpret probability in this example is to treat the possible outcomes of a coin flip as equally likely.)

Statistical inference derives from the sampling distribution of a statistic. The sampling distribution is the set of estimates obtained when all possible samples are selected from a target population. Properties of a statistic (unbiasedness), standard errors, etc. derive from the sampling distribution of that statistic.

Implicit in the frequentist approach to estimation is that there is a fixed quantity in nature, the parameter θ, that we wish to measure with a statistic. For example, the bias of a statistic is the difference between mean of the sampling distribution of the statistic and the true value that is being estimated. So, for a frequentist it is parameters that are fixed while it is the data that are random. This is diametrically opposed to the Bayesian point of view. For a Bayesian it is the data that are fixed and parameters that are random.

The notion that the parameter is a fixed quantity in nature causes problems in interpretation for the frequentist. When we construct a confidence interval for a parameter we like to treat the interval as a probability statement for the parameter—the set of likely values it might take. But in truth if the parameter is fixed then it is either in the interval we've constructed or it's not. There's no probability associated with it. The probability instead derives from the sampling distribution. We call it a 95% confidence interval because we're guaranteed that 95% of the intervals we might have constructed, if we had obtained all possible samples from the population, do in fact contain the true parameter value. All we can do is hope that this is one of the lucky ones.

It's been said that constructing a 95% confidence interval is analogous to running from an elephant while shooting an elephant gun over your shoulder. Did you hit it? You don't know but you know that 95% of the time an elephant gun is known to stop an elephant. You just hope this is one of those times. The bottom line is that when θ is treated as real and fixed, then it's only our methods that can have probability associated with them.

The frequentist perspective uses p-values as measures of evidence against hypotheses, but the p-value is a measure of how likely are the observed data under the null hypothesis, i.e., . This doesn't address the probability of the null hypothesis, for that we need . So the p-value addresses the wrong probability. We've seen from our simple drug example application of Bayes rule that the order matters in conditional probability statements. Practically speaking the frequentist is trying to trying to draw inference about whether a specific individual who tested positive on a test is a drug user using only the sensitivity of the test.

A Bayesian has other problems with p-values beyond the fact that they estimate the wrong probability. By definition a p-value is the probability under the null hypothesis of obtaining data as extreme or more extreme than the data one actually obtained. A Bayesian would wonder why we are basing our decision about the null hypothesis using data we didn't obtain. Shouldn't we base our decisions on the evidence at hand and not on the evidence we might have obtained?

The Bayesian alternative to the frequentist approach

In the Bayesian approach, probability is a matter of opinion. It still must follow the standard rules of probability (the Kolmogorov axioms, for instance), but other than that things are pretty flexible. For this reason Bayesians are said to take a subjective approach to probability (as opposed to the "objective" approach of the frequentists).

As far as the existence of a true value of θ in nature, Bayesians are of an open mind. θ may or may not be real, but in the Bayesian perspective it doesn't matter. All we know about θ is what we believe about it. As knowledge accumulates our beliefs about θ become more focused. Since the value of θ is a matter of belief, and probability is a matter of belief, for all intents and purposes θ can be viewed as a random quantity. Consequently to a Bayesian parameters are random variables, not fixed constants. As a result confidence intervals pose no philosophical dilemma for a Bayesian. Since parameters are random we can make probability statements about their values. Thus a confidence interval for θ is a probability statement about the likely values of θ. To avoid confusion with frequentist confidence intervals, Bayesians often call their intervals "credible intervals".

A Bayesian takes pride in the fact that Bayes rule as formulated above essentially encapsulates inductive science in a single formula. In science we develop theories about nature. Observation and experiment then cause us to modify those theories. Further data collection causes further modification. Thus science is a constant dynamic between prior and posterior probabilities, with prior probabilities becoming modified into posterior probabilities through experimentation at which point they become the prior probabilities of future experiments. Thus the Bayesian perspective accounts for the cumulative nature of science.

A frequentist critique of the Bayesian approach

The frequentist retort is that this is a mischaracterization of science. Science is inherently objective and has no use for subjective quantities such as prior probabilities. Science should be based on data alone and not the prejudices of the researcher.

The Bayesian rejoinder to this is first that science does have a subjective component. The "opinions" of scientists dictate the kinds of research questions they pursue. In any case if there is concern that a prior probability unfairly skews the results, the analysis can be rerun with other priors. In Bayesian analysis it is fairly typical to carry out analyses with a range of priors to demonstrate that results are robust to the choice of prior.

In fact both schools of thought are correct here. While Bayesians may have described the ideal scientific method, in truth consensus in science is informal at best. Perhaps there should be a current prior probability in vogue for everything in science, but typically there isn't. Without this consensus the inherent subjectivity of priors does seem to be a problem.

It's interesting then that Bayesians and frequentists reject the other's approaches for essentially the same reason.

Until Markov chain Monte Carlo came along this entire argument was moot because very few interesting problems could be solved using Bayes rule. The reason lies in the integral that appears in the denominator. For most higher dimensional scenarios evaluating this integral is an intractable problem.

Historically the only estimation problems Bayesians were able to solve were trivial ones where the integral can be ignored because the form of the posterior probability can be determined by inspection. These situations involve what are called conjugate priors in which the likelihood and prior combine in a nice fashion so that the posterior probability can be determined without integration. For instance, a normal prior and a normal likelihood yield a normal posterior of a known form. A binomial likelihood coupled with a beta prior yield a beta posterior of known form. There are a number of other examples of conjugates, but the list is not very long. Now that MCMC methods have obviated this problem by sampling from the posterior distribution directly the pendulum of opinion has swung back in the Bayesian direction.

Bayesian statistics—why should a frequentist care?

While there are many good reasons for taking a Bayesian perspective, I address this section to someone who is committed to doing statistics from a frequentist perspective. So why in general should a frequentist care about the Bayesian approach and have to grapple with the issue of choosing prior probabilities? I give two reasons.

  1. If you have lots of data, then the choice of prior won’t matter much. So parameter estimates based on the posterior distribution will resemble parameter estimates based on the likelihood. Thus one can get MLEs via MCMC. For instance, in a simple estimation problem with a normal likelihood and a normal prior it turns out that the posterior mean is just a weighted sum of the sample mean and prior mean. In this case the posterior mean = w1 × sample mean + w2 × prior mean, where the weights are measures of relative precision. The weights are of the form precision 1 and precision 2 in which n is the sample size, s2 and τ2 are the variances of the sample and prior respectively, and k is a normalizing constant. Clearly from the weights if the sample size n is large the sample mean will dominate this sum and the prior will have little effect.
  2. With only a moderate amount of data we can choose to carry out an "objective" Bayesian analysis. In this approach we select an uninformative (vague, flat, diffuse) prior so that the prior has minimal impact on determining the posterior distribution. That this is possible is also clear from the weighted mean example above. If the variance of the normal prior is large (yielding an uninformative prior) the corresponding weight will be small.

The upstart of this is that MCMC can be used to obtain parameter estimates for distributions when it is far too complicated to obtain these estimates using ordinary frequentist optimization methods. The typical situations where the Bayesian perspective may be necessary are the following:

One or more of these four conditions is typical in ecology. The only real problem from a practical standpoint with using Markov chain Monte Carlo in model estimation is that it exacerbates the temptation to toss parsimony out the window and to fit extremely complicated models. This temptation should be resisted at all costs!

Estimating an analysis of covariance model using JAGS and BUGS

Software such as WinBUGS and JAGS that implement Markov chain Monte Carlo (MCMC) methodology can be to carry out Bayesian estimation. BUGS is an acronym for "Bayesian inference Using Gibbs Sampling" while JAGS is an acronym for "Just Another Gibbs Sampler". Both use the BUGS programming language to formulate the likelihood of our data under our chosen model along with prior distributions for the model parameters. Based on the information we provide it, WinBUGS and JAGS set up the appropriate Gibbs sampler (or some alternative MCMC method) that yields Markov chains that we can then track.

To illustrate the use of this software I refit the analysis of covariance model we considered in lecture 9. The data were taken from Crawley (2002) who described three variables of interest.

Our final model from lecture 9 was the following.

ipo <- read.table( 'ecol 563/ipomopsis.txt', header=T)
out2 <- lm(Fruit ~ Root + Grazing, data=ipo)

We will fit this same model using Bayesian methods and compare the estimates to those returned by lm.

WinBUGS/JAGS Preliminaries

Fig. 3  R file menu choices

WinBUGS is a stand-alone package that is not particularly easy to use. Andrew Gelman of Columbia University has written an R package, arm, that among other things permits running WinBUGS models from within R by calling on the bugs function from the R2WinBUGS package. This simplifies the use of WinBUGS considerably because we can use R to set up the analysis and process the results and then use WinBUGS only to fit the model.

JAGS also uses the BUGS programming language with some minor differences. It is not stand-alone software but requires R in order to run. The R2jags package can be used to run JAGS models from R. The jags function from this package returns output that matches the format of the output generated by WinBUGS.

To run WinBUGS using bugs requires the following.

  1. Version 2.4 (or later) of R. The arm package is not compatible with earlier versions of R.
  2. The arm package of R.
  3. WinBUGS 1.4 plus patches.
  4. The perpetual key that removes any restrictions on the use of WinBUGS.

WinBUGS can be installed to any location on your machine. If WinBUGS is not installed in the default location, C:\Program Files\WinBUGS14, then the correct path needs to be specified when you invoke WinBUGS from within R. To run JAGS you need the JAGS software and the R2jags package from the CRAN site.

At the core of every WinBUGS or JAGS run is a model definition file that specifies the details of the model. This is a text file written in the BUGS language. It should be created in Notepad or some other text editor. It is important that the working directory of R be set to the directory where this text file has been saved. The working directory can be set by making a menu choice or by using the setwd command at the R console. To set the working directory using menus on Windows go to the File menu and select Change dir... (Fig. 4). In the dialog box that appears set the directory to match the location where the WinBUGS model definition file lives. This directory now becomes the default location where R will save graphics and other output. On a Mac choose R > Preferences > Start up and make the changes in the Initial Working Directory section. I prefer to use the setwd command from the console (see below).

IMPORTANT CAVEAT!! Do not bury your BUGS program inside many folders. Try to store it only a few layers deep. More importantly, do not save it in folders with long and complicated names. WinBUGS will fail to run and give you a completely incomprehensible error message to boot.

Describing the model in the BUGS language

WinBUGS likes things simple. For instance WinBUGS accepts only numeric data. Character data and factors are not allowed so factors need to be converted to a set of dummy variables. In addition variables must be entered into WinBUGS in a special way. To facilitate this I begin by pulling out the individual variables of the data frame ipo that we will need for the regression analysis. In doing so I try to keep the variable names simple and generic.

x <- ipo$Root
y <- ipo$Fruit
z <- as.numeric(ipo$Grazing)-1

The BUGS model definition file that we'll construct at one point makes use of the number of observations in the data set as a constant. So I calculate this value in R and assign the result to another object.

n <- length(y)

Next I write the Bayesian model using the BUGS language. Much of the language resembles R but with subtle differences. The language is documented in manuals available at the site where you downloaded WinBUGS and/or JAGS. I enter the lines in NotePad and save the resulting file as ipomodel.txt. Here's the model definition file followed by a line-by-line explanation.

model{
#likelihood
for(i in 1:n) {
y[i]~dnorm(y.hat[i],tau.y)
y.hat[i] <- b0+b1*x[i]+b2*z[i]
}
#priors
b0~dnorm(0,.000001)
b1~dnorm(0,.000001)
b2~dnorm(0,.000001)
#tau.y~dgamma(.001,.001)
tau.y <- pow(sigma.y,-2)
sigma.y~dunif(0,10000)
}

A line by line commentary on the code

  1. The model begins with the key word model. The actual model code is contained within the pair of curly braces { }.
  2. A for loop is used to describe the probability model for the data, the likelihood so to speak. Notice that the index runs from 1 to n, where n is the number of observations, so this loop specifies a probability model for each observation. To make the code more portable I do not hardwire the number of observations explicitly into the code but represent it with the variable n instead.
  3. The dnorm "function" in WinBUGS differs from that of R. First it is not really a function (it's missing the first argument of the dnorm function of R). It is just the way BUGS denotes the normal distribution, much like in the standard mathematical notation Y ~ Normal(0, 1). The first "argument" of dnorm is the mean of the distribution and the second argument specifies what's called the precision.
  4. The second line of the loop further defines the mean, y.hat. The use of the assignment arrow means that y.hat is a derived quantity. The formula given is just the analysis of covariance regression model: b0 + b1x + b2z.
  5. The definition of y.hat here appears to be redundant. It would seem that we could have made the formula b0+b1*x[i]+b2*z[i] the first argument of the dnorm function and skipped the definition of y.hat entirely. You should never do this! WinBUGS is easily confused when complicated expressions are used as arguments in probability functions. Regression formulas and the like should always be specified outside of probability functions just as we're doing here.
  6. The remainder of the code specifies priors for the model parameters. The next three lines place what are hopefully uninformative priors on the regression parameters b0, b1, and b2. Since a priori these coefficients can be anything, we use normal priors because the support for the normal distribution is the entire real line. The second argument is the precision and we choose this to be low. As was noted above, the precision is the reciprocal of the variance so here we're choosing a prior that is a normal distribution with a variance of 1,000,000. This yields a fairly wide distribution implying that we have little knowledge about the true values of b0, b1, and b2.
  7. The next line defines the quantity tau.y, the precision of the normal distribution for the likelihood. The usual approach is to use a gamma distribution (the line that is commented out). The approach we use here is to place a prior on the standard deviation rather than on the precision itself. Hence the line tau.y <- pow(sigma.y,–2) is used to connect the precision to the standard deviation thus causing the precision to be a derived quantity rather than a parameter. Notice the use of the pow (for power) function instead of writing this out with the exponentiation operator. This is the preferred way of writing powers in the BUGS language.
  8. Finally the last line sets a uniform prior on the standard deviation, sigma.y, uniform on the interval (0, 10000). Setting the priors on a standard deviation parameter can be tricky. We need to set the precision low enough to have a minimal effect on the results, but not so low that the sampler used by BUGS has trouble generating values. When the precision is set too low you'll obtain a message such as: "cannot bracket slice for node sigma.y".

Some remarks about WinBUGS/JAGS models in general

BUGS code looks just like R code but with important differences. BUGS code is used to specify the model. The lines of code are not instructions that are executed. In fact BUGS code is not executed at all; it is a declarative language. Although lines are entered sequentially, they’re not interpreted sequentially. BUGS parses the entire model and then runs a process. BUGS can make use of for loops for model specification but there use is just a matter of convenience. There are six kinds of BUGS objects.

  1. Modeled data: in code these are defined with a ~. For example y ~ followed by a probability distribution. The variable y here is the response in our regression model. Modeled data are objects that are assigned probability distributions.
  2. Unmodeled data: objects that are not assigned probability distributions. Examples include predictors, constants, and index variables. In the above code x, z, and n are examples of unmodeled data.
  3. Modeled parameters: these are given informative “priors” that themselves depend on parameters called hyperparameters. These are what a frequentist would call random effects. There are none in the above model.
  4. Unmodeled parameters: these are given uninformative priors. So in truth all parameters are modeled in the Bayesian approach. All the rest of the frequentist parameters fall under this category.
  5. Derived quantities: these objects are typically defined with the assignment arrow, <-
  6. Looping indexes: i, j, etc.

BUGS uses probability function with names that match many of R's probability functions but there is a twist. For instance the dnorm function is parameterized in terms of precision (reciprocal variance) rather than the standard deviation. Secondly, dnorm is not really a function in WinBUGS(as it is in R) but is just a label for the normal distribution. The ~ as used here has the meaning it has mathematically, namely "is distributed as", and is not the formula operator of R.

In this course we will use the following strategies for setting uninformative priors for model parameters.

  1. For nonnegative quantities (standard deviations) or restricted range quantities (correlations) we will use a uniform prior.
  2. For all other quantities we will use a normal prior.

The most important modeling strategy when using BUGS is to start simple and complexify gradually each time checking that the current model works and makes sense before moving to a more complex model.

Running the WinBUGS/JAGS model from R

With the model saved to the working directory, we enter the following three lines back at the console prompt in R.

ipo.data <- list("n", "y", "x", "z")
ipo.inits <- function() {list(b0=rnorm(1), b1=rnorm(1), b2=rnorm(1), sigma.y=runif(1))}
ipo.parms <- c("b0", "b1", "b2", "sigma.y")

These three lines create objects that will be passed to WinBUGS/JAGS for use in estimating the model.

I next set the working directory (if this hasn't already been done using the menu, Fig. 3) using the setwd function. The working directory is the one that contains the BUGS model definition file. It is not a good idea to bury things in folders that are many layers deep. The containing folders should also have fairly simple names. If you violate either of these recommendations, WinBUGS may fail to run (and the error message you get will be completely cryptic). To see what the current working directory is use the getwd() command.

getwd()
"C:/Users/jmweiss/Documents/"
setwd("C:/Users/jmweiss/Documents/ecol 563")

We're now ready to run the model. On a Mac I load the R2jags package (which must be initially downloaded from the CRAN site because it's not part of the standard R installation--be sure to check the install dependencies box when you download it) and make the following call using the jags function from R2jags.

library(R2jags)
ipo.1j <- jags(ipo.data, ipo.inits, ipo.parms, "ipomodel.txt", n.chains=3, n.iter=100)

On Windows I load the arm package (which must be initially downloaded from the CRAN site because it is not part of the standard R installation) and make the following call using the bugs function from arm.

library(arm)
ipo.1 <- bugs(ipo.data, ipo.inits, ipo.parms, "ipomodel.txt", bugs.directory = "C:/WinBUGS14", n.chains=3, n.iter=100, debug=T)

fig 5

Fig. 4  WinBUGS report on the model

Shown below is an explanation of each of the arguments of jags and bugs.

  1. data = ipo.data  The first argument to bugs is the data object that was created above listing the variable names.
  2. inits = ipo.inits  The second argument is the function that sets the initial values for each chain.
  3. parameters = ipo.parms  The third argument is an object containing the names of the parameters we wish to monitor.
  4. model.file = "ipomodel.txt"  The fourth argument is the name of the file that contains the WinBUGS code that defines the model.
  5. bugs.directory = "C:/WinBUGS14"  This is an optional argument that specifies the directory that contains the WinBUGS program, WinBUGS14.exe. If you did the default installation to the C: drive in the Program Files folder, leave off this argument.
  6. n.chains=3  This argument lists the number of Markov chains to create. Three is generally enough to check proper mixing.
  7. n.iter=100  This argument specifies the number of iterations of Gibbs sampling to carry out. For a first run this should be a small number since more than likely the code will fail to run anyway.
  8. debug=T  The last argument turns the debugger on. With debug=T R will freeze and the WinBUGS window will remain open so that error messages can be examined and interpreted. Closing the WinBUGS window will then unfreeze R and return control to the console.

JAGS provides minimal output. WinBUGS provides a lot more information. Fig. 4 shows the partial contents of the log window in WinBUGS from this run. The important thing to observe is that there are no error messages here.

The rest of the log (Fig. 5) displays summary statistics and graphs of individual Markov chains for different parameters and statistics. With only 100 iterations contributing to these reports, there is little to be learned from them. Observe though that the last 50 iterations of the chains already appear to be mixing.

fig 6

Fig. 5  Additional information in the WinBUGS log window

Since there were no errors we can now run the model for real with a reasonable number of iterations. In the run below I specify 10,000 iterations. By default the bugs function is set up to treat the first half of the iterates as part of the the burn-in period. So the first 5,000 iterations of this run are discarded and only the values from the second half of the run are saved. (This can be changed by specifying a value for the n.burnin argument of bugs.)

Furthermore the bugs function uses a default thinning rate of max(1, floor(n.chains*(n.iter-n.burnin)/1000). The n.thin argument can be used to override this default. With three chains, n.iter = 10000, and n.burnin = 5000, the thinning rate is 15. Thus only every 15th simulation value is saved. With 3 chains this means BUGS will return 1002 (3 times 334) samples, which, if convergence has been attained, will be samples from the posterior distributions of each parameter. It is not necessary to keep the debugger on at this point, but I do so in order to inspect some of trace plots of the Markov chains that are produced. JAGS thins things so that 1000 observations remain in each chain. Thus JAGS returns three times as many observations as does WinBUGS.

ipo.1 <- bugs(ipo.data, ipo.inits, ipo.parms, "ipomodel.txt", bugs.directory = "C:/WinBUGS14", n.chains=3, n.iter=10000, debug=T)

Fig. 6 shows plots of the three Markov chains for the standard deviation parameter, sigma.y. Observe that trajectories are thoroughly interdigitating indicating that the chains have mixed fairly well. The fact that all three chains are traversing the same range of y-values provides evidence that the chains are now sampling from the posterior distribution of sigma.y. The fact that the individual chains are jumping around a lot indicates there is very little autocorrelation in the chains. This is good because it implies that the returned observations are nearly independent meaning that our effective sample size is close to the actual sample size.

Fig. 6  Trace plots of the individual Markov chains for the standard deviation of the response

This last model would be fit using JAGS as follows.

ipo.1j <- jags(ipo.data, ipo.inits, ipo.parms, "ipomodel.txt", n.chains=3, n.iter=10000)

Information returned by the bugs (jags) function

To unfreeze the R console, close the WinBUGS window. We can view summary statistics for the posterior distributions of each parameter by typing the name of the object to which the bugs output was assigned.

ipo.1
Inference for Bugs model at "badipomodel.txt", fit using WinBUGS,
 3 chains, each with 10000 iterations (first 5000 discarded), n.thin = 15
 n.sims = 1002 iterations saved
           mean   sd   2.5%    25%    50%    75%  97.5% Rhat n.eff
b0       -128.0 10.4 -148.8 -135.0 -127.6 -121.3 -107.7    1   500
b1         23.6  1.2   21.2   22.8   23.5   24.4   26.0    1   600
b2         36.1  3.6   29.2   33.8   36.0   38.5   43.5    1   700
sigma.y     7.0  0.9    5.5    6.4    6.9    7.6    8.7    1  1000
deviance  267.6  3.2  263.6  265.2  266.9  269.2  276.0    1   830

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = Dbar-Dhat)
pD = 4.0 and DIC = 271.6
DIC is an estimate of expected predictive error (lower deviance is better).

The output shows the mean, standard deviation, and various quantiles of the posterior distributions of the four parameters b0, b1, b2, and sigma.y. To be sure that we are actually sampling from the posterior distribution we should examine the column labeled Rhat. Rhat is the mixing index. Numerically it is the square root of the variance of the mixture of all the chains divided by the average within-chain variance. If the chains are mixed these values should be roughly the same yielding a ratio of approximately 1. Since Rhat is reported to be 1 for all the parameters this is evidence that the chains are well-mixed and that the Gibbs sampler is sampling from the posterior distribution.

A lot of other useful information is contained in different components of the object created by the bugs function.

names(ipo.1)
 [1] "n.chains"      "n.iter"      "n.burnin"   "n.thin"
 [5] "n.keep"        "n.sims"      "sims.array" "sims.list"
 [9] "sims.matrix"   "summary"     "mean"       "sd"
[13] "median"        "root.short"  "long.short" "dimension.short"
[17] "indexes.short" "last.values" "is.DIC"     "DICbyR"
[21] "pD"            "DIC"         "model.file" "program"

  1. The component called summary contains the information shown in the above summary table but organized in matrix form.
  2. The three components mean, sd, and median repeat some of the information contained in summary.
  3. The three components sims.array, sims.list, sims.matrix each contain the 1002 samples from the posterior distribution for each parameter but organized in different ways.

With JAGS this same information is contained in the $BUGSoutput component of the jags object.

names(ipo.1j)
[1] "model"              "BUGSoutput"         "parameters.to.save"
[4] "model.file"         "n.iter"             "DIC"
names(ipo.1j$BUGSoutput)
 [1] "n.chains"        "n.iter"          "n.burnin"        "n.thin"        
 [5] "n.keep"          "n.sims"          "sims.array"      "sims.list"     
 [9] "sims.matrix"     "summary"         "mean"            "sd"            
[13] "median"          "root.short"      "long.short"      "dimension.short"
[17] "indexes.short"   "last.values"     "program"         "model.file"     
[21] "isDIC"           "DICbyR"          "pD"              "DIC"      

So with the jags object we access the these components as ipo.1j$BUGSoutput$sims.matrix, ipo.1j$BUGSoutput$summary, etc.

The sims.array component arranges the information in a three dimensional array, a set of matrices, one for each parameter. The individual matrices contain the samples obtained from each Markov chain as separate columns listed in the order the sampled observations were obtained. Thus each column records the sampling history for that Markov chain. For the current example each parameter is represented by a 334 × 3 matrix. Below I display the first two sampled observations in each chain for all of the parameters.

ipo.1$sims.array[1:2,,]
, , b0

       [,1]   [,2]   [,3]
[1,] -135.1 -136.3 -132.1
[2,] -112.3 -138.2 -133.8

, , b1

      [,1]  [,2]  [,3]
[1,] 24.24 24.77 23.84
[2,] 21.73 24.49 24.16

, , b2

      [,1]  [,2]  [,3]
[1,] 38.22 35.02 37.19
[2,] 30.64 42.11 37.07

, , sigma.y

      [,1]  [,2]  [,3]
[1,] 5.911 7.994 7.598
[2,] 7.157 6.489 6.518

, , deviance

      [,1]  [,2]  [,3]
[1,] 266.4 269.3 267.1
[2,] 266.5 267.9 264.7

dim(ipo.1$sims.array[,,'b0'])
[1] 334 3

We can use the information contained in the sims.array component to make our own trace plot of the Markov chains in order to evaluate mixing.

In the sims.list component, the chains are combined into a single long vector and their order is randomly permuted (so that the actual sample history of the observations is lost). The individual parameters can be accessed as list objects, as in ipo.1$sims.list$b0, ipo.1$sims.list$b1, etc. Below I select the first five elements of each parameter and in the second call I request the first five samples from the posterior distribution of the parameter b0.

sapply(ipo.1$sims.list, function(x) x[1:5])
         b0    b1    b2 sigma.y deviance
[1,] -112.3 21.73 30.64   7.157    266.5
[2,] -125.0 23.27 34.16   6.311    263.7
[3,] -133.6 24.02 37.37   5.950    267.8
[4,] -111.0 21.84 29.60   6.957    268.1
[5,] -126.2 23.46 31.99   7.176    267.8
ipo.1$sims.list$b0[1:5]
[1] -112.3 -125.0 -133.6 -111.0 -126.2

In the sims.matrix component the three chains are combined again but this time the samples from the posterior distributions occupy the columns of a matrix. As was the case with the sims.list component, the row order of the observations in the sims.matrix component has been randomly permuted (so that the actual sample history is lost).

ipo.1$sims.matrix[1:10,]
          b0    b1    b2 sigma.y deviance
 [1,] -112.3 21.73 30.64   7.157    266.5
 [2,] -125.0 23.27 34.16   6.311    263.7
 [3,] -133.6 24.02 37.37   5.950    267.8
 [4,] -111.0 21.84 29.60   6.957    268.1
 [5,] -126.2 23.46 31.99   7.176    267.8
 [6,] -122.3 22.55 37.08   7.475    268.2
 [7,] -131.6 23.94 36.66   7.791    266.0
 [8,] -135.5 24.69 37.03   6.838    265.5
 [9,] -126.2 23.38 36.71   6.076    264.3
[10,] -113.8 21.83 34.88   9.046    272.9

ipo.1$sims.matrix[1:5,'b0']
[1] -112.3 -125.0 -133.6 -111.0 -126.2

We can use this information to calculate our own summary statistics. Here are the means of the sampled posterior distributions. The values match what's reported in the summary table above.

apply(ipo.1$sims.matrix,2,mean)
         b0          b1          b2     sigma.y    deviance
-127.950409   23.579501   36.143623    7.019558  267.581836

Priors that are too informative: comparing Bayesian and frequentist results

It's useful at this point to contrast the estimates obtained from the Bayesian posterior distributions with what we obtained previously from fitting the model using the lm function and ordinary least squares.

round(ipo.1$summary,3)
             mean     sd     2.5%      25%      50%      75%    97.5%  Rhat n.eff
b0       -127.950 10.352 -148.798 -135.000 -127.600 -121.325 -107.710 1.003   500
b1         23.580  1.225   21.171   22.820   23.540   24.377   26.019 1.003   600
b2         36.144  3.599   29.210   33.782   36.050   38.490   43.509 1.002   700
sigma.y     7.020  0.851    5.499    6.421    6.944    7.599    8.740 1.002  1000
deviance  267.582  3.204  263.600  265.200  266.900  269.200  275.992 1.004   830

The ordinary least squares results for this model are the following.

coef(out2)
    (Intercept)            Root GrazingUngrazed
     -127.82936        23.56005        36.10325

In general we don't expect the frequentist and Bayesian estimates to be the same, but for well-defined, trivial problems such as this with truly uninformative priors, we expect them to be close. If the differences in the estimates are large there are three possible explanations.

  1. A longer burn-in period is needed. We're not yet sampling from the posterior distribution.
  2. There's a mistake in how the model was defined.
  3. Some of the priors are weakly informative and they're unduly influencing the results.

Explanation (1) can be ruled out by looking at chain diagnostics, as we've done. Explanation (3) is always a possibility. What's uninformative for one problem may be informative for another. The intercept in this model is an order of magnitude larger than either of the effect estimates. Suppose I increase the precision of all the priors by a factor of 100 in the model definition file. I save the new file as badipomodel.txt. The changes to the old model definition file are indicated in red.

model{
for(i in 1:n) {
y[i]~dnorm(y.hat[i],tau.y)
y.hat[i] <- b0+b1*x[i]+b2*z[i]
}
# priors are too informative
b0~dnorm(0,.0001)
b1~dnorm(0,.0001)
b2~dnorm(0,.0001)
tau.y <- pow(sigma.y,-2)
sigma.y~dunif(0,100)
}

I rerun the model with the new model definition file and then display the summary statistics of the posterior distributions again.

#refit model
ipo.3 <- bugs(ipo.data, ipo.inits, ipo.parms, "badipomodel.txt", bugs.directory = "C:/WinBUGS14", n.chains=3, n.iter=50000, debug=T)
ipo.3
Inference for Bugs model at "ipomodel2.txt", fit using WinBUGS,
 3 chains, each with 50000 iterations (first 25000 discarded), n.thin = 75
 n.sims = 1002 iterations saved
           mean   sd   2.5%    25%    50%    75%  97.5% Rhat n.eff
b0      
-126.5 10.1 -146.5 -133.5 -126.3 -119.8 -107.0    1  1000
b1         23.4  1.2   21.2   22.6   23.4   24.2   25.8    1  1000
b2         35.8  3.6   28.8   33.6   35.8   38.2   42.4    1   760
sigma.y     7.0  0.8    5.5    6.4    6.9    7.5    8.8    1  1000
deviance  267.3  3.0  263.6  265.2  266.7  268.7  275.4    1  1000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = Dbar-Dhat)
pD = 3.8 and DIC = 271.2
DIC is an estimate of expected predictive error (lower deviance is better).

coef(out2)
    (Intercept)            Root GrazingUngrazed
    
-127.82936        23.56005        36.10325

Now the Bayesian and frequentist estimates of the intercept are much further apart. So what went wrong? Choosing a precision of .0001 is equivalent to a normal variance of 10000, or a normal standard deviation of 100. This says that we are 95% certain for instance that the estimate of the intercept lies in the interval (–200, 200). The estimate of the intercept turns out to be –126, so clearly the prior we're using is quite informative. Here the standard deviation is the same order of magnitude as the parameter we're trying to estimate. To be an uninformative prior we need the standard deviation to be at least a couple of orders of magnitude greater than the estimate, yielding a range of uncertainty that is wider than the range of reasonable values of the parameter.

The obvious solution would be to always choose the precision to be very small. The problem with this is that in complicated problems this can lead to numerical instability. BUGS can become computationally unstable when the priors on parameter values are given too wide a range. For instance if I change the prior for sigma.y to sigma.y~dunif(0,1000000) WinBUGS runs the model but I get no output. Instead in the WinBUGS log window the following line appears amidst a lot of other text.

thin.updater(15)
update(334)
cannot bracket slice for node sigma.y

This last line is an error message that indicates the chosen prior for sigma.y is too diffuse.

Golden rule for fitting models with MCMC

The best strategy for having success in fitting Bayesian models is to rescale the data values so that all estimated parameters end up being smaller than 10 in absolute value and typically close to 1. Things can always be rescaled back later either within the WinBUGS program or in R. If such rescaling is done for the current problem, even the original choice of precision of .0001 for the regression parameters will turn out to be reasonable.

Cited reference

Crawley, Michael J. 2002. Statistical Computing: An Introduction to Data Analysis Using S-Plus. Wiley, New York.

Introductory books and articles on Bayesian estimation

R Code

A compact collection of all the R code displayed in this document appears here.

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--November 20, 2012
URL: https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/lectures/lecture23.htm