Lecture 25—Monday, November 19, 2012

Topics

R packages used

Recreating the variables for a Bayesian analysis

I reload the birds data set and create the variables needed to fit the Poisson regression models in WinBUGS and JAGS.

birds <- read.csv('ecol 563/birds.csv')
birds.short <- birds[!is.na(birds$S),]
# create variables for WinBUGS
y <- birds.short$S
year2 <- as.numeric(birds.short$year==2006)
year3 <- as.numeric(birds.short$year==2007)
n <- length(y)
# number of patches
J <- length(unique(birds.short$patch))
# patch identifier renumbered
patch <- as.numeric(factor(birds.short$patch))
setwd("C:/Users/jmweiss/Documents/ecol 563/models")

Bayesian estimation of a random intercepts model

In the frequentist approach the transition from a Poisson separate intercepts model to a Poisson random intercepts model requires switching from glm to lmer.

# separate intercepts model
model2.glm <- glm(S~factor(patch)+factor(year), data=birds.short, family=poisson)
# random intercepts model
library(lme4)
model3.lmer <- lmer(S~factor(year) + (1|patch), data=birds.short, family=poisson)

Let i denote the patch and let j denote the observation from that patch (a measurement made in one of the three years). The random intercepts model can be written in two equivalent ways.

Random effects formulation
Random intercepts formulation
Poisson random effects
random effects model

In the random effects formulation the focus is on the patch random effects, the u0i, which are assumed to have a normal distribution with mean zero. In the random intercepts formulation the focus is on the individual patch intercepts, the β0i, which are assumed to have a normal distribution with mean β0. The connection between these two formulation is that β0i = β0 + u0i. When we fit this model using lmer we automatically get both formulations. We can extract the random effects, the u0i, with the ranef function.

ranef(model3.lmer)[[1]][1:10,1]
 [1] -0.008987382 -0.280241025  0.018405654  0.267184478  0.157106981  0.468749019  0.028217857
 [8]  0.331646243  0.423017080  0.359845270

We obtain the random intercepts, the β0i, with the coef function.

coef(model3.lmer)[[1]][1:10,1]
 [1] 3.182561 2.911308 3.209954 3.458733 3.348656 3.660298 3.219767 3.523195 3.614566 3.551394

We can also obtain the random intercepts from the random effects by adding the fixed effect estimate of the intercept, β0, to the random effects.

fixef(model3.lmer)[1] + ranef(model3.lmer)[[1]][1:10,1]
 [1] 3.182561 2.911308 3.209954 3.458733 3.348656 3.660298 3.219767 3.523195 3.614566 3.551394

There is yet another way to think about this model that focuses on the structural characteristics of the data. It's called the multilevel formulation and is written as follows.

Poisson multilevel

This formulation recognizes that there are quantities that vary among the observations within a patch, such as the variable year, and there are quantities that are constant within a patch but vary between patches, such as the random effects. The multilevel formulation will be especially useful when we add predictors to the model that characterize the patches.

The Bayesian version of the random intercepts model makes explicit use of the multilevel formulation. We've already seen this when we specified the separate intercepts model. As a result moving from the separate intercepts model to the random intercepts model requires only minor changes in the BUGS code. We need to replace the uninformative prior for the individual intercepts with an informative one. This change in the BUGS code is indicated below.

Separate intercepts model (model2.txt)
Random intercepts model (model3.txt)
model{
for(i in 1:n) {
y[i]~dpois(mu.hat[i])
log.mu[i] <- a[patch[i]] + b1*year2[i] + b2*year3[i]
mu.hat[i] <- exp(log.mu[i])
}
#priors
for(j in 1:J){
a[j]~dnorm(0,.000001)
}
b1~dnorm(0,.000001)
b2~dnorm(0,.000001)
}

model{
for(i in 1:n) {
y[i]~dpois(mu.hat[i])
log.mu[i] <- a[patch[i]] + b1*year2[i] + b2*year3[i]
mu.hat[i] <- exp(log.mu[i])
}
#level-2 model
for(j in 1:J){
   a[j]~dnorm(a.hat[j],tau.a)
   a.hat[j] <- mu.a
}
#priors
mu.a~dnorm(0,.000001)
tau.a <- pow(sigma.a,-2)
sigma.a~dunif(0,10000)

b1~dnorm(0,.000001)
b2~dnorm(0,.000001)
}

In the separate intercepts model we assigned separate uninformative priors to each intercept. To a Bayesian the random intercepts model differs because each intercept is given an informative prior so that the intercepts are now modeled parameters. The notation I used for the intercepts in the BUGS code differs slightly from what I used above for the frequentist model.

Bayesian

The parameters that appear in the priors for the intercepts, the mean mu.a and the standard deviation sigma.a, are in turn given their own priors, now called hyperpriors. In a similar vein the parameters mu.a and sigma.a are sometimes called hyperparameters. To fit this model in WinBUGS we just add additional arguments to the initial value list (bird.inits) and parameter vector (bird.parms) that correspond to these new parameters.
bird.data <- list("y", "year2", "year3", "n", "patch", "J")
bird.inits <- function() {list(a=rnorm(J), b1=rnorm(1), b2=rnorm(1), mu.a=rnorm(1), sigma.a=runif(1))}
bird.parms <- c("a", "b1", "b2", "mu.a", "sigma.a")

The calls to WinBUGS and JAGS are shown below.

# WinBUGS
library(arm)
model3.bugs <- bugs(bird.data, bird.inits, bird.parms, "model3.txt", bugs.directory = "C:/WinBUGS14", n.chains=3, n.iter=10000, debug=T)
# JAGS
library(R2jags)
model3.jags <- jags(bird.data, bird.inits, bird.parms, "model3.txt", n.chains=3, n.iter=10000)
dim(model3.bugs$summary)
[1] 106   9

There are 106 estimated parameters, far too many to squeeze the trace plots of the Markov chains into a single graphics window. I plot them instead 16 per panel. The last set of trajectories is shown in Fig. 1. All of the chains appear to be mixing well.

model3.mcmc <- as.mcmc.list(model3.bugs)
xyplot(model3.mcmc, layout=c(4, 4))

fig. 1   
Fig. 1   Trace plots of individual Markov chains from WinBUGS.

Checking the summary object we find that all of the Rhats are less than 1.1 and the effective sample sizes are large.

max(model3.bugs$summary[,"Rhat"])
[1] 1.007241
min(model3.bugs$summary[,"n.eff"])
[1] 290

The means of the Bayesian posterior distributions for the population intercept, the regression coefficients, and the standard deviation of the random effects are all very close to the frequentist MLEs. (In the notation I've used mu.a corresponds to the frequentist intercept β0.)

# Bayesian estimates
model3.bugs$summary[102:105, "mean"]
          b1           b2         mu.a      sigma.a
  -0.1508430   -0.1177506    3.1878453    0.2951361
# frequentist estimates
fixef(model3.lmer)
     (Intercept) factor(year)2006 factor(year)2007
       3.1915487       -0.1541447       -0.1201470
attr(VarCorr(model3.lmer)$patch, "stddev")
(Intercept)
  0.2879088

The frequentist predictions of the random intercepts are also close to the Bayesian posterior means. (Only the first 12 of these are shown below.)

model3.bugs$summary[1:12,"mean"]
    a[1]     a[2]     a[3]     a[4]     a[5]     a[6]     a[7]     a[8]
3.172729 2.892569 3.201317 3.452633 3.340987 3.652841 3.208471 3.513790
    a[9]    a[10]    a[11]    a[12]
3.611343 3.543788 3.196539 3.305855
coef(model3.lmer)[[1]][1:12,1]
 [1] 3.182561 2.911308 3.209954 3.458733 3.348656 3.660298 3.219767 3.523195
 [9] 3.614566 3.551394 3.194726 3.314897

We can also fit the random effects formulation of the random intercepts model as a Bayesian model. The BUGS code for the random intercepts version and the random effects version of the model are shown below with the differences highlighted.

Random intercepts version (model3.txt)
Random effects version (model3a.txt)
model{
for(i in 1:n) {
y[i]~dpois(mu.hat[i])
log.mu[i] <- a[patch[i]] + b1*year2[i] + b2*year3[i]
mu.hat[i] <- exp(log.mu[i])
}
#level-2 model
for(j in 1:J){
   a[j]~dnorm(mu.a,tau.a)
}
#priors
mu.a~dnorm(0,.000001)
tau.a <- pow(sigma.a,-2)
sigma.a~dunif(0,10000)

b1~dnorm(0,.000001)
b2~dnorm(0,.000001)
}
model{
for(i in 1:n) {
y[i]~dpois(mu.hat[i])
log.mu[i] <- a[patch[i]] + b1*year2[i] + b2*year3[i]
mu.hat[i] <- exp(log.mu[i])
}
#level-2 model
for(j in 1:J){
   a[j] <- mu.a + u0[j]
   u0[j]~dnorm(0,tau.a)
}
#priors
mu.a~dnorm(0,.000001)
tau.a <- pow(sigma.a,-2)
sigma.a~dunif(0,10000)

b1~dnorm(0,.000001)
b2~dnorm(0,.000001)
}

To run the random effects version of the model requires changing the initial values and the returned parameters.

bird.inits <- function() {list(u0=rnorm(J), b1=rnorm(1), b2=rnorm(1), mu.a=rnorm(1), sigma.a=runif(1))}
bird.parms <- c("u0", "b1", "b2", "mu.a", "sigma.a")
model3a.bugs <- bugs(bird.data, bird.inits, bird.parms, "model3a.txt", bugs.directory="C:/WinBUGS14", n.chains=3, n.iter=10000, debug=T)

I compare the frequentist predictions of the random effects with the Bayesian means. Although there are some differences, the results are similar.

model3a.bugs$summary[1:12,"mean"]
       u0[1]        u0[2]        u0[3]        u0[4]        u0[5]
-0.029531178 -0.295107819  0.003504299  0.254502339  0.150943809
       u0[6]        u0[7]        u0[8]        u0[9]       u0[10]
 0.464154867  0.025682889  0.327686000  0.419483867  0.353741133
      u0[11]       u0[12]
-0.004123341  0.121433980
ranef(model3.lmer)[[1]][1:12,1]
 [1] -0.008987382 -0.280241025  0.018405654  0.267184478  0.157106981
 [6]  0.468749019  0.028217857  0.331646243  0.423017080  0.359845270
[11]  0.003177495  0.123347898

We've already seen that the lmer function does not report a log-likelihood for Poisson models that is comparable to what is reported for other Poisson models (lecture 22). Even if we adjust for this and also take into account the difference between Dhat and Dbar in the Bayesian formulation, the Bayesian and frequentist deviances with mixed effects models are not directly comparable. In the Bayesian calculations the reported deviance is based on the joint likelihood of the data and the random effects, an expression we would write as follows.

joint likelihood

In the frequentist approach the reported log-likelihood is based on the marginal likelihood of y in which the random effects have been integrated out. These two likelihoods are not the same nor do they contain the same information about model fit.

A random intercepts model with patch-level predictors

So far all the random intercepts models for richness that we've considered have included a single predictor, year, whose value varied across observations. The other variable patch that appeared in the model played a role not as a predictor but as a structural variable. Patch describes how the data are organized, the fact that separate annual observations were made on the same patch yielding a hierarchical data set. The layers of the hierarchy are referred to as levels and so the birds data set is also a 2-level data set. Level 1 corresponds to the individual annual observations; level 2 corresponds to the patch observations while the patch itself is referred to as a level-2 unit. The variable year is a level-1 predictor. This break-down resembles the split plot design discussed in lecture 10 where we had split plot (level 1) units as well as whole plot (level 2) units.

The next complication is to include level-2 predictors in the model. For the birds data set a level-2 predictor is a patch characteristic. All observations made on the same patch will necessarily share the same value of a level-2 predictor. The distinction between level-1 and level-2 predictors is important because the level of a predictor determines how one should measure background variability for use in a statistical test of the predictor's effect on the response. With level-1 predictors measurements from the same patch or from different patches all count as suitable replicates (although there are issues of correlation that need to be addressed), but with a level-2 predictor only measurements made on different patches should contribute to assessing the effect of that predictor. Taking repeated measurements of the same level-2 unit and treating them as if they were true replicates of the effect of the level-2 predictor is called pseudo-replication.

In the birds data set the variables landscape and area can serve as level-2 predictors. Landscape classifies the land use type for each patch and area records the size of each patch. Because they are level-2 predictors they have the same value for all observations made on the same patch. When we fit the model using the lmer function we just include level-2 predictors as ordinary predictors; lmer figures out the appropriate way to deal with them based on their variability in the data set. Previously we used log(area) as the regressor rather than area.

library(lme4)
model4.lmer <- lmer(S~factor(year) + log(area) + landscape + (1|patch), data=birds.short, family=poisson)

Creating the level-2 predictors

Contrary to our experience with lmer, a formal distinction between level-1 and level-2 predictors must be made when fitting a Bayesian model. As we've seen a level-1 predictor has a different value for each observation while a level-2 predictor has a different value for each level-2 unit, here the patch. In the Bayesian model the data for the level-2 regressors must consist of only one observation per patch. To accomplish this we use the tapply function to select one value of log(area) and landscape from each patch. The generic function function(x) x[1] will accomplish this. The following line of code extracts one value of log(area) from each patch.

L.area <- tapply(log(birds.short$area), birds.short$patch, function(x) x[1])

If we examine the class of the variable L.area that tapply has created we discover that it is of type "array".

class(L.area)
[1] "array"
L.area
       ag1a        ag1b        ag1c        ag1d        ag2a        ag2b        ag2c
 1.25566803 -0.48529516  0.80426229  2.11382872  0.31747398  4.08226850 -0.58325863
       ag2d        ag3a        ag3b        ag3c        ag3d        ag3e        ag3f
 1.45861502  2.97737690  1.70314090  0.17591217  1.12879134  0.65645424 -0.01196129
       ag4a        ag4b        ag4c        ag4d        ag4e        ag4f        ag5a
-0.02894873  1.70198933  1.49187731  3.99180410  2.93134152  0.99117214  2.21296689
       ag5b        ag5c        ag5d        ag5e         b1a         b1b         b1c
 0.18857303  3.48248324  1.35823901  1.97287691  2.58970230  0.54765232 -0.54266836
        b1d         b1e         b1f         b1g         b2a         b2b         b2c
 0.80358443  1.24468274 -0.22189433  0.14928170  2.16974141  0.61853203  1.63539321
        b2d         b2e         b3a         b3b         b3c         b3d         b3e
 0.75783281  0.31648516  1.47343760  1.37052419  0.09454227  0.74487167  1.72847540
        b3f         b4a         b4b         b4c         b4d         b4e         b5a
 0.26138199  1.45703186  2.30915873  0.46469650  0.38346511 -0.21811995  1.05605267
        b5b         b5c         b5d       Ref1a       Ref1b       Ref1c       Ref1d
 1.41900354  1.10624972  0.84156719  1.70474809  0.18232156  1.25276297  0.18232156
      Ref1e       Ref1g       Ref1h       Ref2a       Ref2b       Ref2c       Ref2d
 2.01490302  2.70805020          NA  1.25276297  0.78845736  1.70474809  0.78845736
      Ref2e       Ref3a       Ref3b       Ref3c       Ref3d       Ref4a       Ref4b
 0.78845736  0.18232156  1.70474809  1.25276297  0.78845736  0.18232156  1.25276297
      Ref4c       Ref4d       Ref5a       Ref5b       Ref5c       Ref5d       Ref5e
 0.78845736  0.78845736  0.18232156  1.70474809  1.70474809  1.25276297  0.18232156
      Ref6a       Ref6b       Ref6c       Ref6d       Ref6e       Ref6f         u1a
 0.18232156  0.78845736  1.25276297  1.25276297  2.01490302  3.21887582  0.99864818
        u1b         u1c         u1d         u1e         u2a         u2b         u2c
 1.54320497  0.52933787  0.34623913  1.92809282  1.53700588  0.35959235  1.19088049
        u2d         u2e         u3a         u3b         u3c         u3d         u4a
 0.23008174  1.50218226  1.91458168  1.64380994  1.14376245  0.32579492  0.59559073
        u4b         u4c         u4d         u4e
 0.01921263  1.06914298  0.98235007  2.12134565

While JAGS has no problem with data in this format, WinBUGS does. WinBUGS accepts only numeric data. We can change the class of L.area to type "numeric" by using either the as.vector function or the as.numeric function on the output of tapply.

L.area <- as.vector(tapply(log(birds.short$area), birds.short$patch, function(x) x[1]))
class(L.area)
[1] "numeric"
L.area
  [1]  1.25566803 -0.48529516  0.80426229  2.11382872  0.31747398  4.08226850
  [7] -0.58325863  1.45861502  2.97737690  1.70314090  0.17591217  1.12879134
 [13]  0.65645424 -0.01196129 -0.02894873  1.70198933  1.49187731  3.99180410
 [19]  2.93134152  0.99117214  2.21296689  0.18857303  3.48248324  1.35823901
 [25]  1.97287691  2.58970230  0.54765232 -0.54266836  0.80358443  1.24468274
 [31] -0.22189433  0.14928170  2.16974141  0.61853203  1.63539321  0.75783281
 [37]  0.31648516  1.47343760  1.37052419  0.09454227  0.74487167  1.72847540
 [43]  0.26138199  1.45703186  2.30915873  0.46469650  0.38346511 -0.21811995
 [49]  1.05605267  1.41900354  1.10624972  0.84156719  1.70474809  0.18232156
 [55]  1.25276297  0.18232156  2.01490302  2.70805020          NA  1.25276297
 [61]  0.78845736  1.70474809  0.78845736  0.78845736  0.18232156  1.70474809
 [67]  1.25276297  0.78845736  0.18232156  1.25276297  0.78845736  0.78845736
 [73]  0.18232156  1.70474809  1.70474809  1.25276297  0.18232156  0.18232156
 [79]  0.78845736  1.25276297  1.25276297  2.01490302  3.21887582  0.99864818
 [85]  1.54320497  0.52933787  0.34623913  1.92809282  1.53700588  0.35959235
 [91]  1.19088049  0.23008174  1.50218226  1.91458168  1.64380994  1.14376245
 [97]  0.32579492  0.59559073  0.01921263  1.06914298  0.98235007  2.12134565

The data are now in the format that WinBUGS requires, but the display of values reveals another problem. There is a missing value for L.area. Using the output above from when L.area was of class "array" we see that this missing value corresponds to a patch labeled "Ref1h". When we try to list its data in the current data frame we find that this patch doesn't exist.

birds.short[birds.short$patch=='Ref1h',]
[1] patch     S         year      landscape area      log.area. ENN       log.ENN.
<0 rows> (or 0-length row.names)

On the other hand patch 'Ref1h' is present in the original data frame where it has a non-missing value for area.

birds[birds$patch=='Ref1h',]
    patch  S year landscape area log.area. ENN log.ENN.
59  Ref1h NA 2005    Forest  2.2 0.3424227   0        0
161 Ref1h NA 2006    Forest  2.2 0.3424227   0        0
263 Ref1h NA 2007    Forest  2.2 0.3424227   0        0

Notice that this patch has missing values for richness S in all three years and as a result all of its observations were deleted when we created the birds.short data frame. The levels of a factor are defined when the factor is first created and these levels are then inherited by all subsets of the original data frame, regardless of how many levels of the factor are actually represented in the subset. The way around this is to use the factor function again to redefine the factor variable so that it will have only the unique levels that are present in the current data frame. It's worth noting that the as.factor function, which a lot of people in this class like to use, does not work here. The as.factor function will only create a factor from a variable that is not a factor. If it is already a factor it just leaves it alone.

length(L.area)
[1] 102
# as.factor does not remove the ghost level
L.area <- as.vector(tapply(log(birds.short$area), as.factor(birds.short$patch), function(x) x[1]))
length(L.area)
[1] 102
# the factor function does remove the ghost level
L.area <- as.vector(tapply(log(birds.short$area), factor(birds.short$patch), function(x) x[1]))
length(L.area)
[1] 101
L.area
  [1]  1.25566803 -0.48529516  0.80426229  2.11382872  0.31747398  4.08226850
  [7] -0.58325863  1.45861502  2.97737690  1.70314090  0.17591217  1.12879134
 [13]  0.65645424 -0.01196129 -0.02894873  1.70198933  1.49187731  3.99180410
 [19]  2.93134152  0.99117214  2.21296689  0.18857303  3.48248324  1.35823901
 [25]  1.97287691  2.58970230  0.54765232 -0.54266836  0.80358443  1.24468274
 [31] -0.22189433  0.14928170  2.16974141  0.61853203  1.63539321  0.75783281
 [37]  0.31648516  1.47343760  1.37052419  0.09454227  0.74487167  1.72847540
 [43]  0.26138199  1.45703186  2.30915873  0.46469650  0.38346511 -0.21811995
 [49]  1.05605267  1.41900354  1.10624972  0.84156719  1.70474809  0.18232156
 [55]  1.25276297  0.18232156  2.01490302  2.70805020  1.25276297  0.78845736
 [61]  1.70474809  0.78845736  0.78845736  0.18232156  1.70474809  1.25276297
 [67]  0.78845736  0.18232156  1.25276297  0.78845736  0.78845736  0.18232156
 [73]  1.70474809  1.70474809  1.25276297  0.18232156  0.18232156  0.78845736
 [79]  1.25276297  1.25276297  2.01490302  3.21887582  0.99864818  1.54320497
 [85]  0.52933787  0.34623913  1.92809282  1.53700588  0.35959235  1.19088049
 [91]  0.23008174  1.50218226  1.91458168  1.64380994  1.14376245  0.32579492
 [97]  0.59559073  0.01921263  1.06914298  0.98235007  2.12134565

We need to create a level-2 version of the landscape variable too.

Lscape <- as.vector(tapply(birds.short$landscape, factor(birds.short$patch), function(x) x[1]))
Lscape
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[41] 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[81] 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

The factor levels of the landscape variable have been automatically converted to numeric values by tapply. For the regression model we need to create three indicator variables that indicate the 2nd, 3rd, and 4th levels of landscape.

Lscape2 <- as.numeric(Lscape==2)
Lscape3 <- as.numeric(Lscape==3)
Lscape4 <- as.numeric(Lscape==4)

The BUGS model

Level-2 predictors should appear in the level-2 part of the BUGS model. So in the BUGS code for the random intercepts model the level-2 predictors should appear as predictors in the model for a.hat, the individual patch mean. The table below shows the basic random intercepts model in the left panel and indicates in the right panel where the level-2 predictors should be placed in the model. I save the model as model4.txt.

Random intercepts model
Random intercepts model with level-2 predictors 
model{
for(i in 1:n) {
y[i]~dpois(mu.hat[i])
log.mu[i] <- a[patch[i]] + b1*year2[i] + b2*year3[i]
mu.hat[i] <- exp(log.mu[i])
}
#level-2 model
for(j in 1:J){
   a[j]~dnorm(a.hat[j],tau.a)
   a.hat[j]<-mu.a
}
b1~dnorm(0,.000001)
b2~dnorm(0,.000001)
mu.a~dnorm(0,.000001)
tau.a<-pow(sigma.a,-2)
sigma.a~dunif(0,10000)
}

model{
for(i in 1:n) {
y[i]~dpois(mu.hat[i])
log.mu[i] <- a[patch[i]] + b1*year2[i] + b2*year3[i]
mu.hat[i] <- exp(log.mu[i])
}
#level-2 model
for(j in 1:J){
a[j]~dnorm(a.hat[j],tau.a)
a.hat[j]<-mu.a + g1*L.area[j] + g2*Lscape2[j] + g3*Lscape3[j] + g4*Lscape4[j]
}
g1~dnorm(0,.000001)
g2~dnorm(0,.000001)
g3~dnorm(0,.000001)
g4~dnorm(0,.000001)

b1~dnorm(0,.000001)
b2~dnorm(0,.000001)
mu.a~dnorm(0,.000001)
tau.a<-pow(sigma.a,-2)
sigma.a~dunif(0,10000)
}

The new variables are added to the list of data in the bird.data and bird.parms arguments and additional initial values are included in the bird.inits argument.

# objects for BUGS
bird.data <- list("y", "year2", "year3", "n", "patch", "J", "L.area", "Lscape2", "Lscape3", "Lscape4")
bird.inits <- function() {list(a=rnorm(J), b1=rnorm(1), b2=rnorm(1), g1=rnorm(1), g2=rnorm(1), g3=rnorm(1), g4=rnorm(1), sigma.a=runif(1), mu.a=rnorm(1))}
bird.parms <- c("a", "b1", "b2", "sigma.a", "mu.a", "g1", "g2", "g3", "g4")
# fit BUGS model
library(arm)
model4.bugs <- bugs(bird.data, bird.inits, bird.parms, "model4.txt", bugs.directory = "C:/WinBUGS14",n.chains=3, n.iter=100, debug=T)
model4.bugs <- bugs(bird.data, bird.inits, bird.parms, "model4.txt", bugs.directory = "C:/WinBUGS14",n.chains=3, n.iter=10000, debug=T)
max(model4.bugs$summary[,"Rhat"])
[1] 1.010421
min(model4.bugs$summary[,"n.eff"])
[1] 250
# using JAGS
library(R2jags)
model4.jags <- jags(bird.data, bird.inits, bird.parms, "model4.txt", n.chains=3, n.iter=10000)
round(model4.bugs$summary[102:110,],3)
             mean     sd     2.5%      25%      50%      75%    97.5%  Rhat n.eff
b1         -0.131  0.034   -0.197   -0.155   -0.132   -0.109   -0.067 1.000  1000
b2         -0.108  0.031   -0.166   -0.130   -0.109   -0.087   -0.045 1.000  1000
mu.a        3.062  0.054    2.956    3.027    3.063    3.099    3.168 1.000  1000
sigma.a     0.148  0.020    0.112    0.134    0.146    0.160    0.190 1.003  1000
g1          0.223  0.021    0.181    0.209    0.223    0.237    0.264 1.000  1000
g2         -0.228  0.057   -0.339   -0.263   -0.229   -0.192   -0.113 1.000  1000
g3         -0.065  0.053   -0.167   -0.102   -0.066   -0.030    0.039 1.001  1000
g4         -0.266  0.061   -0.388   -0.305   -0.267   -0.227   -0.151 1.001  1000
deviance 1452.312 13.603 1428.000 1442.000 1452.000 1461.000 1481.000 1.002  1000

These results compare favorably with the frequentist results obtained above. The largest difference occurs in the standard deviation of the level-2 random effects, sigma.a, labeled (intercept) in the VarCorr output.

fixef(model4.lmer)
     (Intercept) factor(year)2006 factor(year)2007        log(area)
      3.06613422      -0.13039499      -0.10910950       0.22222232
landscapeBauxite  landscapeForest   landscapeUrban
     -0.22858415      -0.06744431      -0.26663006
attr(VarCorr(model4.lmer)[[1]],'stddev')
(Intercept)
  0.1383838
Predictor
Parameter
Bayesian estimate
(mean of posterior distribution)

Frequentist estimate
(MLE)

Intercept
β0
3.062
3.066
Year 2
β1
–0.131
–0.130
Year 3
 β2
–0.108
–0.109
log(Area)
γ1
0.223
0.222
Bauxite
γ2
–0.228
–0.229
Forest
γ3
–0.065
–0.067
Urban
γ4
–0.266
–0.267
 
τ
0.148
0.138

Obtaining Bayesian estimates of other parameters

One of the major attractions of Bayesian estimation is the ease with which it is possible to obtain interval estimates for auxiliary parameters that are functions of parameters in the model. To obtain the posterior distribution of an auxiliary parameter defined by a function we just apply that function to the posterior distributions of the parameters that appear in its formula. We can do this either by defining the auxiliary parameters explicitly in the model or by calculating them after the model is fit.

Method 1: Add the desired parameters to the BUGS model

Suppose we want to obtain interval estimates of the mean bird richness in the year 2005 for patches with area = 1 separately for each of the four landscape types. The fitted model for the log mean is the following.

landscape model

the desired means are the following.

means

We can add these calculations to the BUGS program file.

Model with landscape means (model4.txt)
model{
for(i in 1:n) {
y[i]~dpois(mu.hat[i])
log.mu[i] <- a[patch[i]] + b1*year2[i] + b2*year3[i]
mu.hat[i] <- exp(log.mu[i])
}
#level-2 model
for(j in 1:J){
a[j]~dnorm(a.hat[j],tau.a)
a.hat[j] <- mu.a + g1*L.area[j] + g2*Lscape2[j] + g3*Lscape3[j] + g4*Lscape4[j]
}
g1~dnorm(0,.000001)
g2~dnorm(0,.000001)
g3~dnorm(0,.000001)
g4~dnorm(0,.000001)
b1~dnorm(0,.000001)
b2~dnorm(0,.000001)
mu.a~dnorm(0,.000001)
tau.a <- pow(sigma.a,-2)
sigma.a~dunif(0,10000)
mean1 <- exp(mu.a)
mean2 <- exp(mu.a + g2)
mean3 <- exp(mu.a + g3)
mean4 <- exp(mu.a + g4)
}

We can refit this model from scratch, or better yet, we can restart the model we previously estimated using the last values of the Markov chains of the sampled parameters as the new initial values. These are stored in the $last.values component of the bugs object and can be used instead of bird.inits as the initial values. The other change is that we need to add the means to the bird.parms object so that samples of their posterior distributions are returned to us.

bird.parms2 <- c("a", "b1", "b2", "mu.a", "sigma.a", "g1", "g2", "g3", "g4", "mean1", "mean2", "mean3", "mean4")
# rerun model using last values of previous run
model4a.bugs <- bugs(bird.data, model4.bugs$last.values, bird.parms2, "model4.txt", bugs.directory="C:/WinBUGS14", n.chains=3, n.iter=10000, debug=T)
# JAGS run
model4a.jags <- jags(bird.data, model4.jags$BUGSoutput$last.values, bird.parms2, "model4.txt", n.chains=3, n.iter=10000)
# percentile credible intervals
apply(model4a.bugs$sims.matrix[, c("mean1", "mean2", "mean3", "mean4")], 2, function(x) quantile(x, c(.025,.975)))
        mean1  mean2   mean3 mean4
2.5%  19.1905 15.301 18.2105 14.63
97.5% 23.6900 18.610 21.8995 18.08
# HPD credible intervals
HPDinterval(as.mcmc(model4a.bugs$sims.matrix[, c("mean1", "mean2", "mean3", "mean4")]))
      lower upper
mean1 19.37 23.84
mean2 15.40 18.66
mean3 18.21 21.91
mean4 14.63 18.08
attr(,"Probability")
[1] 0.9500998

Method 2: Use the model results already obtained

Given that we already have samples from the posterior distributions of all the parameters that are needed for the calculations of the means, we don't actually need to fit the model again. Instead we can perform the arithmetic on the vectors of samples and then obtain the interval estimates from the results. I first calculate the posterior distributions of the means and then assemble the results in a matrix in the order "agricultural", "bauxite", "forest", and "urban".

mean.matrix <- cbind(model4.bugs$sims.matrix[,"mu.a"], model4.bugs$sims.matrix[,"mu.a"]+ model4.bugs$sims.matrix[,"g2"], model4.bugs$sims.matrix[,"mu.a"]+ model4.bugs$sims.matrix[,"g3"], model4.bugs$sims.matrix[,"mu.a"]+ model4.bugs$sims.matrix[,"g4"])
mean.matrix <- exp(mean.matrix)

The percentile 95% credible intervals can be calculated in the usual fashion.

apply(mean.matrix, 2, function(x) quantile(x, c(.025, .975)))
          [,1]     [,2]     [,3]     [,4]
2.5%  19.22141 15.40112 18.36952 14.79520
97.5% 23.75932 18.84194 21.93979 18.19401

Or we can convert the matrix to an mcmc object and calculate the HPD intervals.

mean.mcmc <- as.mcmc(mean.matrix)
HPDinterval(mean.mcmc)
        lower    upper
var1 19.20172 23.71244
var2 15.48853 18.88561
var3 18.26013 21.82482
var4 14.87229 18.24881
attr(,"Probability")
[1] 0.9500998

Bayesian model selection

The log-likelihood in Bayesian and frequentist models

WinBUGS and JAGS and for that matter Bayesians in general use the term deviance to mean –2 times the log-likelihood of the model.

deviance = –2 × log-likelihood

This differs from the classical definition of deviance used in generalized linear models where the deviance (scaled) was defined as twice the difference in the log-likelihoods between the current model and the so-called saturated model—a model in which there is a separate parameter estimated for every observation. The deviance defined in this way can be used as a goodness-of-fit statistic for Poisson or grouped binary data, but only when the expected cell counts meet certain minimum cell-size criteria.

Now in certain circumstances the log-likelihood of the saturated model turns out to be zero and for these cases the Bayesian and classical definitions of the deviance are the same. One example is with binary data in which the response is assumed to have a Bernoulli distribution. For most other probability models the saturated model will contribute a nonzero term to the classical definition of the deviance and as a result the classical definition and the Bayesian definition will not agree. Unfortunately the Bayesian use of the term deviance, in which the contribution of the saturated model to the deviance is ignored, is now well-established in the literature. As a result use of the term "deviance" is ambiguous.

We saw in lecture 24 that for a Poisson fixed effects models, the estimated log-likelihood returned by WinBUGS based on Dhat is approximately the same as the log-likelihood returned by frequentist software. This will generally be the case when the models in question involve only fixed effects (although differences can arise due to the way variances are estimated by maximum likelihood). When random effects are thrown into the mix, the situation changes and the log-likelihoods returned are no longer the same.

In hierarchical models with both fixed and random effects, the log-likelihoods estimated by Bayesian and frequentist software can differ markedly. This occurs because Bayesians and frequentists are no longer estimating the same quantity. As was explained in lecture 22, frequentists work with what is called the marginal likelihood, the likelihood obtained after integrating out the random effects from the joint likelihood of the parameters and the random effects. Bayesians instead work directly with the joint likelihood and treat the individual random effects as parameters to be estimated. This poses an interesting conundrum. We've argued that one of the advantages that random effects models have over fixed effects (separate regressions) models is that they are more parsimonious and do not overfit models to data. But if in the Bayesian approach the random effects are estimated directly as part of fitting the model, then where is the parsimony?

DIC as a measure of fit

Bayesians address the issue of parsimony with a model characteristic they call the effective number of estimated parameters, or pD. The term "effective" is used because unlike the parameters obtained in a separate regressions model, the parameters in a Bayesian mixed effects model are subject to constraints.

  1. If even a weakly informative prior is used for a parameter in a Bayesian model, then that parameter is no longer free to take on any potential value in the parameter space.
  2. Random effects are assumed to be drawn from a common distribution, typically multivariate normal. This distribution dictates that certain sets of values for the parameter are more likely to occur than other sets. Thus the presence of a common distribution imposes additional constraints on the parameters.

These ideas get used formally in the Bayesian formulation of model complexity. If model parameters are constrained, either through their priors or because they arise from a common distribution as is the case with random effects, then the actual number of parameters in the model is not a true reflection of model complexity. Instead we need to know the effective number of parameters, pD, a correction to the actual number of parameters that takes into account their mutual constraints. Bayesians then use pD to calculate DIC, the Bayesian version of the information-theoretic model selection statistic AIC.

Recall that AIC is a measure of relative model complexity in that it estimates expected relative Kullback-Leibler information. In a simplistic sense AIC can be thought of as penalized log-likelihood in which increases in log-likelihood that are achieved by increasing a model's complexity are penalized according to the number of extra parameters that have been added to the model. Using Bayesian notation AIC can be written as follows.

AIC

Here Dmin is the deviance calculated at the modes of the posterior distributions of all parameters in the model and K is number of parameters estimated in fitting the model. Bayesians replace AIC with DIC that is estimated as follows.

where is the deviance calculated at the means of the individual parameter posterior distributions and pD is as defined above. DIC is an acronym for "deviance information criterion". All of this begs the question of how model complexity, pD, in Bayesian models is determined. There are a number of definitions but the most popular one is due to Spiegelhalter et al. (2002). (David Spiegelhalter is one of the primary developers of the WinBUGS software.) Their definition is

pD definition

where Dbar is the mean of posterior distribution of the deviance.

Their rationale for this definition is as follows. Each MCMC sample of D is calculated using the current values of the parameters in the Markov chains that are samples from their respective posterior distributions. Because at any given iteration these values are not likely to be at the modes, or means for that matter, of their respective posterior distributions, the calculated log-likelihood obtained from D will be less than its maximum possible value. Said differently the deviance obtained in this fashion will be larger than Dhat. The mean of all the deviance values in the sample, Dbar, will also exceed Dhat and hence pD defined by the above formula should be non-negative.

When there are more constraints on the individual parameters, D will have less room to vary. Thus the individual values of D will on average be closer to Dhat. When there are fewer constraints on the parameters, values of D can vary more and will typically be further from Dhat. When there are no constraints at all on the parameters (a separate intercepts regressions model with uninformative priors) pD assumes its largest possible value which turns out to be equal to K, the number of parameters in the frequentist definition of AIC. As the number and strength of the constraints is increased, pD decreases from this maximum value. While additional steps are needed to make this argument rigorous, this is essentially the motivation for the Bayesian definition of DIC. Additional information can be found in McCarthy (2007).

The bugs function returns pD as a model component.

model4.bugs$pD
[1] 59.039

In the current model there are seven regression parameters and one variance term for a total of eight fixed effects parameters. This is the number of parameters used by frequentists to calculate AIC.

attr(logLik(model4.lmer),'df')
[1] 8

Yet WinBUGS has determined there are 59.039 effective parameters, roughly 51 more than the number of fixed effects parameters. Random effects in Bayesian models are treated as modeled parameters, but because they are constrained to come from a common distribution they are not free to vary independently of each other. So, even though there are 101 random intercepts in the model,

dim(ranef(model4.lmer)[[1]])
[1] 101 1

due to the constraints WinBUGS counts them as being effectively equivalent to 51 free parameters. WinBUGS also returns DIC as a model component.

model4.bugs$DIC
[1] 1511.35

Although WinBUGS does not return Dhat it does return Dbar. This is enough for us to calculate DIC ourselves because Dhat is not needed as the following algebra shows.

pD identity

Plugging this expression for Dhat into the DIC formula yields the following alternative formula.

DIC identity

Using this we find

model4.bugs$summary["deviance","mean"]
[1] 1452.312
model4.bugs$summary["deviance","mean"] + model4.bugs$pD
[1] 1511.351

which agrees with the reported DIC. Interestingly JAGS reports completely different values for pD and DIC than does WinBUGS.

model4.jags$BUGSoutput$DIC
[1] 1551.947
model4.jags$BUGSoutput$pD
[1] 98.73004

The mean deviance Dbar reported by JAGS is in agreement with the value reported by WinBUGS.

model4.jags$BUGSoutput$summary["deviance", "mean"]
[1] 1453.217

and using the alternative formula for DIC we obtain the value reported by JAGS.

model4.jags$BUGSoutput$summary["deviance", "mean"] + model4.jags$BUGSoutput$pD
[1] 1551.947

Thus the difference between JAGS and WinBUGS is in how pD is calculated. An explanation of what JAGS is doing can be found here.

Because DIC and AIC in mixed effects models work with different log-likelihoods, they are not directly comparable. Whether AIC is an appropriate way to compare mixed effects models is a matter of debate. An alternative definition of AIC has been proposed by Vaida and Blanchard (2005) for mixed effects models. They call their version conditional AIC and it is essentially a frequentist formulation of DIC. They promote the use of conditional AIC for model comparison when the random effects are of primary interest and not merely nuisance parameters. At present it is not known how to calculate conditional AIC except in a few special cases.

Cited references

R Code

A compact collection of all the R code displayed in this document appears here.

Course Home Page


Jack Weiss
Phone: (919) 962-5930
E-Mail: jack_weiss@unc.edu
Address: Curriculum for the Environment and Ecology, Box 3275, University of North Carolina, Chapel Hill, 27599
Copyright © 2012
Last Revised--November 23, 2012
URL: https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/lectures/lecture25.htm