You will need these packages installed, but I recommend not calling
them using library(). This is because they mask other useful functions.
I prefer to call them using ::
#library(data.table)
#library(jagsUI)
Setting a seed is always important when randomness is involved.
set.seed(1989) # The last time the Oakland A's won the world series. Last time ever :(
I am going to simulate data about caterpillar size collected over time. Say we collected Papilio rutulus caterpillars at 5 sites for different numbers of years. We are interested in the impacts of temperature on size. Essentially, I am going to estimate the change in size at 5 sites that is driven by temperature (to varying extents). There will also be trends at some sites that are not driven by temperature.
There is probably a better way to simulate this, but here we are. I am going to write a for loop to simulate data at sites seperatly and then combine at the end.
These are the parameters for the simulation:
# Names of the sites
site_names <- c("SiteA", "SiteB", "SiteC", "SiteD", "SiteE")
# Numbers of years of observation at each site
n_years <- c(60, 45, 32, 75, 45)
# Coefficients for the impact of year on caterpillar richness at each site
year_cater_coefs <- c(-0.07, -0.4, 0, 0.05, -0.1)
# Coefficients for the impact of temperature on caterpillar richness at each site
temp_cater_coefs <- c(-2, -1, -1.5, -2.5, 0.01)
# Makes empty list to store data from each site
sim_data_list <- list()
for (i in 1:length(site_names)) {
years <- rev(2019-1:n_years[i]) # Make a vector of years
temperature <- 16.5 + rnorm(length(years), 0, 0.75) # Simulate temperature data
# This is the linear model. I centered the parameters. I did this because when I was making up
# the coefficients I was thinking about them in terms of their actual effect. I centered because
# this is now the mean caterpillar size when we are at the mean temp and mean year. It's just
# easier to think about. Finally I added some error.
cat_size <- 50 +
year_cater_coefs[i]*scale(years, scale = F)[,1] +
temp_cater_coefs[i]*scale(temperature, scale = F)[,1] +
rnorm(length(years), 0, 1)
# Here we combine all the simulated values and store them in a list.
sim_data_list[[i]] <- data.frame(site = rep(site_names[i], length(years)),
years = years,
temperature = temperature,
cater_size = cat_size)
}
# Combines data from different sites
simulated_data <- data.table::rbindlist(sim_data_list)
Now let’s think bayes: We have five sites, lets say from the same region. Using a hierarchical model we can estimate an effect at each site AND an overall regional effect using information the model learns from the site level estimates.
There are alot of options for bayesian hierarchical modeleling in R, I will use jags here.
jags wants data structured in a list (shown below). In this case, the first four lines are the actual data and the last two lines are the number of observations. You will always need total number of observations (N). If your model is hierarchical you will also need the number of groups in each level (Nsite). In the case it is 5, because there are 5 sites. Note: scale predictors! 1) It makes them comparable and 2) it helps the model converge. Here I only center them because when I simulated them I did this on their natural scales. If I scale here we will not re-generate the original coefficients.
simulated_data$years <- scale(simulated_data$years, scale = F)[,1]
simulated_data$temperature <- scale(simulated_data$temperature, scale = F)[,1]
model_data <- list(size = simulated_data$cater_size, #caterpillar size
years = simulated_data$years, #years
temp = simulated_data$temperature, #temperature
sites = as.numeric(factor(simulated_data$site)), #Site ID's, made to be numbers
N = length(simulated_data$site), #Total number of observations
Nsite = length(unique(simulated_data$site))) #Total number of sites
Once we have the data structured correctly, we then write the model statement. This is written as text and saved as a text file. I will write the model here. There are annotations throughout.
sim_model <- "model {
# Here we model caterpillar size as a nomal distribution with a mean mu and a variance sigma.
# In this you are saying that the mean caterpillar richness (mu) is a function of an intercept (alpha)
# and a linear relationship with year (beta1) and temperature (beta2).
# The brackets next to beta1 and beta2 are what make this hierarchical. They seem complicated,
# but really they are just being used for indexing. Essentially we are saying that we want a beta
# estimates for every site and since we have 5 sites, we will get 5 different beta1 and beta2 estimates.
# The first line is just the number of observations you have, which is N in the data above.
for(i in 1:N) {
# Size is drawn for a normal distribution
size[i] ~ dnorm(mu[i], sigma)
# This is the actual linear model
mu[i] <- alpha[sites[i]] + beta1[sites[i]] * years[i] + beta2[sites[i]] * temp[i]
}
# This next bit is critical in the hierarchy. Since Nsite = 5, you are saying there are
# 5 different pools of data. So you will have a beta1[1] when j = 1 .... and beta1[5] when j = 5.
# Same goes for alpha. Since the above linear code knows which pool each observation belongs to,
# it can link the proper alpha and beta from here with that.
for (j in 1:Nsite) {
beta1[j] ~ dnorm(hbmu1, hbsig1)
beta2[j] ~ dnorm(hbmu2, hbsig2)
alpha[j] ~ dnorm(himu, hisig)
}
# Uninformative hyperpriors
# These are priors on your priors
# These priors inform your site level estimates,
# which inform your overall estimate
# Usually norm() is mean and standard deviation, but jags
# is dumb and dnorm in this case is (mean, 1/sd), which
# is why you need the tau line at the bottom.
# I use a gamma distribution for variance estimates because gamma
# distributions do not allow negative values, and variance cannot
# be negative.
hbmu1 ~ dnorm(0, 0.001)
hbmu2 ~ dnorm(0, 0.001)
himu ~ dnorm(0, 0.001)
hbsig1 ~ dgamma(2, 0.1)
hbsig2 ~ dgamma(2, 0.1)
hisig ~ dgamma(2, 0.1)
sigma ~ dgamma(2, 0.1)
}
"
writeLines(sim_model, con="sim_model.txt")
Now we run the model. There are many options for running jags, I personally like jagsUI. We will run 10,000 iterations and do that 4 times. The number of iterations will vary depending on the problem, the more complicated, the more iterations. That said, if a model does not converge, raising the number of iterations probably won’t fix it. It is more likely an issue with the model itself. The burn in period is the period where the model is learning and those estimates might not be reliable, so we throw them out (how many depends on the burn in parameter). You also have to tell it where the text file with the model is at.
mod <- jagsUI::jags(data = model_data,
n.adapt = 2000,
n.burnin = 1000,
n.iter = 10000,
n.chains = 4,
modules = "glm",
model.file = "sim_model.txt",
parameters.to.save = c("alpha", "beta1", "beta2", "hbmu1", "hbmu2"),
verbose = TRUE)
##
## Processing function input.......
##
## Done.
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 257
## Unobserved stochastic nodes: 22
## Total graph size: 1827
##
## Initializing model
##
## Adaptive phase, 2000 iterations x 4 chains
## If no progress bar appears JAGS has decided not to adapt
##
##
## Burn-in phase, 1000 iterations x 4 chains
##
## | | | 0% | |* | 2% | |** | 4% | |*** | 6% | |**** | 8% | |***** | 10% | |****** | 12% | |******* | 14% | |******** | 16% | |********* | 18% | |********** | 20% | |*********** | 22% | |************ | 24% | |************* | 26% | |************** | 28% | |*************** | 30% | |**************** | 32% | |***************** | 34% | |****************** | 36% | |******************* | 38% | |******************** | 40% | |********************* | 42% | |********************** | 44% | |*********************** | 46% | |************************ | 48% | |************************* | 50% | |************************** | 52% | |*************************** | 54% | |**************************** | 56% | |***************************** | 58% | |****************************** | 60% | |******************************* | 62% | |******************************** | 64% | |********************************* | 66% | |********************************** | 68% | |*********************************** | 70% | |************************************ | 72% | |************************************* | 74% | |************************************** | 76% | |*************************************** | 78% | |**************************************** | 80% | |***************************************** | 82% | |****************************************** | 84% | |******************************************* | 86% | |******************************************** | 88% | |********************************************* | 90% | |********************************************** | 92% | |*********************************************** | 94% | |************************************************ | 96% | |************************************************* | 98% | |**************************************************| 100%
##
## Sampling from joint posterior, 9000 iterations x 4 chains
##
## | | | 0% | |* | 2% | |** | 4% | |*** | 6% | |**** | 8% | |***** | 10% | |****** | 12% | |******* | 14% | |******** | 16% | |********* | 18% | |********** | 20% | |*********** | 22% | |************ | 24% | |************* | 26% | |************** | 28% | |*************** | 30% | |**************** | 32% | |***************** | 34% | |****************** | 36% | |******************* | 38% | |******************** | 40% | |********************* | 42% | |********************** | 44% | |*********************** | 46% | |************************ | 48% | |************************* | 50% | |************************** | 52% | |*************************** | 54% | |**************************** | 56% | |***************************** | 58% | |****************************** | 60% | |******************************* | 62% | |******************************** | 64% | |********************************* | 66% | |********************************** | 68% | |*********************************** | 70% | |************************************ | 72% | |************************************* | 74% | |************************************** | 76% | |*************************************** | 78% | |**************************************** | 80% | |***************************************** | 82% | |****************************************** | 84% | |******************************************* | 86% | |******************************************** | 88% | |********************************************* | 90% | |********************************************** | 92% | |*********************************************** | 94% | |************************************************ | 96% | |************************************************* | 98% | |**************************************************| 100%
##
## Calculating statistics.......
##
## Done.
Here you can examine the chains for convergance. This is quite simply a time series of the samples across the iterations of the model. You want plots looking like a long and fuzzy caterpillar. This tells us that the model has “converged” on an answer.
plot(mod, parameters = "hbmu1")
Lets compare what we found to what we simulated and results from lms
lms <- rbind(coef(lm(cater_size ~ years + temperature, data = subset(simulated_data, site == "SiteA"))),
coef(lm(cater_size ~ years + temperature, data = subset(simulated_data, site == "SiteB"))),
coef(lm(cater_size ~ years + temperature, data = subset(simulated_data, site == "SiteC"))),
coef(lm(cater_size ~ years + temperature, data = subset(simulated_data, site == "SiteD"))),
coef(lm(cater_size ~ years + temperature, data = subset(simulated_data, site == "SiteE"))))
compare <- data.frame(Simulated_year_coef = year_cater_coefs,
Freq_year_coef = lms[,2],
Bayes_year_coef = mod$mean$beta1,
Simulated_temp_coef = temp_cater_coefs,
Freq_temp_coef = lms[,3],
Bayes_temp_coef = mod$mean$beta2)
rownames(compare) <- site_names
compare
## Simulated_year_coef Freq_year_coef Bayes_year_coef Simulated_temp_coef Freq_temp_coef Bayes_temp_coef
## SiteA -0.07 -0.06040948 -0.06019246 -2.00 -1.88860055 -1.86222484
## SiteB -0.40 -0.39834102 -0.39450324 -1.00 -1.13306617 -1.17271038
## SiteC 0.00 -0.01110290 -0.01665838 -1.50 -1.46350183 -1.46589321
## SiteD 0.05 0.05096463 0.05076454 -2.50 -2.51633641 -2.45768824
## SiteE -0.10 -0.10528487 -0.10450801 0.01 0.03772159 -0.07397222
So if the estimates are the same what is the point? Many reasons (but let’s stick to two for now).
If you look at the compare table you will see that the estimates not the same as the simulated value, but its close.
Lets look at beta1 (year) at Site 1. Here you see that the posterior contains the true value. Since we will never know the true value of our data posterior densities like this one give us a great idea of a possible range of answers
# This is the posterior
plot(density(mod$sims.list$beta1[,1]), main = "Posterior Distribution")
# Lets add a line for our lm estimate
abline(v = compare[1,2], col = "green")
# And the true value
abline(v = compare[1,1], col = "yellow")
Another reason bayes is worth is are those across site estimates! Those are stored by the hbmu1 and hbmu2 parameters we defined in the model, which if you recall were the priors for the site estimates.
Here is the posterior for the estimate of the impact of temperature across all sites.
# There is some considerable variation giving the varation within sites
plot(density(mod$sims.list$hbmu2), main = "Posterior Distribution")
# Lets add a line for the mean estimtes from each of our sites
abline(v = mod$mean$beta2[1], col = "green") #SiteA
abline(v = mod$mean$beta2[2], col = "yellow") #SiteB
abline(v = mod$mean$beta2[3], col = "blue") #SiteC
abline(v = mod$mean$beta2[4], col = "purple") #SiteD
abline(v = mod$mean$beta2[5], col = "black") #SiteE