Now that you have learned some of the most basic linear models we are going to add some complexity by introducing different error structures, mixed effects, and some basic Bayes. The Bayes stuff can be a pain to run on your computer so please reach out to me if you are having any trouble.
Set working directory.
setwd("...your_working_directory/")
Load in packages
library(lme4)
library(rjags) #Let me know if this gives you trouble
library(tidyverse)
library(brms)
library(bayesplot)
#library(jagsUI) #I do not recommend importing the full library
#BEST was removed from CRAN August, 2023 we can still install archived packages several ways
#url <- "https://cran.r-project.org/src/contrib/Archive/BEST/BEST_0.5.4.tar.gz"
#install.packages(url, type="source", repos=NULL)
#require(devtools)
#install_version("BEST", version = "0.5.4", repos = "http://cran.us.r-project.org")
# https://cran.r-project.org/src/contrib/Archive/BEST/
library(BEST)
Import data.
complex_data <- read.csv("Complex_models_data.csv")
Examine data.
colnames(complex_data) #Returns column names
## [1] "elevation" "genus" "area" "herbivory" "diversity" "plotID" "surv"
head(complex_data) #Returns first 6 entries for each column
## elevation genus area herbivory diversity plotID surv
## 1 high Piper 400.0000 76 4 1 0
## 2 high Piper 133.3000 5 2 2 1
## 3 high Piper 171.5278 3 2 3 1
## 4 high Piper 38.0000 0 0 4 1
## 5 high Piper 825.0000 1 2 5 1
## 6 high Piper 900.0000 5 3 6 1
Summarize data.
summary(complex_data) #Returns summary of each column.
## elevation genus area herbivory diversity plotID
## Length:179 Length:179 Min. : 12.5 Min. : 0.00 Min. :0.000 Min. : 1.00
## Class :character Class :character 1st Qu.: 105.5 1st Qu.: 1.00 1st Qu.:1.000 1st Qu.: 8.00
## Mode :character Mode :character Median : 192.0 Median : 5.00 Median :2.000 Median :15.00
## Mean : 247.4 Mean :12.16 Mean :2.095 Mean :15.42
## 3rd Qu.: 355.6 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:23.00
## Max. :1093.8 Max. :87.00 Max. :6.000 Max. :31.00
## surv
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6872
## 3rd Qu.:1.0000
## Max. :1.0000
You should notice that these data are very similar to last week’s
data. In fact they are the same except for one additional column:
“surv.” This column describes the survival of these plants measured at
the end of the season. A value of 1 indicates if a plant was alive at
the end of the season and 0 = dead. We will use these data to consider
the glm()
function and mixed models.
Reading Resources:
The GLzM is a generalization of the ordinary linear model for
response variables with error distributions other than normal. You have
essentially performed GLzM in the previous lab without knowing it or
specifying link functions. GLzM are linear models, but the use of a link
function enables them to be adapted them to data with non-normal error
distributions. A note on terminology: in many texts GLM refers to
general linear models, and GlzM refers to generalized
linear models. In R, lm()
runs GLM’s, while
glm()
runs GLzM’s. This can be confusing!
Components of the Generalized Linear Model:
Some notes:
Before going too far, first let’s take a step back and think about what an error distribution is. Last week a common assumption in all of the basic linear models was that the residuals were normally distributed. Here is what this looks like in an equation:
\[y = \alpha + \beta X' + E\] \[E \sim Normal(0,\sigma^2)\] Here the response variable (\(y\)) equals the intercept (\(\alpha\)) + coefficients (\(\beta\)) * the predictors (\(X'\)) + the residuals. The residuals are the error that remains in your model after you have accounted for the effects of the predictor variables. As you can see the ordinary linear model expects these errors (\(E\)) to be normally distributed and on average be 0. This is a problem for many types of ecological data, like survival and counts, because the nature of the response variables creates wacky error distributions.
Enter the GLzM. This generalization of the linear model allows different types of error structures to be applied through the use of “link” functions. The link function essentially constrains the linear regression (the systematic component) to predict within the range of possible outcomes. Link function can be thought of transforming the model predictions in contrast to transforming the response variables (which we did in previous labs). For example, in logistic regression predictions from the linear model are transformed in such a way that they are constrained to fall between 0 and 1. So while the linear model allows values between -∞ and +∞, the link function ensures (“constricts”) predictions fall between 0 and 1.
The link functions vary depending on the error distribution. Generally speaking the error distributions draw from the exponential family of distributions (fig. 1)- for example, the binomial, Poisson, and gamma distributions.
family = "gaussian"
: linear regressionfamily = "binomial"
: logistic regressionfamily = binomial(link = "probit")
: probitfamily = "poisson"
: Poisson regressionFor a brush up on the parameters, application and other details of these distributions refer to Appendix in Ecological Statistics Fox et al 2015 OR Ecological Models and Data in R by Bolker 2008 for a great review.
Today we will focus on the binomial distribution. If you wish to see a great resource exploring count data (e.g. application of the Poisson distribution) Click here for an example. The math behind many of these functions is beyond the scope of this lab, however I recommend that you read about some as they are important for properly interpreting the coefficients your models produce. We will consider the binomial error distribution (i.e. logistic regression) in some detail below. Click here if you need a refresher on logistic regression.
So let’s get on with the lab. For this first section we are interested in the effect of herbivory on plant survival at the end of the year. Thinking about what you did last week, what would you do?
Probably something like this…
normal_model <- lm(surv ~ herbivory, data = complex_data)
summary(normal_model)
##
## Call:
## lm(formula = surv ~ herbivory, data = complex_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85489 0.00212 0.10850 0.18173 0.68298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.909813 0.030013 30.31 <2e-16 ***
## herbivory -0.018308 0.001392 -13.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3316 on 177 degrees of freedom
## Multiple R-squared: 0.4943, Adjusted R-squared: 0.4914
## F-statistic: 173 on 1 and 177 DF, p-value: < 2.2e-16
plot(complex_data$surv, complex_data$herbivory)
Let’s examine the residuals
hist(residuals(normal_model))
They look a little wonky. There is definite left skew - variance is
not constant. Let’s also examine what the model thinks using
predict()
. Predict()
simply takes a value from
your predictor variables (your real predictors by default) and produces
the value that your model predicts. This is usually a great gut check to
perform on models. You do not want your model predicting crazy
things.
hist(predict(normal_model), main = "")
That’s real weird. Our model is predicting negative values. This is a
problem because you cannot have a less than 0 chance of survival.
Something is wrong here. No surprise. We know survival
is a
binary response variable coded with {0,1} binomial variable, so we
shouldn’t have expected the error distribution to reflect a normal error
distribution.
Let’s try a generalized linear model using the glm()
function. The glm()
function has the argument “family” that
we use to specify the error distribution. We would use “Poisson” for
count data and will use “binomial” for survival data. You can use
?family
for details on how R uses these arguments
binom_model <- glm(surv ~ herbivory, family = binomial(link = "logit"), data = complex_data)
# note that the default link function for binomial family is "logit". So you could just type `family="binomial"`- but here I'll specify it to emphasize "under the hood processes"
summary(binom_model)
##
## Call:
## glm(formula = surv ~ herbivory, family = binomial(link = "logit"),
## data = complex_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.38193 0.64283 6.817 9.32e-12 ***
## herbivory -0.33804 0.05422 -6.234 4.55e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 222.447 on 178 degrees of freedom
## Residual deviance: 77.003 on 177 degrees of freedom
## AIC: 81.003
##
## Number of Fisher Scoring iterations: 7
Components of the summary
output:
Click
here for a useful resource explaining the various components of the
summary()
output for glm()
Residual Deviance: Deviance is defined by the difference between the log-likelihood of the saturated model and the log-likelihood of the fitted model. Essentially this value measures the extent to which the likelihood of the saturated model exceeds the likelihood of the proposed model. Low values imply that the fitted model and high values = poor fit. For an explanation on residual deviance and deviance residuals, watch these two videos video1 and video2. Deviance refers to the difference in the log-likelihood of the saturated and fitted model.
Deviance residuals: The R output shows the distribution of each samples contribution to residual deviance. Deviance residuals represent the square root of the contribution that each data point has to the overall residual deviance. Squaring these residuals and summing over all observations yields the residual deviance statistic mentioned above.
Null Deviance: Measure of the difference between
the fitted model and the null model (an intercept only model). A low
null deviance, relative to the measure of residual deviance, implies
that the data can be modeled well merely using the intercept. If the
null deviance is high, relative to the residual deviance, this implies
the model you fit reduces error (deviance) when compared to an intercept
only model. Note that it is possible to perform a significance test on
this drop in deviance, similar to an F-test in a least-squares
regression. The p-value is calculated using the difference in deviances
from the null to the residual, and the difference in degrees of freedom
from the null to the residual, using a chi-square distribution. It is
not part of the output, but can be done.
pchisq(binom_model$null.deviance - binom_model$deviance, binom_model$df.null - binom_model$df.residual, lower.tail=FALSE)
AIC (Akaike information criterion): information-theoretic measure that describes the quality of a model by taking into account the number of parameters (k) in the fitted model. This measure of model fit captures the trade-off between under and over fitting. AIC = -2*lnLmax + 2k, It has two terms: the deviance component -2*lnLmax (explained previously) and the estimated number of parameters (2k). Thus, AIC increases with the number of parameters and declines with the fit to the data. On its own the value is not useful, but it can help you evaluate models if you compare the AIC to other models. For a review of these concepts I recommend this video AND pg. 30-33 in Ecological Statistics by Fox et al. 2015.
Coefficients: This will be explained in detail below. Interpretation depends on the distribution and link function used.
Ok, back to our example. Let’s first examine the residuals.
hist(residuals(binom_model))
These reflect the spread of the deviance residuals we saw using
summary()
. Deviance residuals should be checked for
symmetry: the median value should be close to zero, and the 1Q and 3Q
values should be roughly equidistant from the median.
hist(predict(binom_model, type = "response"), main = "")
This is MUCH better. The predictions are all between 0 and 1. But how do we interpret our results? What does this mean?
Before moving on to interpretation, let’s take a quick look at what these models look like. Model 1 is the linear model (straight line) and model 2 is the logistic regression (sigmoid shape). Which looks better?
plot(0:87,predict(binom_model, newdata = data.frame(herbivory= 0:87), type = "response"), xlab = "Herbivory", ylab = "Probability of survival", type = "l", ylim = c(-1,1), col = 'blue')
lines(0:87,predict(normal_model, newdata = data.frame(herbivory= 0:87), type = "response"), col = 'red')
points(complex_data$herbivory, complex_data$surv)
Interpreting glm()
coefficients
Recall that the coefficient for a continuous variable in an ordinary
linear model should be interpreted as the predicted change in the
response variable for every unit of change in the predictor.
However this becomes more complicated when using link functions.
normal_model <- lm(surv ~ herbivory, data = complex_data)
(the worse model) can be straight forwardly interpreted because we do
not use a link.
coef(normal_model)
## (Intercept) herbivory
## 0.909813 -0.018308
In this case model 1 is saying that for every increase in 1 unit of herbivory the model predicts a 1.8% decrease in survival. The intercept indicates that if there is no herbivory (i.e. the slope = 0) then we expect 90% of the plants to survive. BUT remember that this model is not a “good” model.
This is the better model, but unfortunately the logit link makes these coefficients more difficult to understand.
coef(binom_model)
## (Intercept) herbivory
## 4.3819318 -0.3380374
To interpret them we need to consider what the logit link is doing. Here is the equation for a logistic regression:
\[P = \frac{e^{\alpha + \beta X'}}{1 + e^{\alpha + \beta X'}}\]
This equation is what gives
binom_model <- glm(surv ~ herbivory, family = binomial(link = "logit"), data = complex_data)
its sigmoid shape, but you can see that the coefficients associated with
the model (the alphas and betas) are also going through a
transformation. This means that the coefficients the model produces are
no longer on the easy linear scale we used before. In a binomial
regression, coefficients are on the log odds scale.
So looking at the coefficients for binom_model
we know
that every unit of herbivory causes a log odds decrease in survival of
0.33. WTF does that mean? We can get this to odds using
exp()
.
exp(0.33)
## [1] 1.390968
This tells us that for every 1 unit increase of herbivory that plant is 1.39 times less likely to survive. That’s more useful. But this is still not great because “more likely” is a relative term. You can extract the real probability from the model by using the predict() function. What is the probability of survival if you receive 20 units of herbivory?
predict(binom_model, newdata = data.frame(herbivory = 20), type = "response")
## 1
## 0.08480241
Under the hood, this function is just plugging in the predicted value into the logistic regression equation (with the coefficients filled in). Here is what that looks like:
1/(1 + exp(-(4.3819318 + -0.3380374*20)))
## [1] 0.0848024
We can generate the full curve by predicting the full range of values and plotting them. This is the full relationship between herbivory and survival.
plot(0:87, predict(binom_model, newdata = data.frame(herbivory= 0:87), type = "response"), xlab = "Herbivory", ylab = "Probability of survival")
In summary, when you use links like the logit or log, a transformation is occurring under the hood that makes the data more difficult (but not impossible) to interpret.
Checking for Understanding
lm()
appropriate for these data? - why or why not? c) Find
a more suitable distribution and run a glm()
and interpret
its outputs.Reading Resource:
Ecological Statistics Chapter 13: Linear and generalized linear mixed models by Benjamin M. Bolker ed. Fox et al.2015
Online Tutorial:
Video Resources: (I struggled to find video tutorials I liked on this topic. If you can share ones you found and appreciated, that would be great!)
Video 2 (I appreciated these video the most, but they are limited to examples with categorical predictors:)
A Brief Background
Linear Mixed Effects (LME) Models are very useful tools in ecology because they help us account for many of the pitfalls that belie ecological data. I will not get into pseudoreplication and non-independence among observations in detail here, but these are issues that should always be considered. For data sets with grouped designs (nested/hierarchical) or repeated measures across individuals, mixed models can help out account for variation from those groups. For example, a random intercept may be used to account for variation across sites from which measures were taken. This effectively accounts for variation across sites in their baseline levels of the response (y-intercept). Mixed effect models are essentially regression models with added components that account for variation in the intercept and/or slope parameters. The linear model contains fixed effects that reflect the mean intercept and mean slopes while mixed effects approach includes both fixed effects and random effects hence “mixed effects”. One of the most crucial considerations when building mixed models is deciding what is a random effect and what isn’t. Often people treat fixed effects as “what we care about” and random effects as “other potential data tomfoolery to account for” but this isn’t quite right.
A random effect is a parameter (either slope or intercept) in the model that varies across some level of a variable. The model would still contain a fixed intercept, representing the mean intercept for all levels. Each level then gets their own estimated random effect (from the distribution of random effects), representing an adjustment from the mean that is unique to the level.
Note that different people have different perspectives on which variables should be considered random and which are fixed. Generally, a variable can be considered random when it represents a random sampling of factors from a population (E.g. if you took multiple measurements within specimens selected randomly from their population, within dates selected randomly from the year, within sites selected randomly from a country, etc.) and when you don’t care about the effect of an individual random factor on your response variable, but still want to account for variation in the response caused by the random variable.
When should a random effect be a random slope, random intercept, or random slope and a random intercept? It depends of course. Fig. 4 below shows examples taken from Harrison et al. 2018 shows an example a) intercept changes between groups or b) the slope (which is the effect of a predictor variable) and the intercept varies between groups. The random slope and random intercept model is the most conservative but also takes the most data to fit.
Hypothetical Example: Drawing inspiration from Lord of the Rings
Here we are interested in the difference in mass between 5 measured groups. Lets say they are “ents”. We measured 25 ents in 5 groups for 125 total individuals. Let’s look at a couple of ways to model this:
model1 <- lm(mass ~ ent_group, data = data_mines_of_moria)
(assumes that they were sampled independent)model2 <- lmer(mass ~ 1 + (1|ent_group), data = data_mines_of_moria)
(This model implies we sampled 5 ent groups randomly drawn from a larger
population. If we did it again we might get different groups.)Let’s add another layer of complexity. Let’s now say we measured 5 ent groups in 3 forests (for 15 total ent groups). Yes I know ents are only in Fangorn forest. But let’s say that we found them in 3 forests. Are the 5 groups (125 individuals) we measured in each forest truly independent? It seems likely that ents from the same forest will be more similar than ents from other forests, so while we have 15 groups, it is really more like 3. This is one type of pseudoreplication and we need to account for it. In mixed models we can specify how data are nested. Example:
model3 <- lmer(mass ~ 1 + (1|forest/ent_group), data = data_mines_of_moria)
.
It says that we want to examine the differences in mass between
individuals who were measured in groups that are nested in forests.Both of the mixed models presented thus far have been a little unusual in that they are only intercept models (ANOVAs essentially). Let’s get back to our data set.
Example using our data set
I want to answer this question: how does herbivory impact plant survival at the end of the growing season? Remember that we measured three genera of plants in plots at two different elevations. So what do you think is fixed and what is random? Notice we will be modeling survival (the binomial variable we used in the example for generalized linear models).
binom_mixed <- glmer(surv ~ herbivory + (1|genus), family = "binomial", data = complex_data)
## boundary (singular) fit: see help('isSingular')
summary(binom_mixed)
## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
## Family: binomial ( logit )
## Formula: surv ~ herbivory + (1 | genus)
## Data: complex_data
##
## AIC BIC logLik -2*log(L) df.resid
## 83.0 92.6 -38.5 77.0 176
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.3866 -0.0026 0.1324 0.2198 3.2851
##
## Random effects:
## Groups Name Variance Std.Dev.
## genus (Intercept) 0 0
## Number of obs: 179, groups: genus, 3
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.38193 0.64284 6.817 9.33e-12 ***
## herbivory -0.33804 0.05423 -6.234 4.55e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## herbivory -0.885
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
exp(0.33)
## [1] 1.390968
This model treats genus as a random effect. A debate for sure. One could argue that this is “something we care about” while another says that if our question is about plants broadly these three genera are just samples of a larger plant species pool. If our question was how does survival vary among Miconia, Piper, and Psychotria it would be fixed.
Our model is overfitted (as indicated by the warning:
boundary (singular) fit: see ?isSingular
) – that is, the
random effects structure is too complex to be supported by the data,
hence a model without the random effects structure is likely better. If
we were to interpret the coefficients anyway, we would draw the
following conclusions (remember survival is a binary outcome with a
logit link, so the raw estimates are on the log-odds scale):
Below we try different models where binom_mixed
includes
a random intercept model fixes the slope to be the same between groups
and binom_random
includes a random slope and random
intercept model. There is no answer for all situations and which to use
should be a decision made by thinking about your data and question. In R
code it looks like this:
#Random intercept
binom_mixed <- glmer(surv ~ herbivory + (1|genus), family = "binomial", data = complex_data)
## boundary (singular) fit: see help('isSingular')
summary(binom_mixed)
## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
## Family: binomial ( logit )
## Formula: surv ~ herbivory + (1 | genus)
## Data: complex_data
##
## AIC BIC logLik -2*log(L) df.resid
## 83.0 92.6 -38.5 77.0 176
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.3866 -0.0026 0.1324 0.2198 3.2851
##
## Random effects:
## Groups Name Variance Std.Dev.
## genus (Intercept) 0 0
## Number of obs: 179, groups: genus, 3
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.38193 0.64284 6.817 9.33e-12 ***
## herbivory -0.33804 0.05423 -6.234 4.55e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## herbivory -0.885
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
#Random slope & random intercept
binom_random <- glmer(surv ~ herbivory + (herbivory|genus), family = "binomial", data = complex_data)
## boundary (singular) fit: see help('isSingular')
summary(binom_random)
## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
## Family: binomial ( logit )
## Formula: surv ~ herbivory + (herbivory | genus)
## Data: complex_data
##
## AIC BIC logLik -2*log(L) df.resid
## 87.0 102.9 -38.5 77.0 174
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.3866 -0.0026 0.1324 0.2198 3.2851
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## genus (Intercept) 9.747e-12 3.122e-06
## herbivory 5.265e-14 2.294e-07 -1.00
## Number of obs: 179, groups: genus, 3
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.38193 0.64284 6.817 9.33e-12 ***
## herbivory -0.33804 0.05423 -6.234 4.55e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## herbivory -0.885
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
The outputs are the same because these are overfitted with the additional parameters.
Exploring Mixed Effects Models with Categorical Data
Try out this. It’s an rpub with examples of mixed effects models applied to simulated categorical genetic data.
Check for Understanding
Adapted from https://www.youtube.com/watch?v=-4K-kKXNths&t=386s
Resources:
Quick Overviews:
Videos: Big Picture/ Bayes Rule AND Visuals showing how the Posterior Distribution changes with respect to changes in the Prior distribution and Likelihood Function
A Deeper Dive:
This section, which applies Bayesian Statistics, is intended to provide some example code and demonstrate that frequentist and Bayesian approaches often give the same answer (especially with a fair amount of data) and show how the interpretation of their outputs varies. I use Bayes frequently in my own work and would happily expand on Bayesian models in my office hours if some of you want to try hierarchical models when analyzing each others data.
Bayesian and frequentist statistics are two approaches to inference that use a different definition of probability. Just as suite of methods exist to frequentist statistics, Bayesian statistics can be applied to linear models, t-tests, hierarchical models, etc. In frequentist statistics, probability is derived from the frequency of an event occurring repeatedly. An event, as it may apply to an ecologist, is some model parameter or measure of effect (e.g. slope coefficient - which, don’t forget, represents a hypothesis we are testing). In Bayesian statistics, you are concerned with the probability of an event occurring given (or conditional on) your prior knowledge. In our case, the event is again a parameter and the knowledge or information available to you, is your data. This is often simply put as:
This has some interesting implications when thinking about interpreting p-values and confidence/credible intervals. Bayesian analysis also uses priors, which is a common (and in my opinion a bit overblown) criticism. Finally, Bayesian analysis treats all variables as random samples drawn from a larger distribution. While frequentist analyses return single coefficient estimates, Bayesian analyses return probability distributions.
Components of Bayes Rule in the image below:
Put another way…
Recall the differences among Bayesian and frequentist approaches.
Now that you hopefully feel more comfortable with the Bayesian definition of probability, what is the process underlying how parameters are estimated? Markov Chain Monte Carlo (MCMC) sampling is central to the implementation of Bayes probability theory to estimate parameters. MCMC functions to approximate the parameter given the joint distribution of the prior and data. A Visual of MCMC AND A Visual of MCMC chains using trace plots. Note that this video uses a particular algorithm for sampling-Metropolis-Hastings-but many algorithms exist including Gibbs Sampling- which you will see is often applied in ecology.
Figure Taken from this video lecture which provides an overview of MCMC sampling
Bayesian t-test
Let’s return to the t-test we performed last week. We were interested in the difference in herbivory between high and low elevation plots. We also used a sqrt() transformation.
freq_t.test <- t.test(sqrt(herbivory) ~ elevation, data = complex_data)
freq_t.test
##
## Welch Two Sample t-test
##
## data: sqrt(herbivory) by elevation
## t = 0.91792, df = 169.95, p-value = 0.36
## alternative hypothesis: true difference in means between group high and group low is not equal to 0
## 95 percent confidence interval:
## -0.3515798 0.9627352
## sample estimates:
## mean in group high mean in group low
## 2.845379 2.539801
Now let’s do a Bayesian t-test. You will notice right away that you need to include priors for two parameters: the mean and variance. Here we are saying that based on our prior information, we think the mean or standard deviation of each group is somewhere in the range of this distribution. Click here for examples on the different forms a prior distribution can take. Start time 17:16 min AND Click here to see how changes in the prior influence to the posterior density distribution. How do you know how to define your prior distribution? For an informed prior you can consult literature, talk to experts, use previous data. Alternatively there are less informed priors: uniform priors (same probability across all values), or weakly informed priors that specify a reasonable distribution. “Uninformative” prior is a bit of a misnomer, but shows up in the literature.
plot(density(rnorm(1000, 10, 10)), main ="")
Although this distribution is broad, this is actually a pretty informative prior. Often Bayesian analyses will reduce the influence of their prior by setting “weakly informative” or “uniform priors”. weakly informative priors are when the prior is something like a normal distribution with a mean of 0 and an sd of 10,000. Essentially we are saying that any value is possible, however less extreme values are slightly more likely. Uniform priors state that all values are equally probable. Priors are important and should be carefully considered in any analysis. Personally I think they are a strength of Bayesian analyses.
Stan is a library for analyzing Bayesian models using No-U-Turn sampler (a variant of Hamiltonian Monte Carlo). It’s written in C++ and was developed by an entire team.
Here, we are using the BRMS package and function brm to fit a Bayesian model. We can adjust chains, iterations, and cores based on available computing power. For now, we’ll let R detect how many cores to use.
#Prior for bayesian t-test
# brm will select these defaults, but you can always set them in the function itself
get_prior(bf(herbivory ~ elevation), data = complex_data)
## prior class coef group resp dpar nlpar lb ub source
## (flat) b default
## (flat) b elevationlow (vectorized)
## student_t(3, 5, 7.4) Intercept default
## student_t(3, 0, 7.4) sigma 0 default
CHAINS <- 4
ITER <- 2000
WARMUP <- 1000
BAYES_SEED <- 1234
options(mc.cores = parallel::detectCores()) # Depending on your computer, you can always change this in the brm function, too
BRMout <- brms::brm(
bf(sqrt(herbivory) ~ elevation),
data = complex_data,
chains = CHAINS, iter = ITER, warmup = WARMUP, seed = BAYES_SEED,
file = NULL # You could save results to a file
)
## Compiling Stan program...
## Start sampling
Let’s check our output. First, we’ll plot the posterior predictions. They overlap with our observed data pretty well so we can move on to our summary statistics. The elevationlow coefficient represents the difference in the means. Because the 95% interval for elevationlow includes zero, there is not strong evidence for a difference between groups. Our Bayesian results support our previous frequentist results.
#posterior predictive check
pp_check(BRMout)
## Using 10 posterior draws for ppc type 'dens_overlay' by default.
#looks good
summary(BRMout)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: sqrt(herbivory) ~ elevation
## Data: complex_data (Number of observations: 179)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 2.85 0.24 2.38 3.32 1.00 4235 3142
## elevationlow -0.31 0.33 -0.94 0.34 1.00 4057 3274
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 2.24 0.12 2.01 2.48 1.00 3526 3172
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
brm_data <- as.data.frame(BRMout)
# Plot the estimated difference in means
mcmc_intervals(brm_data, regex_pars = "elevationlow", prob = 0.95)
## Warning: `prob_outer` (0.9) is less than `prob` (0.95)
## ... Swapping the values of `prob_outer` and `prob`
JAGS (Just Another Gibbs Sampler) was written in C++ by Martyn Plummer in 2007. It’s a stand alone program for analyzing Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation
#rjags wants its data input in a list, rather than a data.frame
model_data <- with(complex_data,
list(herb = sqrt(herbivory), #sqrt herbivory
elevation = ifelse(elevation == "high", 1, 0), #elevation transformed to 1 or 0
N = length(elevation))) #Total number of observations
#JAGS code is similar to R code, but notice here that instead of "~" relating two variables as in R, it represents sampling from a distribution.
sim_model <- "model{
for(i in 1:N){
herb[i] ~ dnorm(mu[i], tau)
mu[i] <- alpha + beta1 * elevation[i]
}
#weakly informative priors
alpha ~ dnorm(0, 0.001)
beta1 ~ dnorm(0, 0.001)
tau ~ dgamma(2, 0.1)
}
"
writeLines(sim_model, con="t_model.txt")
jagsmodel <- jagsUI::jags(data = model_data,
n.adapt = 2000,
n.burnin = 1000,
n.iter = 10000,
n.chains = 4,
modules = "glm",
model.file = "t_model.txt",
parameters.to.save = c("alpha", "beta1", "tau"),
verbose = FALSE)
Before doing any else, it is best to make sure the Bayesian model actually worked. Recall that these models are not producing a single coefficient estimate. They are producing thousands of estimates and we want to make sure that all of these estimates converged on one answer. We can do this by examine the traceplots, which should look like fuzzy caterpillars to reflect the “random walk” process. We can also examine the gelman rubin diagnostic (rhat in jagsUI). We want this to be very close to 1 and I mean VERY close. Values like 0.99 or 1.01 might not be terrible, but they indicate something is probably wrong. Values like 0.95 are bad. Here I show the traceplot and rhat for alpha only. You should examine these for all of the variables.
plot(jagsmodel$sims.list$alpha, type = "l")
jagsmodel$Rhat$alpha
## [1] 0.9999817
Lets plot the both posterior distributions (low is black, high is red). Remember that these posteriors are one of the best things about Bayesian analyses. They are scaled so that the integral of each of curve sums to 1 and thus can be interpreted in probabilistic terms. For instance, if an entire posterior distribution is above 1, as both are here, we can that given the data, there is a ~100% probability that the true mean sqrt(herbivory) of both elevations is above 1.
plot(density(jagsmodel$sims.list$alpha), xlim = c(0,5), ylim = c(0,2))
lines(density((jagsmodel$sims.list$alpha +
jagsmodel$sims.list$beta1)), col = "red")
We can also make relative statements. For instance, we can say that there is a 82% chance that sqrt herbivory is greater at high elevations. We can do this by taking all the samples for low elevation (alpha) and comparing them will of of the samples of high elevation (alpha + beta). The below code makes instances where high > low = 1 and instance where low > high = 0, then adds them all up, and divides it my the total samples. This returns the probability that high is greater than low.
low_samples <- jagsmodel$sims.list$alpha
high_samples <- jagsmodel$sims.list$alpha + jagsmodel$sims.list$beta1
sum(ifelse(low_samples < high_samples, 1, 0))/length(low_samples)
## [1] 0.8246111
We can make minor changes to the above model to make this a Bayesian ANOVA. I will write a model below to examine the difference in sqrt herbivory between the three different genera of plants we measured.
Unlike the previous functions we have used for frequentist models (aov, glm, etc.), where we can simply code genus as a factor and write “sqrt(herb) ~ genus”, when we write our own model we have to dummy code our data. This is what functions like glm are actually doing under the hood when we give them a categorical variable with more than 2 levels. We need to make a new column for every level in our categorical variable. In each column a 1 means that species is present and a 0 means it is absent. This is probably easier to demonstrate. Take a look at the data before and after running the following code.
anova_data <- complex_data |>
mutate(ones = 1, fill = 0) |>
spread(key = genus, value = ones, fill = 0) |>
select(herbivory, Miconia, Piper, Psychotria)
Now that we have reformatted our data, let’s put it in the form that jags wants.
model_data <- with(anova_data,
list(herb = sqrt(herbivory), #sqrt herbivory
miconia = Miconia,
piper = Piper,
psychotria = Psychotria,
N = length(herbivory))) #Total number of observations
Now let’s write an ANOVA. Like before we can that sqrt(herbivory) is drawn from a normal distribution, with a mean (mu) and variance (sigma). We then predict the mean using the three plant species. Beta1 corresponds to Miconia, beta2 corresponds to Piper, and beta3 corresponds to Psychotria. So if an observation in the data is of Miconia the model is mu <- beta1 * 1 + beta2 * 0 + beta3 * 0, thus beta1 is the coefficient for the mean herbivory of Miconia. The other two plants work the same way.
sim_model <- "model{
for (i in 1:N){
herb[i] ~ dnorm(mu[i], sigma)
mu[i] <- beta1 * miconia[i] + beta2 * piper[i] + beta3 * psychotria[i]
}
#weakly informative priors
beta1 ~ dnorm(0, 0.001)
beta2 ~ dnorm(0, 0.001)
beta3 ~ dnorm(0, 0.001)
sigma ~ dgamma(2, 0.1)
}
"
writeLines(sim_model, con="aov_model.txt")
jags_aov <- jagsUI::jags(data = model_data,
n.adapt = 2000,
n.burnin = 1000,
n.iter = 10000,
n.chains = 4,
modules = "glm",
model.file = "aov_model.txt",
parameters.to.save = c("beta1", "beta2", "beta3", "sigma"),
verbose = FALSE)
Sometimes plotting more than two posteriors together can be overwhelming to look at, so we can represent posteriors like this as well. Here we see the median and 95% credible intervals. Credible intervals are not the same thing as confidence intervals! Often if 95% confidence intervals do not overlap people say that there is a greater than 95% chance that those groups are different BUT THIS IS WRONG. We can only make probability statements like this using Bayesian analysis. For example, we can say that There is a greater than 95% probability that Piper and Psychotria are different because the credible intervals do not overlap.
plot_dat <- data.frame('genus' = c("Miconia", "Piper", "Psychotria"), #jags outputs can be a huge pain
'median' = c(jags_aov$q50[[1]], jags_aov$q50[[2]], jags_aov$q50[[3]]),
'q2.5' = c(jags_aov$q2.5[[1]], jags_aov$q2.5[[2]], jags_aov$q2.5[[3]]),
'q97.5' = c(jags_aov$q97.5[[1]], jags_aov$q97.5[[2]], jags_aov$q97.5[[3]]))
ggplot(data=plot_dat) +
geom_errorbar(aes(genus, ymin = q2.5, ymax = q97.5), width = 0) +
geom_point(aes(genus, median), pch = 21, fill = "gray65") +
labs(y="Sqrt herbivory") +
theme_classic()
This was very brief introduction to priors and posterior distributions, but in order to perform Bayesian analyses or understand publications these concepts are of fundamental importance. I am always happy to talk about implementing these models. I have added a fairly thorough demonstration of a hierarchical model in this lab’s folder. I think that Bayesian hierarchical models are rad and a great way of accounting for many of the problems in ecological data.
Check for Understanding:
lm()
appropriate for these data? - why or why not? c) Find
a more suitable distribution and run a glm()
and interpret
its outputs.model1 <- lm(diversity ~ area, data = complex_data) #for example
summary(model1)
##
## Call:
## lm(formula = diversity ~ area, data = complex_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6293 -1.0092 0.0392 1.0262 3.4658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7423063 0.1627080 10.708 < 2e-16 ***
## area 0.0014255 0.0005118 2.785 0.00593 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.367 on 177 degrees of freedom
## Multiple R-squared: 0.04199, Adjusted R-squared: 0.03658
## F-statistic: 7.759 on 1 and 177 DF, p-value: 0.005928
hist(complex_data$diversity)
hist(resid(model1))
1a. Discrete probability distribution. Given that diversity is an example of count data and can’t be less than zero, it is reasonable to assume that samples of diversity are drawn from a poisson distribution. Review here
1b. No, because linear models assume normally distributed errors and I believe the appropriate error distribution for this is a non-normal (specifically, I will try out Poisson).
1c.
poisson_model <- glm(diversity ~ area, data = complex_data, family = 'poisson')
summary(poisson_model)
##
## Call:
## glm(formula = diversity ~ area, family = "poisson", data = complex_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5791240 0.0824885 7.021 2.21e-12 ***
## area 0.0006157 0.0002343 2.628 0.00859 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 208.48 on 178 degrees of freedom
## Residual deviance: 202.00 on 177 degrees of freedom
## AIC: 615.84
##
## Number of Fisher Scoring iterations: 5
hist(resid(poisson_model))
predict(poisson_model, newdata = data.frame(area = 1000), type = "response")
## 1
## 3.303117
An increase in area of 1,000 units (near the upper limit of area recorded) is predicted to result in an extra 3.3 species sampled.
changing the t-test priors
model_data <- with(complex_data,
list(herb = sqrt(herbivory), #sqrt herbivory
elevation = ifelse(elevation == "high", 1, 0), #elevation transformed to 1 or 0
N = length(elevation))) #Total number of observations
old_model <- "model{
for(i in 1:N){
herb[i] ~ dnorm(mu[i], tau)
mu[i] <- alpha + beta1 * elevation[i]
}
#weakly informative priors
alpha ~ dnorm(0, 0.001)
beta1 ~ dnorm(0, 0.001)
tau ~ dgamma(2, 0.1)
}
"
#setting the prior for alpha to 5, and the prior for beta to -50, both with sd changed from 1/.001 to 1/.1
new_model <- "model{
for(i in 1:N){
herb[i] ~ dnorm(mu[i], tau)
mu[i] <- alpha + beta1 * elevation[i]
}
#weakly informative priors
alpha ~ dnorm(5, 0.1)
beta1 ~ dnorm(-50, 0.1)
tau ~ dgamma(2, 0.1)
}
"
writeLines(old_model, con="t_model.old.txt")
writeLines(new_model, con="t_model.new.txt")
jagsold <- jagsUI::jags(data = model_data,
n.adapt = 2000,
n.burnin = 1000,
n.iter = 10000,
n.chains = 4,
modules = "glm",
model.file = "t_model.old.txt",
parameters.to.save = c("alpha", "beta1", "tau"),
verbose = FALSE)
jagsnew <- jagsUI::jags(data = model_data,
n.adapt = 2000,
n.burnin = 1000,
n.iter = 10000,
n.chains = 4,
modules = "glm",
model.file = "t_model.new.txt",
parameters.to.save = c("alpha", "beta1", "tau"),
verbose = FALSE)
ggplot()+
geom_density(aes(jagsold$sims.list$alpha, linetype = 'old model', color = 'alpha'))+
geom_density(aes(jagsold$sims.list$beta1, linetype = 'old model', color = 'beta1'))+
geom_density(aes(jagsnew$sims.list$alpha, linetype = 'updated model', color = 'alpha'))+
geom_density(aes(jagsnew$sims.list$beta1, linetype = 'updated model', color = 'beta1'))+
theme_classic()
How does this change the results and inference?
setting a negative prior for beta made the posterior beta1 more negative, and setting a positive prior for alpha made the posterior alpha more positive. Note that the breadth of the distribution had to be made much smaller to see these effects. with sd = 1/.001 (the second argument of dnorm in JAGS is the inverse of the standard deviation, not the sd as in dnorm in R) the old and new distributions were exactly the same. This is why some people take umbrage with the idea of priors, as they can shift data towards your biases (the Bayesian idea being that your biases are data that is worth incorporating into your models). This is not always an issue, and is a huge benefit in some cases, e.g. hierarchical models where the upper levels set the priors for nested levels of the model
Below is old code from the BEST package which was archived but can still be installed using devtools.
splt_herb <- split(complex_data$herbivory, complex_data$elevation)
priors <- list(muM = 10, muSD = 10)
#BESTmcmc here runs a Bayesian t-test. This can take awhile depending on your computer
#Other packages that can do this include brms and rjags, which we will explore in a little while
BESTout <- BESTmcmc(y1 = sqrt(splt_herb$high),
y2 = sqrt(splt_herb$low),
priors = priors,
parallel = F,
verbose = F)
####FOR MAC USERS: If you get error `Error in makePSOCKcluster(names = spec, ...) : Cluster setup failed. 3 of 3 workers failed to connect.` Run the following
#cl <- parallel::makeCluster(2, setup_strategy = "sequential")
## WORKAROUND: https://github.com/rstudio/rstudio/issues/6692
## Revert to 'sequential' setup of PSOCK cluster in RStudio Console on macOS and R 4.0.0
#if (Sys.getenv("RSTUDIO") == "1" && !nzchar(Sys.getenv("RSTUDIO_TERM")) &&
# Sys.info()["sysname"] == "Darwin" && getRversion() >= "4.0.0") {
# parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
#}
You will notice that this takes much longer to run - which reflects one downside (or perhaps constraint) to Bayesian models. This is in large part because of the MCMC sampling which, recall, take thousands of random samples to build probability distributions.
plot(BESTout, which="mean")
This shows a distribution of the estimated difference in herbivory
between high and low elevations. So while t.test()
gives us
one value, the bayes analysis gives thousands of values. The first thing
to notice is that the mean of this distribution (0.32) is very close to
the mean difference from the frequentist t-test (0.31). The best thing
about this is that we can say so much more! For instance since 83% of
the distribution is above 0 we can say that there is an 83& chance
that herbivory (well actually, the sqrt herbivory) is higher in high
elevations than low elevations. These continuous statements about
probability are one of the best things about about bayes.
A frequentist p-value of 0.05 is often misinterpreted as a 5% chance that the null hypothesis is true and thus a 95% chance that it is false (and that the alternate is true). There are a couple of problems with this. In this example, the p-value of 0.36 means that assuming that the null hypothesis is true (that there is no difference between high and low), we would expect results this extreme 36% of the time. Frequentist p-values are NOT a probability statement about the “truth” of the alternate hypothesis. In Bayes land, we can use these distributions to make probability statements. If 83% of the mean differences distribution are above 0, we can say that, given these data, there is an 83% chance there is a difference between groups.
A Bayesian analysis produces a distribution for each variable in the
model. Here you can see the summary statistics for each distribution
that was calculated. mu1
is the mean of the high elevation
group and sigma 1 is the variance around that mean. So yes we have a
distribution (with variance) for the estimated variance.
summary(BESTout)
## mean median mode HDI% HDIlo HDIup compVal %>compVal
## mu1 2.722 2.721 2.721 95 2.1785 3.264
## mu2 2.401 2.397 2.397 95 1.9802 2.834
## muDiff 0.321 0.321 0.272 95 -0.3427 0.965 0 83.3
## sigma1 2.300 2.296 2.312 95 1.8714 2.739
## sigma2 1.799 1.799 1.796 95 1.4010 2.193
## sigmaDiff 0.501 0.498 0.485 95 0.0208 1.002 0 97.9
## nu 22.195 13.855 7.210 95 2.4834 68.778
## log10nu 1.189 1.142 0.966 95 0.5826 1.893
## effSz 0.156 0.156 0.132 95 -0.1704 0.462 0 83.3
Above we used a specific wrapper function BestMCMC
, but
often Bayesian models are more complex. For example, Bayesian
hierarchical models are super useful tools to account for inherent
nested structure within our data, but models like these are typically
implemented using jags
or stan
(although you
can code your own sampler). These tools require you to write out the
full model. This can be difficult to do at first, BUT when you do this
the model becomes MUCH more understandable, because you wrote it.
Nothing is happening under the hood (at least when it comes to the
model).
Here is the same model above, but coded in jags
. All
jags models follow this format. You make a list of the data then write
everything in a for loop. Here we say that sqrt(herb)
is
drawn from a normal distribution, with a mean (mu) and a variance (tau).
We then say that this mean is predicted by our linear model (which is
the same as the t-test above). Pay attention to how I coded the data. I
said that high elevation = 1 and low elevation = 0. So, alpha represents
the mean of low elevation. This is because when an observations is “low”
in the data, the elevation = 0, so the model is
mu[i] <- alpha + beta1 * 0
, so beta1 goes away. When an
observation is “high” in the data, the elevation = 1, so the model is
mu[i] <- alpha + beta1 * 1
, so the beta1 here is
difference between low and high elevations. Recall that is the
exact same interpretation that aov()
or glm()
gives us.