Now that you have learned some of the most basic linear models we are going to add some complexity by introducing different error structures, mixed effects, and some basic Bayes. The Bayes stuff can be a pain to run on your computer so please reach out to me if you are having any trouble.

Set working directory.

setwd("...your_working_directory/")

Load in packages

library(lme4)
library(rjags) #Let me know if this gives you trouble
library(tidyverse)
library(brms)
library(bayesplot)
#library(jagsUI) #I do not recommend importing the full library

#BEST was removed from CRAN August, 2023 we can still install archived packages several ways
#url <- "https://cran.r-project.org/src/contrib/Archive/BEST/BEST_0.5.4.tar.gz"
#install.packages(url, type="source", repos=NULL)  
#require(devtools)
#install_version("BEST", version = "0.5.4", repos = "http://cran.us.r-project.org")
# https://cran.r-project.org/src/contrib/Archive/BEST/

library(BEST) 

Import data.

complex_data <- read.csv("Complex_models_data.csv")

Examine data.

colnames(complex_data) #Returns column names
## [1] "elevation" "genus"     "area"      "herbivory" "diversity" "plotID"    "surv"
head(complex_data) #Returns first 6 entries for each column
##   elevation genus     area herbivory diversity plotID surv
## 1      high Piper 400.0000        76         4      1    0
## 2      high Piper 133.3000         5         2      2    1
## 3      high Piper 171.5278         3         2      3    1
## 4      high Piper  38.0000         0         0      4    1
## 5      high Piper 825.0000         1         2      5    1
## 6      high Piper 900.0000         5         3      6    1

Summarize data.

summary(complex_data) #Returns summary of each column.
##   elevation            genus                area          herbivory       diversity         plotID     
##  Length:179         Length:179         Min.   :  12.5   Min.   : 0.00   Min.   :0.000   Min.   : 1.00  
##  Class :character   Class :character   1st Qu.: 105.5   1st Qu.: 1.00   1st Qu.:1.000   1st Qu.: 8.00  
##  Mode  :character   Mode  :character   Median : 192.0   Median : 5.00   Median :2.000   Median :15.00  
##                                        Mean   : 247.4   Mean   :12.16   Mean   :2.095   Mean   :15.42  
##                                        3rd Qu.: 355.6   3rd Qu.:15.00   3rd Qu.:3.000   3rd Qu.:23.00  
##                                        Max.   :1093.8   Max.   :87.00   Max.   :6.000   Max.   :31.00  
##       surv       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.6872  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

You should notice that these data are very similar to last week’s data. In fact they are the same except for one additional column: “surv.” This column describes the survival of these plants measured at the end of the season. A value of 1 indicates if a plant was alive at the end of the season and 0 = dead. We will use these data to consider the glm() function and mixed models.

The Generalized Linear Model

Reading Resources:

The GLzM is a generalization of the ordinary linear model for response variables with error distributions other than normal. You have essentially performed GLzM in the previous lab without knowing it or specifying link functions. GLzM are linear models, but the use of a link function enables them to be adapted them to data with non-normal error distributions. A note on terminology: in many texts GLM refers to general linear models, and GlzM refers to generalized linear models. In R, lm() runs GLM’s, while glm() runs GLzM’s. This can be confusing!

Figure 1: generalized linear model link funcitons
Figure 1: generalized linear model link funcitons

Components of the Generalized Linear Model:

Click Here For Video Review

Some notes:

Before going too far, first let’s take a step back and think about what an error distribution is. Last week a common assumption in all of the basic linear models was that the residuals were normally distributed. Here is what this looks like in an equation:

\[y = \alpha + \beta X' + E\] \[E \sim Normal(0,\sigma^2)\] Here the response variable (\(y\)) equals the intercept (\(\alpha\)) + coefficients (\(\beta\)) * the predictors (\(X'\)) + the residuals. The residuals are the error that remains in your model after you have accounted for the effects of the predictor variables. As you can see the ordinary linear model expects these errors (\(E\)) to be normally distributed and on average be 0. This is a problem for many types of ecological data, like survival and counts, because the nature of the response variables creates wacky error distributions.

Enter the GLzM. This generalization of the linear model allows different types of error structures to be applied through the use of “link” functions. The link function essentially constrains the linear regression (the systematic component) to predict within the range of possible outcomes. Link function can be thought of transforming the model predictions in contrast to transforming the response variables (which we did in previous labs). For example, in logistic regression predictions from the linear model are transformed in such a way that they are constrained to fall between 0 and 1. So while the linear model allows values between -∞ and +∞, the link function ensures (“constricts”) predictions fall between 0 and 1.

The link functions vary depending on the error distribution. Generally speaking the error distributions draw from the exponential family of distributions (fig. 1)- for example, the binomial, Poisson, and gamma distributions.

For a brush up on the parameters, application and other details of these distributions refer to Appendix in Ecological Statistics Fox et al 2015 OR Ecological Models and Data in R by Bolker 2008 for a great review.

Table 1. Common error distribution in glm() for ecological data. Taken from Fox et al 2015
Table 1. Common error distribution in glm() for ecological data. Taken from Fox et al 2015

Exploring Generalized Linear Models: Binomial Distribution

Today we will focus on the binomial distribution. If you wish to see a great resource exploring count data (e.g. application of the Poisson distribution) Click here for an example. The math behind many of these functions is beyond the scope of this lab, however I recommend that you read about some as they are important for properly interpreting the coefficients your models produce. We will consider the binomial error distribution (i.e. logistic regression) in some detail below. Click here if you need a refresher on logistic regression.

Figure 2. Logistic Regression Overview.
Figure 2. Logistic Regression Overview.
Figure 3. Logistic Regression vs. Linear Regression
Figure 3. Logistic Regression vs. Linear Regression

Image Taken from here

So let’s get on with the lab. For this first section we are interested in the effect of herbivory on plant survival at the end of the year. Thinking about what you did last week, what would you do?

Probably something like this…

normal_model <- lm(surv ~ herbivory, data = complex_data)
summary(normal_model)
## 
## Call:
## lm(formula = surv ~ herbivory, data = complex_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85489  0.00212  0.10850  0.18173  0.68298 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.909813   0.030013   30.31   <2e-16 ***
## herbivory   -0.018308   0.001392  -13.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3316 on 177 degrees of freedom
## Multiple R-squared:  0.4943, Adjusted R-squared:  0.4914 
## F-statistic:   173 on 1 and 177 DF,  p-value: < 2.2e-16
plot(complex_data$surv, complex_data$herbivory)

Let’s examine the residuals

hist(residuals(normal_model))

They look a little wonky. There is definite left skew - variance is not constant. Let’s also examine what the model thinks using predict(). Predict() simply takes a value from your predictor variables (your real predictors by default) and produces the value that your model predicts. This is usually a great gut check to perform on models. You do not want your model predicting crazy things.

hist(predict(normal_model), main = "")

That’s real weird. Our model is predicting negative values. This is a problem because you cannot have a less than 0 chance of survival. Something is wrong here. No surprise. We know survival is a binary response variable coded with {0,1} binomial variable, so we shouldn’t have expected the error distribution to reflect a normal error distribution.

Let’s try a generalized linear model using the glm() function. The glm() function has the argument “family” that we use to specify the error distribution. We would use “Poisson” for count data and will use “binomial” for survival data. You can use ?family for details on how R uses these arguments

binom_model <- glm(surv ~ herbivory, family = binomial(link = "logit"), data = complex_data) 
# note that the default link function for binomial family is "logit". So you could just type `family="binomial"`- but here I'll specify it to emphasize "under the hood processes"
summary(binom_model)
## 
## Call:
## glm(formula = surv ~ herbivory, family = binomial(link = "logit"), 
##     data = complex_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.38193    0.64283   6.817 9.32e-12 ***
## herbivory   -0.33804    0.05422  -6.234 4.55e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 222.447  on 178  degrees of freedom
## Residual deviance:  77.003  on 177  degrees of freedom
## AIC: 81.003
## 
## Number of Fisher Scoring iterations: 7

Components of the summary output:

Click here for a useful resource explaining the various components of the summary() output for glm()

  • Residual Deviance: Deviance is defined by the difference between the log-likelihood of the saturated model and the log-likelihood of the fitted model. Essentially this value measures the extent to which the likelihood of the saturated model exceeds the likelihood of the proposed model. Low values imply that the fitted model and high values = poor fit. For an explanation on residual deviance and deviance residuals, watch these two videos video1 and video2. Deviance refers to the difference in the log-likelihood of the saturated and fitted model.

  • Deviance residuals: The R output shows the distribution of each samples contribution to residual deviance. Deviance residuals represent the square root of the contribution that each data point has to the overall residual deviance. Squaring these residuals and summing over all observations yields the residual deviance statistic mentioned above.

  • Null Deviance: Measure of the difference between the fitted model and the null model (an intercept only model). A low null deviance, relative to the measure of residual deviance, implies that the data can be modeled well merely using the intercept. If the null deviance is high, relative to the residual deviance, this implies the model you fit reduces error (deviance) when compared to an intercept only model. Note that it is possible to perform a significance test on this drop in deviance, similar to an F-test in a least-squares regression. The p-value is calculated using the difference in deviances from the null to the residual, and the difference in degrees of freedom from the null to the residual, using a chi-square distribution. It is not part of the output, but can be done. pchisq(binom_model$null.deviance - binom_model$deviance, binom_model$df.null - binom_model$df.residual, lower.tail=FALSE)

  • AIC (Akaike information criterion): information-theoretic measure that describes the quality of a model by taking into account the number of parameters (k) in the fitted model. This measure of model fit captures the trade-off between under and over fitting. AIC = -2*lnLmax + 2k, It has two terms: the deviance component -2*lnLmax (explained previously) and the estimated number of parameters (2k). Thus, AIC increases with the number of parameters and declines with the fit to the data. On its own the value is not useful, but it can help you evaluate models if you compare the AIC to other models. For a review of these concepts I recommend this video AND pg. 30-33 in Ecological Statistics by Fox et al. 2015.

  • Coefficients: This will be explained in detail below. Interpretation depends on the distribution and link function used.

Ok, back to our example. Let’s first examine the residuals.

hist(residuals(binom_model))

These reflect the spread of the deviance residuals we saw using summary(). Deviance residuals should be checked for symmetry: the median value should be close to zero, and the 1Q and 3Q values should be roughly equidistant from the median.

hist(predict(binom_model, type = "response"), main = "")

This is MUCH better. The predictions are all between 0 and 1. But how do we interpret our results? What does this mean?

Before moving on to interpretation, let’s take a quick look at what these models look like. Model 1 is the linear model (straight line) and model 2 is the logistic regression (sigmoid shape). Which looks better?

plot(0:87,predict(binom_model, newdata = data.frame(herbivory= 0:87), type = "response"), xlab = "Herbivory", ylab = "Probability of survival", type = "l", ylim = c(-1,1), col = 'blue')
lines(0:87,predict(normal_model, newdata = data.frame(herbivory= 0:87), type = "response"), col = 'red')
points(complex_data$herbivory, complex_data$surv)

Interpreting glm() coefficients

Recall that the coefficient for a continuous variable in an ordinary linear model should be interpreted as the predicted change in the response variable for every unit of change in the predictor. However this becomes more complicated when using link functions. normal_model <- lm(surv ~ herbivory, data = complex_data) (the worse model) can be straight forwardly interpreted because we do not use a link.

coef(normal_model)
## (Intercept)   herbivory 
##    0.909813   -0.018308

In this case model 1 is saying that for every increase in 1 unit of herbivory the model predicts a 1.8% decrease in survival. The intercept indicates that if there is no herbivory (i.e. the slope = 0) then we expect 90% of the plants to survive. BUT remember that this model is not a “good” model.

This is the better model, but unfortunately the logit link makes these coefficients more difficult to understand.

coef(binom_model)
## (Intercept)   herbivory 
##   4.3819318  -0.3380374

To interpret them we need to consider what the logit link is doing. Here is the equation for a logistic regression:

\[P = \frac{e^{\alpha + \beta X'}}{1 + e^{\alpha + \beta X'}}\]

This equation is what gives binom_model <- glm(surv ~ herbivory, family = binomial(link = "logit"), data = complex_data) its sigmoid shape, but you can see that the coefficients associated with the model (the alphas and betas) are also going through a transformation. This means that the coefficients the model produces are no longer on the easy linear scale we used before. In a binomial regression, coefficients are on the log odds scale.

So looking at the coefficients for binom_model we know that every unit of herbivory causes a log odds decrease in survival of 0.33. WTF does that mean? We can get this to odds using exp().

exp(0.33)
## [1] 1.390968

This tells us that for every 1 unit increase of herbivory that plant is 1.39 times less likely to survive. That’s more useful. But this is still not great because “more likely” is a relative term. You can extract the real probability from the model by using the predict() function. What is the probability of survival if you receive 20 units of herbivory?

predict(binom_model, newdata = data.frame(herbivory = 20), type = "response")
##          1 
## 0.08480241

Under the hood, this function is just plugging in the predicted value into the logistic regression equation (with the coefficients filled in). Here is what that looks like:

1/(1 + exp(-(4.3819318 + -0.3380374*20)))
## [1] 0.0848024

We can generate the full curve by predicting the full range of values and plotting them. This is the full relationship between herbivory and survival.

plot(0:87, predict(binom_model, newdata = data.frame(herbivory= 0:87), type = "response"), xlab = "Herbivory", ylab = "Probability of survival")

In summary, when you use links like the logit or log, a transformation is occurring under the hood that makes the data more difficult (but not impossible) to interpret.

Checking for Understanding

  1. Use the diversity variable, measured as richness (e.g. the count of unique species of herbivores feeding on the plant), as a response variable in a model using predictors of your choosing. a) What is a probable distribution from which diversity is drawn? b) Is a lm() appropriate for these data? - why or why not? c) Find a more suitable distribution and run a glm() and interpret its outputs.

Mixed Models

Reading Resource:

Online Tutorial:

Video Resources: (I struggled to find video tutorials I liked on this topic. If you can share ones you found and appreciated, that would be great!)

A Brief Background

Linear Mixed Effects (LME) Models are very useful tools in ecology because they help us account for many of the pitfalls that belie ecological data. I will not get into pseudoreplication and non-independence among observations in detail here, but these are issues that should always be considered. For data sets with grouped designs (nested/hierarchical) or repeated measures across individuals, mixed models can help out account for variation from those groups. For example, a random intercept may be used to account for variation across sites from which measures were taken. This effectively accounts for variation across sites in their baseline levels of the response (y-intercept). Mixed effect models are essentially regression models with added components that account for variation in the intercept and/or slope parameters. The linear model contains fixed effects that reflect the mean intercept and mean slopes while mixed effects approach includes both fixed effects and random effects hence “mixed effects”. One of the most crucial considerations when building mixed models is deciding what is a random effect and what isn’t. Often people treat fixed effects as “what we care about” and random effects as “other potential data tomfoolery to account for” but this isn’t quite right.

A random effect is a parameter (either slope or intercept) in the model that varies across some level of a variable. The model would still contain a fixed intercept, representing the mean intercept for all levels. Each level then gets their own estimated random effect (from the distribution of random effects), representing an adjustment from the mean that is unique to the level.

Note that different people have different perspectives on which variables should be considered random and which are fixed. Generally, a variable can be considered random when it represents a random sampling of factors from a population (E.g. if you took multiple measurements within specimens selected randomly from their population, within dates selected randomly from the year, within sites selected randomly from a country, etc.) and when you don’t care about the effect of an individual random factor on your response variable, but still want to account for variation in the response caused by the random variable.

When should a random effect be a random slope, random intercept, or random slope and a random intercept? It depends of course. Fig. 4 below shows examples taken from Harrison et al. 2018 shows an example a) intercept changes between groups or b) the slope (which is the effect of a predictor variable) and the intercept varies between groups. The random slope and random intercept model is the most conservative but also takes the most data to fit.

Figure 4: Illustration of models with random slopes or random slopes AND intercepts Taken from Harrison 2018.
Figure 4: Illustration of models with random slopes or random slopes AND intercepts Taken from Harrison 2018.

Hypothetical Example: Drawing inspiration from Lord of the Rings

Here we are interested in the difference in mass between 5 measured groups. Lets say they are “ents”. We measured 25 ents in 5 groups for 125 total individuals. Let’s look at a couple of ways to model this:

Let’s add another layer of complexity. Let’s now say we measured 5 ent groups in 3 forests (for 15 total ent groups). Yes I know ents are only in Fangorn forest. But let’s say that we found them in 3 forests. Are the 5 groups (125 individuals) we measured in each forest truly independent? It seems likely that ents from the same forest will be more similar than ents from other forests, so while we have 15 groups, it is really more like 3. This is one type of pseudoreplication and we need to account for it. In mixed models we can specify how data are nested. Example:

Both of the mixed models presented thus far have been a little unusual in that they are only intercept models (ANOVAs essentially). Let’s get back to our data set.

Example using our data set

I want to answer this question: how does herbivory impact plant survival at the end of the growing season? Remember that we measured three genera of plants in plots at two different elevations. So what do you think is fixed and what is random? Notice we will be modeling survival (the binomial variable we used in the example for generalized linear models).

binom_mixed <- glmer(surv ~ herbivory + (1|genus), family = "binomial", data = complex_data)
## boundary (singular) fit: see help('isSingular')
summary(binom_mixed)
## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
##  Family: binomial  ( logit )
## Formula: surv ~ herbivory + (1 | genus)
##    Data: complex_data
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##      83.0      92.6     -38.5      77.0       176 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.3866 -0.0026  0.1324  0.2198  3.2851 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  genus  (Intercept) 0        0       
## Number of obs: 179, groups:  genus, 3
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.38193    0.64284   6.817 9.33e-12 ***
## herbivory   -0.33804    0.05423  -6.234 4.55e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr)
## herbivory -0.885
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
exp(0.33)
## [1] 1.390968

This model treats genus as a random effect. A debate for sure. One could argue that this is “something we care about” while another says that if our question is about plants broadly these three genera are just samples of a larger plant species pool. If our question was how does survival vary among Miconia, Piper, and Psychotria it would be fixed.

Our model is overfitted (as indicated by the warning: boundary (singular) fit: see ?isSingular) – that is, the random effects structure is too complex to be supported by the data, hence a model without the random effects structure is likely better. If we were to interpret the coefficients anyway, we would draw the following conclusions (remember survival is a binary outcome with a logit link, so the raw estimates are on the log-odds scale):

Below we try different models where binom_mixed includes a random intercept model fixes the slope to be the same between groups and binom_random includes a random slope and random intercept model. There is no answer for all situations and which to use should be a decision made by thinking about your data and question. In R code it looks like this:

#Random intercept
binom_mixed <- glmer(surv ~ herbivory + (1|genus), family = "binomial", data = complex_data)
## boundary (singular) fit: see help('isSingular')
summary(binom_mixed)
## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
##  Family: binomial  ( logit )
## Formula: surv ~ herbivory + (1 | genus)
##    Data: complex_data
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##      83.0      92.6     -38.5      77.0       176 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.3866 -0.0026  0.1324  0.2198  3.2851 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  genus  (Intercept) 0        0       
## Number of obs: 179, groups:  genus, 3
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.38193    0.64284   6.817 9.33e-12 ***
## herbivory   -0.33804    0.05423  -6.234 4.55e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr)
## herbivory -0.885
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
#Random slope & random intercept
binom_random <- glmer(surv ~ herbivory + (herbivory|genus), family = "binomial", data = complex_data)
## boundary (singular) fit: see help('isSingular')
summary(binom_random)
## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
##  Family: binomial  ( logit )
## Formula: surv ~ herbivory + (herbivory | genus)
##    Data: complex_data
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##      87.0     102.9     -38.5      77.0       174 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.3866 -0.0026  0.1324  0.2198  3.2851 
## 
## Random effects:
##  Groups Name        Variance  Std.Dev.  Corr 
##  genus  (Intercept) 9.747e-12 3.122e-06      
##         herbivory   5.265e-14 2.294e-07 -1.00
## Number of obs: 179, groups:  genus, 3
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.38193    0.64284   6.817 9.33e-12 ***
## herbivory   -0.33804    0.05423  -6.234 4.55e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr)
## herbivory -0.885
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

The outputs are the same because these are overfitted with the additional parameters.

Exploring Mixed Effects Models with Categorical Data

Try out this. It’s an rpub with examples of mixed effects models applied to simulated categorical genetic data.

Check for Understanding

  1. In which of the situations below would you choose to model data with: a) Random slope b) Random Intercept c) Random slope and intercept

Adapted from https://www.youtube.com/watch?v=-4K-kKXNths&t=386s

A Bayesian Approach to Linear Regression

Resources:

Quick Overviews:

A Deeper Dive:

This section, which applies Bayesian Statistics, is intended to provide some example code and demonstrate that frequentist and Bayesian approaches often give the same answer (especially with a fair amount of data) and show how the interpretation of their outputs varies. I use Bayes frequently in my own work and would happily expand on Bayesian models in my office hours if some of you want to try hierarchical models when analyzing each others data.

Bayesian and frequentist statistics are two approaches to inference that use a different definition of probability. Just as suite of methods exist to frequentist statistics, Bayesian statistics can be applied to linear models, t-tests, hierarchical models, etc. In frequentist statistics, probability is derived from the frequency of an event occurring repeatedly. An event, as it may apply to an ecologist, is some model parameter or measure of effect (e.g. slope coefficient - which, don’t forget, represents a hypothesis we are testing). In Bayesian statistics, you are concerned with the probability of an event occurring given (or conditional on) your prior knowledge. In our case, the event is again a parameter and the knowledge or information available to you, is your data. This is often simply put as:

This has some interesting implications when thinking about interpreting p-values and confidence/credible intervals. Bayesian analysis also uses priors, which is a common (and in my opinion a bit overblown) criticism. Finally, Bayesian analysis treats all variables as random samples drawn from a larger distribution. While frequentist analyses return single coefficient estimates, Bayesian analyses return probability distributions.

Components of Bayes Rule in the image below:

Figure 5. Generalization of Bayes Rule. Taken from Gotelli & Ellison 2013
Figure 5. Generalization of Bayes Rule. Taken from Gotelli & Ellison 2013

Put another way…

Figure 6: Components of Bayes Rule
Figure 6: Components of Bayes Rule

Recall the differences among Bayesian and frequentist approaches.

Table 2: Comparisons among Bayesian and Frequentist approaches
Table 2: Comparisons among Bayesian and Frequentist approaches

Now that you hopefully feel more comfortable with the Bayesian definition of probability, what is the process underlying how parameters are estimated? Markov Chain Monte Carlo (MCMC) sampling is central to the implementation of Bayes probability theory to estimate parameters. MCMC functions to approximate the parameter given the joint distribution of the prior and data. A Visual of MCMC AND A Visual of MCMC chains using trace plots. Note that this video uses a particular algorithm for sampling-Metropolis-Hastings-but many algorithms exist including Gibbs Sampling- which you will see is often applied in ecology.

Figure 7: Process of MCMC sampling to estimate posterior distribution
Figure 7: Process of MCMC sampling to estimate posterior distribution

Figure Taken from this video lecture which provides an overview of MCMC sampling

Bayesian t-test

Let’s return to the t-test we performed last week. We were interested in the difference in herbivory between high and low elevation plots. We also used a sqrt() transformation.

freq_t.test <- t.test(sqrt(herbivory) ~ elevation, data = complex_data)
freq_t.test
## 
##  Welch Two Sample t-test
## 
## data:  sqrt(herbivory) by elevation
## t = 0.91792, df = 169.95, p-value = 0.36
## alternative hypothesis: true difference in means between group high and group low is not equal to 0
## 95 percent confidence interval:
##  -0.3515798  0.9627352
## sample estimates:
## mean in group high  mean in group low 
##           2.845379           2.539801

Now let’s do a Bayesian t-test. You will notice right away that you need to include priors for two parameters: the mean and variance. Here we are saying that based on our prior information, we think the mean or standard deviation of each group is somewhere in the range of this distribution. Click here for examples on the different forms a prior distribution can take. Start time 17:16 min AND Click here to see how changes in the prior influence to the posterior density distribution. How do you know how to define your prior distribution? For an informed prior you can consult literature, talk to experts, use previous data. Alternatively there are less informed priors: uniform priors (same probability across all values), or weakly informed priors that specify a reasonable distribution. “Uninformative” prior is a bit of a misnomer, but shows up in the literature.

plot(density(rnorm(1000, 10, 10)), main ="")

Although this distribution is broad, this is actually a pretty informative prior. Often Bayesian analyses will reduce the influence of their prior by setting “weakly informative” or “uniform priors”. weakly informative priors are when the prior is something like a normal distribution with a mean of 0 and an sd of 10,000. Essentially we are saying that any value is possible, however less extreme values are slightly more likely. Uniform priors state that all values are equally probable. Priors are important and should be carefully considered in any analysis. Personally I think they are a strength of Bayesian analyses.

Stan is a library for analyzing Bayesian models using No-U-Turn sampler (a variant of Hamiltonian Monte Carlo). It’s written in C++ and was developed by an entire team.

Here, we are using the BRMS package and function brm to fit a Bayesian model. We can adjust chains, iterations, and cores based on available computing power. For now, we’ll let R detect how many cores to use.

#Prior for bayesian t-test
# brm will select these defaults, but you can always set them in the function itself
get_prior(bf(herbivory ~ elevation), data = complex_data)
##                 prior     class         coef group resp dpar nlpar lb ub       source
##                (flat)         b                                               default
##                (flat)         b elevationlow                             (vectorized)
##  student_t(3, 5, 7.4) Intercept                                               default
##  student_t(3, 0, 7.4)     sigma                                     0         default
CHAINS <- 4
ITER <- 2000
WARMUP <- 1000
BAYES_SEED <- 1234
options(mc.cores = parallel::detectCores())  # Depending on your computer, you can always change this in the brm function, too

BRMout <- brms::brm(
  bf(sqrt(herbivory) ~ elevation),
  data = complex_data,
  chains = CHAINS, iter = ITER, warmup = WARMUP, seed = BAYES_SEED,
  file = NULL  # You could save results to a file
)
## Compiling Stan program...
## Start sampling

Let’s check our output. First, we’ll plot the posterior predictions. They overlap with our observed data pretty well so we can move on to our summary statistics. The elevationlow coefficient represents the difference in the means. Because the 95% interval for elevationlow includes zero, there is not strong evidence for a difference between groups. Our Bayesian results support our previous frequentist results.

#posterior predictive check
pp_check(BRMout)
## Using 10 posterior draws for ppc type 'dens_overlay' by default.

#looks good

summary(BRMout)
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: sqrt(herbivory) ~ elevation 
##    Data: complex_data (Number of observations: 179) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Regression Coefficients:
##              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept        2.85      0.24     2.38     3.32 1.00     4235     3142
## elevationlow    -0.31      0.33    -0.94     0.34 1.00     4057     3274
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     2.24      0.12     2.01     2.48 1.00     3526     3172
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
brm_data <- as.data.frame(BRMout)

# Plot the estimated difference in means
mcmc_intervals(brm_data, regex_pars = "elevationlow", prob = 0.95)
## Warning: `prob_outer` (0.9) is less than `prob` (0.95)
## ... Swapping the values of `prob_outer` and `prob`

JAGS (Just Another Gibbs Sampler) was written in C++ by Martyn Plummer in 2007. It’s a stand alone program for analyzing Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation

#rjags wants its data input in a list, rather than a data.frame
model_data <- with(complex_data,
                   list(herb      = sqrt(herbivory), #sqrt herbivory
                        elevation = ifelse(elevation == "high", 1, 0), #elevation transformed to 1 or 0
                        N         = length(elevation))) #Total number of observations
#JAGS code is similar to R code, but notice here that instead of "~" relating two variables as in R, it represents sampling from a distribution. 
sim_model <- "model{

          for(i in 1:N){  
              herb[i] ~ dnorm(mu[i], tau)
              mu[i] <- alpha + beta1 * elevation[i]
          }   

          #weakly informative priors
          
          alpha ~  dnorm(0, 0.001)
          beta1 ~ dnorm(0, 0.001)
          tau ~ dgamma(2, 0.1)
          
}   
"
writeLines(sim_model, con="t_model.txt")

jagsmodel <- jagsUI::jags(data = model_data,
                    n.adapt = 2000,
                    n.burnin = 1000,
                    n.iter = 10000,
                    n.chains = 4,
                    modules = "glm",
                    model.file = "t_model.txt",
                    parameters.to.save = c("alpha", "beta1", "tau"),
                    verbose = FALSE)

Before doing any else, it is best to make sure the Bayesian model actually worked. Recall that these models are not producing a single coefficient estimate. They are producing thousands of estimates and we want to make sure that all of these estimates converged on one answer. We can do this by examine the traceplots, which should look like fuzzy caterpillars to reflect the “random walk” process. We can also examine the gelman rubin diagnostic (rhat in jagsUI). We want this to be very close to 1 and I mean VERY close. Values like 0.99 or 1.01 might not be terrible, but they indicate something is probably wrong. Values like 0.95 are bad. Here I show the traceplot and rhat for alpha only. You should examine these for all of the variables.

plot(jagsmodel$sims.list$alpha, type = "l")

jagsmodel$Rhat$alpha
## [1] 0.9999817

Lets plot the both posterior distributions (low is black, high is red). Remember that these posteriors are one of the best things about Bayesian analyses. They are scaled so that the integral of each of curve sums to 1 and thus can be interpreted in probabilistic terms. For instance, if an entire posterior distribution is above 1, as both are here, we can that given the data, there is a ~100% probability that the true mean sqrt(herbivory) of both elevations is above 1.

plot(density(jagsmodel$sims.list$alpha), xlim = c(0,5), ylim = c(0,2))
lines(density((jagsmodel$sims.list$alpha +
              jagsmodel$sims.list$beta1)), col = "red")

We can also make relative statements. For instance, we can say that there is a 82% chance that sqrt herbivory is greater at high elevations. We can do this by taking all the samples for low elevation (alpha) and comparing them will of of the samples of high elevation (alpha + beta). The below code makes instances where high > low = 1 and instance where low > high = 0, then adds them all up, and divides it my the total samples. This returns the probability that high is greater than low.

low_samples <- jagsmodel$sims.list$alpha
high_samples <- jagsmodel$sims.list$alpha + jagsmodel$sims.list$beta1

sum(ifelse(low_samples < high_samples, 1, 0))/length(low_samples)
## [1] 0.8246111
Bayesian ANOVA

We can make minor changes to the above model to make this a Bayesian ANOVA. I will write a model below to examine the difference in sqrt herbivory between the three different genera of plants we measured.

Unlike the previous functions we have used for frequentist models (aov, glm, etc.), where we can simply code genus as a factor and write “sqrt(herb) ~ genus”, when we write our own model we have to dummy code our data. This is what functions like glm are actually doing under the hood when we give them a categorical variable with more than 2 levels. We need to make a new column for every level in our categorical variable. In each column a 1 means that species is present and a 0 means it is absent. This is probably easier to demonstrate. Take a look at the data before and after running the following code.

anova_data <- complex_data |> 
  mutate(ones = 1, fill = 0) |> 
  spread(key = genus, value = ones, fill = 0) |>  
  select(herbivory, Miconia, Piper, Psychotria)

Now that we have reformatted our data, let’s put it in the form that jags wants.

model_data <- with(anova_data, 
                   list(herb = sqrt(herbivory), #sqrt herbivory
                        miconia = Miconia,
                        piper = Piper,
                        psychotria = Psychotria,
                        N = length(herbivory))) #Total number of observations

Now let’s write an ANOVA. Like before we can that sqrt(herbivory) is drawn from a normal distribution, with a mean (mu) and variance (sigma). We then predict the mean using the three plant species. Beta1 corresponds to Miconia, beta2 corresponds to Piper, and beta3 corresponds to Psychotria. So if an observation in the data is of Miconia the model is mu <- beta1 * 1 + beta2 * 0 + beta3 * 0, thus beta1 is the coefficient for the mean herbivory of Miconia. The other two plants work the same way.

sim_model <- "model{

          for (i in 1:N){  
              herb[i] ~ dnorm(mu[i], sigma)
              mu[i] <- beta1 * miconia[i] + beta2 * piper[i] + beta3 * psychotria[i]
          }   

          #weakly informative priors
          
          beta1 ~ dnorm(0, 0.001)
          beta2 ~ dnorm(0, 0.001)
          beta3 ~ dnorm(0, 0.001)
          sigma ~ dgamma(2, 0.1)
          
}   
"

writeLines(sim_model, con="aov_model.txt")
jags_aov <- jagsUI::jags(data = model_data,
                    n.adapt = 2000,
                    n.burnin = 1000,
                    n.iter = 10000,
                    n.chains = 4,
                    modules = "glm",
                    model.file = "aov_model.txt",
                    parameters.to.save = c("beta1", "beta2", "beta3", "sigma"),
                    verbose = FALSE)

Sometimes plotting more than two posteriors together can be overwhelming to look at, so we can represent posteriors like this as well. Here we see the median and 95% credible intervals. Credible intervals are not the same thing as confidence intervals! Often if 95% confidence intervals do not overlap people say that there is a greater than 95% chance that those groups are different BUT THIS IS WRONG. We can only make probability statements like this using Bayesian analysis. For example, we can say that There is a greater than 95% probability that Piper and Psychotria are different because the credible intervals do not overlap.

plot_dat <- data.frame('genus'  = c("Miconia", "Piper", "Psychotria"), #jags outputs can be a huge pain
                       'median' = c(jags_aov$q50[[1]],   jags_aov$q50[[2]],   jags_aov$q50[[3]]),
                       'q2.5'   = c(jags_aov$q2.5[[1]],  jags_aov$q2.5[[2]],  jags_aov$q2.5[[3]]),
                       'q97.5'  = c(jags_aov$q97.5[[1]], jags_aov$q97.5[[2]], jags_aov$q97.5[[3]]))

ggplot(data=plot_dat) +
  geom_errorbar(aes(genus, ymin = q2.5, ymax = q97.5), width = 0) +
  geom_point(aes(genus, median), pch = 21, fill = "gray65") +
  labs(y="Sqrt herbivory") +
  theme_classic()

This was very brief introduction to priors and posterior distributions, but in order to perform Bayesian analyses or understand publications these concepts are of fundamental importance. I am always happy to talk about implementing these models. I have added a fairly thorough demonstration of a hierarchical model in this lab’s folder. I think that Bayesian hierarchical models are rad and a great way of accounting for many of the problems in ecological data.

Check for Understanding:

  1. Change the priors on one of these models, plot the prior distribution, and then re-run the model. How does this change the results and inference?

Answers

  1. Use the diversity variable, measured as richness (e.g. the count of unique species of herbivores feeding on the plant), as a response variable in a model using predictors of your choosing. a) What is a probable distribution from which diversity is drawn? b) Is a lm() appropriate for these data? - why or why not? c) Find a more suitable distribution and run a glm() and interpret its outputs.
model1 <- lm(diversity ~ area, data = complex_data) #for example
summary(model1)
## 
## Call:
## lm(formula = diversity ~ area, data = complex_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6293 -1.0092  0.0392  1.0262  3.4658 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.7423063  0.1627080  10.708  < 2e-16 ***
## area        0.0014255  0.0005118   2.785  0.00593 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.367 on 177 degrees of freedom
## Multiple R-squared:  0.04199,    Adjusted R-squared:  0.03658 
## F-statistic: 7.759 on 1 and 177 DF,  p-value: 0.005928
hist(complex_data$diversity)

hist(resid(model1))

1a. Discrete probability distribution. Given that diversity is an example of count data and can’t be less than zero, it is reasonable to assume that samples of diversity are drawn from a poisson distribution. Review here

1b. No, because linear models assume normally distributed errors and I believe the appropriate error distribution for this is a non-normal (specifically, I will try out Poisson).

1c.

poisson_model <- glm(diversity ~ area, data = complex_data, family = 'poisson') 
summary(poisson_model)
## 
## Call:
## glm(formula = diversity ~ area, family = "poisson", data = complex_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.5791240  0.0824885   7.021 2.21e-12 ***
## area        0.0006157  0.0002343   2.628  0.00859 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 208.48  on 178  degrees of freedom
## Residual deviance: 202.00  on 177  degrees of freedom
## AIC: 615.84
## 
## Number of Fisher Scoring iterations: 5
hist(resid(poisson_model))

predict(poisson_model, newdata = data.frame(area = 1000), type = "response")
##        1 
## 3.303117

An increase in area of 1,000 units (near the upper limit of area recorded) is predicted to result in an extra 3.3 species sampled.

  1. In which of the situations below would you choose to model data with: a) Random slope b) Random Intercept c) Random slope and intercept Answer in image below
  1. Change the priors on one of these models, plot the prior distribution, and then re-run the model. How does this change the results and inference?

changing the t-test priors

model_data <- with(complex_data,
                   list(herb      = sqrt(herbivory), #sqrt herbivory
                        elevation = ifelse(elevation == "high", 1, 0), #elevation transformed to 1 or 0
                        N         = length(elevation))) #Total number of observations

old_model <- "model{

          for(i in 1:N){  
              herb[i] ~ dnorm(mu[i], tau)
              mu[i] <- alpha + beta1 * elevation[i]
          }   

          #weakly informative priors
          
          alpha ~  dnorm(0, 0.001)
          beta1 ~ dnorm(0, 0.001)
          tau ~ dgamma(2, 0.1)
          
}   
"
#setting the prior for alpha to 5, and the prior for beta to -50, both with sd changed from 1/.001 to 1/.1
new_model <- "model{

          for(i in 1:N){  
              herb[i] ~ dnorm(mu[i], tau)
              mu[i] <- alpha + beta1 * elevation[i]
          }   

          #weakly informative priors
          
          alpha ~  dnorm(5, 0.1)
          beta1 ~ dnorm(-50, 0.1)
          tau ~ dgamma(2, 0.1)
          
}   
"
writeLines(old_model, con="t_model.old.txt")
writeLines(new_model, con="t_model.new.txt")

jagsold <- jagsUI::jags(data = model_data,
                    n.adapt = 2000,
                    n.burnin = 1000,
                    n.iter = 10000,
                    n.chains = 4,
                    modules = "glm",
                    model.file = "t_model.old.txt",
                    parameters.to.save = c("alpha", "beta1", "tau"),
                    verbose = FALSE)

jagsnew <- jagsUI::jags(data = model_data,
                    n.adapt = 2000,
                    n.burnin = 1000,
                    n.iter = 10000,
                    n.chains = 4,
                    modules = "glm",
                    model.file = "t_model.new.txt",
                    parameters.to.save = c("alpha", "beta1", "tau"),
                    verbose = FALSE)


ggplot()+
  geom_density(aes(jagsold$sims.list$alpha, linetype = 'old model', color = 'alpha'))+
  geom_density(aes(jagsold$sims.list$beta1, linetype = 'old model', color = 'beta1'))+
  geom_density(aes(jagsnew$sims.list$alpha, linetype = 'updated model', color = 'alpha'))+
  geom_density(aes(jagsnew$sims.list$beta1, linetype = 'updated model', color = 'beta1'))+
  theme_classic()

How does this change the results and inference?

setting a negative prior for beta made the posterior beta1 more negative, and setting a positive prior for alpha made the posterior alpha more positive. Note that the breadth of the distribution had to be made much smaller to see these effects. with sd = 1/.001 (the second argument of dnorm in JAGS is the inverse of the standard deviation, not the sd as in dnorm in R) the old and new distributions were exactly the same. This is why some people take umbrage with the idea of priors, as they can shift data towards your biases (the Bayesian idea being that your biases are data that is worth incorporating into your models). This is not always an issue, and is a huge benefit in some cases, e.g. hierarchical models where the upper levels set the priors for nested levels of the model

Below is old code from the BEST package which was archived but can still be installed using devtools.

splt_herb <- split(complex_data$herbivory, complex_data$elevation)
priors <- list(muM = 10, muSD = 10)

#BESTmcmc here runs a Bayesian t-test. This can take awhile depending on your computer
#Other packages that can do this include brms and rjags, which we will explore in a little while
BESTout <- BESTmcmc(y1 = sqrt(splt_herb$high), 
                    y2 = sqrt(splt_herb$low),
                    priors = priors,
                    parallel = F,
                    verbose = F)


####FOR MAC USERS: If you get error `Error in makePSOCKcluster(names = spec, ...) : Cluster setup failed. 3 of 3 workers failed to connect.` Run the following
#cl <- parallel::makeCluster(2, setup_strategy = "sequential")
## WORKAROUND: https://github.com/rstudio/rstudio/issues/6692
## Revert to 'sequential' setup of PSOCK cluster in RStudio Console on macOS and R 4.0.0
#if (Sys.getenv("RSTUDIO") == "1" && !nzchar(Sys.getenv("RSTUDIO_TERM")) && 
#    Sys.info()["sysname"] == "Darwin" && getRversion() >= "4.0.0") {
#  parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
#}

You will notice that this takes much longer to run - which reflects one downside (or perhaps constraint) to Bayesian models. This is in large part because of the MCMC sampling which, recall, take thousands of random samples to build probability distributions.

plot(BESTout, which="mean")

This shows a distribution of the estimated difference in herbivory between high and low elevations. So while t.test() gives us one value, the bayes analysis gives thousands of values. The first thing to notice is that the mean of this distribution (0.32) is very close to the mean difference from the frequentist t-test (0.31). The best thing about this is that we can say so much more! For instance since 83% of the distribution is above 0 we can say that there is an 83& chance that herbivory (well actually, the sqrt herbivory) is higher in high elevations than low elevations. These continuous statements about probability are one of the best things about about bayes.

A frequentist p-value of 0.05 is often misinterpreted as a 5% chance that the null hypothesis is true and thus a 95% chance that it is false (and that the alternate is true). There are a couple of problems with this. In this example, the p-value of 0.36 means that assuming that the null hypothesis is true (that there is no difference between high and low), we would expect results this extreme 36% of the time. Frequentist p-values are NOT a probability statement about the “truth” of the alternate hypothesis. In Bayes land, we can use these distributions to make probability statements. If 83% of the mean differences distribution are above 0, we can say that, given these data, there is an 83% chance there is a difference between groups.

A Bayesian analysis produces a distribution for each variable in the model. Here you can see the summary statistics for each distribution that was calculated. mu1 is the mean of the high elevation group and sigma 1 is the variance around that mean. So yes we have a distribution (with variance) for the estimated variance.

summary(BESTout)
##             mean median  mode HDI%   HDIlo  HDIup compVal %>compVal
## mu1        2.722  2.721 2.721   95  2.1785  3.264                  
## mu2        2.401  2.397 2.397   95  1.9802  2.834                  
## muDiff     0.321  0.321 0.272   95 -0.3427  0.965       0      83.3
## sigma1     2.300  2.296 2.312   95  1.8714  2.739                  
## sigma2     1.799  1.799 1.796   95  1.4010  2.193                  
## sigmaDiff  0.501  0.498 0.485   95  0.0208  1.002       0      97.9
## nu        22.195 13.855 7.210   95  2.4834 68.778                  
## log10nu    1.189  1.142 0.966   95  0.5826  1.893                  
## effSz      0.156  0.156 0.132   95 -0.1704  0.462       0      83.3

Above we used a specific wrapper function BestMCMC, but often Bayesian models are more complex. For example, Bayesian hierarchical models are super useful tools to account for inherent nested structure within our data, but models like these are typically implemented using jags or stan (although you can code your own sampler). These tools require you to write out the full model. This can be difficult to do at first, BUT when you do this the model becomes MUCH more understandable, because you wrote it. Nothing is happening under the hood (at least when it comes to the model).

Here is the same model above, but coded in jags. All jags models follow this format. You make a list of the data then write everything in a for loop. Here we say that sqrt(herb) is drawn from a normal distribution, with a mean (mu) and a variance (tau). We then say that this mean is predicted by our linear model (which is the same as the t-test above). Pay attention to how I coded the data. I said that high elevation = 1 and low elevation = 0. So, alpha represents the mean of low elevation. This is because when an observations is “low” in the data, the elevation = 0, so the model is mu[i] <- alpha + beta1 * 0, so beta1 goes away. When an observation is “high” in the data, the elevation = 1, so the model is mu[i] <- alpha + beta1 * 1, so the beta1 here is difference between low and high elevations. Recall that is the exact same interpretation that aov() or glm() gives us.