In this last lab we are going to examine structural equation modeling and then return to a principle that we have already mentioned in this class, that in statistics there are often multiple paths to the same answer.

Introduction to SEM

Resources:

Book: Cause and Correlation in Biology, Bill Shipley 2016

Website: Introduction of the Framework

Website: Video and 3-part seminar * I highly recommend going through these seminars if you are interested in this method. The question often pops up “Isn’t SEM just linear regression” and I think this does a great job showing how these methods differ with respect to covariance in residual errors.

PDF: Lavaan reference Sheet

Structural Equation Modeling (SEM)

Structural equation modeling is a linear model framework that models both simultaneous regression equations with latent variables. Two types of models within this framework include: confirmatory factor analysis and path analysis. The working definition here, is that structural equation modeling is a model selection approach to path analysis that either includes latent variables or statistically compares different path models with the goal of comparing different causal hypotheses and examining indirect effects.

Latent variables are key components of SEM. Latent variables are immeasurable or unobservable variables but whose influence can be summarized through observable indicator variables - variables that are measured directly. That is to say, we can study things which are inherently impossible to measure by analyzing the relationships of things they cause. An example of a latent variable in ecology is climate change. Since no one has invented a climate-changeometer, we use measures of climate variables like temperature and precipitation, or rates of change in these variables, as indicators for the latent construct - climate change. These methods are widely used outside of ecology, in part because fields such as psychology use survey data items to construct latent variables of interest such as depression, anxiety and intelligence, but the theory and methods are also very useful to apply to, and consider for, ecological questions.

SEM accounts for the following relationships:

Path Diagrams

Let’s visualize the components of a path diagram in a heuristic visualization, to understand the components of a path diagram and strength of the SEM approach.

Figure 1: example structural equation model
Figure 1: example structural equation model

There’s a lot to take in here, and not all of these symbols will be used in every SEM diagram.

Let’s look at a simplified version of this model:

Figure 2: a simple SEM
Figure 2: a simple SEM

In the above model we are saying that x1 is a predictor of y1 and y2. Additionally, we are saying that y1 is a predictor of y2. This is another key feature of SEMs. We can separate the direct effect of x1 on y1 and the indirect effect of x1 on y2 mediated by y1. Each line in the diagram represents a statistical model. For instance the model predicting y1 looks like this y1 = x1 + error. Each response variable gets its own error (ζ), although these are not always shown in diagrams.

Also note that we can’t have arrows that form loops in our models (the specific terminology is that these are acyclic graphs), so a model like this is invalid:

Figure 3: An example of an invalid SEM
Figure 3: An example of an invalid SEM

This is why SEMs are so useful for causal analysis. The basic idea of causality, that causes drive effects and not the other way around, is built into the assumptions of the model.

Typically, before conducting the analysis, it is useful to explicitly draw out the metamodel (model with the variables of interest including (+/-) relationships to represent your hypotheses.

lavvan: R package to explore SEM

Frequently used syntax in lavaan

here’s an example of how these characters are used when building a model in R:

model <- '
# regressions
y1 + y2 ~ f1 + f2 + x1 + x2
f1 ~ f2 + f3
f2 ~ f3 + x1 + x2

# latent variable definitions
f1 =~ y1 + y2 + y3
f2 =~ y4 + y5 + y6
f3 =~ y7 + y8 + y9 + y10

# variances and covariances
y1 ~~ y1
y1 ~~ y2
f1 ~~ f2

# intercepts
y1 ~ 1
f1 ~ 1'

Practice

Let’s get on to working with real data. For this lab we will simulate everything.

#Load libraries
library(lavaan)
library(lavaanPlot)
library(piecewiseSEM)
library(tidyverse)
library(rjags)
library(jagsUI)
library(coda)
library(matrixStats)
library(pals)

Here we will simulate data of caterpillar richness measured over 60 years. First I simulate that temperature is increasing. Then I simulate that caterpillar richness is declining as a function of both temperature and year. Then I add a precipitation variable with no trends and no effect on caterpillars.

set.seed(2002)

semdata <- data.frame('years' = 1:60) |> #60 years of data
  within({
    # precip is simulated from a random distribution with a mean of 100, standard devation of 10 . 60 values will be taken from that distribution (one for each year).
    precip <- rnorm(60, 100, 10) 
    
    #temperature is modeled with an intercept of 0.05 a positive slope of magnitude 0.05 to indicate it is increasing each year by 0.05 degrees Celsius. Plus some error/ noise. Noise is taken from a Gaussian distribtuion.
    temperature <- 0.05 + 0.05*years + rnorm(60, 1, 0.5) 

    #simulate the caterpillar richness as a function of years and temps
    catrich <- 350 - 2*years -50*temperature + rnorm(60, 50, 50)
  })


#visualize trend of temperature across the 60 years
ggplot(semdata, aes(years, temperature))+
  geom_point()+
  geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'

#visualize caterpillar richness across years
ggplot(semdata, aes(years, catrich))+
  geom_point()+
  geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'

In the case where we didn’t simulate data and we are hypothesizing about these patterns, we would draw a metamodel of the relationships we are about to test (it can still be nice to do this even when you aren’t planning on using an SEM). Below is the meta model for the model we are about to write. Note that this meta model is very simplified in that it only displays the edges that represent regression or path coefficients. When creating your model, it is important to consider correlated error structure. The covariance of residuals and residual variances of the variables are typically not displayed in a path diagram, but the model output will describe this information.

Figure 4: A metamodel showing the causal hypotheses we want to test
Figure 4: A metamodel showing the causal hypotheses we want to test

Let’s first examine the observed or sample covariance matrix to get a sense of these relationships.

cov(semdata)
##                   years     catrich temperature      precip
## years         305.00000 -1358.42825  15.3312769 -23.4048821
## catrich     -1358.42825  8825.22098 -82.1936621  90.2473145
## temperature    15.33128   -82.19366   1.0431215   0.2863871
## precip        -23.40488    90.24731   0.2863871  97.8564803

Now let’s run our hypothesized SEM. First, we need to write a model statement. Finally we run the model using sem() and plot it using lavaanPlot().

##Write the model
# In the model below the exogenous variables are years and temperature
# Endogenous variables are caterpillar richness and precipitation
# Precipitation is predicted by years, caterpillar richness is predicted by years and temperature, temperature is predicted by years

model1 <- '
  precip + temperature ~ years
  catrich ~ years + temperature'

# fitting the sem model to your data
model1.fit <- sem(model1, data = semdata) 
## Warning: lavaan->lav_data_full():  
##    some observed variances are (at least) a factor 1000 times larger than others; use varTable(fit) to investigate
# get summary of model fits (since we specified `fit.measures=TRUE`), standardized estimates (since we specified `standardized =TRUE`) this will return the same coefficients had we scaled all the data and an R-squared value
summary(model1.fit, rsq = TRUE, fit.measures = TRUE, standardized = TRUE) 
## lavaan 0.6-19 ended normally after 21 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         8
## 
##   Number of observations                            60
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 5.118
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.024
## 
## Model Test Baseline Model:
## 
##   Test statistic                               175.155
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.976
##   Tucker-Lewis Index (TLI)                       0.854
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -580.210
##   Loglikelihood unrestricted model (H1)       -577.651
##                                                       
##   Akaike (AIC)                                1176.420
##   Bayesian (BIC)                              1193.175
##   Sample-size adjusted Bayesian (SABIC)       1168.013
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.262
##   90 Percent confidence interval - lower         0.077
##   90 Percent confidence interval - upper         0.504
##   P-value H_0: RMSEA <= 0.050                    0.034
##   P-value H_0: RMSEA >= 0.080                    0.948
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.054
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   precip ~                                                              
##     years            -0.077    0.072   -1.059    0.290   -0.077   -0.135
##   temperature ~                                                         
##     years             0.050    0.004   13.027    0.000    0.050    0.860
##   catrich ~                                                             
##     years            -1.702    0.652   -2.609    0.009   -1.702   -0.314
##     temperature     -54.746   11.124   -4.921    0.000  -54.746   -0.592
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .precip ~~                                                             
##    .catrich          64.990   57.207    1.136    0.256   64.990    0.148
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .precip           94.459   17.246    5.477    0.000   94.459    0.982
##    .temperature       0.268    0.049    5.477    0.000    0.268    0.261
##    .catrich        2034.063  371.367    5.477    0.000 2034.063    0.231
## 
## R-Square:
##                    Estimate
##     precip            0.018
##     temperature       0.739
##     catrich           0.769
#PLot model
lavaanPlot(name = "model1", model1.fit, coefs = TRUE) # prints path diagram that specifies relationships specified in `model1` along with the estimated path coefficients from the fitted model (here, the unstandardized coefficient estimates in the summary output.)

Let’s break down the output:

Figure 5: Lavaan model output
Figure 5: Lavaan model output

Model Fit Estimates

The general goal of SEM is to identify a set of maximally likely population parameters given

  1. the variances and covariances of the observed data and
  2. several user-specified paths thought to give rise to said variances and covariances.

Thus, model fit is based on the comparison of the observed variance-covariance matrix and model-implied covariance matrix. If the model is an accurate representation of the patterns among the data, the variance-covariance matrix using the estimated parameters should match the variance-covariance of the sample or observed matrix.

  1. Estimator Methods
  1. Model Test User Model

This is the section describing the fitted model. Values in this section are reported in publications for model fit (e.g., χ2 = 2.3, p=0.128, df=1)

  1. Model Test Baseline Model

The baseline model represents a null model whereby your observed variables are constrained to covary with no other variables - covariances are fixed to 0 such that only variances are estimated in the covariance matrix. It represents a poor fitting model. This is tested against your fitted model.

D-G) More model fit statistics

A bunch of other model fit statistics. Beyond the scope of this lesson. In the spirit of somewhat arbitrary cut-offs: RMSEA less than 0.05, CFI/TLI above 0.90 to 0.95, and SRMR less than 0.08 are accepted as indicators of “good fit” refer to Hu and Bentler (1999).

Parameter Estimates

  1. Regression Coefficients

Values in this section get interpreted just like any other linear model. These are the path coefficients that are presented on the path diagrams.

I, J) Variance Covariance Outputs

In these sections it is important to note that a (.) before the variable term (e.g., .precip) indicates that it is a residual term. Also, variances and covariance can be displayed on the path diagram abd there are specific ways to illustrate those relationships, however, this information is often omitted from published path diagrams for simplification. Published path diagrams, in ecology, typically only include edges representing the regression coefficients.

  1. Covariances

This section displays the covariance of the residuals among the endogenous variables (precip and catrich - the predicted variables, i.e. those with arrows pointing to them). In our example the value can be interpreted as a negative association between the variance of precip and catrich that was not accounted for by the exogenous variables. As a default, lavaan allows all exogenous variables to covary but it fixes the residual covariances among endogenous variables to zero. The residual terms of stress precip and catrich are freed to covary within this section. Recall, to designate a covariance (or variance) we use the ~~ operator. Thus, .precip ~~ .catrich tells lavaan to include a residual covariance for precip and catrich. Note that values in the std.all column represents the correlation.

  1. Variances

This section represents the residual variance of the endogenous variables precip, temp and catrich - the left-over variance that is not explained by the predictor(s). Notice that precip has a lot of unexplained variance 0.963 while catrich = 0.190 and temperature = 0.261 have much smaller residual variance. This makes sense since we modeled temp and catrich to vary with each other and years.

  1. R-squared

The amount of total variance explained by the model for each variable. As we would expect, a lot of the variance in precip was not explained by the model. Had we modeled precip off of years then we would have expected a lot more variance in precip to be explained by the model.

Let’s try the same model with our variables scaled:

semdata_scaled <- apply(semdata, MARGIN = 2, scale)

model1.scaled <- sem(model1, data = semdata_scaled)
summary(model1.scaled, rsq = TRUE, fit.measures = TRUE, standardized = TRUE)
## lavaan 0.6-19 ended normally after 14 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         8
## 
##   Number of observations                            60
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 5.118
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.024
## 
## Model Test Baseline Model:
## 
##   Test statistic                               175.155
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.976
##   Tucker-Lewis Index (TLI)                       0.854
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -168.878
##   Loglikelihood unrestricted model (H1)       -166.319
##                                                       
##   Akaike (AIC)                                 353.755
##   Bayesian (BIC)                               370.510
##   Sample-size adjusted Bayesian (SABIC)        345.348
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.262
##   90 Percent confidence interval - lower         0.077
##   90 Percent confidence interval - upper         0.504
##   P-value H_0: RMSEA <= 0.050                    0.034
##   P-value H_0: RMSEA >= 0.080                    0.948
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.054
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   precip ~                                                              
##     years            -0.135    0.128   -1.059    0.290   -0.135   -0.135
##   temperature ~                                                         
##     years             0.860    0.066   13.027    0.000    0.860    0.860
##   catrich ~                                                             
##     years            -0.316    0.121   -2.609    0.009   -0.316   -0.314
##     temperature      -0.595    0.121   -4.921    0.000   -0.595   -0.592
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .precip ~~                                                             
##    .catrich           0.070    0.062    1.136    0.256    0.070    0.148
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .precip            0.965    0.176    5.477    0.000    0.965    0.982
##    .temperature       0.257    0.047    5.477    0.000    0.257    0.261
##    .catrich           0.230    0.042    5.477    0.000    0.230    0.231
## 
## R-Square:
##                    Estimate
##     precip            0.018
##     temperature       0.739
##     catrich           0.769
lavaanPlot(name = "scaled", model1.scaled, coefs = TRUE)

SEM is especially powerful when we use them to test different hypotheses and perform model comparison. This next model does not model a trend in caterpillars over time. We can then perform model comparison using AIC or some other metric.

Figure 6: Creating a second metamodel with no direct effect of years on catrich
Figure 6: Creating a second metamodel with no direct effect of years on catrich
model2 <- 'precip +temperature ~ years
    catrich ~ temperature'

model2.fit <- sem(model2, data = semdata_scaled) #still using scaled data
summary(model2.fit, fit.measures = TRUE, standardized = TRUE, rsquare=TRUE)
## lavaan 0.6-19 ended normally after 13 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         7
## 
##   Number of observations                            60
## 
## Model Test User Model:
##                                                       
##   Test statistic                                11.148
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.004
## 
## Model Test Baseline Model:
## 
##   Test statistic                               175.155
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.946
##   Tucker-Lewis Index (TLI)                       0.838
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -171.892
##   Loglikelihood unrestricted model (H1)       -166.319
##                                                       
##   Akaike (AIC)                                 357.785
##   Bayesian (BIC)                               372.445
##   Sample-size adjusted Bayesian (SABIC)        350.428
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.276
##   90 Percent confidence interval - lower         0.134
##   90 Percent confidence interval - upper         0.444
##   P-value H_0: RMSEA <= 0.050                    0.007
##   P-value H_0: RMSEA >= 0.080                    0.985
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.055
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   precip ~                                                              
##     years            -0.101    0.127   -0.796    0.426   -0.101   -0.102
##   temperature ~                                                         
##     years             0.860    0.066   13.027    0.000    0.860    0.860
##   catrich ~                                                             
##     temperature      -0.870    0.066  -13.147    0.000   -0.870   -0.860
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .precip ~~                                                             
##    .catrich           0.112    0.067    1.680    0.093    0.112    0.222
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .precip            0.966    0.176    5.477    0.000    0.966    0.990
##    .temperature       0.257    0.047    5.477    0.000    0.257    0.261
##    .catrich           0.262    0.048    5.477    0.000    0.262    0.260
## 
## R-Square:
##                    Estimate
##     precip            0.010
##     temperature       0.739
##     catrich           0.740
lavaanPlot(name = "model2", model2.fit, labels = names(semdata_scaled), coefs = TRUE)
AIC(model1.fit, model1.scaled, model2.fit)
##               df       AIC
## model1.fit     8 1176.4203
## model1.scaled  8  353.7550
## model2.fit     7  357.7849

INTERPRETATION SUMMARY Our scaled version of the first model appears to be the best, so let’s take a closer look. We can see that years has a positive effects on temperature, which is to say that temperature is increasing over time. We can also see that temperature has a direct negative effect on caterpillar richness. There is a also slight negative trend in richness over time after accounting for temperature change and a slight negative trend in precip. The precip result should be interpreted with caution, as we did not simulate a relationship between precip and year. In fact if you look at the model output, the is no reason to suspect this is not due to random noise (high p-value).

Now for the indirect effect. We only have one in this model, it is the impact of year on caterpillar richness that is mediated by temperature change. We can calculate the total effect by multiplying the path coefficients. So the total impact of year on caterpillars is (-0.37 + (0.86 * -0.57)), which is -0.86.

Checking for Understanding

  1. Try a few other SEMs with the above data. a) How does the AIC vary between them? b) Try reversing the causality. What happens then?**

Multiple paths to the same answer

In this class we have introduced many techniques and depending on your data and questions some will be more appropriate than others. That said, often what some may call the wrong technique will still yield some reasonable interpretations that reflect some of the underlying patterns in your data. This is not to say picking the right technique is not important, it is very important, but this is to say that it can be good to try a bunch of techniques to see where they agree and where they differ. It is SUPER important to remember that you probably will not report all of the things you tried and the results you do report should be based off whether or not the test was appropriate and not which one gives you the “best” results.

For this last section we will look at the same data set a bunch of different ways. I have simulated relationships between a bunch of different variables. Elevation is a factor with two levels: high and low. At each of these two elevations 100 plots were placed. In each, the leaf area of the plant, the herbivory (an ordinal variable from 0 to 2) was determined, and the survival at the end of the season was observed.

multiple_paths <- read.csv("multiple_paths.csv")
summary(multiple_paths) #examine the data
##     plot_num       elevation           leaf_area       herbivory        survival    
##  Min.   :  1.00   Length:200         Min.   :199.2   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 50.75   Class :character   1st Qu.:220.5   1st Qu.:0.000   1st Qu.:0.000  
##  Median :100.50   Mode  :character   Median :227.7   Median :1.000   Median :0.000  
##  Mean   :100.50                      Mean   :228.0   Mean   :1.065   Mean   :0.315  
##  3rd Qu.:150.25                      3rd Qu.:235.0   3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   :200.00                      Max.   :259.2   Max.   :2.000   Max.   :1.000
Figure 7: Metamodel for the generated data
Figure 7: Metamodel for the generated data

SEMs

These data were simulated with SEMs in mind so it seem reasonable to analyze them using SEMs. There are a couple of different ways we can consider treating our variables. The first interesting one we should examine is herbivory. That was measured in an ordinal fashion and has three categories. Since this is the case, treating it like any other continuous variable probably is not the best thing to do, however it can be easier to interpret. Let’s start there.

#scale herbivory
scaled_paths <- within(multiple_paths, {
  herbivory <- scale(herbivory)
  leaf_area <- scale(leaf_area)
})

model <- '
  leaf_area + herbivory ~ elevation
  survival ~ herbivory + elevation'

#Note that writing the model like this is analogous to
#model <- 'leaf_area ~ elevation 
#   survival ~  elevation
#   herbivory ~ elevation
#   survival ~ herbivory'

#fit model
scaled_fit <- sem(model, data = scaled_paths)

#summary output
summary(scaled_fit, rsq = T, fit.measures = TRUE, standardized = TRUE)
## lavaan 0.6-19 ended normally after 10 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         8
## 
##   Number of observations                           200
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 2.322
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.128
## 
## Model Test Baseline Model:
## 
##   Test statistic                               108.420
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.987
##   Tucker-Lewis Index (TLI)                       0.923
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -643.959
##   Loglikelihood unrestricted model (H1)             NA
##                                                       
##   Akaike (AIC)                                1303.919
##   Bayesian (BIC)                              1330.306
##   Sample-size adjusted Bayesian (SABIC)       1304.961
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.081
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.224
##   P-value H_0: RMSEA <= 0.050                    0.220
##   P-value H_0: RMSEA >= 0.080                    0.649
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.031
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   leaf_area ~                                                           
##     elevation         0.598    0.135    4.441    0.000    0.598    0.300
##   herbivory ~                                                           
##     elevation         0.927    0.125    7.425    0.000    0.927    0.465
##   survival ~                                                            
##     herbivory        -0.191    0.034   -5.649    0.000   -0.191   -0.410
##     elevation         0.367    0.067    5.445    0.000    0.367    0.395
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .leaf_area ~~                                                          
##    .survival         -0.009    0.028   -0.333    0.739   -0.009   -0.024
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .leaf_area         0.906    0.091   10.000    0.000    0.906    0.910
##    .herbivory         0.780    0.078   10.000    0.000    0.780    0.784
##    .survival          0.178    0.018   10.000    0.000    0.178    0.826
## 
## R-Square:
##                    Estimate
##     leaf_area         0.090
##     herbivory         0.216
##     survival          0.174
#plot it in a path diagram
lavaanPlot(name = "MODEL1", scaled_fit, coefs = TRUE)

Instead of treating herbivory like a continuous variable, we can treat it as a categorical variable. This is a more conservative approach because we get an estimate for each herbivory level. The down side is the model looks a bit messier and it will be slightly harder to fit. That said, we have more than enough data here. First I dummy code these categorical variables, then I scale leaf area, before finally running the model.

###dummy code herbivory into a categorical variable with 3 levels 
dummy_paths <- mutate(multiple_paths, ones = 1) |> 
  spread(key = herbivory, value = ones, fill = 0) |>
  setNames(c('plot_num','elevation','leaf_area','survival','herb0','herb1','herb2'))

#each level of herbivory is now a column filled with 1,0 where 1 indicated that the sample had that level of herbivory
head(dummy_paths)
##   plot_num elevation leaf_area survival herb0 herb1 herb2
## 1        1      high  223.9341        0     0     1     0
## 2        2      high  241.1445        0     0     1     0
## 3        3      high  238.8826        0     0     0     1
## 4        4      high  232.0278        0     0     0     1
## 5        5      high  247.5750        1     1     0     0
## 6        6      high  223.0350        0     1     0     0
#Fun tangent. If you are interested, this is the same step as above but showing you a base R method to do this:

baseR <- cbind(subset(multiple_paths, select = -herbivory),
  do.call(
    cbind, 
    lapply(0:max(multiple_paths$herbivory), 
           \(h) 0+(multiple_paths$herbivory == h))))|>
  setNames(c('plot_num','elevation','leaf_area','survival','herb0','herb1','herb2'))

head(dummy_paths == baseR)
##      plot_num elevation leaf_area survival herb0 herb1 herb2
## [1,]     TRUE      TRUE      TRUE     TRUE  TRUE  TRUE  TRUE
## [2,]     TRUE      TRUE      TRUE     TRUE  TRUE  TRUE  TRUE
## [3,]     TRUE      TRUE      TRUE     TRUE  TRUE  TRUE  TRUE
## [4,]     TRUE      TRUE      TRUE     TRUE  TRUE  TRUE  TRUE
## [5,]     TRUE      TRUE      TRUE     TRUE  TRUE  TRUE  TRUE
## [6,]     TRUE      TRUE      TRUE     TRUE  TRUE  TRUE  TRUE
#ok. End tangent
# Let's use `dummy_paths`
dummy_paths$leaf_area <- scale(dummy_paths$leaf_area) #scale leaf area

Now that we have the data set dummy_paths ready to go with categorical variables, let’s run a similar model updates with categorical herbivory.

model <- '
  leaf_area + herb1 + herb2 ~ elevation
  survival ~ herb1 + herb2 + elevation'

dummy_fit <- sem(model, data = dummy_paths)
summary(dummy_fit, rsq = TRUE, fit.measures = TRUE, standardized = TRUE)
## lavaan 0.6-19 ended normally after 10 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        11
## 
##   Number of observations                           200
## 
## Model Test User Model:
##                                                       
##   Test statistic                                78.230
##   Degrees of freedom                                 3
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               181.187
##   Degrees of freedom                                10
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.561
##   Tucker-Lewis Index (TLI)                      -0.465
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -634.292
##   Loglikelihood unrestricted model (H1)             NA
##                                                       
##   Akaike (AIC)                                1290.585
##   Bayesian (BIC)                              1326.866
##   Sample-size adjusted Bayesian (SABIC)       1292.017
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.354
##   90 Percent confidence interval - lower         0.289
##   90 Percent confidence interval - upper         0.424
##   P-value H_0: RMSEA <= 0.050                    0.000
##   P-value H_0: RMSEA >= 0.080                    1.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.157
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   leaf_area ~                                                           
##     elevation         0.598    0.135    4.441    0.000    0.598    0.300
##   herb1 ~                                                               
##     elevation        -0.090    0.067   -1.345    0.179   -0.090   -0.095
##   herb2 ~                                                               
##     elevation         0.420    0.061    6.881    0.000    0.420    0.438
##   survival ~                                                            
##     herb1            -0.307    0.063   -4.889    0.000   -0.307   -0.293
##     herb2            -0.473    0.069   -6.869    0.000   -0.473   -0.455
##     elevation         0.361    0.066    5.439    0.000    0.361    0.362
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .leaf_area ~~                                                          
##    .survival         -0.010    0.028   -0.354    0.723   -0.010   -0.025
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .leaf_area         0.906    0.091   10.000    0.000    0.906    0.910
##    .herb1             0.224    0.022   10.000    0.000    0.224    0.991
##    .herb2             0.186    0.019   10.000    0.000    0.186    0.809
##    .survival          0.177    0.018   10.000    0.000    0.177    0.711
## 
## R-Square:
##                    Estimate
##     leaf_area         0.090
##     herb1             0.009
##     herb2             0.191
##     survival          0.289
lavaanPlot(name = "dummy_paths", dummy_fit, coefs = TRUE)

You will probably recall from the complex models lab that we do not usually model binomial data as a regular glm, which both of the previous models have done. Lavaan can only run gaussian models. We need to use the piecewiseSEM package for fancier models.

psem_model <- transform(multiple_paths, leaf_area <- scale(leaf_area))
#first, scale leaf area like we did in the previous models
# `psem` is a function in the `piecewise` package that allows you to do an SEM but specify distributions for the variables other than a normal distribution - as is the case with lavaan. The syntax is different between lavaan and piecewise. Here you will notice, instead of using quotes to enclose the model, the function `psem` encloses the model and the `lm` function is used for each relationship - which is not the case in lavaan

glmsem <- psem(
  lm(leaf_area ~ elevation, psem_model), 
  lm(herbivory ~ elevation, psem_model),
  glm(survival ~ herbivory + leaf_area + elevation, "binomial", psem_model)
)

# As a summary output we will use `coefs`. This outputs the regression coefficients (path coefficients) for our model
coefs(glmsem)
## Warning: Categorical or non-linear variables detected. Please refer to documentation for interpretation of
## Estimates!
##     Response        Predictor Estimate Std.Error  DF Crit.Value P.Value Std.Estimate    
## 1  leaf_area        elevation        -         -   1    19.5256  0.0000            - ***
## 2  leaf_area elevation = high 224.8589    1.0051 198   223.7262  0.0000            - ***
## 3  leaf_area  elevation = low 231.1396    1.0051 198   229.9753  0.0000            - ***
## 4  herbivory        elevation        -         -   1    54.5795  0.0000            - ***
## 5  herbivory elevation = high     0.69    0.0718 198     9.6121  0.0000            - ***
## 6  herbivory  elevation = low     1.44    0.0718 198    20.0600  0.0000            - ***
## 7   survival        herbivory  -1.3462    0.2766 196    -4.8670  0.0000            - ***
## 8   survival        leaf_area  -0.0073    0.0168 196    -0.4345  0.6639            -    
## 9   survival        elevation        -         -   1    26.9222  0.0000            - ***
## 10  survival elevation = high  -2.0447    0.3288 Inf    -6.2180  0.0000            - ***
## 11  survival  elevation = low   0.1363     0.251 Inf     0.5431  0.5870            -

PiecewiseSEM does not have a “nice” plotting function. You could make one in illustration software (i.e. Adobe Illustrator, Powerpoint, Inkscape, etc). Also check out ggraph, Tidygraph, and Rgraphviz if you want to try to do this more programmatically.

If you look at the coefficient directionality and the magnitude of the estimates, the patterns follow our overall story!

Linear models

So far we have only looked at different types of SEMs. What if we perform regular linear models? Can we pick up any of the signal in the data?

linear <- lm(survival ~ elevation + herbivory + leaf_area + elevation*herbivory, data = multiple_paths)
summary(linear)
## 
## Call:
## lm(formula = survival ~ elevation + herbivory + leaf_area + elevation * 
##     herbivory, data = multiple_paths)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7940 -0.2750 -0.1565  0.4634  1.0518 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.612963   0.681641   0.899 0.369631    
## elevationlow            0.418324   0.115111   3.634 0.000357 ***
## herbivory              -0.214975   0.059984  -3.584 0.000428 ***
## leaf_area              -0.001088   0.003042  -0.358 0.721008    
## elevationlow:herbivory -0.041847   0.084693  -0.494 0.621795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4269 on 195 degrees of freedom
## Multiple R-squared:  0.1765, Adjusted R-squared:  0.1596 
## F-statistic: 10.45 on 4 and 195 DF,  p-value: 1.086e-07
binom <- glm(survival ~ elevation + herbivory + leaf_area + elevation*herbivory, family = "binomial", data = multiple_paths)
summary(binom)
## 
## Call:
## glm(formula = survival ~ elevation + herbivory + leaf_area + 
##     elevation * herbivory, family = "binomial", data = multiple_paths)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             1.04381    3.74622   0.279 0.780530    
## elevationlow            1.70865    0.60216   2.838 0.004546 ** 
## herbivory              -1.86889    0.56165  -3.327 0.000876 ***
## leaf_area              -0.00656    0.01673  -0.392 0.694908    
## elevationlow:herbivory  0.74054    0.64743   1.144 0.252700    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 249.22  on 199  degrees of freedom
## Residual deviance: 208.95  on 195  degrees of freedom
## AIC: 218.95
## 
## Number of Fisher Scoring iterations: 5
simple_binom <- glm(survival ~ elevation, family = "binomial", data = multiple_paths)
summary(simple_binom)
## 
## Call:
## glm(formula = survival ~ elevation, family = "binomial", data = multiple_paths)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.2657     0.2414  -5.243 1.58e-07 ***
## elevationlow   0.9017     0.3156   2.857  0.00428 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 249.22  on 199  degrees of freedom
## Residual deviance: 240.75  on 198  degrees of freedom
## AIC: 244.75
## 
## Number of Fisher Scoring iterations: 4

We lose some of the indirect effect interpretation with these, but we still retain the interpretation that herbivory has a negative effect on survival, high elevations have lower survival, and that leaf area is not very predictive.

Bayesian Hierarchical Model

We can also model these data as a Bayesian hierarchical model. Since plots are nested in elevation we can estimate both elevation specific and across elevation effects. We’ll work with the dummy_paths data set:

#############################
# Step1:                    #
#Data.frame to lit for rJAGS#
#############################

# Put the model data into a list. The jags function requires this formatting.
#note we scale leaf area and change elevation to a numeric variable so jags can interpret it
model_data <- with(dummy_paths,
                   list(
                     surv  = survival,
                     herb0 = herb0,
                     herb1 = herb1,
                     herb2 = herb2,
                     area  = scale(leaf_area)[,1], 
                     elevs = as.numeric(factor(elevation)),
                     N     = length(plot_num), #jags requires the number of observations to be specified. In our case each row is a unique plot, so if we use length on the plot column, we will get the number of observations 
                     Nelev = length(unique(elevation))))

#############################
# Step2:                    #
#Build Model                #
#############################
#Model specification 

"
Here are the parts of the model:

`for(i in 1:N)` :         for each observation i through N observations. Here N = 200 

`surv[i] ~ dbern(p[i])`:  y is distributed as a Bernoulli distribution with parameter p, 
                          p = probabily of success and in this case refers to the 
                          probability of survival. i refers to each trial. 
                          The number of trials equals the number of observations.
                          
`logit(p[i])`:            The logit probibility of success for the ith observation 

`logit(p[i]) <- beta0[elevs[i]] * herb0[i] + beta1[elevs[i]] * herb1[i] +  
beta2[elevs[i]] * herb2[i] + beta3[elevs[i]*area[i]`: 
                          The logit probibility of success for the ith observation 
                          is a linear function of the different categories of 
                          herbivory (herb0, herb1, herb2) and area. 
                          The brackets next to the beta coefficients (e.g. beta1[elevs[i]]) 
                          indicate the heirarchical components of the model. 
                          Such that a beta coefficient for each level of elevation 
                          will be calculated for each level of herbivory or area.
                          
Now We define Uninformative priors.
Prior distributions are a central component of Bayesian models.

`for (j in 1:Nelev)`:     For the j'th elevation in 1 through the number of the elevations  
                          (which is 2 levels of elevation; high and low)
                          
`dnorm(hbmu0, 0.001)`:    Each beta coefficient has a prior ditribution. In this 
                          example, we have chosen uninformative priors that are 
                          defined as coming from a normal distribuion with two parameters: 
                          mean(hbmu) & precision(0.001). You may notice that 
                          typically in a Gaussian distribution, we specify the mean 
                          and standard deviation. For the priors it is the convention 
                          to use precision which is the inverse variance (1/variance).  
                          
Finally we define uninformative hyperpriors (good band name)
Notice in this chunck of code, hbmu is a parameter of a parameter (e.g. hbmu0 is a hyperparamter 
since it is used to model the parameter beta0). Since it is a hyperparameter for the prior 
distribution, it is called a hyperprior. We need to include hyperparameters in the model as 
random variables since we do not know their value. Like above, we made these hyperpriors 
uninformative using a normal distributon to model our uncertainty in hbmu. We model the 
hyperprior with a mean of 0 and precision of 0.001.      
"

sim_model <- "model {

          for(i in 1:N) {  
              surv[i] ~ dbern(p[i]) 
              logit(p[i]) <- beta0[elevs[i]] * herb0[i] + 
              beta1[elevs[i]] * herb1[i] +   
              beta2[elevs[i]] * herb2[i] +   
              beta3[elevs[i]] * area[i]      
          }
        
          for (j in 1:Nelev) {
               beta0[j] ~ dnorm(hbmu0, 0.001)
               beta1[j] ~ dnorm(hbmu1, 0.001)
               beta2[j] ~ dnorm(hbmu2, 0.001)
               beta3[j] ~ dnorm(hbmu3, 0.001)
          } 

          hbmu0 ~ dnorm(0, 0.001)
          hbmu1 ~ dnorm(0, 0.001)
          hbmu2 ~ dnorm(0, 0.001)
          hbmu3 ~ dnorm(0, 0.001)
}   
"

writeLines(sim_model, con="sim_model.txt")

############################################################
# Step3:                                                   #
#Run data through the Sampler and specify sampler details  #
############################################################

#This specifies details of the MCMC chain (i.e., number of burnin, number of iterations, number of chains)
jags_model <- jags(data = model_data,
                    n.adapt = 2000,
                    n.burnin = 1000,
                    n.iter = 10000,
                    n.chains = 4,
                    modules = "glm",
                    model.file = "sim_model.txt",
                    parameters.to.save = c("beta0", "beta1", "beta2", "beta3", "hbmu0", "hbmu1", "hbmu2", "hbmu3"),
                    verbose = F)
# The `jags_model` output is an array. The section of code below pulls out the necessary information from the array to make a handy data.frame called `plot_data`. `med` represents the mean of the posterior distribution- these are the beta coefficients we report. `lc` and `uc` represent the respective lower and upper credible  intervals around that estimate. Recall that beta coefficients and their credible intervals represent the effect of herbivory and leaf area on survival for each level of those factors (high and low)
plot_data <- with(jags_model,
                  data.frame(
                    'label' = c("High_Herb0", "High_Herb1", "High_Herb2", "High_Leaf",
                                "Low_Herb0",  "Low_Herb1",  "Low_Herb2",  "Low_Leaf"),
                    'med'   = c(q50$beta0[1],  q50$beta1[1],   q50$beta2[1],   q50$beta3[1],
                                q50$beta0[2],  q50$beta1[2],   q50$beta2[2],   q50$beta3[2]),
                    'lc'   = c(q2.5$beta0[1],  q2.5$beta1[1],  q2.5$beta2[1],  q2.5$beta3[1],
                               q2.5$beta0[2],  q2.5$beta1[2],  q2.5$beta2[2],  q2.5$beta3[2]),
                    'uc'   = c(q97.5$beta0[1], q97.5$beta1[1], q97.5$beta2[1], q97.5$beta3[1],
                               q97.5$beta0[2], q97.5$beta1[2], q97.5$beta2[2], q97.5$beta3[2]),
                    'col'  = c('high_herb','high_herb', 'high_herb', 'leaf_area',
                               'low_herb','low_herb','low_herb','leaf_area')))

plot_data
##        label         med         lc         uc       col
## 1 High_Herb0 -0.34648097 -0.9997522  0.2877027 high_herb
## 2 High_Herb1 -3.09697389 -5.0919547 -1.8428158 high_herb
## 3 High_Herb2 -3.02170515 -6.3379414 -1.2268727 high_herb
## 4  High_Leaf  0.03307931 -0.5214955  0.5885796 leaf_area
## 5  Low_Herb0  1.35027904  0.1180363  2.9340244  low_herb
## 6  Low_Herb1  0.17345041 -0.5588493  0.9097685  low_herb
## 7  Low_Herb2 -1.00734040 -1.6503275 -0.4114526  low_herb
## 8   Low_Leaf -0.16131156 -0.6418516  0.3050866 leaf_area

Now let’s plot the estimated beta coefficients and their credible intervals for the effects (beta coeficients) of herbivory and leaf area on survival at each elevation category

ggplot(plot_data) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray", linewidth = 2) +
  geom_errorbarh(aes(xmin = lc, xmax = uc, y = label, color = col), height = 0, linewidth = 2) +
  geom_point(aes(med, label, color = col), pch = 21, fill = "gray65", size=3) +
  labs(x = "Point Estimates of beta coefficients from posterior distribution (95% CI)", y="", color='')+
  scale_color_manual(values = ocean.haline(4)[-1])+
  theme_classic()

When beta coefficients are close to zero and the credible intervals overlap with zero, these indicate low confidence in an effect of those variables on survival. Here we see the same pattern as we saw before. Increasingly negative effects of of high herbivory (teal). We also see no effect of leaf area (green) and we see greater survival overall at low elevations (yellow).

#This dataframe includes the OVERALL beta coefficients and credible intervals of herbivory and leaf area effect on survival
plot_overall <- with(jags_model,
                     data.frame(
                       'label' = c("Herb0", "Herb1", "Herb2", "Leaf"),
                       'med' = c(q50$hbmu0, q50$hbmu1, q50$hbmu2, q50$hbmu3),
                       'lc' = c(q2.5$hbmu0, q2.5$hbmu1, q2.5$hbmu2, q2.5$hbmu3),
                       'uc' = c(q97.5$hbmu0, q97.5$hbmu1, q97.5$hbmu2, q97.5$hbmu3)))
  
#plot the posterior distributions
ggplot(plot_overall) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray", linewidth = 2) +
  geom_errorbarh(aes(xmin = lc, xmax = uc, y = label), height = 0, linewidth = 2) +
  geom_point(aes(med, label), pch = 21, fill = "gray65", size=3) +
  labs(x = "", y="") +
  theme_classic()

There are no cross elevation effects, as we see in the estimated higher levels. This was not part of the original simulation so this also checks out.

Check for Understanding

  1. Go back to the SEM data (caterpillar richness over time) and try some other types of analyses. Even if they are weird. One idea is to break one of the continuous variables into groups (maybe something like early and late years). Is the story the same?

Key Take aways