In this last lab we are going to examine structural equation modeling and then return to a principle that we have already mentioned in this class, that in statistics there are often multiple paths to the same answer.
Resources:
Book: Cause and Correlation in Biology, Bill Shipley 2016
Website: Introduction of the Framework
Website: Video and 3-part seminar * I highly recommend going through these seminars if you are interested in this method. The question often pops up “Isn’t SEM just linear regression” and I think this does a great job showing how these methods differ with respect to covariance in residual errors.
Structural equation modeling is a linear model framework that models both simultaneous regression equations with latent variables. Two types of models within this framework include: confirmatory factor analysis and path analysis. The working definition here, is that structural equation modeling is a model selection approach to path analysis that either includes latent variables or statistically compares different path models with the goal of comparing different causal hypotheses and examining indirect effects.
Latent variables are key components of SEM. Latent variables are immeasurable or unobservable variables but whose influence can be summarized through observable indicator variables - variables that are measured directly. That is to say, we can study things which are inherently impossible to measure by analyzing the relationships of things they cause. An example of a latent variable in ecology is climate change. Since no one has invented a climate-changeometer, we use measures of climate variables like temperature and precipitation, or rates of change in these variables, as indicators for the latent construct - climate change. These methods are widely used outside of ecology, in part because fields such as psychology use survey data items to construct latent variables of interest such as depression, anxiety and intelligence, but the theory and methods are also very useful to apply to, and consider for, ecological questions.
SEM accounts for the following relationships:
Let’s visualize the components of a path diagram in a heuristic visualization, to understand the components of a path diagram and strength of the SEM approach.
There’s a lot to take in here, and not all of these symbols will be used in every SEM diagram.
Exogenous variable: represented by (x1). These are independent variables (either observed or latent) that explains an endogenous variable. Arrows point away from them, not at them.
Endogenous variables: represented by (y1 and y2). These are dependent variables (either observed or latent) that are explained by exogenous variables in the model. Arrows point toward them and these represent causal paths.
Latent variables: represented by (L1). These are unobserved variables that predict exogenous variables in the model.
Arrows: (also called edges) are relationships among variables. Sometimes arrows are used to denote positive relationships while blunted lines represent negative relationships. Single headed arrows represent a causal relationship, while double headed arrows represent covariance, i.e. a relationship where we don’t know the direction of effect.
Vertices: (also called nodes) are the variables themselves.
Let’s look at a simplified version of this model:
In the above model we are saying that x1 is a predictor of y1 and y2. Additionally, we are saying that y1 is a predictor of y2. This is another key feature of SEMs. We can separate the direct effect of x1 on y1 and the indirect effect of x1 on y2 mediated by y1. Each line in the diagram represents a statistical model. For instance the model predicting y1 looks like this y1 = x1 + error. Each response variable gets its own error (ζ), although these are not always shown in diagrams.
Also note that we can’t have arrows that form loops in our models (the specific terminology is that these are acyclic graphs), so a model like this is invalid:
This is why SEMs are so useful for causal analysis. The basic idea of causality, that causes drive effects and not the other way around, is built into the assumptions of the model.
Typically, before conducting the analysis, it is useful to explicitly draw out the metamodel (model with the variables of interest including (+/-) relationships to represent your hypotheses.
Frequently used syntax in lavaan
here’s an example of how these characters are used when building a model in R:
model <- '
# regressions
y1 + y2 ~ f1 + f2 + x1 + x2
f1 ~ f2 + f3
f2 ~ f3 + x1 + x2
# latent variable definitions
f1 =~ y1 + y2 + y3
f2 =~ y4 + y5 + y6
f3 =~ y7 + y8 + y9 + y10
# variances and covariances
y1 ~~ y1
y1 ~~ y2
f1 ~~ f2
# intercepts
y1 ~ 1
f1 ~ 1'
Let’s get on to working with real data. For this lab we will simulate everything.
#Load libraries
library(lavaan)
library(lavaanPlot)
library(piecewiseSEM)
library(tidyverse)
library(rjags)
library(jagsUI)
library(coda)
library(matrixStats)
library(pals)
Here we will simulate data of caterpillar richness measured over 60 years. First I simulate that temperature is increasing. Then I simulate that caterpillar richness is declining as a function of both temperature and year. Then I add a precipitation variable with no trends and no effect on caterpillars.
set.seed(2002)
semdata <- data.frame('years' = 1:60) |> #60 years of data
within({
# precip is simulated from a random distribution with a mean of 100, standard devation of 10 . 60 values will be taken from that distribution (one for each year).
precip <- rnorm(60, 100, 10)
#temperature is modeled with an intercept of 0.05 a positive slope of magnitude 0.05 to indicate it is increasing each year by 0.05 degrees Celsius. Plus some error/ noise. Noise is taken from a Gaussian distribtuion.
temperature <- 0.05 + 0.05*years + rnorm(60, 1, 0.5)
#simulate the caterpillar richness as a function of years and temps
catrich <- 350 - 2*years -50*temperature + rnorm(60, 50, 50)
})
#visualize trend of temperature across the 60 years
ggplot(semdata, aes(years, temperature))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'
#visualize caterpillar richness across years
ggplot(semdata, aes(years, catrich))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'
In the case where we didn’t simulate data and we are hypothesizing about these patterns, we would draw a metamodel of the relationships we are about to test (it can still be nice to do this even when you aren’t planning on using an SEM). Below is the meta model for the model we are about to write. Note that this meta model is very simplified in that it only displays the edges that represent regression or path coefficients. When creating your model, it is important to consider correlated error structure. The covariance of residuals and residual variances of the variables are typically not displayed in a path diagram, but the model output will describe this information.
Let’s first examine the observed or sample covariance matrix to get a sense of these relationships.
cov(semdata)
## years catrich temperature precip
## years 305.00000 -1358.42825 15.3312769 -23.4048821
## catrich -1358.42825 8825.22098 -82.1936621 90.2473145
## temperature 15.33128 -82.19366 1.0431215 0.2863871
## precip -23.40488 90.24731 0.2863871 97.8564803
Now let’s run our hypothesized SEM. First, we need to write a model
statement. Finally we run the model using sem()
and plot it
using lavaanPlot()
.
##Write the model
# In the model below the exogenous variables are years and temperature
# Endogenous variables are caterpillar richness and precipitation
# Precipitation is predicted by years, caterpillar richness is predicted by years and temperature, temperature is predicted by years
model1 <- '
precip + temperature ~ years
catrich ~ years + temperature'
# fitting the sem model to your data
model1.fit <- sem(model1, data = semdata)
## Warning: lavaan->lav_data_full():
## some observed variances are (at least) a factor 1000 times larger than others; use varTable(fit) to investigate
# get summary of model fits (since we specified `fit.measures=TRUE`), standardized estimates (since we specified `standardized =TRUE`) this will return the same coefficients had we scaled all the data and an R-squared value
summary(model1.fit, rsq = TRUE, fit.measures = TRUE, standardized = TRUE)
## lavaan 0.6-19 ended normally after 21 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 8
##
## Number of observations 60
##
## Model Test User Model:
##
## Test statistic 5.118
## Degrees of freedom 1
## P-value (Chi-square) 0.024
##
## Model Test Baseline Model:
##
## Test statistic 175.155
## Degrees of freedom 6
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.976
## Tucker-Lewis Index (TLI) 0.854
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -580.210
## Loglikelihood unrestricted model (H1) -577.651
##
## Akaike (AIC) 1176.420
## Bayesian (BIC) 1193.175
## Sample-size adjusted Bayesian (SABIC) 1168.013
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.262
## 90 Percent confidence interval - lower 0.077
## 90 Percent confidence interval - upper 0.504
## P-value H_0: RMSEA <= 0.050 0.034
## P-value H_0: RMSEA >= 0.080 0.948
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.054
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## precip ~
## years -0.077 0.072 -1.059 0.290 -0.077 -0.135
## temperature ~
## years 0.050 0.004 13.027 0.000 0.050 0.860
## catrich ~
## years -1.702 0.652 -2.609 0.009 -1.702 -0.314
## temperature -54.746 11.124 -4.921 0.000 -54.746 -0.592
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .precip ~~
## .catrich 64.990 57.207 1.136 0.256 64.990 0.148
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .precip 94.459 17.246 5.477 0.000 94.459 0.982
## .temperature 0.268 0.049 5.477 0.000 0.268 0.261
## .catrich 2034.063 371.367 5.477 0.000 2034.063 0.231
##
## R-Square:
## Estimate
## precip 0.018
## temperature 0.739
## catrich 0.769
#PLot model
lavaanPlot(name = "model1", model1.fit, coefs = TRUE) # prints path diagram that specifies relationships specified in `model1` along with the estimated path coefficients from the fitted model (here, the unstandardized coefficient estimates in the summary output.)
Let’s break down the output:
Model Fit Estimates
The general goal of SEM is to identify a set of maximally likely population parameters given
Thus, model fit is based on the comparison of the observed variance-covariance matrix and model-implied covariance matrix. If the model is an accurate representation of the patterns among the data, the variance-covariance matrix using the estimated parameters should match the variance-covariance of the sample or observed matrix.
This is the section describing the fitted model. Values in this section are reported in publications for model fit (e.g., χ2 = 2.3, p=0.128, df=1)
The baseline model represents a null model whereby your observed variables are constrained to covary with no other variables - covariances are fixed to 0 such that only variances are estimated in the covariance matrix. It represents a poor fitting model. This is tested against your fitted model.
D-G) More model fit statistics
A bunch of other model fit statistics. Beyond the scope of this lesson. In the spirit of somewhat arbitrary cut-offs: RMSEA less than 0.05, CFI/TLI above 0.90 to 0.95, and SRMR less than 0.08 are accepted as indicators of “good fit” refer to Hu and Bentler (1999).
Parameter Estimates
Values in this section get interpreted just like any other linear model. These are the path coefficients that are presented on the path diagrams.
Estimates: Contains the (estimated or fixed) parameter value for
each model parameter. You have the option to report unstandardized
estimates (these are reported under Estimate
) or
standardized estimates (reported under Std.all
).
Standardized are most useful for interpreting effects when neither
predictors nor outcomes have natural scales and when gauging relative
effect sizes across predictors on different scales. It is also an option
to do partial standardization if you wish to say a one unit increase in
the predictor leads to some increase in standard deviations of the
response. If you wish for the output to give partial standardized
coefficients use summary(fit, std.nox=TRUE, rsquare=TRUE)
.
Example interpretation: The unstandardized coefficients for
precip ~ years
indicate that each year precipitation
increases by -0.098 units. Using, standardized coefficients indicates
that for every 1 standard deviation in year, precipitation decreases by
-0.191 standard deviations.
Std.err: Contains the standard error for each estimated parameter
Z-value: contains the Wald statistic (which is simply obtained by dividing the parameter value by its standard error), and the last column (P(>|z|)) contains the p-value for testing the null hypothesis that the parameter equals zero in the population
I, J) Variance Covariance Outputs
In these sections it is important to note that a (.) before the
variable term (e.g., .precip
) indicates that it is a
residual term. Also, variances and covariance can be displayed on the
path diagram abd there are specific ways to illustrate those
relationships, however, this information is often omitted from published
path diagrams for simplification. Published path diagrams, in ecology,
typically only include edges representing the regression
coefficients.
This section displays the covariance of the residuals among the
endogenous variables (precip
and catrich
- the
predicted variables, i.e. those with arrows pointing to them). In our
example the value can be interpreted as a negative association between
the variance of precip
and catrich
that was
not accounted for by the exogenous variables. As a default, lavaan
allows all exogenous variables to covary but it fixes the residual
covariances among endogenous variables to zero. The residual terms of
stress precip
and catrich
are freed to covary
within this section. Recall, to designate a covariance (or variance) we
use the ~~ operator. Thus, .precip ~~ .catrich
tells lavaan
to include a residual covariance for precip
and
catrich
. Note that values in the std.all
column represents the correlation.
This section represents the residual variance of the endogenous
variables precip
, temp
and
catrich
- the left-over variance that is not explained by
the predictor(s). Notice that precip
has a lot of
unexplained variance 0.963
while
catrich = 0.190
and temperature = 0.261
have
much smaller residual variance. This makes sense since we modeled
temp
and catrich
to vary with each other and
years.
The amount of total variance explained by the model for each
variable. As we would expect, a lot of the variance in
precip
was not explained by the model. Had we modeled
precip
off of years then we would have expected a lot more
variance in precip
to be explained by the model.
Let’s try the same model with our variables scaled:
semdata_scaled <- apply(semdata, MARGIN = 2, scale)
model1.scaled <- sem(model1, data = semdata_scaled)
summary(model1.scaled, rsq = TRUE, fit.measures = TRUE, standardized = TRUE)
## lavaan 0.6-19 ended normally after 14 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 8
##
## Number of observations 60
##
## Model Test User Model:
##
## Test statistic 5.118
## Degrees of freedom 1
## P-value (Chi-square) 0.024
##
## Model Test Baseline Model:
##
## Test statistic 175.155
## Degrees of freedom 6
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.976
## Tucker-Lewis Index (TLI) 0.854
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -168.878
## Loglikelihood unrestricted model (H1) -166.319
##
## Akaike (AIC) 353.755
## Bayesian (BIC) 370.510
## Sample-size adjusted Bayesian (SABIC) 345.348
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.262
## 90 Percent confidence interval - lower 0.077
## 90 Percent confidence interval - upper 0.504
## P-value H_0: RMSEA <= 0.050 0.034
## P-value H_0: RMSEA >= 0.080 0.948
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.054
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## precip ~
## years -0.135 0.128 -1.059 0.290 -0.135 -0.135
## temperature ~
## years 0.860 0.066 13.027 0.000 0.860 0.860
## catrich ~
## years -0.316 0.121 -2.609 0.009 -0.316 -0.314
## temperature -0.595 0.121 -4.921 0.000 -0.595 -0.592
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .precip ~~
## .catrich 0.070 0.062 1.136 0.256 0.070 0.148
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .precip 0.965 0.176 5.477 0.000 0.965 0.982
## .temperature 0.257 0.047 5.477 0.000 0.257 0.261
## .catrich 0.230 0.042 5.477 0.000 0.230 0.231
##
## R-Square:
## Estimate
## precip 0.018
## temperature 0.739
## catrich 0.769
lavaanPlot(name = "scaled", model1.scaled, coefs = TRUE)
SEM is especially powerful when we use them to test different hypotheses and perform model comparison. This next model does not model a trend in caterpillars over time. We can then perform model comparison using AIC or some other metric.
model2 <- 'precip +temperature ~ years
catrich ~ temperature'
model2.fit <- sem(model2, data = semdata_scaled) #still using scaled data
summary(model2.fit, fit.measures = TRUE, standardized = TRUE, rsquare=TRUE)
## lavaan 0.6-19 ended normally after 13 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 7
##
## Number of observations 60
##
## Model Test User Model:
##
## Test statistic 11.148
## Degrees of freedom 2
## P-value (Chi-square) 0.004
##
## Model Test Baseline Model:
##
## Test statistic 175.155
## Degrees of freedom 6
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.946
## Tucker-Lewis Index (TLI) 0.838
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -171.892
## Loglikelihood unrestricted model (H1) -166.319
##
## Akaike (AIC) 357.785
## Bayesian (BIC) 372.445
## Sample-size adjusted Bayesian (SABIC) 350.428
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.276
## 90 Percent confidence interval - lower 0.134
## 90 Percent confidence interval - upper 0.444
## P-value H_0: RMSEA <= 0.050 0.007
## P-value H_0: RMSEA >= 0.080 0.985
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.055
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## precip ~
## years -0.101 0.127 -0.796 0.426 -0.101 -0.102
## temperature ~
## years 0.860 0.066 13.027 0.000 0.860 0.860
## catrich ~
## temperature -0.870 0.066 -13.147 0.000 -0.870 -0.860
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .precip ~~
## .catrich 0.112 0.067 1.680 0.093 0.112 0.222
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .precip 0.966 0.176 5.477 0.000 0.966 0.990
## .temperature 0.257 0.047 5.477 0.000 0.257 0.261
## .catrich 0.262 0.048 5.477 0.000 0.262 0.260
##
## R-Square:
## Estimate
## precip 0.010
## temperature 0.739
## catrich 0.740
lavaanPlot(name = "model2", model2.fit, labels = names(semdata_scaled), coefs = TRUE)
AIC(model1.fit, model1.scaled, model2.fit)
## df AIC
## model1.fit 8 1176.4203
## model1.scaled 8 353.7550
## model2.fit 7 357.7849
INTERPRETATION SUMMARY Our scaled version of the first model appears to be the best, so let’s take a closer look. We can see that years has a positive effects on temperature, which is to say that temperature is increasing over time. We can also see that temperature has a direct negative effect on caterpillar richness. There is a also slight negative trend in richness over time after accounting for temperature change and a slight negative trend in precip. The precip result should be interpreted with caution, as we did not simulate a relationship between precip and year. In fact if you look at the model output, the is no reason to suspect this is not due to random noise (high p-value).
Now for the indirect effect. We only have one in this model, it is
the impact of year on caterpillar richness that is mediated by
temperature change. We can calculate the total effect by multiplying the
path coefficients. So the total impact of year on caterpillars is
(-0.37 + (0.86 * -0.57))
, which is -0.86.
Checking for Understanding
In this class we have introduced many techniques and depending on your data and questions some will be more appropriate than others. That said, often what some may call the wrong technique will still yield some reasonable interpretations that reflect some of the underlying patterns in your data. This is not to say picking the right technique is not important, it is very important, but this is to say that it can be good to try a bunch of techniques to see where they agree and where they differ. It is SUPER important to remember that you probably will not report all of the things you tried and the results you do report should be based off whether or not the test was appropriate and not which one gives you the “best” results.
For this last section we will look at the same data set a bunch of different ways. I have simulated relationships between a bunch of different variables. Elevation is a factor with two levels: high and low. At each of these two elevations 100 plots were placed. In each, the leaf area of the plant, the herbivory (an ordinal variable from 0 to 2) was determined, and the survival at the end of the season was observed.
multiple_paths <- read.csv("multiple_paths.csv")
summary(multiple_paths) #examine the data
## plot_num elevation leaf_area herbivory survival
## Min. : 1.00 Length:200 Min. :199.2 Min. :0.000 Min. :0.000
## 1st Qu.: 50.75 Class :character 1st Qu.:220.5 1st Qu.:0.000 1st Qu.:0.000
## Median :100.50 Mode :character Median :227.7 Median :1.000 Median :0.000
## Mean :100.50 Mean :228.0 Mean :1.065 Mean :0.315
## 3rd Qu.:150.25 3rd Qu.:235.0 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :200.00 Max. :259.2 Max. :2.000 Max. :1.000
These data were simulated with SEMs in mind so it seem reasonable to analyze them using SEMs. There are a couple of different ways we can consider treating our variables. The first interesting one we should examine is herbivory. That was measured in an ordinal fashion and has three categories. Since this is the case, treating it like any other continuous variable probably is not the best thing to do, however it can be easier to interpret. Let’s start there.
#scale herbivory
scaled_paths <- within(multiple_paths, {
herbivory <- scale(herbivory)
leaf_area <- scale(leaf_area)
})
model <- '
leaf_area + herbivory ~ elevation
survival ~ herbivory + elevation'
#Note that writing the model like this is analogous to
#model <- 'leaf_area ~ elevation
# survival ~ elevation
# herbivory ~ elevation
# survival ~ herbivory'
#fit model
scaled_fit <- sem(model, data = scaled_paths)
#summary output
summary(scaled_fit, rsq = T, fit.measures = TRUE, standardized = TRUE)
## lavaan 0.6-19 ended normally after 10 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 8
##
## Number of observations 200
##
## Model Test User Model:
##
## Test statistic 2.322
## Degrees of freedom 1
## P-value (Chi-square) 0.128
##
## Model Test Baseline Model:
##
## Test statistic 108.420
## Degrees of freedom 6
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.987
## Tucker-Lewis Index (TLI) 0.923
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -643.959
## Loglikelihood unrestricted model (H1) NA
##
## Akaike (AIC) 1303.919
## Bayesian (BIC) 1330.306
## Sample-size adjusted Bayesian (SABIC) 1304.961
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.081
## 90 Percent confidence interval - lower 0.000
## 90 Percent confidence interval - upper 0.224
## P-value H_0: RMSEA <= 0.050 0.220
## P-value H_0: RMSEA >= 0.080 0.649
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.031
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## leaf_area ~
## elevation 0.598 0.135 4.441 0.000 0.598 0.300
## herbivory ~
## elevation 0.927 0.125 7.425 0.000 0.927 0.465
## survival ~
## herbivory -0.191 0.034 -5.649 0.000 -0.191 -0.410
## elevation 0.367 0.067 5.445 0.000 0.367 0.395
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .leaf_area ~~
## .survival -0.009 0.028 -0.333 0.739 -0.009 -0.024
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .leaf_area 0.906 0.091 10.000 0.000 0.906 0.910
## .herbivory 0.780 0.078 10.000 0.000 0.780 0.784
## .survival 0.178 0.018 10.000 0.000 0.178 0.826
##
## R-Square:
## Estimate
## leaf_area 0.090
## herbivory 0.216
## survival 0.174
#plot it in a path diagram
lavaanPlot(name = "MODEL1", scaled_fit, coefs = TRUE)
Instead of treating herbivory like a continuous variable, we can treat it as a categorical variable. This is a more conservative approach because we get an estimate for each herbivory level. The down side is the model looks a bit messier and it will be slightly harder to fit. That said, we have more than enough data here. First I dummy code these categorical variables, then I scale leaf area, before finally running the model.
###dummy code herbivory into a categorical variable with 3 levels
dummy_paths <- mutate(multiple_paths, ones = 1) |>
spread(key = herbivory, value = ones, fill = 0) |>
setNames(c('plot_num','elevation','leaf_area','survival','herb0','herb1','herb2'))
#each level of herbivory is now a column filled with 1,0 where 1 indicated that the sample had that level of herbivory
head(dummy_paths)
## plot_num elevation leaf_area survival herb0 herb1 herb2
## 1 1 high 223.9341 0 0 1 0
## 2 2 high 241.1445 0 0 1 0
## 3 3 high 238.8826 0 0 0 1
## 4 4 high 232.0278 0 0 0 1
## 5 5 high 247.5750 1 1 0 0
## 6 6 high 223.0350 0 1 0 0
#Fun tangent. If you are interested, this is the same step as above but showing you a base R method to do this:
baseR <- cbind(subset(multiple_paths, select = -herbivory),
do.call(
cbind,
lapply(0:max(multiple_paths$herbivory),
\(h) 0+(multiple_paths$herbivory == h))))|>
setNames(c('plot_num','elevation','leaf_area','survival','herb0','herb1','herb2'))
head(dummy_paths == baseR)
## plot_num elevation leaf_area survival herb0 herb1 herb2
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#ok. End tangent
# Let's use `dummy_paths`
dummy_paths$leaf_area <- scale(dummy_paths$leaf_area) #scale leaf area
Now that we have the data set dummy_paths
ready to go
with categorical variables, let’s run a similar model updates with
categorical herbivory.
model <- '
leaf_area + herb1 + herb2 ~ elevation
survival ~ herb1 + herb2 + elevation'
dummy_fit <- sem(model, data = dummy_paths)
summary(dummy_fit, rsq = TRUE, fit.measures = TRUE, standardized = TRUE)
## lavaan 0.6-19 ended normally after 10 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 11
##
## Number of observations 200
##
## Model Test User Model:
##
## Test statistic 78.230
## Degrees of freedom 3
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 181.187
## Degrees of freedom 10
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.561
## Tucker-Lewis Index (TLI) -0.465
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -634.292
## Loglikelihood unrestricted model (H1) NA
##
## Akaike (AIC) 1290.585
## Bayesian (BIC) 1326.866
## Sample-size adjusted Bayesian (SABIC) 1292.017
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.354
## 90 Percent confidence interval - lower 0.289
## 90 Percent confidence interval - upper 0.424
## P-value H_0: RMSEA <= 0.050 0.000
## P-value H_0: RMSEA >= 0.080 1.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.157
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## leaf_area ~
## elevation 0.598 0.135 4.441 0.000 0.598 0.300
## herb1 ~
## elevation -0.090 0.067 -1.345 0.179 -0.090 -0.095
## herb2 ~
## elevation 0.420 0.061 6.881 0.000 0.420 0.438
## survival ~
## herb1 -0.307 0.063 -4.889 0.000 -0.307 -0.293
## herb2 -0.473 0.069 -6.869 0.000 -0.473 -0.455
## elevation 0.361 0.066 5.439 0.000 0.361 0.362
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .leaf_area ~~
## .survival -0.010 0.028 -0.354 0.723 -0.010 -0.025
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .leaf_area 0.906 0.091 10.000 0.000 0.906 0.910
## .herb1 0.224 0.022 10.000 0.000 0.224 0.991
## .herb2 0.186 0.019 10.000 0.000 0.186 0.809
## .survival 0.177 0.018 10.000 0.000 0.177 0.711
##
## R-Square:
## Estimate
## leaf_area 0.090
## herb1 0.009
## herb2 0.191
## survival 0.289
lavaanPlot(name = "dummy_paths", dummy_fit, coefs = TRUE)
You will probably recall from the complex models lab that we do not
usually model binomial data as a regular glm, which both of the previous
models have done. Lavaan can only run gaussian models.
We need to use the piecewiseSEM
package for fancier
models.
psem_model <- transform(multiple_paths, leaf_area <- scale(leaf_area))
#first, scale leaf area like we did in the previous models
# `psem` is a function in the `piecewise` package that allows you to do an SEM but specify distributions for the variables other than a normal distribution - as is the case with lavaan. The syntax is different between lavaan and piecewise. Here you will notice, instead of using quotes to enclose the model, the function `psem` encloses the model and the `lm` function is used for each relationship - which is not the case in lavaan
glmsem <- psem(
lm(leaf_area ~ elevation, psem_model),
lm(herbivory ~ elevation, psem_model),
glm(survival ~ herbivory + leaf_area + elevation, "binomial", psem_model)
)
# As a summary output we will use `coefs`. This outputs the regression coefficients (path coefficients) for our model
coefs(glmsem)
## Warning: Categorical or non-linear variables detected. Please refer to documentation for interpretation of
## Estimates!
## Response Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate
## 1 leaf_area elevation - - 1 19.5256 0.0000 - ***
## 2 leaf_area elevation = high 224.8589 1.0051 198 223.7262 0.0000 - ***
## 3 leaf_area elevation = low 231.1396 1.0051 198 229.9753 0.0000 - ***
## 4 herbivory elevation - - 1 54.5795 0.0000 - ***
## 5 herbivory elevation = high 0.69 0.0718 198 9.6121 0.0000 - ***
## 6 herbivory elevation = low 1.44 0.0718 198 20.0600 0.0000 - ***
## 7 survival herbivory -1.3462 0.2766 196 -4.8670 0.0000 - ***
## 8 survival leaf_area -0.0073 0.0168 196 -0.4345 0.6639 -
## 9 survival elevation - - 1 26.9222 0.0000 - ***
## 10 survival elevation = high -2.0447 0.3288 Inf -6.2180 0.0000 - ***
## 11 survival elevation = low 0.1363 0.251 Inf 0.5431 0.5870 -
PiecewiseSEM does not have a “nice” plotting function. You could make
one in illustration software (i.e. Adobe Illustrator, Powerpoint,
Inkscape, etc). Also check out ggraph
,
Tidygraph
, and Rgraphviz
if you want to try to
do this more programmatically.
If you look at the coefficient directionality and the magnitude of the estimates, the patterns follow our overall story!
So far we have only looked at different types of SEMs. What if we perform regular linear models? Can we pick up any of the signal in the data?
linear <- lm(survival ~ elevation + herbivory + leaf_area + elevation*herbivory, data = multiple_paths)
summary(linear)
##
## Call:
## lm(formula = survival ~ elevation + herbivory + leaf_area + elevation *
## herbivory, data = multiple_paths)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7940 -0.2750 -0.1565 0.4634 1.0518
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.612963 0.681641 0.899 0.369631
## elevationlow 0.418324 0.115111 3.634 0.000357 ***
## herbivory -0.214975 0.059984 -3.584 0.000428 ***
## leaf_area -0.001088 0.003042 -0.358 0.721008
## elevationlow:herbivory -0.041847 0.084693 -0.494 0.621795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4269 on 195 degrees of freedom
## Multiple R-squared: 0.1765, Adjusted R-squared: 0.1596
## F-statistic: 10.45 on 4 and 195 DF, p-value: 1.086e-07
binom <- glm(survival ~ elevation + herbivory + leaf_area + elevation*herbivory, family = "binomial", data = multiple_paths)
summary(binom)
##
## Call:
## glm(formula = survival ~ elevation + herbivory + leaf_area +
## elevation * herbivory, family = "binomial", data = multiple_paths)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.04381 3.74622 0.279 0.780530
## elevationlow 1.70865 0.60216 2.838 0.004546 **
## herbivory -1.86889 0.56165 -3.327 0.000876 ***
## leaf_area -0.00656 0.01673 -0.392 0.694908
## elevationlow:herbivory 0.74054 0.64743 1.144 0.252700
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 249.22 on 199 degrees of freedom
## Residual deviance: 208.95 on 195 degrees of freedom
## AIC: 218.95
##
## Number of Fisher Scoring iterations: 5
simple_binom <- glm(survival ~ elevation, family = "binomial", data = multiple_paths)
summary(simple_binom)
##
## Call:
## glm(formula = survival ~ elevation, family = "binomial", data = multiple_paths)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2657 0.2414 -5.243 1.58e-07 ***
## elevationlow 0.9017 0.3156 2.857 0.00428 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 249.22 on 199 degrees of freedom
## Residual deviance: 240.75 on 198 degrees of freedom
## AIC: 244.75
##
## Number of Fisher Scoring iterations: 4
We lose some of the indirect effect interpretation with these, but we
still retain the interpretation that herbivory has a negative effect on
survival, high elevations have lower survival, and that
leaf area
is not very predictive.
We can also model these data as a Bayesian hierarchical model. Since
plots are nested in elevation we can estimate both elevation specific
and across elevation effects. We’ll work with the
dummy_paths
data set:
#############################
# Step1: #
#Data.frame to lit for rJAGS#
#############################
# Put the model data into a list. The jags function requires this formatting.
#note we scale leaf area and change elevation to a numeric variable so jags can interpret it
model_data <- with(dummy_paths,
list(
surv = survival,
herb0 = herb0,
herb1 = herb1,
herb2 = herb2,
area = scale(leaf_area)[,1],
elevs = as.numeric(factor(elevation)),
N = length(plot_num), #jags requires the number of observations to be specified. In our case each row is a unique plot, so if we use length on the plot column, we will get the number of observations
Nelev = length(unique(elevation))))
#############################
# Step2: #
#Build Model #
#############################
#Model specification
"
Here are the parts of the model:
`for(i in 1:N)` : for each observation i through N observations. Here N = 200
`surv[i] ~ dbern(p[i])`: y is distributed as a Bernoulli distribution with parameter p,
p = probabily of success and in this case refers to the
probability of survival. i refers to each trial.
The number of trials equals the number of observations.
`logit(p[i])`: The logit probibility of success for the ith observation
`logit(p[i]) <- beta0[elevs[i]] * herb0[i] + beta1[elevs[i]] * herb1[i] +
beta2[elevs[i]] * herb2[i] + beta3[elevs[i]*area[i]`:
The logit probibility of success for the ith observation
is a linear function of the different categories of
herbivory (herb0, herb1, herb2) and area.
The brackets next to the beta coefficients (e.g. beta1[elevs[i]])
indicate the heirarchical components of the model.
Such that a beta coefficient for each level of elevation
will be calculated for each level of herbivory or area.
Now We define Uninformative priors.
Prior distributions are a central component of Bayesian models.
`for (j in 1:Nelev)`: For the j'th elevation in 1 through the number of the elevations
(which is 2 levels of elevation; high and low)
`dnorm(hbmu0, 0.001)`: Each beta coefficient has a prior ditribution. In this
example, we have chosen uninformative priors that are
defined as coming from a normal distribuion with two parameters:
mean(hbmu) & precision(0.001). You may notice that
typically in a Gaussian distribution, we specify the mean
and standard deviation. For the priors it is the convention
to use precision which is the inverse variance (1/variance).
Finally we define uninformative hyperpriors (good band name)
Notice in this chunck of code, hbmu is a parameter of a parameter (e.g. hbmu0 is a hyperparamter
since it is used to model the parameter beta0). Since it is a hyperparameter for the prior
distribution, it is called a hyperprior. We need to include hyperparameters in the model as
random variables since we do not know their value. Like above, we made these hyperpriors
uninformative using a normal distributon to model our uncertainty in hbmu. We model the
hyperprior with a mean of 0 and precision of 0.001.
"
sim_model <- "model {
for(i in 1:N) {
surv[i] ~ dbern(p[i])
logit(p[i]) <- beta0[elevs[i]] * herb0[i] +
beta1[elevs[i]] * herb1[i] +
beta2[elevs[i]] * herb2[i] +
beta3[elevs[i]] * area[i]
}
for (j in 1:Nelev) {
beta0[j] ~ dnorm(hbmu0, 0.001)
beta1[j] ~ dnorm(hbmu1, 0.001)
beta2[j] ~ dnorm(hbmu2, 0.001)
beta3[j] ~ dnorm(hbmu3, 0.001)
}
hbmu0 ~ dnorm(0, 0.001)
hbmu1 ~ dnorm(0, 0.001)
hbmu2 ~ dnorm(0, 0.001)
hbmu3 ~ dnorm(0, 0.001)
}
"
writeLines(sim_model, con="sim_model.txt")
############################################################
# Step3: #
#Run data through the Sampler and specify sampler details #
############################################################
#This specifies details of the MCMC chain (i.e., number of burnin, number of iterations, number of chains)
jags_model <- jags(data = model_data,
n.adapt = 2000,
n.burnin = 1000,
n.iter = 10000,
n.chains = 4,
modules = "glm",
model.file = "sim_model.txt",
parameters.to.save = c("beta0", "beta1", "beta2", "beta3", "hbmu0", "hbmu1", "hbmu2", "hbmu3"),
verbose = F)
# The `jags_model` output is an array. The section of code below pulls out the necessary information from the array to make a handy data.frame called `plot_data`. `med` represents the mean of the posterior distribution- these are the beta coefficients we report. `lc` and `uc` represent the respective lower and upper credible intervals around that estimate. Recall that beta coefficients and their credible intervals represent the effect of herbivory and leaf area on survival for each level of those factors (high and low)
plot_data <- with(jags_model,
data.frame(
'label' = c("High_Herb0", "High_Herb1", "High_Herb2", "High_Leaf",
"Low_Herb0", "Low_Herb1", "Low_Herb2", "Low_Leaf"),
'med' = c(q50$beta0[1], q50$beta1[1], q50$beta2[1], q50$beta3[1],
q50$beta0[2], q50$beta1[2], q50$beta2[2], q50$beta3[2]),
'lc' = c(q2.5$beta0[1], q2.5$beta1[1], q2.5$beta2[1], q2.5$beta3[1],
q2.5$beta0[2], q2.5$beta1[2], q2.5$beta2[2], q2.5$beta3[2]),
'uc' = c(q97.5$beta0[1], q97.5$beta1[1], q97.5$beta2[1], q97.5$beta3[1],
q97.5$beta0[2], q97.5$beta1[2], q97.5$beta2[2], q97.5$beta3[2]),
'col' = c('high_herb','high_herb', 'high_herb', 'leaf_area',
'low_herb','low_herb','low_herb','leaf_area')))
plot_data
## label med lc uc col
## 1 High_Herb0 -0.34648097 -0.9997522 0.2877027 high_herb
## 2 High_Herb1 -3.09697389 -5.0919547 -1.8428158 high_herb
## 3 High_Herb2 -3.02170515 -6.3379414 -1.2268727 high_herb
## 4 High_Leaf 0.03307931 -0.5214955 0.5885796 leaf_area
## 5 Low_Herb0 1.35027904 0.1180363 2.9340244 low_herb
## 6 Low_Herb1 0.17345041 -0.5588493 0.9097685 low_herb
## 7 Low_Herb2 -1.00734040 -1.6503275 -0.4114526 low_herb
## 8 Low_Leaf -0.16131156 -0.6418516 0.3050866 leaf_area
Now let’s plot the estimated beta coefficients and their credible intervals for the effects (beta coeficients) of herbivory and leaf area on survival at each elevation category
ggplot(plot_data) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray", linewidth = 2) +
geom_errorbarh(aes(xmin = lc, xmax = uc, y = label, color = col), height = 0, linewidth = 2) +
geom_point(aes(med, label, color = col), pch = 21, fill = "gray65", size=3) +
labs(x = "Point Estimates of beta coefficients from posterior distribution (95% CI)", y="", color='')+
scale_color_manual(values = ocean.haline(4)[-1])+
theme_classic()
When beta coefficients are close to zero and the credible intervals overlap with zero, these indicate low confidence in an effect of those variables on survival. Here we see the same pattern as we saw before. Increasingly negative effects of of high herbivory (teal). We also see no effect of leaf area (green) and we see greater survival overall at low elevations (yellow).
#This dataframe includes the OVERALL beta coefficients and credible intervals of herbivory and leaf area effect on survival
plot_overall <- with(jags_model,
data.frame(
'label' = c("Herb0", "Herb1", "Herb2", "Leaf"),
'med' = c(q50$hbmu0, q50$hbmu1, q50$hbmu2, q50$hbmu3),
'lc' = c(q2.5$hbmu0, q2.5$hbmu1, q2.5$hbmu2, q2.5$hbmu3),
'uc' = c(q97.5$hbmu0, q97.5$hbmu1, q97.5$hbmu2, q97.5$hbmu3)))
#plot the posterior distributions
ggplot(plot_overall) +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray", linewidth = 2) +
geom_errorbarh(aes(xmin = lc, xmax = uc, y = label), height = 0, linewidth = 2) +
geom_point(aes(med, label), pch = 21, fill = "gray65", size=3) +
labs(x = "", y="") +
theme_classic()
There are no cross elevation effects, as we see in the estimated higher levels. This was not part of the original simulation so this also checks out.
Check for Understanding
Structural Equation Modeling via path analysis is a useful framework for testing causal hypotheses and understanding direct and indirect relationships among variables, particularly when considering latent variables.
Path diagrams are a strength of the method for their ability to carefully outline hypotheses in metamodels and then illustrate the direct and indirect relationships among variables using model estimated coefficients (and other properties e.g. residual variance/covariance estimates).
Model fit is evaluated by comparing the observed covariance matrix to the model-estimated covariance matrix.
The piecewise package has made it possible to model variables on non-normal distributions.
The overall relationships among variables can be elicited from many approaches and you gain more confidence in these relationships if you try a bunch of techniques to see where they agree and where they differ.