Resources:

Online Tutorial: UCLAstats

Readings Recommended by Lee for this script:

Rogosa, D. (1980). Comparing nonparallel regression lines. Psychological Bulletin, 88, 307–321 Bauer, D.J. and Curran, P.J., 2005. Probing interactions in fixed and multilevel regression: Inferential and graphical techniques. Multivariate behavioral research, 40(3), pp.373-400.

Set working directory.

setwd("...your_working_directory/")

Load the following libraries. If not yet installed, use install.packages("package_name")

library("car")
library("data.table")
library("tidyverse")
library("pals")
library("jtools")
library("ggplot2")
library("ggiraphExtra")
library("emmeans")
library("corrgram")
library("Matrix")

Exploring Interactions

In last week’s lab, we explored linear models that contained main effects. In other words, we focused on models that evaluated the distinct impact of two predictor variables on the response. However, it may be the case that the magnitude of effect of a predictor variable (x1) on the response y, is modified by a second predictor variable (x2; or some grouping factor). For example, a large difference in y may be observed between two levels of x2 at lower values of x1 but that difference disappears at higher values of x1. Interaction terms explore whether the effect of one predictor on a response variable (Y ~ X1) changes for different levels of predictor variable (X2). Visually, interactions can be illustrated in many ways but examples frequently draw from non-parallel lines on xy plots (e.g. intersecting lines).

Figure 1: Different Illustrations of an Interaction
Figure 1: Different Illustrations of an Interaction

What exactly are we trying to evaluate when we ask “Is there an interaction?”

Note in this lab, we will focus on effect sizes when interpreting the outputs - not null hypothesis testing (e.g. p-values/significance testing). The first section focuses on interpreting the coef output. In other words, understanding the effect sizes as they relate to linear models with interactions. In the second section, you will manipulate interactions through simulations.

Section 1: Interpretting Outputs

Note: This section covers ONLY multiplicative Interactions. We will explore non-multiplicative interactions using simulated data in Section 2.

Interactions with two categorical predictors (each with two levels)

We will use the Salaries data set for the example. Salaries is built into R and has the salary for professors of different rank (associate professor, assistant professor and professor), from different disciplines (two levels: A and B) and binary gender data (male, female) - among other variables. Below we explore whether sex modifies the relationship between salary and discipline. In other words, maybe males make higher salary in one discipline and females in the other.

salary_model <- lm(salary ~ discipline*sex, data = Salaries)
coef(salary_model)
##         (Intercept)         disciplineB             sexMale disciplineB:sexMale 
##            89064.94            22169.58            21635.04           -14109.19
interaction.plot(Salaries$discipline, Salaries$sex, Salaries$salary)

Remember, coef returns the effect sizes (aka differences in means). The interpretation of coef is as follows.

  • Intercept or β0: is the mean salary for females in discipline A. Females in discipline A make on average $89,064.94 (refer eq. A below)
  • disciplineB or β1: is the effect (or the change in mean salary) for females in discipline B. On average, females make 22169.58 more dollars in discipline B than they do in discipline A. Mean salary for females in discipline B is $111,234.52. (89,064.94 + 22,169.58). (refer eq. B below)
  • sexMale or β2: is the effect (or the change in mean salary) for males in discipline A. Males in discipline A make on average 21,635.04 more dollars than females in discipline A. Average salary for males in discipline A is $110,699.98.(89,064.94 + 21,635.04) (refer eq. C below)
  • disciplineB:sexMale or β3: Change in discipline effect if sex is male OR the change in the sex effect if the individual teaches discipline is B. The mean salary for males in discipline B can be calculated by adding all the coefficients together. (89,064.94 + 22,169.58 + 21,635.04 -14,109.19)(refer eq. D below)

If those interpretations make perfect sense to you I’d skip ahead to the next section. I like to remind myself what is happening under the hood to get those outputs and find it a helpful exercise, especially if you are newer to stats, to write out the mathematical formula. The way R codes a model can sometimes detach us from the mathematical model.

First a quick review of model components:

Figure 2: Components of a linear model
Figure 2: Components of a linear model

Now we will create a regression equation that solves for each of the conclusions above.

Calculating Intercept/ Salary for females in discipline A:

Remember, use of the lm() function with two categorical variables is the same as performing a 2x2 ANOVA (Since ANOVAs are linear models, you can get R to summarize aov() and lm() models in the same way useing summary.aov()). In cases with categorical variables, R dummy codes the variables and the structure of the regression formula. Here are the dummy codes relevant to our model above:

discipline: A = 0; B = 1 sex: female = 0; male = 1

How do I know that is the way R dummy codes these variables? R assigns 1 to the first level it encounters for a given factor. Also, the coef heading disciplineB and sexMale lets you know those two levels were assigned dummy code = 1 for each of those variables.

Let’s input the coefficients, and dummy code accordingly. If we are interested in Salary for females in discipline A then this would imply the dummy code for discipline = 0 and sexM = 0. All terms cancel with the exception of the intercept. Yay! It makes sense the intercept represents the mean salary for females in discipline A.

Calculating salary for females in discipline B:

If we are interested in Salary for females in discipline B then this would imply the dummy code for discipline = 1 and sexM = 0. Only the sex term and interaction term cancel this time. Hence, we add the effect of discipline to the intercept.

Calculating salary for males in discipline A:

Calculating salary for males in discipline B:

In summary, the effect of gender on salary is modified by the discipline to which the professors are a part of. Male professors generally make higher salary than female professors (ugh) and this disparity is greater for discipline A than B.

Need Review? Video1 OR Video2

Interaction with one continuous and one categorical predictor

Let’s grab some example data from online.

autompg = read.table(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", 
    quote = "\"", 
    comment.char = "", 
    stringsAsFactors = FALSE) |>
  
  setNames(c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name")) |> #name the columns
  
  subset(hp   != "?") |> #remove missing data, which is stored as "?"
  subset(name != "plymouth reliant") |> #remove the plymouth reliant, as it causes some issues
  subset(!(cyl %in% c(3, 5))) |> #remove 3 and 5 cylinder cars (which are very rare.)
  select(-name) |> #remove the "name" column
  
  within({
    hp       <- as.numeric(hp) #change horsepower from character to numeric
    domestic <- as.numeric(origin == 1) #create a dummy variable for cars. domestic = 1
    cyl      <- as.factor(cyl) #change cyl from numeric to factor
  }) 

In autompg:

#There are two ways to code the same interaction in R using the lm() function. One uses the `*` operator and the other the `:` operator. 
model.disp_dom1 <- lm(mpg ~ disp*domestic, data = autompg)
model.disp_dom2 <- lm(mpg ~ disp + domestic + disp:domestic, data = autompg) 

#R interprets * as a combination of + (addition) and : (interaction). R should see these models as exactly the same

#Yay! outputs yield same results-just as we'd expect.
summary(model.disp_dom1)
## 
## Call:
## lm(formula = mpg ~ disp * domestic, data = autompg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8332  -2.8956  -0.8332   2.2828  18.7749 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    46.05484    1.80582  25.504  < 2e-16 ***
## disp           -0.15692    0.01668  -9.407  < 2e-16 ***
## domestic      -12.57547    1.95644  -6.428 3.90e-10 ***
## disp:domestic   0.10252    0.01692   6.060 3.29e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.308 on 379 degrees of freedom
## Multiple R-squared:  0.7011, Adjusted R-squared:  0.6987 
## F-statistic: 296.3 on 3 and 379 DF,  p-value: < 2.2e-16
summary(model.disp_dom2)
## 
## Call:
## lm(formula = mpg ~ disp + domestic + disp:domestic, data = autompg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8332  -2.8956  -0.8332   2.2828  18.7749 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    46.05484    1.80582  25.504  < 2e-16 ***
## disp           -0.15692    0.01668  -9.407  < 2e-16 ***
## domestic      -12.57547    1.95644  -6.428 3.90e-10 ***
## disp:domestic   0.10252    0.01692   6.060 3.29e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.308 on 379 degrees of freedom
## Multiple R-squared:  0.7011, Adjusted R-squared:  0.6987 
## F-statistic: 296.3 on 3 and 379 DF,  p-value: < 2.2e-16

Let’s look at this visually:

#Define intercepts. 
int_for = coef(model.disp_dom1)[1]
int_dom = coef(model.disp_dom1)[1] + coef(model.disp_dom1)[3] #Consider the formula above that we worked out by hand. The calculation for int_dom should make sense now. 

#Define slopes
slope_for = coef(model.disp_dom1)[2]
slope_dom = coef(model.disp_dom1)[2] + coef(model.disp_dom1)[4]#Consider the formula above that we worked out by hand. The calculation for slope_dom should make sense now. 

#plot the coefficients and slopes using base R functions
plot(mpg ~ disp, data = autompg, col = domestic + 1, pch = domestic + 1, xlim = c(0, 500), ylim = c(0, 50))
abline(int_for, slope_for, col = 1, lty = 1, lwd = 2) # line for foreign cars
abline(int_dom, slope_dom, col = 2, lty = 2, lwd = 2) # line for domestic cars
legend("topright", c("Foreign", "Domestic"), pch = c(1, 2), col = c(1, 2))

#or if you don't want to do the math
ggplot(autompg, aes(disp, mpg, color = as.factor(domestic)))+
  geom_point()+
  geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'

Again, since we are emphasizing effect sizes (not significance testing) let’s just look at the coef

coef(model.disp_dom1)
##   (Intercept)          disp      domestic disp:domestic 
##    46.0548423    -0.1569239   -12.5754714     0.1025184

Interpreting Model Outputs:

Remember from the ANOVA example, we cannot interpret the main effects without respect to a given level of the other variable. disp = -0.16 is not interpreted as an overall effect on mpg. Instead, to interpret disp = -0.16 we must take into account the two levels of domestic.

Means and slopes for each group:

Skip ahead if you can interpret the coef output easily and those interpretations make perfect sense. We will now write out the model like before and show how we came to the conclusions above.

Let’s interpret coefficients by implementing dummy coding.

Let’s first consider the distinct effect of engine displacement disp on mpg for foreign cars. To do so, we dummy code domestic with 0. Solving for the equation reduces it to mpg = 46.05 + (-0.16*disp). We can now interpret the coefficient estimate for disp in the summary output above as: In foreign cars, the estimated change in average fuel efficiency (mpg) is a reduction of 0.16 for every one unit change in engine power. The mean mpg for foreign cars with no displacement is 46.05.

Now let’s solve for the distinct effect of engine displacement disp on mpg for domestic cars. To do so, we dummy code domestic with 1. To solve for the equation, you will see we end up doing the following: β1 + β3 = -0.06 and β0 + β2 = 33.48. We can now interpret the coefficient estimate for disp in the summary output above as: In domestic cars, the estimated change in average fuel efficiency (mpg) is a reduction of 0.06 for every one unit change in engine power. We also see that the intercept for domestic cars is 33.48.

Summary: Foreign cars have a stronger negative relationship between engine power and miles per gallon than domestic cars. At lower values of displacement (engine power), foreign cars have greater fuel efficiency than domestic cars. However, as displacement increases (engine power increases above >120), domestic cars have greater fuel efficiency.

Review: Video 1

Check for Understanding

  1. Perform and interpret effect sizes for the following model with a categorical variable with 3 levels cyl: lm(mpg ~ disp*cyl).

Interaction with two continuous predictors

Now let’s consider two continuous predictor variables (disp and hp, engine displacement and horsepower) and their effect on miles per gallon.

# hp = horsepower
model.disp_hp <- lm(mpg ~ disp *hp, data = autompg)
coef(model.disp_hp)
##   (Intercept)          disp            hp       disp:hp 
## 52.4081997848 -0.1001737655 -0.2198199720  0.0005658269

As we did before, let’s write out the equation (eq. a). First, we we can rearrange terms by factoring out x1 (eq. b). Once inputting the coef (eq. c) we see that for a one unit increase in x1 (disp), the mean of Y (mpg) increases by the term β1 + β3*x2. This indicates that the increase in mean Y for each unit increase in x1, will differ across values of x2 (hp) (eq. d).

Let’s solve for x2 (eq. e-g). For a one unit increase in x2 (hp), the mean of Y (mpg) increases by the term β2 + β3x1 (eq. g). This indicates that the increase in mean Y for each unit increase in x2, will differ across values of x1 (disp).

To plot this we can bin data and you will see an example of this later in this script using simulated data where values are grouped by quantiles. For now, let’s interpret the coef.

coef(model.disp_hp)
##   (Intercept)          disp            hp       disp:hp 
## 52.4081997848 -0.1001737655 -0.2198199720  0.0005658269
  • β0 = 52.408 is the estimated average mpg for a car with 0 disp and 0 hp
  • β1 = -0.10 is the estimated change in average mpg for an increase in 1 disp, for a car with 0 hp. When hp increases to any value other than zero, then the influence of disp on mpg increase by 0.0005 for every level of hp.
  • β2 = -0.219 is the estimated change in average mpg for an increase in 1 hp, for a car with 0 disp. When disp takes on values above zero, then the influence of hp on mpg increases by 0.005 for every level of disp.
  • β3 = 0.0005 is an estimate of the modification to the change in average mpg for and increase in disp, for a car of a certain hp (or visa versa)

Need Review? Video1

You may be asking now, there’s got to be an automated way to get those slopes for interactions? Of course there is! But our approach above hopefully encouraged a better understanding. Use emmeans::emtrends() for a quick output of the slopes for each level of the modifier variable. For example:

emtrends(model.disp_dom1, ~ domestic, var = "disp")
##  domestic disp.trend      SE  df lower.CL upper.CL
##         0    -0.1569 0.01670 379  -0.1897  -0.1241
##         1    -0.0544 0.00282 379  -0.0599  -0.0489
## 
## Confidence level used: 0.95

Section2: The Basics of Interactions with Simulated Data

These are some functions that will come in handy later. split_data() takes a variable in data and breaks into a specified number of quantiles n_breaks. It adds a column break_num to data which is the quantile to which each observation belongs. ggint() makes an interaction plot where colors represent quantiles, which will save us from copy - pasting this code over and over.

split_data <- function(data, n_breaks, var){
  
  breaks <- quantile(data[,var], seq(0, 1, by = 1/n_breaks)) #named vector of quantiles for the input column
  data$break_num <- as.factor(sapply(data[,var], \(i) max(which(breaks <= i)))) #the quantile for each point
  
  #fixes max value returning n_breaks + 1 when it should be n_breaks
  return(within(data,{break_num[data[,var] == max(data[,var])] = n_breaks})) 
}

ggint <- function(data){
  return(
    ggplot(slopes_plot) + 
      geom_point(aes(x1, y, fill = break_num), pch = 21, color = "black") + 
      geom_smooth(aes(x1, y, group = break_num), color = "black", se = FALSE, method = lm, linewidth = 1.5)+
      geom_smooth(aes(x1, y, color = break_num), se = FALSE, method = "lm", linewidth = 1) + 
      scale_fill_manual(values = ocean.haline(6)) + #pals::ocean contains colorblind friendly palettes
      scale_color_manual(values = ocean.haline(6)) + 
      labs(fill = "x2 quantile", color = "x2 quantile") + 
      theme_classic())
}

Main Effects Model- No Interactions

Before we examine interactions let’s start with a dataset with no interactions - only additive effects. The code below creates that data frame. It is important to remember that we are simulating data, therefore, we are defining what predicts the response variable (y). This is in contrast to when you have data in hand and you fit a model to find what predicts y. Here we are defining y as y <- 5*x1 + -2*x2 + rnorm(1000, 0, 2). Which is to say we are simulating y to be dependent on two main effects: x1 and x2 with an error term (noise) that is normally distributed. Notice that by making our errors normally distributed we are “forcing” our model to meet the assumptions of linear models.

set.seed(335)

# simulate to uncorrelated predictor variables: x1 and x2. If we wanted to simulate correlated variables,
# we would so something like this: x2 <- x1 + rnorm(1000, 0, 0.2)
additive <- data.frame('x1' = rnorm(1000, 0, 1), 'x2' = rnorm(1000, 0, 1)) |>
  within({y <- 5*x1 + -2 *x2 + rnorm(1000, 0, 2)}) #generate a response variable that is predicted by x1 and x2, with noise.

To look for interactions in each data set, we will examine whether the relationship between x1 and y changes for different subsets of the data. Since x2 is currently a continuous variable, we will make it categorical using the split_data function. This breaks x2 into 6 parts (because we specified n_breaks = 6). The new column break_num is created in a data frame slopes_plot. A break_num = 1 represents the lowest values of x2, while break_num = 6 represents the highest values of x2. Here we break the data into 6 sections.

slopes_plot <- split_data(additive, 6, "x2")

Let’s plot the data. We see that the relationship between x1 and y is independent of x2. We know this because the slope of the line does not change whether looking exclusively at the low values of x2 or the high values of x2. No interactions here.

ggint(slopes_plot)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Let’s view this relationship another way by plotting the slope estimates for each x2 quantile. Note here we have n_breaks = 20 rather than n_breaks = 6 from above. The latter would only leave us with 6 points on the graph.

# `coef_plot_data` is a data frame of slope estimates (aka. effect sizes or coefficients) for the effect of X1 on Y for each group (here, quantile of X2). 
slopes_plot <- split_data(additive, 20, "x2")

coef_plot_data <- data.frame(
  'x1' = sapply(1:20, \(i) coef(lm(y ~ x1, data = subset(slopes_plot, break_num == i)))[[2]]),
  'x2' = 1:20)

ggplot(coef_plot_data, aes(x1, x2)) + 
 geom_point() + 
 labs(x = "Effect size of x1 on y", y = "Quantile of x2") + 
 theme_classic()

Here we see how the slope estimate of the line varies by subset of the data. Again, no pattern here. In other words, there is no interaction among x1 and x2. The effect of x1 on y is not modified by x2. As we simulated, these predictor variables are independent of each other.

While the plot above shows compelling visual evidence for no relationship between x1 and x2, let’s verify using the effect size from a linear model. Effect size basically zero (x1 = 0.21 +/- 4.1).

summary(lm(x2 ~ x1, data = coef_plot_data))
## 
## Call:
## lm(formula = x2 ~ x1, data = coef_plot_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5538 -4.8155  0.0106  4.6486  9.5761 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   9.4441    20.6794   0.457    0.653
## x1            0.2101     4.1059   0.051    0.960
## 
## Residual standard error: 6.078 on 18 degrees of freedom
## Multiple R-squared:  0.0001455,  Adjusted R-squared:  -0.0554 
## F-statistic: 0.002619 on 1 and 18 DF,  p-value: 0.9598

Model With Interaction Effect

Now we add a multiplicative interaction, with still uncorrelated predictors. Same steps as above, but you will note the interaction added to the model in the term: 2*x1*x2

multiplicative <- data.frame('x1' = rnorm(1000, 0, 1), 'x2' = rnorm(1000, 0, 1)) |>
  within({y <- 5*x1 + 2*x2 + 2*x1*x2 + rnorm(1000, 0, 2)}) #here you can see the interaction (2*x1*x2)

slopes_plot <- split_data(multiplicative, 6, "x2")

#Now we see an interaction. As the values of x2 increase, so does the strength
#of the relationship between x1 and y.
ggint(slopes_plot)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

We can see here that for different levels of x2, the effect (or slope) of x1 on y differs.

Now, when we plot the slopes for different quantiles of x2, we should expect to see some relationship. A pattern indicates that the effect of one variable on the response will vary depending on the value of the other predictor.

#Now this coefficient plot might be more interesting. Let's examine the relationship
#between the values of x2 and the slope between x1 and y.
slopes_plot <- split_data(multiplicative, 20, "x2")

coef_plot_data <- data.frame(
  'x1' = sapply(1:20, \(i) coef(lm(y ~ x1, data = subset(slopes_plot, break_num == i)))[[2]]),
  'x2' = 1:20)

#We see a linear relationship, like we simulated!
ggplot(coef_plot_data, aes(x1, x2)) + 
 geom_point() + 
 labs(x = "Effect size of x1 on y", y = "Quantile of x2") + 
 theme_classic()

#Were we see a slope of around 3, not quite what we simulated, but close.
summary(lm(x2 ~ x1, data = coef_plot_data))
## 
## Call:
## lm(formula = x2 ~ x1, data = coef_plot_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2501 -0.6889 -0.1723  0.5711  2.5269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.5392     0.7689  -5.903 1.37e-05 ***
## x1            2.9956     0.1434  20.894 4.52e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.21 on 18 degrees of freedom
## Multiple R-squared:  0.9604, Adjusted R-squared:  0.9582 
## F-statistic: 436.6 on 1 and 18 DF,  p-value: 4.522e-14

Probing interactions

By evaluating the slopes at different levels of the modifier variable we “probed the interaction”. By binning the x2, we can see is high or low slopes are more important to modifying the effect of x1 on y. A novice may limit their excitement to a significant interaction, but it is possible to further explore the pattern of that interaction by asking what levels of the variable are more important modifiers. For example, we could say something like “At high values of the moderator x2 the effect of x1 on the response was very important, but at lower levels not so much”. Hypothesis tests can be applied to assess the statistical significance of the coefficient of the modifier. For some reason the term “probing” is more common in Psyc literature and is often referred to more generally as “moderator analysis”. To explore the concept of probing, Lee recommends reading Rogosa 1980 & Bauer and Curran 2005. For a video explanation using point-and-click software Click here

What if the interaction is not multiplicative?

Now you will notice when we simulate the response variable y, the interaction effect is exponential exp(x1 + x2).

exponential <- data.frame('x1' = rnorm(1000, 0, 1), 'x2' = rnorm(1000, 0, 1)) |>
  within({y <- 5*x1 + 2*x2 + exp(x1 + x2) + rnorm(1000, 0, 2)}) #Here we simulate an exponential interaction

slopes_plot <- split_data(exponential, 6, "x2")

ggint(slopes_plot)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Interesting, for the highest level of x2 (red poits) we see what looks like a non-linear increase in higher values of x1. Let’s look at the plot of coefficients.

slopes_plot <- split_data(exponential, 20, "x2")

coef_plot_data <- data.frame(
  'x1' = sapply(1:20, \(i) coef(lm(y ~ x1, data = subset(slopes_plot, break_num == i)))[[2]]),
  'x2' = 1:20)

#Here we see that the effect of x1 on y explodes at high values of x2.
ggplot(data = coef_plot_data) + 
 geom_point(aes(x1, x2), pch = 21, fill = "gray65", color = "black") + 
 labs(x = "Effect size of x1 on y", y = "Quantile of x2") + 
 theme_classic()

summary(lm(x2 ~ x1, data = coef_plot_data))
## 
## Call:
## lm(formula = x2 ~ x1, data = coef_plot_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.005 -3.173 -0.456  3.657  5.346 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0323     2.0750   0.497    0.625    
## x1            1.2163     0.2417   5.033 8.64e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.917 on 18 degrees of freedom
## Multiple R-squared:  0.5846, Adjusted R-squared:  0.5615 
## F-statistic: 25.33 on 1 and 18 DF,  p-value: 8.641e-05

Very cool! Much more clear. For an exponential interaction, the effect of x1 on y is modified by x2 and this effect is super strong at higher values of x2.

Let’s compare some models using AIC. Model 1 uses typical interaction coding, notice that the slope is close to the slope we got a couple of lines up. Basically this model draws a straight line through that exponential relationship.

model_linear <- lm(y ~ x1 + x2 + x1*x2, data = exponential)
summary(model_linear) 
## 
## Call:
## lm(formula = y ~ x1 + x2 + x1 * x2, data = exponential)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.806 -2.403 -0.548  1.594 66.070 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.8241     0.1561   18.09   <2e-16 ***
## x1            7.8708     0.1573   50.03   <2e-16 ***
## x2            4.7530     0.1589   29.92   <2e-16 ***
## x1:x2         3.1270     0.1587   19.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.937 on 996 degrees of freedom
## Multiple R-squared:  0.7993, Adjusted R-squared:  0.7987 
## F-statistic:  1322 on 3 and 996 DF,  p-value: < 2.2e-16
#model 2 tests what we simulated
model_exp <- lm(y ~ x1 + x2 + exp(x1*x2), data = exponential)
summary(model_exp) 
## 
## Call:
## lm(formula = y ~ x1 + x2 + exp(x1 * x2), data = exponential)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.3037  -1.8098  -0.3427   1.6274  29.8417 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.64514    0.11323   14.53   <2e-16 ***
## x1            7.44491    0.11121   66.95   <2e-16 ***
## x2            4.54927    0.11177   40.70   <2e-16 ***
## exp(x1 * x2)  0.56825    0.01336   42.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.468 on 996 degrees of freedom
## Multiple R-squared:  0.901,  Adjusted R-squared:  0.9007 
## F-statistic:  3021 on 3 and 996 DF,  p-value: < 2.2e-16
#An AIC difference of 1500 is pretty compelling. The exponential model fits the data much better
AIC(model_linear, model_exp)
##              df      AIC
## model_linear  5 6037.378
## model_exp     5 5330.858

What if the interactive variables are correlated?

Remember: Interaction terms are not incorporated into models because you suspect that the two predictors are correlated. They are incorporated because you suspect that one variable modifies the other variables effect on y. Nevertheless, to satisfy curiosity, let’s generate a response variable that is predicted by correlated variables x1 and x2, and see how a model with and without the interaction compares.

Generate correlated variable.

#correlated predictors
x1 <- rnorm(1000, 0, 1)
x2 <- x1 + rnorm(1000, 0, 0.2) # here we correlate x2 to x1
cor(x1, x2) #Cool! We successfully correlated x2 to x1
## [1] 0.9804142

Generate a response variable that is predicted by x1 and x2, no interaction.

y <- 5*x1 + -2*x2 + rnorm(1000, 0, 2)

We can use VIF (variance inflation factor) to detect multicollinearity - which we expect in this case because we correlated the simulated variables. The value tells us by how much the variance of the coefficient estimate is inflated because of multicollinearity. 1 is the lowest value and means that the variance is the same whether the variable is in the model or not (absence of multicollinearity.). Higher values indicate variables are correlated. What constitutes as higher? Some say values > 5.

model <- lm(y ~ x1 + x2)
vif(model)
##       x1       x2 
## 25.78121 25.78121
summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3850 -1.3645  0.0228  1.2694  6.7666 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.006565   0.063611  -0.103    0.918    
## x1           5.351415   0.320067  16.720  < 2e-16 ***
## x2          -2.273416   0.314393  -7.231 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.011 on 997 degrees of freedom
## Multiple R-squared:  0.7102, Adjusted R-squared:  0.7096 
## F-statistic:  1222 on 2 and 997 DF,  p-value: < 2.2e-16

Model y with an interaction

y <- 5*x1 + -2*x2 + 2*x1*x2 + rnorm(1000, 0, 2)
model_int <- lm(y ~ x1 + x2 + x1*x2)
vif(model_int) 
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##        x1        x2     x1:x2 
## 25.784239 25.787843  1.000873
summary(model_int)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x1 * x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1678 -1.3044 -0.0509  1.2498  5.7364 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.007115   0.074566  -0.095    0.924    
## x1           4.819437   0.308132  15.641  < 2e-16 ***
## x2          -1.788074   0.302691  -5.907 4.77e-09 ***
## x1:x2        1.994971   0.041897  47.616  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.936 on 996 degrees of freedom
## Multiple R-squared:  0.8315, Adjusted R-squared:  0.831 
## F-statistic:  1638 on 3 and 996 DF,  p-value: < 2.2e-16

Whether a model has an interaction term or not, multicollinearity is detected among the two variables.

Section 3: Playing with Interactions (Lee Dyer Style!)

First we will simulate data. Your job later will be to play around and modify the values of the coefficients, the equations for simulated responses and predictors to see how they change the interpretation.

we’ll first set coefficients. We are establishing the raw effect sizes here (alpha & beta 1 through 4). Second, we’ll generate covariates and some error terms that will be used to define (or “cause”) the various response variables (A, B, e1, e2, e3, int_err, int, int_add, e). You will notice that covariates A (two levels) and B (three levels) will serve to generate a data set with a traditional ANOVA design. We’ll then generate response and predictor variables (temp, precip, interact, drought, abundance1, abundance2, mortal, f). The code chunk below generates all sorts of variables so that later you can modify them to test all sorts of questions.

set.seed(345)

simulation <- data.frame(
  'alpha' =  rep(1, 1000),
  'beta1' =  2, 
  'beta2' = -1,
  'beta3' = -2,
  'beta4' =  1)|>
  within({
    ## Factorial variables
    A = rep(0:1, each = 500) # '0' 500 times, '1' 500 times
    B = rep(c(0,1,0,1), each = 250) # '0'x250, '1'x250, '0'x250, '1'x250
  
    ## Error variables
    e1 = rnorm(1000, 10, sd = 2)
    e2 = rnorm(1000, 5, sd = 1)
    e3 = rnorm(1000, 1, sd = 0.5)
  
    ## Interaction variables
    int_err = A*B + e1
    int     = A*B #multiplicative interaction
    int_add = (A + 1) + (B*3) #additive interaction
    e       = rnorm(1000, 10, sd = 2) 
    
    ## Simulated Continuous PREDICTOR Variables
    temp     = A + e1 #This makes a continuous variable where each level of A will have a different overall mean
    precip   = B + e2 #This makes a continuous variable where each level of B, will have a different overall mean
    interact = temp*precip #simulating an interaction variable
    drought  = temp + precip + e3 #This is also an additive interaction if added to one of the response vars
    
    ## Simulated RESPONSE Variables
    #`abundance1` models abundance on the two categorical predictors and their interaction
    #`abundance2` models abundance two observed continuous predictors `precip` and `temp` and their interaction
    abundance1   = alpha + beta1*A + beta2*B + beta3*A*B + e 
    abundance2   = alpha + beta1*temp + beta2*precip + beta4*temp*precip + e 
    abundance1.z = (abundance1 - mean(abundance1)) / sd(abundance1) #standardized from`abundance1` above
    abundance2.z = (abundance2 - mean(abundance2)) / sd(abundance2) #standardized from`abundance2` above
    mortal       = beta2*abundance2 + e # mortality is modeled on `abundance2`
    mortal.z     = (mortal - mean(mortal)) / sd(mortal) #standardized response variable of `mortal`
    
    #this is intended to represent a correlated predictor variable. It's correlated with `abundance1`
    f <- ifelse(A == 0, abundance1 + 2 , abundance1 - 2) 
  })

Exploring 2-Categorical Predictors

Let’s now apply our simulated data. We will start by modeling the unstandardized response variable. Can we estimate our original linear model parameters? We should be able to! No matter how much you mess with the beta coefficients we established, when the unstandardized response is used, the linear model with the interaction should yield coefficients that match them. Recall, we set the coefficients above to be:

beta1 = 2 beta2 = -1 *beta3 = -2

model1 <- lm(abundance1 ~ A + B + A*B , data = simulation) #modeling with the unstandardized variable `abundance1`
summary(model1)
## 
## Call:
## lm(formula = abundance1 ~ A + B + A * B, data = simulation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6068 -1.3093  0.0053  1.2336  6.3523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.1887     0.1218  91.845  < 2e-16 ***
## A             1.7287     0.1723  10.034  < 2e-16 ***
## B            -1.1881     0.1723  -6.896 9.46e-12 ***
## A:B          -1.6173     0.2436  -6.638 5.22e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.926 on 996 degrees of freedom
## Multiple R-squared:  0.2707, Adjusted R-squared:  0.2685 
## F-statistic: 123.3 on 3 and 996 DF,  p-value: < 2.2e-16
model2 <- lm(abundance1.z ~ A + B + A*B , data = simulation) #modeling with the standardized variable `abundance1.z`
summary(model2)
## 
## Call:
## lm(formula = abundance1.z ~ A + B + A * B, data = simulation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.48954 -0.58136  0.00235  0.54774  2.82058 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.05951    0.05409   1.100    0.271    
## A            0.76758    0.07650  10.034  < 2e-16 ***
## B           -0.52755    0.07650  -6.896 9.46e-12 ***
## A:B         -0.71811    0.10818  -6.638 5.22e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8553 on 996 degrees of freedom
## Multiple R-squared:  0.2707, Adjusted R-squared:  0.2685 
## F-statistic: 123.3 on 3 and 996 DF,  p-value: < 2.2e-16

Yep, as expected:

Key takeaways here:

Explore Two Continuous Predictors

model <- lm(abundance2 ~ precip + precip*temp, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2 ~ precip + precip * temp, data = simulation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5528 -1.2886  0.0206  1.2534  6.4130 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.10114    1.61194   8.128 1.29e-15 ***
## precip      -1.35980    0.28865  -4.711 2.82e-06 ***
## temp         1.80881    0.15157  11.934  < 2e-16 ***
## precip:temp  1.03355    0.02709  38.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.927 on 996 degrees of freedom
## Multiple R-squared:  0.9901, Adjusted R-squared:  0.9901 
## F-statistic: 3.327e+04 on 3 and 996 DF,  p-value: < 2.2e-16
ggPredict(model, se = TRUE, interactive = TRUE) 

Check for Understanding

  1. Try modifying the covariates and rerunning the code.
  1. Do the fixed effects stay constant or change with the inclusion of new terms? Consider why or why not.
  2. Now try using the standardized response variable abundance2.z? Do the coefficients remain the same?
  1. For insight on how we detect (or fail to detect) interactions you will now modify abundance2, develop models and compare outputs for model fits and the coefficients. Substitute the abundance2 variable in the code above with these options (non-multiplicative interactions and no main effects) and rerun code:

For example, let’s say I reran abun = alpha + beta1*temp + beta2*precip + beta3*log(temp + 1)*precip + e. I would then compare summary(model1) and summary(model2) for a model that:

  1. Consider why do these yield the same outputs, despite different formatting?
model1 <- aov(abundance1.z ~ A + B + int, data = simulation) # Fit an ANOVA with interaction (int)
summary(model1)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## A             1   41.7   41.72   57.04 9.64e-14 ***
## B             1  196.5  196.52  268.66  < 2e-16 ***
## int           1   32.2   32.23   44.06 5.22e-11 ***
## Residuals   996  728.5    0.73                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2 <-aov(abundance1.z ~ A*B, data = simulation) #Fit an ANOVA with interaction (A*B)
summary(model2)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## A             1   41.7   41.72   57.04 9.64e-14 ***
## B             1  196.5  196.52  268.66  < 2e-16 ***
## A:B           1   32.2   32.23   44.06 5.22e-11 ***
## Residuals   996  728.5    0.73                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Why do the effect sizes for the main effects remain the same between models? b) Why does f have a minimal effect on standardized abundance? c) Why is the A:B interaction important, but not the A:f in model2? d) Does including an interaction with a correlated variable reduce its effect on the response?
model1 <- aov(abundance1.z ~ A + B + f, data = simulation)#Fit an ANOVA with 3 variables, no interaction
summary(model1)
##              Df Sum Sq Mean Sq   F value Pr(>F)    
## A             1   41.7    41.7 3.183e+30 <2e-16 ***
## B             1  196.5   196.5 1.499e+31 <2e-16 ***
## f             1  760.8   760.8 5.804e+31 <2e-16 ***
## Residuals   996    0.0     0.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2 <- aov(abundance1.z ~ A + B + f + A*B + A*f, data = simulation)#Fit an ANOVA with 3 variables, and some interactions
summary(model2)
##              Df Sum Sq Mean Sq   F value Pr(>F)    
## A             1   41.7    41.7 3.184e+30 <2e-16 ***
## B             1  196.5   196.5 1.500e+31 <2e-16 ***
## f             1  760.8   760.8 5.806e+31 <2e-16 ***
## A:B           1    0.0     0.0 2.321e+00  0.128    
## A:f           1    0.0     0.0 1.220e-01  0.727    
## Residuals   994    0.0     0.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model1 <- aov(abundance1.z ~ A + B + f + A*f, data = simulation)#Fit an ANOVA with 3 variables, and some interactions
summary(model1)
##              Df Sum Sq Mean Sq   F value Pr(>F)    
## A             1   41.7    41.7 3.180e+30 <2e-16 ***
## B             1  196.5   196.5 1.498e+31 <2e-16 ***
## f             1  760.8   760.8 5.799e+31 <2e-16 ***
## A:f           1    0.0     0.0 1.580e-01  0.691    
## Residuals   995    0.0     0.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. How do you you interpret the effect sizes for the models using standdardized and unstandardized responses?
model1 <- lm(abundance2.z ~ temp + precip + int, data = simulation)
summary(model1)
## 
## Call:
## lm(formula = abundance2.z ~ temp + precip + int, data = simulation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68610 -0.09342 -0.00247  0.09089  0.66298 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.732234   0.035585  -189.2   <2e-16 ***
## temp         0.385957   0.002360   163.6   <2e-16 ***
## precip       0.485373   0.004687   103.5   <2e-16 ***
## int          0.022693   0.011942     1.9   0.0577 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1559 on 996 degrees of freedom
## Multiple R-squared:  0.9758, Adjusted R-squared:  0.9757 
## F-statistic: 1.337e+04 on 3 and 996 DF,  p-value: < 2.2e-16
model2 <- lm(abundance2 ~ temp + precip + int, data = simulation)
summary(model2)
## 
## Call:
## lm(formula = abundance2 ~ temp + precip + int, data = simulation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2836  -1.8087  -0.0479   1.7597  12.8361 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -45.88649    0.68896   -66.6   <2e-16 ***
## temp          7.47255    0.04569   163.6   <2e-16 ***
## precip        9.39735    0.09075   103.5   <2e-16 ***
## int           0.43935    0.23121     1.9   0.0577 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.018 on 996 degrees of freedom
## Multiple R-squared:  0.9758, Adjusted R-squared:  0.9757 
## F-statistic: 1.337e+04 on 3 and 996 DF,  p-value: < 2.2e-16

Answers

  1. Perform and interpret effect sizes for the following model with a categorical variable with 3 levels cyl: lm(mpg ~ disp*cyl).
model.disp_cyl <- lm(mpg ~ disp*as.factor(cyl), data = autompg)
coef(model.disp_cyl)
##          (Intercept)                 disp      as.factor(cyl)6      as.factor(cyl)8 disp:as.factor(cyl)6 
##          43.59051971          -0.13069347         -13.20026081         -20.85706424           0.08298924 
## disp:as.factor(cyl)8 
##           0.10817135
cs <- coef(model.disp_cyl)
ggplot(autompg, aes(disp, mpg, color = as.factor(cyl)))+
  geom_point()+
  geom_abline(intercept = cs[1], slope = cs[2], color = "#F8766D") + 
  geom_abline(intercept = cs[1] + cs[3], slope = cs[2] + cs[5], color = "#00BA38") + 
  geom_abline(intercept = cs[1] + cs[4], slope = cs[2] + cs[6], color = "#619CFF")

Solve using formula:
Effect of displacement on mpg in 4 cylin:

Interpretation: A 4-cylinder car with zero displacement has an average mpg of 43.59 mpg. In a 4-cylinder car, the change in mpg for 1 unit engine displacement is -0.13.

Effect of displacement on mpg in 6 cylin

Interpretation: A 6-cylinder car with zero displacement has an average mpg of 30.39 mpg. In a 6-cylinder car, the change in mpg for 1 unit engine displacement is -0.05.

Effect of displacement on mpg in 8 cylin

Interpretation: An 8-cylinder car with zero displacement has an average mpg of 22.74 mpg. In an 8-cylinder car, the change in mpg for 1 unit engine displacement is -0.02.

  1. Try modifying the covariates and rerunning the code.
  1. Do the fixed effects stay constant or change with the inclusion of new terms? Consider why or why not.

Below I explored 3 examples/possibilities to draw attention to some important points.

Example 1: Excluding an interaction term

# original model
model <- lm(abundance2 ~ precip + temp + precip*temp, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2 ~ precip + temp + precip * temp, data = simulation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5528 -1.2886  0.0206  1.2534  6.4130 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.10114    1.61194   8.128 1.29e-15 ***
## precip      -1.35980    0.28865  -4.711 2.82e-06 ***
## temp         1.80881    0.15157  11.934  < 2e-16 ***
## precip:temp  1.03355    0.02709  38.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.927 on 996 degrees of freedom
## Multiple R-squared:  0.9901, Adjusted R-squared:  0.9901 
## F-statistic: 3.327e+04 on 3 and 996 DF,  p-value: < 2.2e-16
# modified model (interaction removed)
model <- lm(abundance2 ~ precip + temp, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2 ~ precip + temp, data = simulation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3096  -1.8105   0.0029   1.8156  12.9973 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -46.16159    0.67446  -68.44   <2e-16 ***
## precip        9.44342    0.08757  107.84   <2e-16 ***
## temp          7.48504    0.04527  165.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.022 on 997 degrees of freedom
## Multiple R-squared:  0.9757, Adjusted R-squared:  0.9756 
## F-statistic: 2e+04 on 2 and 997 DF,  p-value: < 2.2e-16

Let’s take lm(abundance2 ~ precip + temp + precip*temp, data = simulation) to be truth. Since we simulated abundance2 this way. Let’s see how deviations from the “truth” impact how we would understand the relationships:

Observation: In the model without the interaction term, the coefficient estimate for precip changed in magnitude and direction.

Discussion: Main effects will change in their value and sometimes direction when an interaction term is removed. Interaction terms change how we interpret main effects. Interaction terms can be thought of as modifier terms, such that when they are included in the model, they modify how we interpret the main effect. The interpretation becomes a CONDITIONAL statement - main effects become CONDITIONAL effects: the effect of a predictor variable becomes conditional on the value of the other predictor variable. (e.g. the effect of x1 on y depends on x2). Let’s first focus on directionality. In the model with the interaction term, the precip coefficient (-1.35980) makes sense- given that when temp = 0 the effect of precip on abundance2 was negative (refer to the dark blue line on graph generated from Ln 639). In the model without the interaction, we no longer will need to interpret precip as conditional on temp. Now, taken on its own, precip has a positive relationship with abundance2 (9.44342). That makes sense too! Consider the overall trend between precip and abundance2 in the graph below:

model <- lm(abundance2 ~ precip + temp, data = simulation)
ggPredict(model, se = TRUE, interactive = TRUE) 

Observation: In the model without the interaction, the standard errors decreased (certainty around the estimate changed).

Discussion: Let’s consider precip. In the model with the interaction, the standard error and its p-value was (se = 0.28865, p-value = 2.82e-06) which represents a decrease in certainty in the parameter estimate for precip when compared to the model without the interaction term (se = 0.08757, p-value = <2e-16). This is a result of the correlation between precip and the interaction term precip*temp- there is multicollinearity. The more correlated variables independent are, the more difficult it is to estimate how much variation in the response each predictor is responsible for. For example, if the two predictors are highly correlated, it is difficult to ascribe that variation to one over the other. So, as a result standard errors increase for variables when they are in a model with variables they may be correlated with. We also see this reflected in the p-value for precip. While still “significant” in both cases, the significance decreased in the model with the interaction term. In summary, mulitcolinearity reduces power of the significance test.

Observation: There are fewer degrees of freedom (DF) in the model without the interaction term.

Discussion: There are 3 degrees of freedom in the model with the interaction term and 2 degrees of freedom in the model without the interaction term). Degrees of freedom here refer specifically to the number of parameters in the model to be estimated. So Df = 3 refers to the precip, temp and precip*temp terms and df = 2 refers to the precipand temp term. Therefore, it should also make sense why the F-statistic also changed between the two models since part of the calculation for the F statistic is the DF.

Observation: In the model without the interaction term, the y-intercept was (-46.16159) compared to (13.10114).

Discussion:The y-intercept is calculated as the point on the axis where the predictors are zero. In this case we are looking for the intersection of the y-axis where precip and temp = 0. While it is not possible to get an abundance of -46.16, therefore the y-intercept is not meaningful here, -46.16 is the point in the y-axis when precip and temp = 0. Refer to the dark blue line (line that represents temp = 0) and where it intersects the y-axis when precip is also zero.

Example 2: The inclusion of an uncorrelated fixed effect

# original model
model <- lm(abundance2 ~ precip, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2 ~ precip, data = simulation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.938 -10.859   0.037  11.157  56.209 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.1847     2.6195   11.52   <2e-16 ***
## precip        9.8516     0.4664   21.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.1 on 998 degrees of freedom
## Multiple R-squared:  0.3089, Adjusted R-squared:  0.3082 
## F-statistic: 446.1 on 1 and 998 DF,  p-value: < 2.2e-16
# modified model (uncorrelated fixed effect is included)
model <- lm(abundance2 ~ precip + temp, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2 ~ precip + temp, data = simulation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3096  -1.8105   0.0029   1.8156  12.9973 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -46.16159    0.67446  -68.44   <2e-16 ***
## precip        9.44342    0.08757  107.84   <2e-16 ***
## temp          7.48504    0.04527  165.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.022 on 997 degrees of freedom
## Multiple R-squared:  0.9757, Adjusted R-squared:  0.9756 
## F-statistic: 2e+04 on 2 and 997 DF,  p-value: < 2.2e-16

Observation: The inclusion of an UNCORRELATED predictor variable (e.g. temp) increased the certainty around the parameter estimate for precip.

Discussion: We know from our specification of abundance2 that temp explains a lot of the variation in abundance2. We also know temp is NOT correlated to precip because they were simulated differently. Thus, we can see that the inclusion of another fixed effect, temp, reduced the standard error of the precip variable-our estimate of it became more certain as a lot of the unaccounted variation became accounted for my temp. Had these variables been correlated (multicollinear), the magnitude of the effect would be largely different and in some cases can flip from + to - and vice versa. Secondly, holding all else constant, had abundance2 been simulated with another variable that wasn’t temp, the inclusion of temp in a model would have not accounted for residual variation in abundance2 and therefore would not have changed the coefficient for precip in the same way.

Observation: The range of values for residuals are much greater in the model precip as the only predictor and the range is smaller for the model with temp and precip.

Discussion: We would expect more residual variation in the precip only model compared to the model that includes temp because more variation in abundance2 is accounted for by the addition of the other variable that we designed abundance2 to be dependant on.

Observation: DF changed

Discussion: same explanation as in previous example

Example 3: include an interaction but exclude a main effect

#original model
model <- lm(abundance2 ~ precip + temp + precip:temp, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2 ~ precip + temp + precip:temp, data = simulation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5528 -1.2886  0.0206  1.2534  6.4130 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.10114    1.61194   8.128 1.29e-15 ***
## precip      -1.35980    0.28865  -4.711 2.82e-06 ***
## temp         1.80881    0.15157  11.934  < 2e-16 ***
## precip:temp  1.03355    0.02709  38.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.927 on 996 degrees of freedom
## Multiple R-squared:  0.9901, Adjusted R-squared:  0.9901 
## F-statistic: 3.327e+04 on 3 and 996 DF,  p-value: < 2.2e-16
# modified model (main effect in the interaction removed)
model <- lm(abundance2 ~ precip + precip:temp, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2 ~ precip + precip:temp, data = simulation)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.582 -1.374 -0.001  1.403  7.208 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31.970242   0.335108   95.40   <2e-16 ***
## precip      -4.673908   0.084109  -55.57   <2e-16 ***
## precip:temp  1.350961   0.005515  244.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.06 on 997 degrees of freedom
## Multiple R-squared:  0.9887, Adjusted R-squared:  0.9887 
## F-statistic: 4.364e+04 on 2 and 997 DF,  p-value: < 2.2e-16

Observation: The intercept increased from 13.10114 in the model with the interaction, to 31.970242 in the model without the interaction.

Discussion: That is because 31.970242 marks the intersection of the y-axis when precip = 0. Since that model does not include temp the interpretation of the intercept is only with regard to the precip. In the other model, the y-intercept represents the intersection where both precip and temp = 0. Compare the intercepts. You will also notice that this also causes the intercept to be the same for all levels of temp. All the lines will intersect the abundance2 at 31.970242. If there is reason to believe that the average abundance2 at precip = 0 should be the same for all levels, then it would make sense to model this way. For example, in some cases where the effect of treatment variable is tested (e.g. multiple dosages) for a treatment and control group, it may be reasonable to assume that baseline of the response is properly represented as the average response of all samples regardless of treatment at dosage = 0.

model <- lm(abundance2 ~ precip + precip:temp, data = simulation)
ggPredict(model)

Observation: Standard errors decreased in the model without temp.

Discussion: temp is correlated with the interaction term precip*temp so by removing the term we become more confident about the variation in abundance2 attributed to the other terms.

Observation: DF changed

Discussion: same explanation as in previous examples

2b) Now try using the standardized response variable abundance2.z? Do the coefficients remain the same?

Below I repeat what I did above but replace abundance2 with abundance2.z. I won’t go into detail with each because the patterns are the same. coefficients are now understood as standard deviations from the mean since abundance2.z is abundance2 standardized.So we would expect the coefficients, standard errors, ranges of the residuals, etc. to be much smaller in magnitude, but the relative contribution of each term to explaining variance in abundance2 remains the same.

Example 1: Excluding an interaction term

model <- lm(abundance2.z ~ precip + temp + precip*temp, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2.z ~ precip + temp + precip * temp, data = simulation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28680 -0.06655  0.00106  0.06474  0.33123 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.685526   0.083257 -44.267  < 2e-16 ***
## precip      -0.070234   0.014909  -4.711 2.82e-06 ***
## temp         0.093425   0.007829  11.934  < 2e-16 ***
## precip:temp  0.053383   0.001399  38.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09955 on 996 degrees of freedom
## Multiple R-squared:  0.9901, Adjusted R-squared:  0.9901 
## F-statistic: 3.327e+04 on 3 and 996 DF,  p-value: < 2.2e-16
# modified model (interaction removed)
model <- lm(abundance2.z ~ precip + temp, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2.z ~ precip + temp, data = simulation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68744 -0.09351  0.00015  0.09377  0.67131 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.746442   0.034836  -193.7   <2e-16 ***
## precip       0.487752   0.004523   107.8   <2e-16 ***
## temp         0.386602   0.002338   165.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1561 on 997 degrees of freedom
## Multiple R-squared:  0.9757, Adjusted R-squared:  0.9756 
## F-statistic: 2e+04 on 2 and 997 DF,  p-value: < 2.2e-16

Example 2: Including an uncorrelated main effect

model <- lm(abundance2.z ~ precip, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2.z ~ precip, data = simulation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7859 -0.5609  0.0019  0.5762  2.9032 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.80316    0.13530  -20.72   <2e-16 ***
## precip       0.50884    0.02409   21.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8317 on 998 degrees of freedom
## Multiple R-squared:  0.3089, Adjusted R-squared:  0.3082 
## F-statistic: 446.1 on 1 and 998 DF,  p-value: < 2.2e-16
# modified model (uncorrelated fixed effect is included)
model <- lm(abundance2.z ~ precip + temp, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2.z ~ precip + temp, data = simulation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68744 -0.09351  0.00015  0.09377  0.67131 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.746442   0.034836  -193.7   <2e-16 ***
## precip       0.487752   0.004523   107.8   <2e-16 ***
## temp         0.386602   0.002338   165.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1561 on 997 degrees of freedom
## Multiple R-squared:  0.9757, Adjusted R-squared:  0.9756 
## F-statistic: 2e+04 on 2 and 997 DF,  p-value: < 2.2e-16

Example 3: Excluding a variable as a main effect but keeping it in the interaction term

#original model
model <- lm(abundance2.z ~ precip + temp + precip:temp, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2.z ~ precip + temp + precip:temp, data = simulation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28680 -0.06655  0.00106  0.06474  0.33123 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.685526   0.083257 -44.267  < 2e-16 ***
## precip      -0.070234   0.014909  -4.711 2.82e-06 ***
## temp         0.093425   0.007829  11.934  < 2e-16 ***
## precip:temp  0.053383   0.001399  38.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09955 on 996 degrees of freedom
## Multiple R-squared:  0.9901, Adjusted R-squared:  0.9901 
## F-statistic: 3.327e+04 on 3 and 996 DF,  p-value: < 2.2e-16
# modified model (main effect in the interaction removed)
model <- lm(abundance2.z ~ precip + precip:temp, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2.z ~ precip + precip:temp, data = simulation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28832 -0.07099 -0.00005  0.07246  0.37230 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.7109373  0.0173083 -156.63   <2e-16 ***
## precip      -0.2414071  0.0043442  -55.57   <2e-16 ***
## precip:temp  0.0697771  0.0002848  244.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1064 on 997 degrees of freedom
## Multiple R-squared:  0.9887, Adjusted R-squared:  0.9887 
## F-statistic: 4.364e+04 on 2 and 997 DF,  p-value: < 2.2e-16
  1. For insight on how we detect (or fail to detect) interactions you will now modify abundance2, develop models and compare outputs for model fits and the coefficients. Substitute the abundance2 variable in the code above with these options (non-multiplicative interactions and no main effects) and rerun code:

Example 1: A model with non-linear interaction term

# modify how we model `abundance2`. Recall that `abundance2` was originally simulated as: `abundance2 = alpha + beta1*temp + beta2*precip + beta4*temp*precip + e`

simulation <- transform(simulation, abundance2.new = alpha + beta1*temp + beta2*precip + beta3*log(temp + 1)*precip + e)

# Let's first run the model that we know properly reflects how `abundance2` was simulated
model <- lm(abundance2.new ~ precip + temp + precip:log(temp + 1), data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2.new ~ precip + temp + precip:log(temp + 
##     1), data = simulation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5863 -1.2880  0.0078  1.2405  6.3998 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           12.9179     1.3458   9.598  < 2e-16 ***
## precip                -1.8215     0.5978  -3.047  0.00237 ** 
## temp                   1.8267     0.1253  14.582  < 2e-16 ***
## precip:log(temp + 1)  -1.6646     0.2457  -6.775 2.13e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.927 on 996 degrees of freedom
## Multiple R-squared:  0.9236, Adjusted R-squared:  0.9234 
## F-statistic:  4014 on 3 and 996 DF,  p-value: < 2.2e-16
# Now let's run a model with all relevant terms, but leave `temp` untransformed in the interaction term**
model <- lm(abundance2 ~ precip + temp + precip:temp, data = simulation)
summary(model)
## 
## Call:
## lm(formula = abundance2 ~ precip + temp + precip:temp, data = simulation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5528 -1.2886  0.0206  1.2534  6.4130 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.10114    1.61194   8.128 1.29e-15 ***
## precip      -1.35980    0.28865  -4.711 2.82e-06 ***
## temp         1.80881    0.15157  11.934  < 2e-16 ***
## precip:temp  1.03355    0.02709  38.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.927 on 996 degrees of freedom
## Multiple R-squared:  0.9901, Adjusted R-squared:  0.9901 
## F-statistic: 3.327e+04 on 3 and 996 DF,  p-value: < 2.2e-16

Observation: Despite abundance2 being simulated from linear main effects and non-linear interactive effects, we we see that both interaction effects -linear and non-linear come out significant in both models.

Discussion: This alludes to the idea that we may include and interaction term but fail to properly characterize it.

  1. Consider why do these yield the same outputs, despite different formatting?
model <- aov(abundance1.z ~ A + B + int, data = simulation) # Fit an ANOVA with interaction (int)
summary(model)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## A             1   41.7   41.72   57.04 9.64e-14 ***
## B             1  196.5  196.52  268.66  < 2e-16 ***
## int           1   32.2   32.23   44.06 5.22e-11 ***
## Residuals   996  728.5    0.73                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model <-aov(abundance1.z ~ A*B, data = simulation) #Fit an ANOVA with interaction (A*B)
summary(model)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## A             1   41.7   41.72   57.04 9.64e-14 ***
## B             1  196.5  196.52  268.66  < 2e-16 ***
## A:B           1   32.2   32.23   44.06 5.22e-11 ***
## Residuals   996  728.5    0.73                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Recall that int is the multiplicative interaction between A and B - in other words, int = A*B. Also, recall that A*B is a shorthand way to denote a model with two main effects and the interaction term - aov(abundance1.z ~ A*B, data = simulation) is analogous to aov(abundance1.z ~ A + B + A*B, data = simulation)

    1. Why do the effect sizes remain the same between models? b) Why does f have a minimal effect on standardized abundance? c) Why is the A:B interaction important, but not the A:f in model2? d) Does including an interaction with a correlated variable reduce its effect on the response?
model <- aov(abundance1.z ~ A + B + A*B, data = simulation)
summary(model)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## A             1   41.7   41.72   57.04 9.64e-14 ***
## B             1  196.5  196.52  268.66  < 2e-16 ***
## A:B           1   32.2   32.23   44.06 5.22e-11 ***
## Residuals   996  728.5    0.73                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model <- aov(abundance1.z ~ A + B + f, data = simulation)
summary(model)
##              Df Sum Sq Mean Sq   F value Pr(>F)    
## A             1   41.7    41.7 3.183e+30 <2e-16 ***
## B             1  196.5   196.5 1.499e+31 <2e-16 ***
## f             1  760.8   760.8 5.804e+31 <2e-16 ***
## Residuals   996    0.0     0.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model <- aov(abundance1.z ~ A + B + A*B + A*f, data = simulation)
summary(model)
##              Df Sum Sq Mean Sq   F value Pr(>F)    
## A             1   41.7    41.7 3.184e+30 <2e-16 ***
## B             1  196.5   196.5 1.500e+31 <2e-16 ***
## f             1  760.8   760.8 5.806e+31 <2e-16 ***
## A:B           1    0.0     0.0 2.321e+00  0.128    
## A:f           1    0.0     0.0 1.220e-01  0.727    
## Residuals   994    0.0     0.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model <- aov(abundance1.z ~ A + B + A*f, data = simulation)
summary(model)
##              Df Sum Sq Mean Sq   F value Pr(>F)    
## A             1   41.7    41.7 3.180e+30 <2e-16 ***
## B             1  196.5   196.5 1.498e+31 <2e-16 ***
## f             1  760.8   760.8 5.799e+31 <2e-16 ***
## A:f           1    0.0     0.0 1.580e-01  0.691    
## Residuals   995    0.0     0.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because of the specifications for f and abundance1.z, all the variation in abundance1.z is accounted for by A and f. The model fit or R^2 = 1.Try summary(lm(abundance1.z ~ A + f, data = simulation)) and summary(lm(abundance1.z ~ A + f + A*B, data = simulation)).

  1. How do you you interpret the effect sizes for the models using standardized and unstandardized responses?
model1 <- lm(abundance2.z ~ temp + precip + int, data = simulation)
summary(model1)
## 
## Call:
## lm(formula = abundance2.z ~ temp + precip + int, data = simulation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68610 -0.09342 -0.00247  0.09089  0.66298 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.732234   0.035585  -189.2   <2e-16 ***
## temp         0.385957   0.002360   163.6   <2e-16 ***
## precip       0.485373   0.004687   103.5   <2e-16 ***
## int          0.022693   0.011942     1.9   0.0577 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1559 on 996 degrees of freedom
## Multiple R-squared:  0.9758, Adjusted R-squared:  0.9757 
## F-statistic: 1.337e+04 on 3 and 996 DF,  p-value: < 2.2e-16
model2 <- lm(abundance2 ~ temp + precip + int, data = simulation)
summary(model2)
## 
## Call:
## lm(formula = abundance2 ~ temp + precip + int, data = simulation)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2836  -1.8087  -0.0479   1.7597  12.8361 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -45.88649    0.68896   -66.6   <2e-16 ***
## temp          7.47255    0.04569   163.6   <2e-16 ***
## precip        9.39735    0.09075   103.5   <2e-16 ***
## int           0.43935    0.23121     1.9   0.0577 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.018 on 996 degrees of freedom
## Multiple R-squared:  0.9758, Adjusted R-squared:  0.9757 
## F-statistic: 1.337e+04 on 3 and 996 DF,  p-value: < 2.2e-16

Notice that the significance of the the coefficients does not change between models with the standardized and unstandardized response variable.

Interpretation of Outputs:

lm(abundance2.z ~ temp + precip + int, data = simulation)

lm(abundance2 ~ temp + precip + int, data = simulation)