Resources

Throughout, I recommend links to online resources that review the concepts and under-the-hood math in more detail. Below I highlight some texts and online resources that may be helpful to you.

Recommended Texts:

  1. Ecological Statistics Fox et al . 2015
  2. Biostatistics: The Bare Essentials by Norman and Striener (Focus on concepts- no R code)
  3. Discovering Statistics in R by Field, Miles, Field (Covers stats concepts and R code )

Recommended Online resources:

YouTube Channels & Videos:

Online Courses & Tutorials:

A reference guide that may help you interpret R outputs:

Introduction

The focus of this lab includes the following statistical techniques: t-test, ANOVA, and basic linear regression. The code focuses on testing assumptions of these techniques and interpreting outputs. In addition, we review: transformations and standardization. Ideally, you leave this lab understanding how the methods are related, especially with regard to how they integrate the concept of variance.

In addition to understanding the under-the-hood math to these techniques, it is very important to remind ourselves of the bigger idea to which the techniques belong. This will help develop your understanding of these methods. These techniques fall under the larger umbrella of hypothesis testing (review here) and maximum likelihood (review here & review here) which fall under the larger umbrella of frequentist inference - one approach to statistical inference. Frequentist inference may be compared to an alternative approach: Bayesian inference. Bayesian inference will not be covered until later in the course, but if you are curious how it compares to frequentist stats you can check out this video.

Table 1. Comparison of Frequentist and Bayesian approaches to statistical inference. Taken from Shrepnek 2007
Table 1. Comparison of Frequentist and Bayesian approaches to statistical inference. Taken from Shrepnek 2007

Let’s Begin

There are “checking for understanding” questions throughout. These do not need to be turned in, but if you are struggling to answer, please ask for help. Answers to these questions are at the end of the document.

Set working directory.

setwd("...your_working_directory/")

Import data.

example_data <- read.csv("intro_LM_data.csv")

Examine data.

colnames(example_data) #Returns column names
## [1] "elevation" "genus"     "area"      "herbivory" "diversity" "plotID"
head(example_data)     #Returns first 6 entries for each column
##   elevation genus     area herbivory diversity plotID
## 1      high Piper 400.0000        76         4      1
## 2      high Piper 133.3000         5         2      2
## 3      high Piper 171.5278         3         2      3
## 4      high Piper  38.0000         0         0      4
## 5      high Piper 825.0000         1         2      5
## 6      high Piper 900.0000         5         3      6

Summarize data.

summary(example_data)
##   elevation            genus                area          herbivory       diversity         plotID     
##  Length:179         Length:179         Min.   :  12.5   Min.   : 0.00   Min.   :0.000   Min.   : 1.00  
##  Class :character   Class :character   1st Qu.: 105.5   1st Qu.: 1.00   1st Qu.:1.000   1st Qu.: 8.00  
##  Mode  :character   Mode  :character   Median : 192.0   Median : 5.00   Median :2.000   Median :15.00  
##                                        Mean   : 247.4   Mean   :12.16   Mean   :2.095   Mean   :15.42  
##                                        3rd Qu.: 355.6   3rd Qu.:15.00   3rd Qu.:3.000   3rd Qu.:23.00  
##                                        Max.   :1093.8   Max.   :87.00   Max.   :6.000   Max.   :31.00

Check for Understanding:

1) Why are some variables summarized with means and quantiles and others not?

2) Which column(s) in these data are categorical with two levels?

Explore a single variable: Herbivory

summary(example_data$herbivory)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    5.00   12.16   15.00   87.00
boxplot(example_data$herbivory)

boxplot(example_data$herbivory)$stats #returns quartiles of the input vector, see also ?quantile
##      [,1]
## [1,]    0
## [2,]    1
## [3,]    5
## [4,]   15
## [5,]   35

Verify you understand the various descriptive statistics and how to interpret R outputs. Click here for review on descriptive statistics

Check for Understanding

3) Why are outputs from summary(example_data$herbivory) and boxplot(example_data$herbivory)$stats different? You should be able to discern why the max values differ (i.e. 86 and 35, respectively) and why summary() returns 6 values and $stats returns 5 values.

4) Verify that you are able to match the following terms and values with their location on boxplot(example_data$herbivory):

5) You should understand what ‘quartiles’ represent. With consideration to the quartiles for example_data$herbivory, how do you interpret Q1 = 1, Q2 = 5 and Q3 = 18?

T-test

This section is taken from: https://rpubs.com/cwoods/t_test

A t-test is a parametric test that compares two means and tells you if they are significantly different from each other. A t-test has two assumptions that have to be met:

  1. The residuals have to be normally distributed for each group;

  2. There is equal variance among groups (homogeneity of variance).

The residuals are the leftover variation that is not accounted for by your explanatory variable. Residuals plots are useful for checking these two assumptions. Before we can interpret our statistical output, we should make sure the assumptions above are met.

Background of Theory

A t-test depends on the t distribution, which measures the difference between two groups over the difference within the groups. This is essentially measuring how different two means are over their variance. If the difference between your two groups is larger than the variation within groups (differences among means are large and variance within each mean is small), then the groups will be significantly different from each other.

Here are some statistical terms you should understand to interpret your t-test.

Most of the time, you are only testing if the values are different from each other, regardless of the direction so a two-tailed t-test is used, in which case the area under each side of the curve is 2.5% (0.025; Figure 1). If you are interested in if one mean is only larger or only smaller than another, then you use a one-tail t-test. If the t statistic generated by your test is larger than the tcritical value, the area under the curve is smaller than 5% and, therefore, the p-value is < 0.05, and the means are significantly different. You have explained more variance than random chance. And if you say that there is a significant effect and there really isn’t, you are wrong less than 5% of the time.

Figure 1. Example of a two-tailed normal distribution.
Figure 1. Example of a two-tailed normal distribution.

In the figure below, the t statistic can be conceptualized as the ratio of the difference in means (blue arrow) over the variance you can’t explain (red braces). If you divide the blue arrows in each figure by the red braces, the t statistic is much larger in the left figure than the right figure. That means you are explaining more variation than you aren’t explaining in the left figure (large difference between groups and small variation within groups) and the data in the left figure are statistically significant. The data in the right figure are not statistically significant because there is a small difference between groups (small blue arrow) and large variation within groups (large red braces).

Figure 2. Average (± SE) percent coverage of lichen under high light (<75% canopy cover) and low light (>75% canopy cover) on sycamore trees in Tacoma, WA. Data are used for illustrative purposes.
Figure 2. Average (± SE) percent coverage of lichen under high light (<75% canopy cover) and low light (>75% canopy cover) on sycamore trees in Tacoma, WA. Data are used for illustrative purposes.

It is important to note that there are a few types of t tests (e.g. Student’s and Welch’s t-test). They differ in their assumptions and therefore utility. When deciding which to use, consider the following comparisons.

Table 2: Comparison of Student t test and Welch’s T Test
Table 2: Comparison of Student t test and Welch’s T Test

t.test() allows you to perform both types of t tests depending on how you modify the var.equal argument. The default is var.equal = FALSE (e.g. a Welch’s t-test). You will notice “Welch T test” specified in the heading of the output. If you change the argument to var.equal = TRUE (e.g.a Student’s t test), you will notice “Student t test” in the header of the output. Try both scenarios to familiarize yourself with the t.test() function. Because Welch’s t-test is not derived under the assumption of equal variances, it allows users to compare the mean of two populations without first having to test for equal variances. Even so, we do test assumptions in the code below, to reinforce understanding of model assumptions and how to evaluate residual plots.

Apply T-test to data

Before running statistical tests I always find it useful to plot the data first. Let’s take a look at a plot of the raw data to see if the results of this t-test make sense.

boxplot(herbivory ~ elevation, data = example_data)

We see two groups (high and low elevation) with different levels of herbivory.

Okay let’s do a Welch’s t-test. Here we will examine the differences between herbivory in the high elevation and low elevation group.

herb_t.test <- t.test(herbivory ~ elevation, data = example_data) #We save the results as "herb_t.test."
herb_t.test #Examine the results of the t-test.
## 
##  Welch Two Sample t-test
## 
## data:  herbivory by elevation
## t = 1.3249, df = 170, p-value = 0.187
## alternative hypothesis: true difference in means between group high and group low is not equal to 0
## 95 percent confidence interval:
##  -1.730763  8.796057
## sample estimates:
## mean in group high  mean in group low 
##           13.93820           10.40556

Here we see that the mean herbivory in the high group is 13.9 and 10.4 in the low group. This means the “high” group has about 3.5 cm^2 more herbivory on average. This is the effect size or raw coefficient. It is up to the researcher(s) to know if this is biologically significant or not.

Another note: Here, the p value is 0.187, meaning that if there is truly no difference between groups that we expect to find a difference at least this extreme 18.7% of the time.

What R is doing behind the hood:

sample.data <- rgamma(10,2,.1)
null.mean <- 10

sample.size <- length(sample.data)
sample.mean <- mean(sample.data)
sample.sd <- sd(sample.data)
std.err <- sample.sd/sqrt(sample.size)
t.stat <- (sample.mean-null.mean)/std.err

curve(dt(x,sample.size-1),-4,4)
abline(v=t.stat,col="green",lwd=3)

t.crit <- abs(qt(0.025,sample.size-1))   # for 2-tailed test
abline(v=t.crit, col="lightblue", lty=3, lwd=3)

p.val <- (1-pt(abs(t.stat),sample.size-1))*2   # 
p.val
## [1] 0.03066376

Evaluate Model Output Assumptions

We can see that the differences in means from the boxplot matches the t-test, but there is something else you should notice. Both the high and low samples are skewed, with many small values and a few high values. This brings us back to the assumptions of classic 2-sample t-tests (Student’s t test) and how one may verify whether data meets those assumptions.

  1. Normally distributed residuals for each group;

  2. Constant variance among groups (homogeneity of variance).

The residuals are the variation that is left over after the accounting for the main variables in the model (“unexplained variance” after fitting the model). Residuals are calculated as the difference between each observed value and its predicted or fitted value from the model. For example, in a two-sample unpaired t test, the fitted value for a measurement is the mean of the sample from which it came, so the residual would be the observed value minus the sample mean.

Note that here we refer to “standard” residuals. You may read about “Pearson” or “Deviance” residuals which are calculated differently from the definition above.

There are multiple methods available to test assumptions in statistical tests (both statistical tests and visuals can be used to evaluate these assumptions) and these methods are not exclusive to a particular statistical method. If they test for homogeneity of variance, they can do so for a t-test, ANOVA, regression and other applications.

Table 3. T-test assumptions to consider and methods to test those assumptions. The list of methods for each assumption is not exhaustive and there may be other methods not mentioned.
Table 3. T-test assumptions to consider and methods to test those assumptions. The list of methods for each assumption is not exhaustive and there may be other methods not mentioned.

So you may ask:

1) Which is the best statistical test to evaluate if your data meet assumptions of your t-test, ANOVA, linear regression? There is NOT one that reigns supreme, instead they all have different attributes and these attributes can influence their sensitivity to departures from normality of your data among other things. So you will have to be comfortable with the attributes of your data and choose the appropriate test accordingly.

2) Which visual method is best for evaluating if your data meet assumptions of your t-test, ANOVA, linear regression? Again, there is no “best”. However,residual analyses are crucially important to evaluate assumptions and there are a few types of residual plots you can use:

  • residual vs. fitted (x-axis= fitted values; y-axis= residuals)

  • residual vs. predictor (x-axis=predictor values; y-axis=residuals)

  • residual vs. ordered predictors (x-axis=predictor values (but in this case predictors are ordered in the temporal or spatial order they were collected ); y-axis=residuals)

Typically, residual vs. fitted plots are used. Residual vs. predictor plots do not yield any additional information that you would have already learned from a residual vs. fitted plot. However, Residual vs. predictor plots can be extremely helpful if you were to replace the predictor values with a new predictor variable. If the new predictor explains variance in the residuals of the original model, it is evidence that this new predictor should be added to the model. More on that in later labs.

Click here for a helpful explanation on the common assumptions of ANOVA, linear regression and t-test and very helpful visuals illustrating the ways we can think about variance

Click here for a helpful website on residuals

Test Assumption 1: Normality of Residuals

To test model assumption 1 (whether the residuals are normally distributed), you will use 2 pieces of information:

  1. a density plot of residuals for each group
  2. a Shapiro-Wilk test of normality for each group

First, a visual evaluation of the normality of residuals using a density plot. Because the t.test() function only reports summary statistics, we need to first run a linear model to get the residuals.

fit <- glm(herbivory ~ elevation, data = example_data) #glm() fits a generalized linear model, we'll get into why this is reasonable later 
plot(density(resid(fit))) #check for normality of residuals

Examine the density plot. If normal, it should follow more or less a bell-shaped curve. The distribution of residuals is clearly skewed and the residuals are unlikely normally distributed. If your density plot of residuals doesn’t look normal, you need to either transform your response variable or use a non-parametric test. Before we deal with that, let’s confirm with some other assumption tests.

Next a statistical test to evaluate normality of residuals- Shapiro-Wilk test. The null hypothesis for the Shapiro-Wilks tests is that the distribution of the sample does not differ from a normal distribution. Thus, p<0.5 supports the alternative hypothesis that the sample distribution does differ from a normal distribution- indicating our data breaks the assumption of normal residuals. Some may say breaking the normality assumption is less important than the assumption of constant variance. Let’s evaluate for constant variance.

shapiro.test(resid(fit))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit)
## W = 0.71003, p-value < 2.2e-16

Test Assumption 2: Constant Variance/homogeneity of variance

Like before, let’s start with a visual evaluation of constant variance using a residual vs. fitted plot.

plot(fitted.values(fit), resid(fit))

If our predictor variables were continuous, as in a linear regression, a residual vs fitted value plot should appear as a cloud of points about zero (i.e. a shot gun blast). You will see this later in the code when we explore linear regression. In the case of a categorical predictor with two levels,there are only two unique fitted values (the mean of each of the two samples). Therefore, the graph of residuals against fitted values will consist of two vertical “stacks” of residuals. If the residuals have constant variance, we would expect the stacks to be about the same length and at about the same level. In other words, the spread of residuals along the y-axis is similar among the two groups. We can also evaluate normality here (i.e.the cluster of points in both groupings should gather around zero and have equal spread about 0).

Now let’s try a statistical test to test for homogeneity of variance using a F test. You will see this test statistic again in linear regression to evaluate model performance. It is important to remember the F statistic is a ratio of variances. var.test() performs an F-test. The F-test is obtained by computing the ratio of variances in the high and low elevation group.It is known to be sensitive to deviations from normality. Note that the function defaults to a two sided test.

Table 4. Comparison of one and two-tailed F tests. Taken from http://www.vias.org/tmdatanaleng/cc_test_2sample_ftest.html
Table 4. Comparison of one and two-tailed F tests. Taken from http://www.vias.org/tmdatanaleng/cc_test_2sample_ftest.html
var.test(herbivory ~ elevation, data = example_data) 
## 
##  F test to compare two variances
## 
## data:  herbivory by elevation
## F = 1.4749, num df = 88, denom df = 89, p-value = 0.06908
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9698925 2.2438722
## sample estimates:
## ratio of variances 
##           1.474863

Higher p values indicate that the variance is homogeneous. The null hypothesis is that the variances are equal, so our data supports the null indicating homogeneous variance. We can conclude so far that data is not normally distributed, but the variance of the two samples from high and low elevation are constant.

It’s also worth mentioning that many visual tests can conducted using just the plot() function:

par(mfrow = c(2, 2))
plot(fit) #plot() will return statistical summaries for certain model outputs

par(mfrow = c(1, 1))

details for how to interpret these plots can be found here

Options when data do not meet assumption of normal residuals: Transform Data

Transforming your data means modifying the response variable so that the assumptions are met. There are plenty of types of transformations and these can be categorized as linear or non-linear transformations:

Types of transformation:

  1. linear: by adding constant or multiplying by constant (does not change results of statistical tests). Z-score falls in this category.
  2. non-linear:log-transformation, square-root transformation etc. (results of statistical tests are different from tests of not-transformed variables)

This often means taking the square root or the log of the response variable and running the test again. The reason transformations work is that the relative distance between each replicate stays the same (sample 1 is lower than sample 2, for example) but the absolute distance between them is reduced. In Figure 3, we see that square-root and log transformations generally reduce the largest values and inflate the smaller ones.

Figure 3: A. heuristic representation of how transformations affect outliers. B, C. how logarithmic and square-root transformations change the distribution and spread of raw data.
Figure 3: A. heuristic representation of how transformations affect outliers. B, C. how logarithmic and square-root transformations change the distribution and spread of raw data.

Fig 3a Taken from here. Figs b & c Taken from here

We will square root transform the response variable and see if this improves the normality and variance. Square-root transformations can make non-linear relationship between X and Y linear and can reduce unequal variances. It is important to remember that after doing this, our effect size is on a square root scale which can make interpretation of the results more difficult.You still present your data using the raw numbers, not the transformed numbers.

#Revisit untransformed data
plot(density(resid(fit)))

shapiro.test(residuals(fit))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fit)
## W = 0.71003, p-value < 2.2e-16
#Compare to transformed data
example_data$herb_sqrt <- sqrt(example_data$herbivory)

#Re-run density plot
fit_sqrt <- glm(herb_sqrt ~ elevation, data = example_data)

plot(density(resid(fit_sqrt)))

#Re-run Shapiro-Wilks test
shapiro.test(residuals(fit_sqrt))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fit_sqrt)
## W = 0.92432, p-value = 5.034e-08

That’s a little better. You can see how sensitive the statistical tests are to departures of normality. Even with transformation the Shapiro-Wilk test indicates the residuals are not normal. This is part of why visual examinations of data are so important.

Let’s compare the transformed and untransformed distributions for each elevation. ggplot2 is a tidyverse package that makes certain plotting operations easier. Everything you can do in ggplot can be done in base R, but the more complex your figures are the more useful ggplot will become. For example, Figures 1 2, & 5 in this markdown were made in ggplot.

library(ggplot2) 

ggplot(example_data)+  
  stat_density(aes(herb_sqrt, fill = 'herb_sqrt'), alpha = .5)+
  stat_density(aes(herbivory, fill = 'herbivory'), alpha = .5)+
  facet_grid(cols = vars(elevation))

And finally let’s compare the elevations

#conduct t test on transformed data
herb_sqrt_t.test <- t.test(herb_sqrt ~ elevation, data = example_data) #We save the results of the t-test as "herbsqrt_t.test"
herb_sqrt_t.test #Examine the results of the t-test.
## 
##  Welch Two Sample t-test
## 
## data:  herb_sqrt by elevation
## t = 0.91792, df = 169.95, p-value = 0.36
## alternative hypothesis: true difference in means between group high and group low is not equal to 0
## 95 percent confidence interval:
##  -0.3515798  0.9627352
## sample estimates:
## mean in group high  mean in group low 
##           2.845379           2.539801

Checking For Understanding 6) Why do the means differ from the untransformed data? How does our interpretation change from the first t-test?

7) Perform a t-test to compare herbivory between Piper and the other plants. Hint: you will need to make a new variable that categorizes as “piper” and “not piper.” The ifelse() function might come in handy, but there are many ways of doing this

ANOVA

ANOVA stands for analysis of variance and is used for comparing the means of a categorical variable with three or more levels. Identify the column(s) in these data that are categorical with three levels.

Let’s examine the relationship between plant genus and herbivory The assumptions of an ANOVA:

  1. The residuals are normally distributed
  2. There is equal variance among groups (homogeneity of variance)
  3. The residuals are independent
Table 5. ANOVA assumptions to consider and methods to test those assumptions. The list of methods for each assumption is not exhaustive and there may be other methods not mentioned.
Table 5. ANOVA assumptions to consider and methods to test those assumptions. The list of methods for each assumption is not exhaustive and there may be other methods not mentioned.
boxplot(herbivory ~ genus, data = example_data)

Here we will examine the differences between the three plant genera. If you need a refresher on how to interpret ANOVA outputs, you can use the document in the Google folder “Interpreting R Outputs”

herb_aov <- aov(herbivory ~ genus, data = example_data) #We save the results of the ANOVA as "herb_aov."
summary(herb_aov) #Examine the results of the ANOVA.
##              Df Sum Sq Mean Sq F value  Pr(>F)   
## genus         2   3511  1755.6   5.804 0.00362 **
## Residuals   176  53236   302.5                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unlike the t.test() function, the aov() function is a bit less verbose and does not output effect sizes, which are the coefficients of the model. We can use the coef() function for this.

coef(herb_aov)
##     (Intercept)      genusPiper genusPsychotria 
##       12.303279        5.239094       -5.667685

With the ANOVA, we are essentially comparing the means of all of the groups. To do this, one group becomes the intercept (which is usually the first level alphabetically, see ?factor for how to change this), and the other coefficients are compared to the intercept. So in this example the intercept is the mean herbivory on Miconia and genusPiper and genusPsychotria are the difference in herbivory each genus has compared to Miconia.

  1. Mean herbivory Miconia= 12.303
  2. Mean herbivory Piper= 12.303 +5.239094= 17.54
  3. Mean herbivory Psychotria= 12.303 - 5.667685= 6.6

You may have noticed by now the use of functions like coef() and glm()- which seem appropriate for continuous variables in regression analysis - are used in instances of categorical predictor variables for t.test() and anova(). It is important to understand that: ANOVA & t-test ARE essentially linear regressions. If you are unsure how to make sense of that, click here..

In all three techniques, variance of the fitted line is compared to the total variance. To do this for categorical predictors, as in t-test and anova, R codes categorical variables into dummy variables and inputs them into a design matrix review here

(Some) People care about p values, so let’s look at those too. You can see that the ANOVA (herb_aov) is “significant”, but what does that mean? Remember that a p value tells the chance of obtaining group differences at least this extreme if the null hypothesis is true. The null hypothesis in this case is taht all groups are equal. How can we examine which groups are different? We can use a Tukey post-hoc test.

TukeyHSD(herb_aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = herbivory ~ genus, data = example_data)
## 
## $genus
##                          diff        lwr       upr     p adj
## Piper-Miconia        5.239094  -2.267465 12.745653 0.2277095
## Psychotria-Miconia  -5.667685 -13.174245  1.838874 0.1776925
## Psychotria-Piper   -10.906780 -18.475635 -3.337924 0.0023435

Let’s check for the assumption of normality. For this we will quickly evaluate visually using a density plot.

plot(density(resid(herb_aov)))

Now, we will evaluate homogeneity of variance using a statistical test. Let’s try a different test from the F- test we tried previously. We will use the Bartlett test:

bartlett.test(herbivory ~ genus, data = example_data)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  herbivory by genus
## Bartlett's K-squared = 13.576, df = 2, p-value = 0.001127

The null hypothesis of the Bartlett test is that there’s no difference in variances between groups. Here our results show that we reject this null hypothesis, indicating variances are not constant. Like before, you can transform the data to try and meet the assumptions of the test.

Check for Understanding:

8) Transform these data, re run the ANOVA, and interpret the results. Keep in mind that there are many types of transformations. Try a few.

The ANOVAs you have run so far are called one-way ANOVAs . This is because you are looking at the effects of one variable (with more than two levels) on a response variable. A two-way ANOVA examines the relationship between two categorical predictor variables and a response variable. review here for brandon fultz playlist 13 on one-way and two-way ANOVA

Here is an example:

herb_aov.2way <- aov(herbivory ~ genus*elevation, data = example_data) #We save the results of the ANOVA as "herb_aov.2way."
summary(herb_aov.2way) #Examine the results of the ANOVA.
##                  Df Sum Sq Mean Sq F value  Pr(>F)   
## genus             2   3511  1755.6   5.879 0.00339 **
## elevation         1    558   557.6   1.867 0.17356   
## genus:elevation   2   1021   510.6   1.710 0.18389   
## Residuals       173  51657   298.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(herb_aov.2way)
##                  (Intercept)                   genusPiper              genusPsychotria                 elevationlow 
##                    13.564516                     8.987208                    -7.840378                    -2.564516 
##      genusPiper:elevationlow genusPsychotria:elevationlow 
##                    -7.287208                     4.357045

Plot:

boxplot(herbivory ~ elevation * genus, data = example_data)

Basic linear regression

A linear regression is a technique used for comparing the linear relationship between a continuous response variable and continuous predictor variables.

Assumptions of linear regressions:

  1. normally distributed residuals
  2. homogeneity of variance of residuals
  3. independent samples
  4. linear relationship between variables
  5. no multicollinearity among predictors
Table 6. Basic linear regression assumptions to consider and methods to test those assumptions. The list of methods for each assumption is not exhaustive and there may be other methods not mentioned.
Table 6. Basic linear regression assumptions to consider and methods to test those assumptions. The list of methods for each assumption is not exhaustive and there may be other methods not mentioned.

Let’s first make sure we know how to interpret a linear model output. Below is an example of a single predictor model examining the affect of plant area on percent herbivory. If you are having trouble making sense of this output or under-the-hood math, I recommend the following resources.

Some very approachable You-tube tutorials:

Mathier version with an online course format:

herb_lm <- lm(herbivory ~ area, data = example_data)

plot(herbivory ~ area, data = example_data)
abline(herb_lm, col = 'red') #abline adds the modeled line to the active plot

summary(herb_lm)
## 
## Call:
## lm(formula = herbivory ~ area, data = example_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.259  -9.932  -7.107   4.058  77.120 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.16521    2.11106   4.342 2.38e-05 ***
## area         0.01211    0.00664   1.824   0.0698 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.74 on 177 degrees of freedom
## Multiple R-squared:  0.01846,    Adjusted R-squared:  0.01291 
## F-statistic: 3.328 on 1 and 177 DF,  p-value: 0.06979

Let’s see if a new variable, diversity, explains some variance in the residuals. If so, we should add it to the model.

plot(resid(herb_lm), example_data$diversity)

Yep - seems to have a positive linear relationship. So let’s add that variable to the model.

herb_lm.div <- lm(herbivory ~ area + diversity, data = example_data)
summary(herb_lm.div)
## 
## Call:
## lm(formula = herbivory ~ area + diversity, data = example_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.544  -8.543  -3.041   3.332  68.538 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.57502    2.24489  -1.593    0.113    
## area         0.00169    0.00562   0.301    0.764    
## diversity    7.31228    0.80787   9.051 2.61e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.7 on 176 degrees of freedom
## Multiple R-squared:  0.3302, Adjusted R-squared:  0.3226 
## F-statistic: 43.39 on 2 and 176 DF,  p-value: 4.805e-16
par(mfrow = c(2, 2))
plot(herb_lm.div)

par(mfrow = c(1, 1))

Like before, the coefficient is the effect size. This is what we care about most. It tells us the amount of change predicted in our response variable (herbivory) for every 1 unit in our predictor variable (area). In this case we expect a change in 0.002 percent herbivory for every 1cm^2 of leaf area. This seems small, but remember the scale of leaf area. This variable ranges from 12 to over 1000. On the other hand diversity only ranges from 0 to 6. The effect size is bigger, but is that just because these are measured on vastly different scales? How can we compare them?

This is the use of standardization. By standardizing variables we put them on the same scale, the scale of standard deviation. This takes all of the data in a column and transforms it to have a mean of zero and a standard deviation of 1. How does it do this? The default to scale() is to take the z-score which centers data (subtract the mean from each value) and scales by dividing by the standard deviation. You can modify arguments of this function to your needs, for example you may just want to center data. Click here for a more detailed explanation on how to modify the scale() function Click here for a more detailed explanation of z-score

Figure 4. The formula behind the z-score transformation
Figure 4. The formula behind the z-score transformation

Thus, units of each variable are now in terms of standard deviations. This is very important when it comes to comparing variables measured on different scales.

We can standardize area and diversity like this:

example_data$scale_div <- scale(example_data$diversity, center = FALSE) 
example_data$scale_area <- scale(example_data$area, center = FALSE)

Let’s take a quick look at what you just did:

hist(example_data$area, main = "Area before scaling")

hist(example_data$scale_area, main = "Area after scaling")

hist(example_data$diversity, main = "Area before scaling")

hist(example_data$scale_div, main = "Area after scaling")

You can see that distributions are the same, but the range of values has changed (because they are now standardized). You should also notice that the data are not normal, do you think that is important?

Below we will focus on assumptions not yet discussed: multicollinearity and linearity. We will first revisit the concept of homogeneity of variance (homoscedasticity) as it will appear on regression plots.

Multi-collinearity

review here-time 30:35min Multi-collinearity occurs when you run a multiple regression (which have multiple predictor variables) and the predictor variables are “too” correlated. This can cause the model to have trouble separating the effects of each predictor. Checking correlations among variables or using a variance inflation function is a good idea.

We will explore multicollinearity in more detail in a future lab. There are lots of useful correlation plot functions to choose from in R and these can be helpful to visualize correlation among multiple variables. Since we only have two continuous variables in example_data, here I simply use the cor(). Forget how correlations coefficients are calculated? Click here

cor(example_data[c("area", "diversity")])
##                area diversity
## area      1.0000000 0.2049222
## diversity 0.2049222 1.0000000

Linearity

review here-time 8:12min Linearity means that the mean of the response variable at each predictor value should be a linear function of the predictor. Visual inspection of a plot of the response and predictor can help you evaluate this assumption. Let’s plot the raw data for herbivory and the two predictors separately.

plot(herbivory ~ area, example_data)

plot(herbivory ~ diversity, example_data)

These data do not seem to exhibit a non-linear pattern (e.g. polynomial curve) to rule out the use of a linear regression. Although, the relationship between area and herbivory, while linear, is also not a strong linear relationship. However, if you don’t trust your eye, let’s try fitting a polynomial loess regression using the loess in ggplot. If the loess regression approximates the linear line for the most part, then we can be more confident the linearity assumption is not broken.

ggplot(example_data, aes(area, herbivory)) + 
  stat_smooth(method="loess") +
  stat_smooth(method="lm", color="red", fill="red", alpha=.25) + 
  geom_point()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

ggplot(example_data, aes(diversity, herbivory)) + 
  stat_smooth(method="loess") + 
  stat_smooth(method="lm", color="red", fill="red", alpha=.25) + 
  geom_point()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Homogeneity of Variance Let’s revisit the concept of homogeneity of variance (aka.”homoskadasticity”). The residual vs. fitted plots for linear models will look different than the residual vs. fitted plots for categorical predictor variables - like we did above for t-tests. A “shot gun blast” pattern in the residual vs. fitted plot is characteristic of constant variance. Here we simulate data to illustrate how the two scenarios would appear with raw data.

You can see that when data are homoskadastic that the variance does not change along the line. There are some residual points with larger variance for smaller fitted values , but generally the residuals have equal spread around y=0.

When data are heteroskadastic the variance changes along the line. Here, you will notice the characteristic pattern of data with non-constant variance is a “fan horn” shape.

Check for Understanding

9) Match the indicated points in the raw data (Figure A) with their corresponding points in the residual plot (Figure B).

Figure 5. Regression plot and its corresponding B) residual plot. Adapted from: https://online.stat.psu.edu/stat462/node/117/
Figure 5. Regression plot and its corresponding B) residual plot. Adapted from: https://online.stat.psu.edu/stat462/node/117/

10. Your turn: a) Re-run the linear regression with scaled predictor variables. b) Which variable has a larger effect? c) Also, what is the p-value telling you here? Use my explanation from before and some research on your own to properly interpret it. P-values are tricky and often do not mean what the researcher thinks they do.

Answers

1. Why are some columns summarized with means and quantiles and others not? Columns or variables not summarized by quantiles and means (i.e. elevation and genus) are not numeric variables (i.e. area, herbivory, diversity and plotID)

2. Which column(s) in these data are categorical with two levels?

elevation

3. Why are outputs from summary(example_data$herbivory) and boxplot(example_data$herbivory)$stats different?

You will notice that the boxplot(example_data$herbivory)$stats reports:

You will notice summary(example_data$herbivory) reports:

4) Verify that you are able to match the following terms and values with their location on boxplot(example_data$herbivory)

Answer 4.
Answer 4.

5) You should understand what ‘quartiles’ represent. With consideration to the quartiles for example_data$herbivory, how do you interpret Q1 = 1, Q2 = 5 and Q3 = 18?

6) Why are the means different? How does our interpretation change from the first t-test?

The magnitude of greatest values (outliers) were decreased substantially. It shows how large outliers can lead to Type 1 error. The p value has increased, and we should consider it likely we would see an effect of this magnitude or greater even when there’s no difference in the two groups

7) Perform a t-test to compare herbivory between Piper and the other plants. Hint: you will need to make a new variable that categorizes as “piper” and “not piper.” The ifelse() function might come in handy. Answer in R script

#Answer #7
#Here's a few ways to create a column that categorizes plants as "piper" and "not piper"
example_data$is_piper <- "not Piper"
example_data$is_piper[example_data$genus == 'Piper'] <- 'Piper'

#or
example_data$is_piper <- ifelse(example_data$genus != "Piper", "not Piper", "Piper" ) 

#or
example_data$is_piper <- c('not Piper', 'Piper')[(example_data$genus == "Piper")+1]

#Examine data
boxplot(herbivory ~ is_piper, data = example_data)

piper_t.test <- t.test(herbivory ~ is_piper, data = example_data) 
piper_t.test #Raw effect size: Piper plants have 8.03 cm^2 more herbivory on average than non piper plants
## 
##  Welch Two Sample t-test
## 
## data:  herbivory by is_piper
## t = -2.5671, df = 87.116, p-value = 0.01196
## alternative hypothesis: true difference in means between group not Piper and group Piper is not equal to 0
## 95 percent confidence interval:
##  -14.239517  -1.811895
## sample estimates:
## mean in group not Piper     mean in group Piper 
##                9.516667               17.542373
#Evaluate Model Assumptions
#Examine Normality of Residuals
fit <- glm(herbivory ~ is_piper, data = example_data)

plot(density(resid(fit)))

par(mfrow = c(2, 2))
plot(fit)

par(mfrow = c(1, 1))

shapiro.test(resid(fit))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit)
## W = 0.74596, p-value = 2.84e-16
#Conclusion: Residuals are not normally distributed

#Examine Constant Variance
var.test(herbivory ~ elevation, data = example_data) 
## 
##  F test to compare two variances
## 
## data:  herbivory by elevation
## F = 1.4749, num df = 88, denom df = 89, p-value = 0.06908
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9698925 2.2438722
## sample estimates:
## ratio of variances 
##           1.474863
#Variance is constant

#Transform Data
example_data$herb_sqrt <- sqrt(example_data$herbivory)
fit <- glm(herb_sqrt ~ is_piper, data = example_data)

plot(density(resid(fit)))

par(mfrow = c(2, 2))
plot(fit)

par(mfrow = c(1, 1))

shapiro.test(residuals(fit))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fit)
## W = 0.938, p-value = 5.601e-07
var.test(herb_sqrt ~ is_piper, data = example_data) 
## 
##  F test to compare two variances
## 
## data:  herb_sqrt by is_piper
## F = 0.63448, num df = 119, denom df = 58, p-value = 0.03805
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.398598 0.975348
## sample estimates:
## ratio of variances 
##          0.6344755
#Re-run t test
piper_t.test.sqrt <- t.test(herb_sqrt ~ is_piper, data = example_data) 
piper_t.test.sqrt
## 
##  Welch Two Sample t-test
## 
## data:  herb_sqrt by is_piper
## t = -2.7078, df = 95.31, p-value = 0.008028
## alternative hypothesis: true difference in means between group not Piper and group Piper is not equal to 0
## 95 percent confidence interval:
##  -1.7548935 -0.2702282
## sample estimates:
## mean in group not Piper     mean in group Piper 
##                2.357988                3.370548

8) Transform these data, re run the ANOVA, and interpret the results. Keep in mind that there are many types of transformations. Try a few.

#Answer #8
# sqrt transformation
aov.sqrt <- aov(herb_sqrt ~ genus, data = example_data) #We save the results of the ANOVA as "aov.sqrt"

summary(aov.sqrt)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## genus         2   75.3   37.63   8.228 0.000384 ***
## Residuals   176  804.8    4.57                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(aov.sqrt)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = herb_sqrt ~ genus, data = example_data)
## 
## $genus
##                          diff        lwr        upr     p adj
## Piper-Miconia       0.4837031 -0.4392614  1.4066676 0.4320028
## Psychotria-Miconia -1.0756429 -1.9986074 -0.1526784 0.0177505
## Psychotria-Piper   -1.5593460 -2.4899701 -0.6287219 0.0003181
hist(resid(aov.sqrt))

bartlett.test(herb_sqrt ~ genus, data = example_data)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  herb_sqrt by genus
## Bartlett's K-squared = 5.9434, df = 2, p-value = 0.05122
#logarthmic transformation
example_data$herb_log <- log(example_data$herbivory+1) #we add one here to avoid taking the log of zero
aov.log <- aov(herb_log ~ genus, data = example_data) #We save the results of the ANOVA as "aov.log"
summary(aov.log)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## genus         2  25.71  12.855   8.774 0.000233 ***
## Residuals   176 257.86   1.465                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(aov.log)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = herb_log ~ genus, data = example_data)
## 
## $genus
##                          diff        lwr        upr     p adj
## Piper-Miconia       0.1975807 -0.3248469  0.7200084 0.6448387
## Psychotria-Miconia -0.6905152 -1.2129429 -0.1680876 0.0058840
## Psychotria-Piper   -0.8880959 -1.4148591 -0.3613327 0.0002896
hist(resid(aov.log))

bartlett.test(herb_log ~ genus, data = example_data)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  herb_log by genus
## Bartlett's K-squared = 1.792, df = 2, p-value = 0.4082

9) Match the indicated points in the raw data (Figure A) with their corresponding points in the residual plot (Figure B).

Answer 9.
Answer 9.

10) a) Re-run the linear regression with scaled predictor variables. b) Which variable has the largest effect? c) Also, what is the p-value telling you here? Use my explanation from before and some research on your own to properly interpret it. P-values are tricky and often do not mean what the researcher thinks they do.

Answer to 10A

lm.scaled <- lm(herb_sqrt ~ scale_area + scale_div, data = example_data)
summary(lm.scaled)
## 
## Call:
## lm(formula = herb_sqrt ~ scale_area + scale_div, data = example_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1268 -0.8597 -0.2618  0.4470  5.5529 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.2738     0.2294   1.194    0.234    
## scale_area   -0.1006     0.1831  -0.549    0.584    
## scale_div     3.0032     0.2081  14.434   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.501 on 176 degrees of freedom
## Multiple R-squared:  0.5492, Adjusted R-squared:  0.5441 
## F-statistic: 107.2 on 2 and 176 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(lm.scaled)

par(mfrow = c(1, 1))

10B. Diversity has a large effect size. These standardized coefficients can be interpreted as: holding everything else constant, on average, one standard deviation of change in the predictor is associated with a change in the response equal to the magnitude of the coefficient.

10C.From above: “Another note: Here, the p value is 0.187, meaning that if there is truly no difference between groups that we expect to find a difference at least this extreme 18.7% of the time.”

If there is truly no effect of area on herbivory, as stated by the null hypothesis, we still expect to find an effect of area on herbivory at least that extreme (-0.1006), 58.4% of the time.

If there is truly no effect of diversity on herbivory, as stated by the null hypothesis, we still expect to find an effect of diversity on herbivory at least that extreme (3.0032), <0.00001% of the time.