Throughout, I recommend links to online resources that review the concepts and under-the-hood math in more detail. Below I highlight some texts and online resources that may be helpful to you.
Recommended Texts:
Recommended Online resources:
YouTube Channels & Videos:
Online Courses & Tutorials:
A reference guide that may help you interpret R outputs:
The focus of this lab includes the following statistical techniques: t-test, ANOVA, and basic linear regression. The code focuses on testing assumptions of these techniques and interpreting outputs. In addition, we review: transformations and standardization. Ideally, you leave this lab understanding how the methods are related, especially with regard to how they integrate the concept of variance.
In addition to understanding the under-the-hood math to these techniques, it is very important to remind ourselves of the bigger idea to which the techniques belong. This will help develop your understanding of these methods. These techniques fall under the larger umbrella of hypothesis testing (review here) and maximum likelihood (review here & review here) which fall under the larger umbrella of frequentist inference - one approach to statistical inference. Frequentist inference may be compared to an alternative approach: Bayesian inference. Bayesian inference will not be covered until later in the course, but if you are curious how it compares to frequentist stats you can check out this video.
There are “checking for understanding” questions throughout. These do not need to be turned in, but if you are struggling to answer, please ask for help. Answers to these questions are at the end of the document.
Set working directory.
setwd("...your_working_directory/")
Import data.
example_data <- read.csv("intro_LM_data.csv")
Examine data.
colnames(example_data) #Returns column names
## [1] "elevation" "genus" "area" "herbivory" "diversity" "plotID"
head(example_data) #Returns first 6 entries for each column
## elevation genus area herbivory diversity plotID
## 1 high Piper 400.0000 76 4 1
## 2 high Piper 133.3000 5 2 2
## 3 high Piper 171.5278 3 2 3
## 4 high Piper 38.0000 0 0 4
## 5 high Piper 825.0000 1 2 5
## 6 high Piper 900.0000 5 3 6
Summarize data.
summary(example_data)
## elevation genus area herbivory diversity plotID
## Length:179 Length:179 Min. : 12.5 Min. : 0.00 Min. :0.000 Min. : 1.00
## Class :character Class :character 1st Qu.: 105.5 1st Qu.: 1.00 1st Qu.:1.000 1st Qu.: 8.00
## Mode :character Mode :character Median : 192.0 Median : 5.00 Median :2.000 Median :15.00
## Mean : 247.4 Mean :12.16 Mean :2.095 Mean :15.42
## 3rd Qu.: 355.6 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:23.00
## Max. :1093.8 Max. :87.00 Max. :6.000 Max. :31.00
Check for Understanding:
1) Why are some variables summarized with means and quantiles and others not?
2) Which column(s) in these data are categorical with two levels?
Explore a single variable: Herbivory
summary(example_data$herbivory)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 5.00 12.16 15.00 87.00
boxplot(example_data$herbivory)
boxplot(example_data$herbivory)$stats #returns quartiles of the input vector, see also ?quantile
## [,1]
## [1,] 0
## [2,] 1
## [3,] 5
## [4,] 15
## [5,] 35
Verify you understand the various descriptive statistics and how to interpret R outputs. Click here for review on descriptive statistics
Check for Understanding
3) Why are outputs from
summary(example_data$herbivory)
and
boxplot(example_data$herbivory)$stats
different? You should
be able to discern why the max values differ (i.e. 86 and 35,
respectively) and why summary() returns 6 values and $stats returns 5
values.
4) Verify that you are able to match the following terms and values with their location on boxplot(example_data$herbivory):
5) You should understand what ‘quartiles’ represent. With consideration to the quartiles for example_data$herbivory, how do you interpret Q1 = 1, Q2 = 5 and Q3 = 18?
This section is taken from: https://rpubs.com/cwoods/t_test
A t-test is a parametric test that compares two means and tells you if they are significantly different from each other. A t-test has two assumptions that have to be met:
The residuals have to be normally distributed for each group;
There is equal variance among groups (homogeneity of variance).
The residuals are the leftover variation that is not accounted for by your explanatory variable. Residuals plots are useful for checking these two assumptions. Before we can interpret our statistical output, we should make sure the assumptions above are met.
A t-test depends on the t distribution, which measures the difference between two groups over the difference within the groups. This is essentially measuring how different two means are over their variance. If the difference between your two groups is larger than the variation within groups (differences among means are large and variance within each mean is small), then the groups will be significantly different from each other.
Here are some statistical terms you should understand to interpret your t-test.
Most of the time, you are only testing if the values are different from each other, regardless of the direction so a two-tailed t-test is used, in which case the area under each side of the curve is 2.5% (0.025; Figure 1). If you are interested in if one mean is only larger or only smaller than another, then you use a one-tail t-test. If the t statistic generated by your test is larger than the tcritical value, the area under the curve is smaller than 5% and, therefore, the p-value is < 0.05, and the means are significantly different. You have explained more variance than random chance. And if you say that there is a significant effect and there really isn’t, you are wrong less than 5% of the time.
In the figure below, the t statistic can be conceptualized as the ratio of the difference in means (blue arrow) over the variance you can’t explain (red braces). If you divide the blue arrows in each figure by the red braces, the t statistic is much larger in the left figure than the right figure. That means you are explaining more variation than you aren’t explaining in the left figure (large difference between groups and small variation within groups) and the data in the left figure are statistically significant. The data in the right figure are not statistically significant because there is a small difference between groups (small blue arrow) and large variation within groups (large red braces).
It is important to note that there are a few types of t tests (e.g. Student’s and Welch’s t-test). They differ in their assumptions and therefore utility. When deciding which to use, consider the following comparisons.
t.test()
allows you to perform both types of t tests
depending on how you modify the var.equal
argument. The
default is var.equal = FALSE
(e.g. a Welch’s t-test). You
will notice “Welch T test” specified in the heading of the output. If
you change the argument to var.equal = TRUE
(e.g.a
Student’s t test), you will notice “Student t test” in the header of the
output. Try both scenarios to familiarize yourself with the
t.test()
function. Because Welch’s t-test is not derived
under the assumption of equal variances, it allows users to compare the
mean of two populations without first having to test for equal
variances. Even so, we do test assumptions in the code below, to
reinforce understanding of model assumptions and how to evaluate
residual plots.
Before running statistical tests I always find it useful to plot the data first. Let’s take a look at a plot of the raw data to see if the results of this t-test make sense.
boxplot(herbivory ~ elevation, data = example_data)
We see two groups (high and low elevation) with different levels of herbivory.
Okay let’s do a Welch’s t-test. Here we will examine the differences between herbivory in the high elevation and low elevation group.
herb_t.test <- t.test(herbivory ~ elevation, data = example_data) #We save the results as "herb_t.test."
herb_t.test #Examine the results of the t-test.
##
## Welch Two Sample t-test
##
## data: herbivory by elevation
## t = 1.3249, df = 170, p-value = 0.187
## alternative hypothesis: true difference in means between group high and group low is not equal to 0
## 95 percent confidence interval:
## -1.730763 8.796057
## sample estimates:
## mean in group high mean in group low
## 13.93820 10.40556
Here we see that the mean herbivory in the high group is 13.9 and 10.4 in the low group. This means the “high” group has about 3.5 cm^2 more herbivory on average. This is the effect size or raw coefficient. It is up to the researcher(s) to know if this is biologically significant or not.
Another note: Here, the p value is 0.187, meaning that if there is truly no difference between groups that we expect to find a difference at least this extreme 18.7% of the time.
What R is doing behind the hood:
sample.data <- rgamma(10,2,.1)
null.mean <- 10
sample.size <- length(sample.data)
sample.mean <- mean(sample.data)
sample.sd <- sd(sample.data)
std.err <- sample.sd/sqrt(sample.size)
t.stat <- (sample.mean-null.mean)/std.err
curve(dt(x,sample.size-1),-4,4)
abline(v=t.stat,col="green",lwd=3)
t.crit <- abs(qt(0.025,sample.size-1)) # for 2-tailed test
abline(v=t.crit, col="lightblue", lty=3, lwd=3)
p.val <- (1-pt(abs(t.stat),sample.size-1))*2 #
p.val
## [1] 0.03066376
We can see that the differences in means from the boxplot matches the t-test, but there is something else you should notice. Both the high and low samples are skewed, with many small values and a few high values. This brings us back to the assumptions of classic 2-sample t-tests (Student’s t test) and how one may verify whether data meets those assumptions.
Normally distributed residuals for each group;
Constant variance among groups (homogeneity of variance).
The residuals are the variation that is left over after the accounting for the main variables in the model (“unexplained variance” after fitting the model). Residuals are calculated as the difference between each observed value and its predicted or fitted value from the model. For example, in a two-sample unpaired t test, the fitted value for a measurement is the mean of the sample from which it came, so the residual would be the observed value minus the sample mean.
Note that here we refer to “standard” residuals. You may read about “Pearson” or “Deviance” residuals which are calculated differently from the definition above.
There are multiple methods available to test assumptions in statistical tests (both statistical tests and visuals can be used to evaluate these assumptions) and these methods are not exclusive to a particular statistical method. If they test for homogeneity of variance, they can do so for a t-test, ANOVA, regression and other applications.
So you may ask:
1) Which is the best statistical test to evaluate if your data meet assumptions of your t-test, ANOVA, linear regression? There is NOT one that reigns supreme, instead they all have different attributes and these attributes can influence their sensitivity to departures from normality of your data among other things. So you will have to be comfortable with the attributes of your data and choose the appropriate test accordingly.
2) Which visual method is best for evaluating if your data meet assumptions of your t-test, ANOVA, linear regression? Again, there is no “best”. However,residual analyses are crucially important to evaluate assumptions and there are a few types of residual plots you can use:
residual vs. fitted (x-axis= fitted values; y-axis= residuals)
residual vs. predictor (x-axis=predictor values; y-axis=residuals)
residual vs. ordered predictors (x-axis=predictor values (but in this case predictors are ordered in the temporal or spatial order they were collected ); y-axis=residuals)
Typically, residual vs. fitted plots are used. Residual vs. predictor plots do not yield any additional information that you would have already learned from a residual vs. fitted plot. However, Residual vs. predictor plots can be extremely helpful if you were to replace the predictor values with a new predictor variable. If the new predictor explains variance in the residuals of the original model, it is evidence that this new predictor should be added to the model. More on that in later labs.
To test model assumption 1 (whether the residuals are normally distributed), you will use 2 pieces of information:
First, a visual evaluation of the normality of residuals
using a density plot. Because the t.test()
function only reports summary statistics, we need to first run a linear
model to get the residuals.
fit <- glm(herbivory ~ elevation, data = example_data) #glm() fits a generalized linear model, we'll get into why this is reasonable later
plot(density(resid(fit))) #check for normality of residuals
Examine the density plot. If normal, it should follow more or less a bell-shaped curve. The distribution of residuals is clearly skewed and the residuals are unlikely normally distributed. If your density plot of residuals doesn’t look normal, you need to either transform your response variable or use a non-parametric test. Before we deal with that, let’s confirm with some other assumption tests.
Next a statistical test to evaluate normality of residuals- Shapiro-Wilk test. The null hypothesis for the Shapiro-Wilks tests is that the distribution of the sample does not differ from a normal distribution. Thus, p<0.5 supports the alternative hypothesis that the sample distribution does differ from a normal distribution- indicating our data breaks the assumption of normal residuals. Some may say breaking the normality assumption is less important than the assumption of constant variance. Let’s evaluate for constant variance.
shapiro.test(resid(fit))
##
## Shapiro-Wilk normality test
##
## data: resid(fit)
## W = 0.71003, p-value < 2.2e-16
Like before, let’s start with a visual evaluation of constant variance using a residual vs. fitted plot.
plot(fitted.values(fit), resid(fit))
If our predictor variables were continuous, as in a linear regression, a residual vs fitted value plot should appear as a cloud of points about zero (i.e. a shot gun blast). You will see this later in the code when we explore linear regression. In the case of a categorical predictor with two levels,there are only two unique fitted values (the mean of each of the two samples). Therefore, the graph of residuals against fitted values will consist of two vertical “stacks” of residuals. If the residuals have constant variance, we would expect the stacks to be about the same length and at about the same level. In other words, the spread of residuals along the y-axis is similar among the two groups. We can also evaluate normality here (i.e.the cluster of points in both groupings should gather around zero and have equal spread about 0).
Now let’s try a statistical test to test for homogeneity of
variance using a F test. You will see this test statistic again
in linear regression to evaluate model performance. It is important to
remember the F statistic is a ratio of variances.
var.test()
performs an F-test. The F-test is obtained by
computing the ratio of variances in the high and low elevation group.It
is known to be sensitive to deviations from normality. Note that the
function defaults to a two sided test.
var.test(herbivory ~ elevation, data = example_data)
##
## F test to compare two variances
##
## data: herbivory by elevation
## F = 1.4749, num df = 88, denom df = 89, p-value = 0.06908
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.9698925 2.2438722
## sample estimates:
## ratio of variances
## 1.474863
Higher p values indicate that the variance is homogeneous. The null hypothesis is that the variances are equal, so our data supports the null indicating homogeneous variance. We can conclude so far that data is not normally distributed, but the variance of the two samples from high and low elevation are constant.
It’s also worth mentioning that many visual tests can conducted using
just the plot()
function:
par(mfrow = c(2, 2))
plot(fit) #plot() will return statistical summaries for certain model outputs
par(mfrow = c(1, 1))
details for how to interpret these plots can be found here
Transforming your data means modifying the response variable so that the assumptions are met. There are plenty of types of transformations and these can be categorized as linear or non-linear transformations:
Types of transformation:
This often means taking the square root or the log of the response variable and running the test again. The reason transformations work is that the relative distance between each replicate stays the same (sample 1 is lower than sample 2, for example) but the absolute distance between them is reduced. In Figure 3, we see that square-root and log transformations generally reduce the largest values and inflate the smaller ones.
Fig 3a Taken from here. Figs b & c Taken from here
We will square root transform the response variable and see if this improves the normality and variance. Square-root transformations can make non-linear relationship between X and Y linear and can reduce unequal variances. It is important to remember that after doing this, our effect size is on a square root scale which can make interpretation of the results more difficult.You still present your data using the raw numbers, not the transformed numbers.
#Revisit untransformed data
plot(density(resid(fit)))
shapiro.test(residuals(fit))
##
## Shapiro-Wilk normality test
##
## data: residuals(fit)
## W = 0.71003, p-value < 2.2e-16
#Compare to transformed data
example_data$herb_sqrt <- sqrt(example_data$herbivory)
#Re-run density plot
fit_sqrt <- glm(herb_sqrt ~ elevation, data = example_data)
plot(density(resid(fit_sqrt)))
#Re-run Shapiro-Wilks test
shapiro.test(residuals(fit_sqrt))
##
## Shapiro-Wilk normality test
##
## data: residuals(fit_sqrt)
## W = 0.92432, p-value = 5.034e-08
That’s a little better. You can see how sensitive the statistical tests are to departures of normality. Even with transformation the Shapiro-Wilk test indicates the residuals are not normal. This is part of why visual examinations of data are so important.
Let’s compare the transformed and untransformed distributions for each elevation. ggplot2 is a tidyverse package that makes certain plotting operations easier. Everything you can do in ggplot can be done in base R, but the more complex your figures are the more useful ggplot will become. For example, Figures 1 2, & 5 in this markdown were made in ggplot.
library(ggplot2)
ggplot(example_data)+
stat_density(aes(herb_sqrt, fill = 'herb_sqrt'), alpha = .5)+
stat_density(aes(herbivory, fill = 'herbivory'), alpha = .5)+
facet_grid(cols = vars(elevation))
And finally let’s compare the elevations
#conduct t test on transformed data
herb_sqrt_t.test <- t.test(herb_sqrt ~ elevation, data = example_data) #We save the results of the t-test as "herbsqrt_t.test"
herb_sqrt_t.test #Examine the results of the t-test.
##
## Welch Two Sample t-test
##
## data: herb_sqrt by elevation
## t = 0.91792, df = 169.95, p-value = 0.36
## alternative hypothesis: true difference in means between group high and group low is not equal to 0
## 95 percent confidence interval:
## -0.3515798 0.9627352
## sample estimates:
## mean in group high mean in group low
## 2.845379 2.539801
Checking For Understanding 6) Why do the means differ from the untransformed data? How does our interpretation change from the first t-test?
7) Perform a t-test to compare herbivory between Piper and the other plants. Hint: you will need to make a new variable that categorizes as “piper” and “not piper.” The ifelse() function might come in handy, but there are many ways of doing this
ANOVA stands for analysis of variance and is used for comparing the means of a categorical variable with three or more levels. Identify the column(s) in these data that are categorical with three levels.
Let’s examine the relationship between plant genus and herbivory The assumptions of an ANOVA:
boxplot(herbivory ~ genus, data = example_data)
Here we will examine the differences between the three plant genera. If you need a refresher on how to interpret ANOVA outputs, you can use the document in the Google folder “Interpreting R Outputs”
herb_aov <- aov(herbivory ~ genus, data = example_data) #We save the results of the ANOVA as "herb_aov."
summary(herb_aov) #Examine the results of the ANOVA.
## Df Sum Sq Mean Sq F value Pr(>F)
## genus 2 3511 1755.6 5.804 0.00362 **
## Residuals 176 53236 302.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Unlike the t.test()
function, the aov()
function is a bit less verbose and does not output effect sizes, which
are the coefficients of the model. We can use the coef()
function for this.
coef(herb_aov)
## (Intercept) genusPiper genusPsychotria
## 12.303279 5.239094 -5.667685
With the ANOVA, we are essentially comparing the means of all of the
groups. To do this, one group becomes the intercept (which is usually
the first level alphabetically, see ?factor
for how to
change this), and the other coefficients are compared to the intercept.
So in this example the intercept is the mean herbivory on Miconia and
genusPiper and genusPsychotria are the difference in herbivory each
genus has compared to Miconia.
You may have noticed by now the use of functions like
coef()
and glm()
- which seem appropriate for
continuous variables in regression analysis - are used in instances of
categorical predictor variables for t.test()
and
anova()
. It is important to understand that: ANOVA
& t-test ARE essentially linear regressions. If you are
unsure how to make sense of that, click here..
In all three techniques, variance of the fitted line is compared to the total variance. To do this for categorical predictors, as in t-test and anova, R codes categorical variables into dummy variables and inputs them into a design matrix review here
(Some) People care about p values, so let’s look at those too. You can see that the ANOVA (herb_aov) is “significant”, but what does that mean? Remember that a p value tells the chance of obtaining group differences at least this extreme if the null hypothesis is true. The null hypothesis in this case is taht all groups are equal. How can we examine which groups are different? We can use a Tukey post-hoc test.
TukeyHSD(herb_aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = herbivory ~ genus, data = example_data)
##
## $genus
## diff lwr upr p adj
## Piper-Miconia 5.239094 -2.267465 12.745653 0.2277095
## Psychotria-Miconia -5.667685 -13.174245 1.838874 0.1776925
## Psychotria-Piper -10.906780 -18.475635 -3.337924 0.0023435
Let’s check for the assumption of normality. For this we will quickly evaluate visually using a density plot.
plot(density(resid(herb_aov)))
Now, we will evaluate homogeneity of variance using a statistical test. Let’s try a different test from the F- test we tried previously. We will use the Bartlett test:
bartlett.test(herbivory ~ genus, data = example_data)
##
## Bartlett test of homogeneity of variances
##
## data: herbivory by genus
## Bartlett's K-squared = 13.576, df = 2, p-value = 0.001127
The null hypothesis of the Bartlett test is that there’s no difference in variances between groups. Here our results show that we reject this null hypothesis, indicating variances are not constant. Like before, you can transform the data to try and meet the assumptions of the test.
Check for Understanding:
8) Transform these data, re run the ANOVA, and interpret the results. Keep in mind that there are many types of transformations. Try a few.
The ANOVAs you have run so far are called one-way ANOVAs . This is because you are looking at the effects of one variable (with more than two levels) on a response variable. A two-way ANOVA examines the relationship between two categorical predictor variables and a response variable. review here for brandon fultz playlist 13 on one-way and two-way ANOVA
Here is an example:
herb_aov.2way <- aov(herbivory ~ genus*elevation, data = example_data) #We save the results of the ANOVA as "herb_aov.2way."
summary(herb_aov.2way) #Examine the results of the ANOVA.
## Df Sum Sq Mean Sq F value Pr(>F)
## genus 2 3511 1755.6 5.879 0.00339 **
## elevation 1 558 557.6 1.867 0.17356
## genus:elevation 2 1021 510.6 1.710 0.18389
## Residuals 173 51657 298.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(herb_aov.2way)
## (Intercept) genusPiper genusPsychotria elevationlow
## 13.564516 8.987208 -7.840378 -2.564516
## genusPiper:elevationlow genusPsychotria:elevationlow
## -7.287208 4.357045
Plot:
boxplot(herbivory ~ elevation * genus, data = example_data)
A linear regression is a technique used for comparing the linear relationship between a continuous response variable and continuous predictor variables.
Assumptions of linear regressions:
Let’s first make sure we know how to interpret a linear model output. Below is an example of a single predictor model examining the affect of plant area on percent herbivory. If you are having trouble making sense of this output or under-the-hood math, I recommend the following resources.
Some very approachable You-tube tutorials:
Mathier version with an online course format:
herb_lm <- lm(herbivory ~ area, data = example_data)
plot(herbivory ~ area, data = example_data)
abline(herb_lm, col = 'red') #abline adds the modeled line to the active plot
summary(herb_lm)
##
## Call:
## lm(formula = herbivory ~ area, data = example_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.259 -9.932 -7.107 4.058 77.120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.16521 2.11106 4.342 2.38e-05 ***
## area 0.01211 0.00664 1.824 0.0698 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.74 on 177 degrees of freedom
## Multiple R-squared: 0.01846, Adjusted R-squared: 0.01291
## F-statistic: 3.328 on 1 and 177 DF, p-value: 0.06979
Let’s see if a new variable, diversity, explains some variance in the residuals. If so, we should add it to the model.
plot(resid(herb_lm), example_data$diversity)
Yep - seems to have a positive linear relationship. So let’s add that variable to the model.
herb_lm.div <- lm(herbivory ~ area + diversity, data = example_data)
summary(herb_lm.div)
##
## Call:
## lm(formula = herbivory ~ area + diversity, data = example_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.544 -8.543 -3.041 3.332 68.538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.57502 2.24489 -1.593 0.113
## area 0.00169 0.00562 0.301 0.764
## diversity 7.31228 0.80787 9.051 2.61e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.7 on 176 degrees of freedom
## Multiple R-squared: 0.3302, Adjusted R-squared: 0.3226
## F-statistic: 43.39 on 2 and 176 DF, p-value: 4.805e-16
par(mfrow = c(2, 2))
plot(herb_lm.div)
par(mfrow = c(1, 1))
Like before, the coefficient is the effect size. This is what we care about most. It tells us the amount of change predicted in our response variable (herbivory) for every 1 unit in our predictor variable (area). In this case we expect a change in 0.002 percent herbivory for every 1cm^2 of leaf area. This seems small, but remember the scale of leaf area. This variable ranges from 12 to over 1000. On the other hand diversity only ranges from 0 to 6. The effect size is bigger, but is that just because these are measured on vastly different scales? How can we compare them?
This is the use of standardization. By standardizing variables we put
them on the same scale, the scale of standard deviation. This takes all
of the data in a column and transforms it to have a mean of zero and a
standard deviation of 1. How does it do this? The default to
scale()
is to take the z-score which centers data (subtract
the mean from each value) and scales by dividing by the standard
deviation. You can modify arguments of this function to your needs, for
example you may just want to center data. Click here for a more
detailed explanation on how to modify the scale()
function Click
here for a more detailed explanation of z-score
Thus, units of each variable are now in terms of standard deviations. This is very important when it comes to comparing variables measured on different scales.
We can standardize area and diversity like this:
example_data$scale_div <- scale(example_data$diversity, center = FALSE)
example_data$scale_area <- scale(example_data$area, center = FALSE)
Let’s take a quick look at what you just did:
hist(example_data$area, main = "Area before scaling")
hist(example_data$scale_area, main = "Area after scaling")
hist(example_data$diversity, main = "Area before scaling")
hist(example_data$scale_div, main = "Area after scaling")
You can see that distributions are the same, but the range of values has changed (because they are now standardized). You should also notice that the data are not normal, do you think that is important?
Below we will focus on assumptions not yet discussed: multicollinearity and linearity. We will first revisit the concept of homogeneity of variance (homoscedasticity) as it will appear on regression plots.
Multi-collinearity
review here-time 30:35min Multi-collinearity occurs when you run a multiple regression (which have multiple predictor variables) and the predictor variables are “too” correlated. This can cause the model to have trouble separating the effects of each predictor. Checking correlations among variables or using a variance inflation function is a good idea.
We will explore multicollinearity in more detail in a future lab.
There are lots of useful correlation plot functions to choose from in R
and these can be helpful to visualize correlation among multiple
variables. Since we only have two continuous variables in
example_data
, here I simply use the cor()
. Forget how
correlations coefficients are calculated? Click here
cor(example_data[c("area", "diversity")])
## area diversity
## area 1.0000000 0.2049222
## diversity 0.2049222 1.0000000
Linearity
review here-time 8:12min Linearity means that the mean of the response variable at each predictor value should be a linear function of the predictor. Visual inspection of a plot of the response and predictor can help you evaluate this assumption. Let’s plot the raw data for herbivory and the two predictors separately.
plot(herbivory ~ area, example_data)
plot(herbivory ~ diversity, example_data)
These data do not seem to exhibit a non-linear pattern
(e.g. polynomial curve) to rule out the use of a linear regression.
Although, the relationship between area and herbivory, while linear, is
also not a strong linear relationship. However, if you don’t trust your
eye, let’s try fitting a polynomial loess regression using the
loess
in ggplot. If the loess regression approximates the
linear line for the most part, then we can be more confident the
linearity assumption is not broken.
ggplot(example_data, aes(area, herbivory)) +
stat_smooth(method="loess") +
stat_smooth(method="lm", color="red", fill="red", alpha=.25) +
geom_point()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
ggplot(example_data, aes(diversity, herbivory)) +
stat_smooth(method="loess") +
stat_smooth(method="lm", color="red", fill="red", alpha=.25) +
geom_point()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Homogeneity of Variance Let’s revisit the concept of homogeneity of variance (aka.”homoskadasticity”). The residual vs. fitted plots for linear models will look different than the residual vs. fitted plots for categorical predictor variables - like we did above for t-tests. A “shot gun blast” pattern in the residual vs. fitted plot is characteristic of constant variance. Here we simulate data to illustrate how the two scenarios would appear with raw data.
You can see that when data are homoskadastic that the variance does not change along the line. There are some residual points with larger variance for smaller fitted values , but generally the residuals have equal spread around y=0.
When data are heteroskadastic the variance changes along the line. Here, you will notice the characteristic pattern of data with non-constant variance is a “fan horn” shape.
Check for Understanding
9) Match the indicated points in the raw data (Figure A) with their corresponding points in the residual plot (Figure B).
10. Your turn: a) Re-run the linear regression with scaled predictor variables. b) Which variable has a larger effect? c) Also, what is the p-value telling you here? Use my explanation from before and some research on your own to properly interpret it. P-values are tricky and often do not mean what the researcher thinks they do.
1. Why are some columns summarized with means and quantiles and others not? Columns or variables not summarized by quantiles and means (i.e. elevation and genus) are not numeric variables (i.e. area, herbivory, diversity and plotID)
2. Which column(s) in these data are categorical with two levels?
elevation
3. Why are outputs from
summary(example_data$herbivory)
and
boxplot(example_data$herbivory)$stats
different?
You will notice that the
boxplot(example_data$herbivory)$stats
reports:
You will notice summary(example_data$herbivory)
reports:
minimum outlier
Q1
Median(aka Q2)
Mean
Q3
Max outlier, respectively
Some explanation on the whiskers: According to the R function, the upper whisker value is determined by: Q3 +1.5 * (Q3 - Q1). (Q3 - Q1) is referred to as the interquartile range (IQR) and is represented as the gray area on the box and whisker plot. In this example, Q1 = 1; Q3 = 15. Therefore, 15 + 1.5 * (15 - 1) = 36. If you were to calculate the lower whisker, use Q1 - 1.5 * (Q3 - Q1). You will notice that calculated value (36) may not match the value reported in the R output (35). This is because the output reports the maximum observed data point that is not greater than 36.
It is important to note that box and whisker plots illustrate the median (not the mean) while the t-test is calculated from the means. I think this is often a point of confusion. However, you can see that by illustrating the median, boxplots illustrate important information with regard to the degree of skew in the distribution.
If this was challenging for you, check out: https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch12/5214889-eng.htm
4) Verify that you are able to match the following terms and
values with their location on
boxplot(example_data$herbivory)
5) You should understand what ‘quartiles’ represent. With
consideration to the quartiles for example_data$herbivory
,
how do you interpret Q1 = 1, Q2 = 5 and Q3 = 18?
6) Why are the means different? How does our interpretation change from the first t-test?
The magnitude of greatest values (outliers) were decreased substantially. It shows how large outliers can lead to Type 1 error. The p value has increased, and we should consider it likely we would see an effect of this magnitude or greater even when there’s no difference in the two groups
7) Perform a t-test to compare herbivory between Piper and the other plants. Hint: you will need to make a new variable that categorizes as “piper” and “not piper.” The ifelse() function might come in handy. Answer in R script
#Answer #7
#Here's a few ways to create a column that categorizes plants as "piper" and "not piper"
example_data$is_piper <- "not Piper"
example_data$is_piper[example_data$genus == 'Piper'] <- 'Piper'
#or
example_data$is_piper <- ifelse(example_data$genus != "Piper", "not Piper", "Piper" )
#or
example_data$is_piper <- c('not Piper', 'Piper')[(example_data$genus == "Piper")+1]
#Examine data
boxplot(herbivory ~ is_piper, data = example_data)
piper_t.test <- t.test(herbivory ~ is_piper, data = example_data)
piper_t.test #Raw effect size: Piper plants have 8.03 cm^2 more herbivory on average than non piper plants
##
## Welch Two Sample t-test
##
## data: herbivory by is_piper
## t = -2.5671, df = 87.116, p-value = 0.01196
## alternative hypothesis: true difference in means between group not Piper and group Piper is not equal to 0
## 95 percent confidence interval:
## -14.239517 -1.811895
## sample estimates:
## mean in group not Piper mean in group Piper
## 9.516667 17.542373
#Evaluate Model Assumptions
#Examine Normality of Residuals
fit <- glm(herbivory ~ is_piper, data = example_data)
plot(density(resid(fit)))
par(mfrow = c(2, 2))
plot(fit)
par(mfrow = c(1, 1))
shapiro.test(resid(fit))
##
## Shapiro-Wilk normality test
##
## data: resid(fit)
## W = 0.74596, p-value = 2.84e-16
#Conclusion: Residuals are not normally distributed
#Examine Constant Variance
var.test(herbivory ~ elevation, data = example_data)
##
## F test to compare two variances
##
## data: herbivory by elevation
## F = 1.4749, num df = 88, denom df = 89, p-value = 0.06908
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.9698925 2.2438722
## sample estimates:
## ratio of variances
## 1.474863
#Variance is constant
#Transform Data
example_data$herb_sqrt <- sqrt(example_data$herbivory)
fit <- glm(herb_sqrt ~ is_piper, data = example_data)
plot(density(resid(fit)))
par(mfrow = c(2, 2))
plot(fit)
par(mfrow = c(1, 1))
shapiro.test(residuals(fit))
##
## Shapiro-Wilk normality test
##
## data: residuals(fit)
## W = 0.938, p-value = 5.601e-07
var.test(herb_sqrt ~ is_piper, data = example_data)
##
## F test to compare two variances
##
## data: herb_sqrt by is_piper
## F = 0.63448, num df = 119, denom df = 58, p-value = 0.03805
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.398598 0.975348
## sample estimates:
## ratio of variances
## 0.6344755
#Re-run t test
piper_t.test.sqrt <- t.test(herb_sqrt ~ is_piper, data = example_data)
piper_t.test.sqrt
##
## Welch Two Sample t-test
##
## data: herb_sqrt by is_piper
## t = -2.7078, df = 95.31, p-value = 0.008028
## alternative hypothesis: true difference in means between group not Piper and group Piper is not equal to 0
## 95 percent confidence interval:
## -1.7548935 -0.2702282
## sample estimates:
## mean in group not Piper mean in group Piper
## 2.357988 3.370548
8) Transform these data, re run the ANOVA, and interpret the results. Keep in mind that there are many types of transformations. Try a few.
#Answer #8
# sqrt transformation
aov.sqrt <- aov(herb_sqrt ~ genus, data = example_data) #We save the results of the ANOVA as "aov.sqrt"
summary(aov.sqrt)
## Df Sum Sq Mean Sq F value Pr(>F)
## genus 2 75.3 37.63 8.228 0.000384 ***
## Residuals 176 804.8 4.57
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(aov.sqrt)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = herb_sqrt ~ genus, data = example_data)
##
## $genus
## diff lwr upr p adj
## Piper-Miconia 0.4837031 -0.4392614 1.4066676 0.4320028
## Psychotria-Miconia -1.0756429 -1.9986074 -0.1526784 0.0177505
## Psychotria-Piper -1.5593460 -2.4899701 -0.6287219 0.0003181
hist(resid(aov.sqrt))
bartlett.test(herb_sqrt ~ genus, data = example_data)
##
## Bartlett test of homogeneity of variances
##
## data: herb_sqrt by genus
## Bartlett's K-squared = 5.9434, df = 2, p-value = 0.05122
#logarthmic transformation
example_data$herb_log <- log(example_data$herbivory+1) #we add one here to avoid taking the log of zero
aov.log <- aov(herb_log ~ genus, data = example_data) #We save the results of the ANOVA as "aov.log"
summary(aov.log)
## Df Sum Sq Mean Sq F value Pr(>F)
## genus 2 25.71 12.855 8.774 0.000233 ***
## Residuals 176 257.86 1.465
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(aov.log)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = herb_log ~ genus, data = example_data)
##
## $genus
## diff lwr upr p adj
## Piper-Miconia 0.1975807 -0.3248469 0.7200084 0.6448387
## Psychotria-Miconia -0.6905152 -1.2129429 -0.1680876 0.0058840
## Psychotria-Piper -0.8880959 -1.4148591 -0.3613327 0.0002896
hist(resid(aov.log))
bartlett.test(herb_log ~ genus, data = example_data)
##
## Bartlett test of homogeneity of variances
##
## data: herb_log by genus
## Bartlett's K-squared = 1.792, df = 2, p-value = 0.4082
9) Match the indicated points in the raw data (Figure A) with their corresponding points in the residual plot (Figure B).
10) a) Re-run the linear regression with scaled predictor variables. b) Which variable has the largest effect? c) Also, what is the p-value telling you here? Use my explanation from before and some research on your own to properly interpret it. P-values are tricky and often do not mean what the researcher thinks they do.
Answer to 10A
lm.scaled <- lm(herb_sqrt ~ scale_area + scale_div, data = example_data)
summary(lm.scaled)
##
## Call:
## lm(formula = herb_sqrt ~ scale_area + scale_div, data = example_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1268 -0.8597 -0.2618 0.4470 5.5529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2738 0.2294 1.194 0.234
## scale_area -0.1006 0.1831 -0.549 0.584
## scale_div 3.0032 0.2081 14.434 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.501 on 176 degrees of freedom
## Multiple R-squared: 0.5492, Adjusted R-squared: 0.5441
## F-statistic: 107.2 on 2 and 176 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(lm.scaled)
par(mfrow = c(1, 1))
10B. Diversity has a large effect size. These standardized coefficients can be interpreted as: holding everything else constant, on average, one standard deviation of change in the predictor is associated with a change in the response equal to the magnitude of the coefficient.
10C.From above: “Another note: Here, the p value is 0.187, meaning that if there is truly no difference between groups that we expect to find a difference at least this extreme 18.7% of the time.”
If there is truly no effect of area on herbivory, as stated by the null hypothesis, we still expect to find an effect of area on herbivory at least that extreme (-0.1006), 58.4% of the time.
If there is truly no effect of diversity on herbivory, as stated by the null hypothesis, we still expect to find an effect of diversity on herbivory at least that extreme (3.0032), <0.00001% of the time.