Welcome to R! R is one of several simple programming language used globally for data analysis across many fields of study. It can seem daunting at first, but the upfront work it totally worth it. It makes it easy to share work with collaborators and perform tasks with large data that would take forever in excel or google sheets. This intro is in no way comprehensive and it meant to give some basic code that will be handy for this course. Feel free to reference this throughout the class.
You will notice in R that there are often many ways to accomplish the same goal. In the spirit of this, some of the code below uses the base R language and the tidyverse. They will usually accomplish the same thing. That said, it can be rewarding to minimize packages, that way you really understand what is going on.
For those interested, I learned most of my R knowledge from the book “R for Data Science” by Hadley Wickham. The book can be found here: https://r4ds.had.co.nz/index.html. Full disclosure: it is a tidyverse book.
OK! Let’s get started!
Libraries are collections of functions that you can call. When you
fire up R some are already loaded. It is possible to do a ton in R
without ever deviating from these base libraries. That said, often we
want to run functions that are not in R and we need to install packages.
To do this we use the function install.packages
. Here is an
example of using install packages to install the package
tidyverse
, an especially useful package in data science.
The tidyverse are a set of packages that are use for data processing,
plotting, etc. Most of these functions can also be performed using base
functions, but I like tidyverse.
install.packages("tidyverse") #sorry, tidyverse is kinda a lot
You only have to install a package once (although you may need to
update them occasionally). Once package has been installed you can call
it using the function library
. Require
and
Attach
are similar functions.
library("tidyverse", verbose = FALSE)
#require("tidyverse")
This is a critical step and EVERY script you write should include
this. The working directory is where the R will look for files you want
to use and where it will save files. After my libraries, the next line
in all of my scripts is setwd
. Here I am setting my working
directory as the folder “test_folder” with my “Documents” folder. Set
your working directory to wherever you saved the “test_data.csv” file.
You can also find
You can also find out which directory you are working in using
getwd
.
setwd("...your_working_directory/")
getwd()
R stores different types of data in variables. We can assign data to a new variable using “<-”, “->”, and “=”.
x <- 1 #arrow assignment. also works as 1 -> x
y = 2 #equals assignment
z <- (a <- 1) + 2 #compound assignment. This sets "a" equal to 1, and "z" equal to "a + 2"
A vector stores data of the same type. As x, y, z, and a are all numbers, we can create numeric vector using the “c” function.
c(x,y,z,a)
## [1] 1 2 3 1
We can compare data using the operators ==, >, >=, <, <=, and !=. R uses ! to mean “not”, so != means “not equals”, and !TRUE returns FALSE.
a == x
## [1] TRUE
a >= 1
## [1] TRUE
a <= 0
## [1] FALSE
a != x
## [1] FALSE
We can use if/else statements to run code when certain conditions are met.
if (a == x) {print("a equals x")}
## [1] "a equals x"
if (a < 0) {print("a is negative")} else {print("a is not negative")}
## [1] "a is not negative"
These are a little tricky to get the hang of but are SUPER useful.
Loops allow us to iterate tasks. The best part about this is you can set
them to run and then go off to grab a beverage and still feel like you
are working. The basic loops in R are for
,
while
, and repeat
, but for
is by
far the most commonly used.
for loops start off like this, where it is looping through the vector you provide.
for (i in 1:30){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
In this, the for loop will run through the vector 1:30, which is just 1,2,3,… 30. In each iteration i changes value. In the first run i=1, on the second run i=2, and so on. It is important to note that i is a standard notation, but this can be anything. You could say “for (FrankHerbert in 1:30)” and it would still work.
The vector you provide doesn’t have to be numbers.
vec <- c("I have heard", "it from the mouths", "of the Edema Ruh,", "so I know it to be true")
for (i in vec) {
print(i)
}
## [1] "I have heard"
## [1] "it from the mouths"
## [1] "of the Edema Ruh,"
## [1] "so I know it to be true"
In this case i does not equal a number, but rather the elements of the character vector.
Each item of a vector can also be accessed using an index number. It can be useful to loop through a character vector like this.
vec <- c("I have heard", "it from the mouths", "of the Edema Ruh,", "so I know it to be true")
for (i in 1:length(vec)) {
print(vec[i])
}
## [1] "I have heard"
## [1] "it from the mouths"
## [1] "of the Edema Ruh,"
## [1] "so I know it to be true"
While for loops have their uses, if you need your loop to return data it’s almost always better to use one of the suite of functions centered around “apply”. These functions are typically faster than for loops, but can be a little harder to write and read.
compare this for loop, which generates a list of vectors that contain a single number and the square of that number:
data <- list()
for (i in 1:10){
data[[i]] <- c(i, i^2)
}
data
## [[1]]
## [1] 1 1
##
## [[2]]
## [1] 2 4
##
## [[3]]
## [1] 3 9
##
## [[4]]
## [1] 4 16
##
## [[5]]
## [1] 5 25
##
## [[6]]
## [1] 6 36
##
## [[7]]
## [1] 7 49
##
## [[8]]
## [1] 8 64
##
## [[9]]
## [1] 9 81
##
## [[10]]
## [1] 10 100
to the function “lapply”:
data <- lapply(1:10, \(i) c(i, i^2))
data
## [[1]]
## [1] 1 1
##
## [[2]]
## [1] 2 4
##
## [[3]]
## [1] 3 9
##
## [[4]]
## [1] 4 16
##
## [[5]]
## [1] 5 25
##
## [[6]]
## [1] 6 36
##
## [[7]]
## [1] 7 49
##
## [[8]]
## [1] 8 64
##
## [[9]]
## [1] 9 81
##
## [[10]]
## [1] 10 100
This may seem a little convoluted for this example, but when working with large data and complex functions I find it very handy.
Data to be read into R should be a csv. If you are saving data from
excel with the intention of using it in R, it should be saved as a csv.
We can import data using the read.csv
function.
Like before you can us tab complete to more easily find files. It tries to matches file names from your working directory.
dat <- read.csv("intro_r_data.csv")
We need a name for the data we imported so I called it “dat”. You can save objects using <- or =.
R saves objects in different formats. The item “test_data” is a data frame, which is essentially an array of vectors. We can access those vectors using the “$” operator.
Col1 is an integer vector.
dat$Col1
## [1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Col2 is a character vector.
dat$Col2
## [1] "Species A" "Species A" "Species A" "Species A" "Species A" "Species A" "Species A" "Species A" "Species A"
## [10] "Species A" "Species B" "Species B" "Species B" "Species B" "Species B" "Species B" "Species B" "Species B"
## [19] "Species B" "Species B" "Species C" "Species C" "Species C" "Species C" "Species C" "Species C" "Species C"
## [28] "Species C" "Species C" "Species C"
You can access elements of a vector using brackets. This will access the second element of the vector.
dat$Col2[2]
## [1] "Species A"
These brackets also apply to the entire data.frame. Since it is two dimensional we need to add a command. dat[1,2] accesses the value in the first row and the second column.
dat[1,2]
## [1] "Species A"
If you want to access an entire row or column just put in a comma.
dat[1,] #first row
## Col1 Col2 Col3
## 1 1 Species A 49.90378
dat[,1] #first column
## [1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Now that we have imported the data, let’s take a look at it! There are many useful ways to look at data, too many to describe here, so I have chosen a few I use all the time. These are:
colnames
length
nrow
ncol
unique
head
tail
as.character
, as.factor
,
as.numeric
Before going any further its worth pointing out that functions require arguments. These will be different for every functions depending on what it does. If you ever need help with a function put a question mark in front of it. Tab is also useful.
?colnames
Returns the name of all of the columns in the data.
colnames(dat)
## [1] "Col1" "Col2" "Col3"
This function can also be used rename columns. We can to this be creating a vector of names. To make a vector of values we need to enclose them in c()
colnames(dat) <- c("plotID", "species", "height")
colnames(dat)
## [1] "plotID" "species" "height"
Returns the number of elements in a vector. A vector is a basic element that contains values of the same type. The code below creates a new vector named “spec” that contains the values in the species column. length returns the number of values in that vector.
spec <- dat$species
length(spec)
## [1] 30
Return the number of rows and the number of columns.
nrow(dat)
## [1] 30
ncol(dat)
## [1] 3
Return all of the unique values in a vector (or column). I often pair it with length() to determine the number of levels in a categorical variable. Let’s examine the species category.
unique(dat$species)
## [1] "Species A" "Species B" "Species C"
length(unique(dat$species))
## [1] 3
Returns the first 6 rows and the last 6 rows in a dataframe respectively. Very handy for taking a quick look at data. You can modify n argument in this function to change the number of rows shown.
head(dat)
## plotID species height
## 1 1 Species A 49.90378
## 2 2 Species A 47.31901
## 3 3 Species A 50.87232
## 4 4 Species A 49.41498
## 5 5 Species A 51.29793
## 6 6 Species A 50.82202
These functions are used to convert columns to a different data type. Sometimes data can read in weird and sometimes stats functions need data in a certain format. We can use these to move between them.
dat$species <- as.character(dat$species) #makes species a character
dat$species <- as.factor(dat$species) #makes it a factor again
Data wrangling is the term for re formatting your data. There are many reasons you might want to do this. Maybe a particular function needs your data a certain way. Maybe you want to remove NAs. There are too many to count. Often in EECB related fields that have lots of data, this is the part of analysis that takes the most time. Here are some useful functions for data wrangling.
with
, within
, and
transform
filter
and/or subset
select
gather
and spread
na.omit
R stores information in different environments. Right now we’re in an environment where R knows about the variable “test_data”, but doesn’t know about the columns it contains. That’s why we need to use the “$” operator to access the environment within “test_data”.
We can jump into the environment of the data.frame so we don’t have to use test_data$ to access variables:
test_data <- transform(dat, species = as.character(species)) #this sets species to a character vector
#within allows any amount of code to be written between the curly brackets. This is very useful
test_data <- within(dat, {species <- as.factor(species)})
with(test_data, aggregate(height, list(species), mean)) #returns the mean height for each species
## Group.1 x
## 1 Species A 49.83775
## 2 Species B 69.97579
## 3 Species C 90.09803
Filter
and subset
are two ways of
accomplishing the same thing. Filter
is a tidyverse
function while subset
is a base R function.
Filter
will only work if you have called the library
tidyverse
or dplyr
. They allow you to look at
portions of your data. For example, let only look at species A.
SpeciesA.1 <- filter(dat, species == "Species A")
SpeciesA.2 <- subset(dat, species == "Species A")
Select
is used to remove columns. Let’s remove “Plot ID”
from dat. You can do this by keeping the columns you want or removing
.
dat2 <- select(dat, species, height)
dat2 <- select(dat, -plotID)
The base R equivalents can be a little clunkier.
test_data.select <- dat[c('species', 'height')]
test_data.select <- dat[-which(names(test_data) == 'plotID')]
You can also remove columns using the brackets we used earlier. Since Plot ID is the first column we can also do this.
dat2 <- dat[,-1] #would remove the first column.
dat_data_removed <- dat[-1,] #would remove the first row.
Gather
and spread
are very handy for
reformatting data frames. Essentially these functions transition data
between “long” and “wide” forms. Long dataframes tend to have more rows
and a fewer columns, while wide dataframes have more columns and fewer
rows. Let’s take a look at what these look like. Long dataframes are
nice for data wrangling and for running some models, while wide
dataframes are needed for analyses like hill numbers and ordination. The
functions melt
and cast
in the reshape package
are similar.
dat
These data are long (although since it is only a small amount of data it isn’t a great example). Imagine you have 300 or 3000 observations (rows). Then the term “long” would make more sense.
We can use spread
to make it wider. We want a column for
each species (which is the “key”) and those columns to be filled with
the height measurement (the “value” column),
dat_spread <- spread(dat, key = species, value = height)
head(dat_spread)
## plotID Species A Species B Species C
## 1 1 49.90378 67.43697 90.37914
## 2 2 47.31901 70.33240 91.13306
## 3 3 50.87232 70.89353 92.75686
## 4 4 49.41498 70.39009 89.08342
## 5 5 51.29793 71.12169 89.56514
## 6 6 50.82202 71.29156 91.64644
Now “Species A”, “Species B”, and “Species C” each have their own column.
We can reverse this using gather
. We need to specify
that we only want to gather columns 2:4. In gather
, key is
the name for the new column created from the former column names and
value is the name for the values.
dat_gathered <- gather(dat_spread, 2:4, key = "Species", value = "height")
head(dat_gathered)
## plotID Species height
## 1 1 Species A 49.90378
## 2 2 Species A 47.31901
## 3 3 Species A 50.87232
## 4 4 Species A 49.41498
## 5 5 Species A 51.29793
## 6 6 Species A 50.82202
A base R equivalent is reshape
, although there are some
differences in output.
test_data.spread <- reshape(test_data, idvar = "plotID", timevar = "species", direction = "wide")
head(test_data.spread)
## plotID height.Species A height.Species B height.Species C
## 1 1 49.90378 67.43697 90.37914
## 2 2 47.31901 70.33240 91.13306
## 3 3 50.87232 70.89353 92.75686
## 4 4 49.41498 70.39009 89.08342
## 5 5 51.29793 71.12169 89.56514
## 6 6 50.82202 71.29156 91.64644
test_data.gathered <- reshape(test_data.spread, direction = "long")
head(test_data.gathered)
## plotID species height.Species A
## 1.Species A 1 Species A 49.90378
## 2.Species A 2 Species A 47.31901
## 3.Species A 3 Species A 50.87232
## 4.Species A 4 Species A 49.41498
## 5.Species A 5 Species A 51.29793
## 6.Species A 6 Species A 50.82202
More recent versions of R allow us to string together a chain of commands using pipes. A pipe takes input data, runs a function, then passes the output to another function, and so on. The pipe in base R is |> and the Tidyverse pipe is %>%. I tend to use the tidyverse pipe.
In this example we will take “dat” and first filter and then remove a column. We specify the data we will use at the beginning and so we do not need to include it in subsequent functions. We want to first only keep rows where height is greater than or equal to 50 and then remove the PlotID column.
piped_dat <- dat %>%
filter(height >= 50) %>%
select(-plotID)
head(piped_dat)
## species height
## 1 Species A 50.87232
## 2 Species A 51.29793
## 3 Species A 50.82202
## 4 Species A 50.86813
## 5 Species A 51.78562
## 6 Species B 67.43697
These functions are always used in conjunction to summarize data based on a grouping variable. These are best used with pipes. In the example below we will calculate the mean and standard deviation of height in each group. We make the new columns “av” and “std_dev.”
summ <- dat %>%
group_by(species) %>%
summarise(av = mean(height), std_dev = sd(height))
head(summ)
## # A tibble: 3 × 3
## species av std_dev
## <fct> <dbl> <dbl>
## 1 Species A 49.8 1.57
## 2 Species B 70.0 1.36
## 3 Species C 90.1 1.54
Here you can see that a new data frame is created. It has a species column, a column with the mean height of each species and a column for the standard deviation of height in each species.
na.omit
is used to remove nas that appear in your data.
It is important to know that it works by row. So if an na appears in an
column of a row, that row is removed. Sometimes this is totally fine.
Other times you may only care about nas in a particular column only.
dat_noNA <- na.omit(dat)
You can see that there is no difference, because there were no NAs in the original data.
R is also super handy for simulating data! Simulating data is a super important part of statistics. In order to make sure your code and analyses makes sense it is important to simulate data first and check. If you know the “answer”, which you generate by simulating data, your statistical test should reflect this. If it doesn’t there is something wrong.
When simulating data in R, you simulate data from a distribution. There are many types of distributions. Here are a few. Distributions can be grouped into discrete and continuous. The Normal and gamma distributions are continuous because they predict continuous values, while the Poisson and binomial distributions are discrete, because they predict whole numbers. The wiki page for these is pretty good!
Let’s focus on the normal distribution because it is the easiest.
We can simulate data using the rnorm
function. This
functions simulates points from a normal distribution. The first command
is the number of points, the second is the mean of the distribution, and
the third command it the standard deviation.
Here is a distribution with 1000, points, a mean of 10, and a standard deviation of 2.
sam <- rnorm(1000, 10, 2)
plot(density(sam))
Let’s look at the process of simulating data to check a statistical test. The t-test compares the distributions of two groups. Let simulate two distributions:
sam1 <- rnorm(50, 20, 1)
sam2 <- rnorm(50, 15, 1)
plot(density(sam1), xlim = c(10,25) ,ylim = c(0,1))
lines(density(sam2), col ="red")
You’ll notice that these distributions are bumpier, this is because
we simulated fewer points. The more points we simulate the more
individual outliers even out. You’re distributions will also probably
look different than these, because the random points you got are
different than mine! We can use the function set.seed
to
fix this. I’ll use this later.
Now lets look at a t-test
t.test(sam1, sam2)
##
## Welch Two Sample t-test
##
## data: sam1 and sam2
## t = 25.618, df = 96.219, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.819461 5.629030
## sample estimates:
## mean of x mean of y
## 20.13174 14.90749
This looks right! The means match what we simulated and it is very unlikely we obtained these differences by chance (we did simulate them to be different after all). Lets look at what happens if we simulate them to be the same.
set.seed(666)
sam1 <- rnorm(50, 15, 1)
sam2 <- rnorm(50, 15, 1)
plot(density(sam1), xlim = c(10,25) ,ylim = c(0,1))
lines(density(sam2), col ="red")
t.test(sam1, sam2)
##
## Welch Two Sample t-test
##
## data: sam1 and sam2
## t = 0.018033, df = 94.266, p-value = 0.9857
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4071033 0.4145660
## sample estimates:
## mean of x mean of y
## 14.93520 14.93147
This also looks right! The distributions were simulated to be the same and the t-test supports this! Remember that they are not identical because they are composed of random numbers. The more values we generate the closer they will appear. Also, notice that your distribution should look the same as mine. That is because we set the seed first so our different computers generated the same random numbers.
We can also use rnorm
to simulate relationships between
variables. This is super handy for confirming that analyses are doing
what you think they are. Say we want to simulate the effect of height
(from the test dataset) on water use, which is a variable we will make
up.
We can do something like this. Here we are specifying the relationship between water growth and height. We are saying that an increase in 3 units of plant height changes water growth by 1 unit. We use rnorm to simulate the data with and without noise. Obviously data are messy so we should simulate our data with noise (I use n=30 for rnorm because we have 30 data points).
water_growth_nonoise <- 3 * dat$height
water_growth_noise <- 3 * dat$height + rnorm(30, 0, 10)
Let’s look at the differences
plot(dat$height, water_growth_nonoise, main = "No Noise")
plot(dat$height, water_growth_noise, main = "Noise")
Let’s quickly examine if the relationship we simulated.
coef(glm(water_growth_noise ~ dat$height))
## (Intercept) dat$height
## -10.792341 3.129803
It is!