Welcome to R! R is one of several simple programming language used globally for data analysis across many fields of study. It can seem daunting at first, but the upfront work it totally worth it. It makes it easy to share work with collaborators and perform tasks with large data that would take forever in excel or google sheets. This intro is in no way comprehensive and it meant to give some basic code that will be handy for this course. Feel free to reference this throughout the class.

You will notice in R that there are often many ways to accomplish the same goal. In the spirit of this, some of the code below uses the base R language and the tidyverse. They will usually accomplish the same thing. That said, it can be rewarding to minimize packages, that way you really understand what is going on.

For those interested, I learned most of my R knowledge from the book “R for Data Science” by Hadley Wickham. The book can be found here: https://r4ds.had.co.nz/index.html. Full disclosure: it is a tidyverse book.

OK! Let’s get started!

Installing packages and calling libraries

Libraries are collections of functions that you can call. When you fire up R some are already loaded. It is possible to do a ton in R without ever deviating from these base libraries. That said, often we want to run functions that are not in R and we need to install packages. To do this we use the function install.packages. Here is an example of using install packages to install the package tidyverse, an especially useful package in data science. The tidyverse are a set of packages that are use for data processing, plotting, etc. Most of these functions can also be performed using base functions, but I like tidyverse.

install.packages("tidyverse") #sorry, tidyverse is kinda a lot

You only have to install a package once (although you may need to update them occasionally). Once package has been installed you can call it using the function library. Require and Attach are similar functions.

library("tidyverse", verbose = FALSE)
#require("tidyverse")

Setting your working directory

This is a critical step and EVERY script you write should include this. The working directory is where the R will look for files you want to use and where it will save files. After my libraries, the next line in all of my scripts is setwd. Here I am setting my working directory as the folder “test_folder” with my “Documents” folder. Set your working directory to wherever you saved the “test_data.csv” file. You can also find

You can also find out which directory you are working in using getwd.

setwd("...your_working_directory/")
getwd()

Some R basics

R stores different types of data in variables. We can assign data to a new variable using “<-”, “->”, and “=”.

x <- 1            #arrow assignment. also works as 1 -> x
y = 2             #equals assignment
z <- (a <- 1) + 2 #compound assignment. This sets "a" equal to 1, and "z" equal to "a + 2" 

A vector stores data of the same type. As x, y, z, and a are all numbers, we can create numeric vector using the “c” function.

c(x,y,z,a)
## [1] 1 2 3 1

We can compare data using the operators ==, >, >=, <, <=, and !=. R uses ! to mean “not”, so != means “not equals”, and !TRUE returns FALSE.

a == x
## [1] TRUE
a >= 1
## [1] TRUE
a <= 0
## [1] FALSE
a != x
## [1] FALSE

We can use if/else statements to run code when certain conditions are met.

if (a == x) {print("a equals x")}
## [1] "a equals x"
if (a <  0) {print("a is negative")} else {print("a is not negative")}
## [1] "a is not negative"

For loop

These are a little tricky to get the hang of but are SUPER useful. Loops allow us to iterate tasks. The best part about this is you can set them to run and then go off to grab a beverage and still feel like you are working. The basic loops in R are for, while, and repeat, but for is by far the most commonly used.

for loops start off like this, where it is looping through the vector you provide.

for (i in 1:30){
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30

In this, the for loop will run through the vector 1:30, which is just 1,2,3,… 30. In each iteration i changes value. In the first run i=1, on the second run i=2, and so on. It is important to note that i is a standard notation, but this can be anything. You could say “for (FrankHerbert in 1:30)” and it would still work.

The vector you provide doesn’t have to be numbers.

vec <- c("I have heard", "it from the mouths", "of the Edema Ruh,", "so I know it to be true")
for (i in vec) {
  print(i)
}
## [1] "I have heard"
## [1] "it from the mouths"
## [1] "of the Edema Ruh,"
## [1] "so I know it to be true"

In this case i does not equal a number, but rather the elements of the character vector.

Each item of a vector can also be accessed using an index number. It can be useful to loop through a character vector like this.

vec <- c("I have heard", "it from the mouths", "of the Edema Ruh,", "so I know it to be true")
for (i in 1:length(vec)) {
  print(vec[i])
}
## [1] "I have heard"
## [1] "it from the mouths"
## [1] "of the Edema Ruh,"
## [1] "so I know it to be true"

While for loops have their uses, if you need your loop to return data it’s almost always better to use one of the suite of functions centered around “apply”. These functions are typically faster than for loops, but can be a little harder to write and read.

compare this for loop, which generates a list of vectors that contain a single number and the square of that number:

data <- list()
for (i in 1:10){
  data[[i]] <- c(i, i^2)
}
data
## [[1]]
## [1] 1 1
## 
## [[2]]
## [1] 2 4
## 
## [[3]]
## [1] 3 9
## 
## [[4]]
## [1]  4 16
## 
## [[5]]
## [1]  5 25
## 
## [[6]]
## [1]  6 36
## 
## [[7]]
## [1]  7 49
## 
## [[8]]
## [1]  8 64
## 
## [[9]]
## [1]  9 81
## 
## [[10]]
## [1]  10 100

to the function “lapply”:

data <- lapply(1:10, \(i) c(i, i^2))
data
## [[1]]
## [1] 1 1
## 
## [[2]]
## [1] 2 4
## 
## [[3]]
## [1] 3 9
## 
## [[4]]
## [1]  4 16
## 
## [[5]]
## [1]  5 25
## 
## [[6]]
## [1]  6 36
## 
## [[7]]
## [1]  7 49
## 
## [[8]]
## [1]  8 64
## 
## [[9]]
## [1]  9 81
## 
## [[10]]
## [1]  10 100

This may seem a little convoluted for this example, but when working with large data and complex functions I find it very handy.

Importing data

Data to be read into R should be a csv. If you are saving data from excel with the intention of using it in R, it should be saved as a csv. We can import data using the read.csv function.

Like before you can us tab complete to more easily find files. It tries to matches file names from your working directory.

dat <- read.csv("intro_r_data.csv")

We need a name for the data we imported so I called it “dat”. You can save objects using <- or =.

Some data basics

R saves objects in different formats. The item “test_data” is a data frame, which is essentially an array of vectors. We can access those vectors using the “$” operator.

Col1 is an integer vector.

dat$Col1
##  [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10

Col2 is a character vector.

dat$Col2
##  [1] "Species A" "Species A" "Species A" "Species A" "Species A" "Species A" "Species A" "Species A" "Species A"
## [10] "Species A" "Species B" "Species B" "Species B" "Species B" "Species B" "Species B" "Species B" "Species B"
## [19] "Species B" "Species B" "Species C" "Species C" "Species C" "Species C" "Species C" "Species C" "Species C"
## [28] "Species C" "Species C" "Species C"

You can access elements of a vector using brackets. This will access the second element of the vector.

dat$Col2[2]
## [1] "Species A"

These brackets also apply to the entire data.frame. Since it is two dimensional we need to add a command. dat[1,2] accesses the value in the first row and the second column.

dat[1,2]
## [1] "Species A"

If you want to access an entire row or column just put in a comma.

dat[1,] #first row
##   Col1      Col2     Col3
## 1    1 Species A 49.90378
dat[,1] #first column
##  [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10

Examining data

Now that we have imported the data, let’s take a look at it! There are many useful ways to look at data, too many to describe here, so I have chosen a few I use all the time. These are:

Before going any further its worth pointing out that functions require arguments. These will be different for every functions depending on what it does. If you ever need help with a function put a question mark in front of it. Tab is also useful.

?colnames
colnames()

Returns the name of all of the columns in the data.

colnames(dat)
## [1] "Col1" "Col2" "Col3"

This function can also be used rename columns. We can to this be creating a vector of names. To make a vector of values we need to enclose them in c()

colnames(dat) <- c("plotID", "species", "height")
colnames(dat)
## [1] "plotID"  "species" "height"
length()

Returns the number of elements in a vector. A vector is a basic element that contains values of the same type. The code below creates a new vector named “spec” that contains the values in the species column. length returns the number of values in that vector.

spec <- dat$species
length(spec)
## [1] 30
nrow() and ncol()

Return the number of rows and the number of columns.

nrow(dat)
## [1] 30
ncol(dat)
## [1] 3
unique()

Return all of the unique values in a vector (or column). I often pair it with length() to determine the number of levels in a categorical variable. Let’s examine the species category.

unique(dat$species)
## [1] "Species A" "Species B" "Species C"
length(unique(dat$species))
## [1] 3
head() and tail()

Returns the first 6 rows and the last 6 rows in a dataframe respectively. Very handy for taking a quick look at data. You can modify n argument in this function to change the number of rows shown.

head(dat)
##   plotID   species   height
## 1      1 Species A 49.90378
## 2      2 Species A 47.31901
## 3      3 Species A 50.87232
## 4      4 Species A 49.41498
## 5      5 Species A 51.29793
## 6      6 Species A 50.82202
as.character(), as.factor(), as.numeric()

These functions are used to convert columns to a different data type. Sometimes data can read in weird and sometimes stats functions need data in a certain format. We can use these to move between them.

dat$species <- as.character(dat$species) #makes species a character
dat$species <- as.factor(dat$species) #makes it a factor again

Wrangle data

Data wrangling is the term for re formatting your data. There are many reasons you might want to do this. Maybe a particular function needs your data a certain way. Maybe you want to remove NAs. There are too many to count. Often in EECB related fields that have lots of data, this is the part of analysis that takes the most time. Here are some useful functions for data wrangling.

with, within, and transform

R stores information in different environments. Right now we’re in an environment where R knows about the variable “test_data”, but doesn’t know about the columns it contains. That’s why we need to use the “$” operator to access the environment within “test_data”.

We can jump into the environment of the data.frame so we don’t have to use test_data$ to access variables:

test_data <- transform(dat, species = as.character(species)) #this sets species to a character vector

#within allows any amount of code to be written between the curly brackets. This is very useful
test_data <- within(dat, {species <- as.factor(species)}) 

with(test_data, aggregate(height, list(species), mean)) #returns the mean height for each species
##     Group.1        x
## 1 Species A 49.83775
## 2 Species B 69.97579
## 3 Species C 90.09803
filter() and/or subset()

Filter and subset are two ways of accomplishing the same thing. Filter is a tidyverse function while subset is a base R function. Filter will only work if you have called the library tidyverse or dplyr. They allow you to look at portions of your data. For example, let only look at species A.

SpeciesA.1 <- filter(dat, species == "Species A")
SpeciesA.2 <- subset(dat, species == "Species A")
select()

Select is used to remove columns. Let’s remove “Plot ID” from dat. You can do this by keeping the columns you want or removing .

dat2 <- select(dat, species, height)
dat2 <- select(dat, -plotID)

The base R equivalents can be a little clunkier.

test_data.select <- dat[c('species', 'height')]
test_data.select <- dat[-which(names(test_data) == 'plotID')]

You can also remove columns using the brackets we used earlier. Since Plot ID is the first column we can also do this.

dat2 <- dat[,-1] #would remove the first column.
dat_data_removed <- dat[-1,] #would remove the first row.
gather() and spread()

Gather and spread are very handy for reformatting data frames. Essentially these functions transition data between “long” and “wide” forms. Long dataframes tend to have more rows and a fewer columns, while wide dataframes have more columns and fewer rows. Let’s take a look at what these look like. Long dataframes are nice for data wrangling and for running some models, while wide dataframes are needed for analyses like hill numbers and ordination. The functions melt and cast in the reshape package are similar.

dat

These data are long (although since it is only a small amount of data it isn’t a great example). Imagine you have 300 or 3000 observations (rows). Then the term “long” would make more sense.

We can use spread to make it wider. We want a column for each species (which is the “key”) and those columns to be filled with the height measurement (the “value” column),

dat_spread <- spread(dat, key = species, value = height)
head(dat_spread)
##   plotID Species A Species B Species C
## 1      1  49.90378  67.43697  90.37914
## 2      2  47.31901  70.33240  91.13306
## 3      3  50.87232  70.89353  92.75686
## 4      4  49.41498  70.39009  89.08342
## 5      5  51.29793  71.12169  89.56514
## 6      6  50.82202  71.29156  91.64644

Now “Species A”, “Species B”, and “Species C” each have their own column.

We can reverse this using gather. We need to specify that we only want to gather columns 2:4. In gather, key is the name for the new column created from the former column names and value is the name for the values.

dat_gathered <- gather(dat_spread, 2:4, key = "Species", value = "height")
head(dat_gathered)
##   plotID   Species   height
## 1      1 Species A 49.90378
## 2      2 Species A 47.31901
## 3      3 Species A 50.87232
## 4      4 Species A 49.41498
## 5      5 Species A 51.29793
## 6      6 Species A 50.82202

A base R equivalent is reshape, although there are some differences in output.

test_data.spread <- reshape(test_data, idvar = "plotID", timevar = "species", direction = "wide")
head(test_data.spread)
##   plotID height.Species A height.Species B height.Species C
## 1      1         49.90378         67.43697         90.37914
## 2      2         47.31901         70.33240         91.13306
## 3      3         50.87232         70.89353         92.75686
## 4      4         49.41498         70.39009         89.08342
## 5      5         51.29793         71.12169         89.56514
## 6      6         50.82202         71.29156         91.64644
test_data.gathered <- reshape(test_data.spread, direction = "long")
head(test_data.gathered)
##             plotID   species height.Species A
## 1.Species A      1 Species A         49.90378
## 2.Species A      2 Species A         47.31901
## 3.Species A      3 Species A         50.87232
## 4.Species A      4 Species A         49.41498
## 5.Species A      5 Species A         51.29793
## 6.Species A      6 Species A         50.82202
pipes

More recent versions of R allow us to string together a chain of commands using pipes. A pipe takes input data, runs a function, then passes the output to another function, and so on. The pipe in base R is |> and the Tidyverse pipe is %>%. I tend to use the tidyverse pipe.

In this example we will take “dat” and first filter and then remove a column. We specify the data we will use at the beginning and so we do not need to include it in subsequent functions. We want to first only keep rows where height is greater than or equal to 50 and then remove the PlotID column.

piped_dat <- dat %>% 
  filter(height >= 50) %>% 
  select(-plotID)

head(piped_dat)
##     species   height
## 1 Species A 50.87232
## 2 Species A 51.29793
## 3 Species A 50.82202
## 4 Species A 50.86813
## 5 Species A 51.78562
## 6 Species B 67.43697
group_by() and summarise()

These functions are always used in conjunction to summarize data based on a grouping variable. These are best used with pipes. In the example below we will calculate the mean and standard deviation of height in each group. We make the new columns “av” and “std_dev.”

summ <- dat %>% 
  group_by(species) %>% 
  summarise(av = mean(height), std_dev = sd(height))
head(summ)
## # A tibble: 3 × 3
##   species      av std_dev
##   <fct>     <dbl>   <dbl>
## 1 Species A  49.8    1.57
## 2 Species B  70.0    1.36
## 3 Species C  90.1    1.54

Here you can see that a new data frame is created. It has a species column, a column with the mean height of each species and a column for the standard deviation of height in each species.

na.omit()

na.omit is used to remove nas that appear in your data. It is important to know that it works by row. So if an na appears in an column of a row, that row is removed. Sometimes this is totally fine. Other times you may only care about nas in a particular column only.

dat_noNA <- na.omit(dat)

You can see that there is no difference, because there were no NAs in the original data.

Simulation

R is also super handy for simulating data! Simulating data is a super important part of statistics. In order to make sure your code and analyses makes sense it is important to simulate data first and check. If you know the “answer”, which you generate by simulating data, your statistical test should reflect this. If it doesn’t there is something wrong.

When simulating data in R, you simulate data from a distribution. There are many types of distributions. Here are a few. Distributions can be grouped into discrete and continuous. The Normal and gamma distributions are continuous because they predict continuous values, while the Poisson and binomial distributions are discrete, because they predict whole numbers. The wiki page for these is pretty good!

Let’s focus on the normal distribution because it is the easiest.

We can simulate data using the rnorm function. This functions simulates points from a normal distribution. The first command is the number of points, the second is the mean of the distribution, and the third command it the standard deviation.

Here is a distribution with 1000, points, a mean of 10, and a standard deviation of 2.

sam <- rnorm(1000, 10, 2)
plot(density(sam))

Let’s look at the process of simulating data to check a statistical test. The t-test compares the distributions of two groups. Let simulate two distributions:

sam1 <- rnorm(50, 20, 1)
sam2 <- rnorm(50, 15, 1)

plot(density(sam1), xlim = c(10,25) ,ylim = c(0,1))
lines(density(sam2), col ="red")

You’ll notice that these distributions are bumpier, this is because we simulated fewer points. The more points we simulate the more individual outliers even out. You’re distributions will also probably look different than these, because the random points you got are different than mine! We can use the function set.seed to fix this. I’ll use this later.

Now lets look at a t-test

t.test(sam1, sam2)
## 
##  Welch Two Sample t-test
## 
## data:  sam1 and sam2
## t = 25.618, df = 96.219, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.819461 5.629030
## sample estimates:
## mean of x mean of y 
##  20.13174  14.90749

This looks right! The means match what we simulated and it is very unlikely we obtained these differences by chance (we did simulate them to be different after all). Lets look at what happens if we simulate them to be the same.

set.seed(666)

sam1 <- rnorm(50, 15, 1)
sam2 <- rnorm(50, 15, 1)

plot(density(sam1), xlim = c(10,25) ,ylim = c(0,1))
lines(density(sam2), col ="red")

t.test(sam1, sam2)
## 
##  Welch Two Sample t-test
## 
## data:  sam1 and sam2
## t = 0.018033, df = 94.266, p-value = 0.9857
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4071033  0.4145660
## sample estimates:
## mean of x mean of y 
##  14.93520  14.93147

This also looks right! The distributions were simulated to be the same and the t-test supports this! Remember that they are not identical because they are composed of random numbers. The more values we generate the closer they will appear. Also, notice that your distribution should look the same as mine. That is because we set the seed first so our different computers generated the same random numbers.

We can also use rnorm to simulate relationships between variables. This is super handy for confirming that analyses are doing what you think they are. Say we want to simulate the effect of height (from the test dataset) on water use, which is a variable we will make up.

We can do something like this. Here we are specifying the relationship between water growth and height. We are saying that an increase in 3 units of plant height changes water growth by 1 unit. We use rnorm to simulate the data with and without noise. Obviously data are messy so we should simulate our data with noise (I use n=30 for rnorm because we have 30 data points).

water_growth_nonoise <- 3 * dat$height
water_growth_noise <- 3 * dat$height + rnorm(30, 0, 10)

Let’s look at the differences

plot(dat$height, water_growth_nonoise, main = "No Noise")

plot(dat$height, water_growth_noise, main = "Noise")

Let’s quickly examine if the relationship we simulated.

coef(glm(water_growth_noise ~ dat$height))
## (Intercept)  dat$height 
##  -10.792341    3.129803

It is!