library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Introduction

This code demonstrates Principal Component Analysis (PCA) using both prcomp and breaking it apart into its component parts (using eigen and matrix multiplication).

Data Preparation

This example uses the Iris dataset that is built into R. It contains various plant measurements (which are correlated) from different species. I subset to one plant species and show the correlations below.

dat <- iris %>% 
  filter(Species == "versicolor") %>% 
  select(-Species)

pairs(dat) #variables are correlated

PCA

Using prcomp

First you want to scale each variable by its variance.

dat <- data.frame(apply(dat, 2, scale))

This runs a PCA using the prcomp function.

pc <- prcomp(dat)

This is the rotation matrix (also called the loadings). You use these data to describe how the original data relate to the new PC axes.

pc$rotation
##                     PC1        PC2        PC3        PC4
## Sepal.Length -0.4823284 -0.6107980 -0.4906296 -0.3918772
## Sepal.Width  -0.4648460  0.6727830 -0.5399025  0.1994658
## Petal.Length -0.5345136 -0.3068495  0.3402185  0.7102042
## Petal.Width  -0.5153375  0.2830765  0.5933290 -0.5497778

Screeplot

The screeplot tells us how much of the total variation in the original dataset is explained by each PC axis.

screeplot(pc)

PCA Scores Plot

This the plot we are often interested in. It shows the new PC1 and PC2 axes. Keep in mind that it is often informative to look at PC3, PC4, and so on. It depends on the size of your dataset and the amount of variance explained by each PC axis.

pca_dat <- data.frame(pc1 = pc$x[,1], pc2 = pc$x[,2])

ggplot(data = pca_dat) +
  geom_point(aes(pc1, pc2), pch = 21, fill = "darkblue", alpha = 0.5, size = 3) +
  theme_classic()

PCA by Hand (using eigen decomposition)

Data Preparation

First we need to create a variance/covariance matrix. Keep in mind that since we are running var on the scaled dataset it is equivalent to a correlation matrix.

mat <- var(dat)
mat
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.5259107    0.7540490   0.5464611
## Sepal.Width     0.5259107   1.0000000    0.5605221   0.6639987
## Petal.Length    0.7540490   0.5605221    1.0000000   0.7866681
## Petal.Width     0.5464611   0.6639987    0.7866681   1.0000000

Now we perform eigen decomposition. In the terms of linear algebra (where this comes from), we are identifying the axes in our data that do not change their direction when a linear transformation is applied. The vectors are the axes (position and direction) and the values describe the amount the axis is displaced. Since our dataset is composed of variances, the eigenvectors are the axes of most explained variance and the eigenvalues describe the amount of variance explained by each one. Generally, eigenvectors do not need to be orthogonal, however when applied to a square matrix (like a var/cov matrix), they are. For more information see the associated slides.

e <- eigen(mat)
e$vectors
##            [,1]       [,2]       [,3]       [,4]
## [1,] -0.4823284  0.6107980 -0.4906296  0.3918772
## [2,] -0.4648460 -0.6727830 -0.5399025 -0.1994658
## [3,] -0.5345136  0.3068495  0.3402185 -0.7102042
## [4,] -0.5153375 -0.2830765  0.5933290  0.5497778
e$values
## [1] 2.9263407 0.5462747 0.3949976 0.1323871

Screeplot from scratch

We can re-create the same screeplot from above using the eigenvalues.

scree_dat <- data.frame(type = c(rep("PC1", round(e$values*100)[1]), 
                                 rep("PC2", round(e$values*100)[2]),
                                 rep("PC3", round(e$values*100)[3]), 
                                 rep("PC4", round(e$values*100)[4])))

ggplot(data = scree_dat) +
  geom_bar(aes(x=type), color="black", fill = "gray80", linewidth = 0.5) +
  theme_classic()

Computing PCA Scores Manually

We can re-create the PC plot from above by multiplying the eigenvectors (or rotation matrix) by our original data. Essentially the eigenvectors describe the relationship between our original data and the transformed data and by multiplying them we can convert one to the other. Note: I perform the matrix multiplication by hand and using %*%. They are the same.

pc1 <- dat[,1] * e$vectors[1,1] + dat[,2] * e$vectors[2,1] + dat[,3] * e$vectors[3,1] + dat[,4] * e$vectors[4,1]
pc2 <- dat[,1] * e$vectors[1,2] + dat[,2] * e$vectors[2,2] + dat[,3] * e$vectors[3,2] + dat[,4] * e$vectors[4,2]
pc3 <- dat[,1] * e$vectors[1,3] + dat[,2] * e$vectors[2,3] + dat[,3] * e$vectors[3,3] + dat[,4] * e$vectors[4,3]
pc4 <- dat[,1] * e$vectors[1,4] + dat[,2] * e$vectors[2,4] + dat[,3] * e$vectors[3,4] + dat[,4] * e$vectors[4,4]

out <- data.frame(X1= pc1, X2 = pc2, X3 = pc3, X4 = pc4)

# Using matrix multiplication
out1 <- data.frame(as.matrix(dat) %*% as.matrix(e$vectors))

ggplot(data = out1) +
  geom_point(aes(X1, X2), pch = 21, fill = "darkblue", alpha = 0.5, size = 3) +
  theme_classic()