library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
This code demonstrates Principal Component Analysis (PCA) using both
prcomp
and breaking it apart into its component parts
(using eigen
and matrix multiplication).
This example uses the Iris dataset that is built into R. It contains various plant measurements (which are correlated) from different species. I subset to one plant species and show the correlations below.
dat <- iris %>%
filter(Species == "versicolor") %>%
select(-Species)
pairs(dat) #variables are correlated
prcomp
First you want to scale each variable by its variance.
dat <- data.frame(apply(dat, 2, scale))
This runs a PCA using the prcomp
function.
pc <- prcomp(dat)
This is the rotation matrix (also called the loadings). You use these data to describe how the original data relate to the new PC axes.
pc$rotation
## PC1 PC2 PC3 PC4
## Sepal.Length -0.4823284 -0.6107980 -0.4906296 -0.3918772
## Sepal.Width -0.4648460 0.6727830 -0.5399025 0.1994658
## Petal.Length -0.5345136 -0.3068495 0.3402185 0.7102042
## Petal.Width -0.5153375 0.2830765 0.5933290 -0.5497778
The screeplot tells us how much of the total variation in the original dataset is explained by each PC axis.
screeplot(pc)
This the plot we are often interested in. It shows the new PC1 and PC2 axes. Keep in mind that it is often informative to look at PC3, PC4, and so on. It depends on the size of your dataset and the amount of variance explained by each PC axis.
pca_dat <- data.frame(pc1 = pc$x[,1], pc2 = pc$x[,2])
ggplot(data = pca_dat) +
geom_point(aes(pc1, pc2), pch = 21, fill = "darkblue", alpha = 0.5, size = 3) +
theme_classic()
First we need to create a variance/covariance matrix. Keep in mind
that since we are running var
on the scaled dataset it is
equivalent to a correlation matrix.
mat <- var(dat)
mat
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 0.5259107 0.7540490 0.5464611
## Sepal.Width 0.5259107 1.0000000 0.5605221 0.6639987
## Petal.Length 0.7540490 0.5605221 1.0000000 0.7866681
## Petal.Width 0.5464611 0.6639987 0.7866681 1.0000000
Now we perform eigen decomposition. In the terms of linear algebra (where this comes from), we are identifying the axes in our data that do not change their direction when a linear transformation is applied. The vectors are the axes (position and direction) and the values describe the amount the axis is displaced. Since our dataset is composed of variances, the eigenvectors are the axes of most explained variance and the eigenvalues describe the amount of variance explained by each one. Generally, eigenvectors do not need to be orthogonal, however when applied to a square matrix (like a var/cov matrix), they are. For more information see the associated slides.
e <- eigen(mat)
e$vectors
## [,1] [,2] [,3] [,4]
## [1,] -0.4823284 0.6107980 -0.4906296 0.3918772
## [2,] -0.4648460 -0.6727830 -0.5399025 -0.1994658
## [3,] -0.5345136 0.3068495 0.3402185 -0.7102042
## [4,] -0.5153375 -0.2830765 0.5933290 0.5497778
e$values
## [1] 2.9263407 0.5462747 0.3949976 0.1323871
We can re-create the same screeplot from above using the eigenvalues.
scree_dat <- data.frame(type = c(rep("PC1", round(e$values*100)[1]),
rep("PC2", round(e$values*100)[2]),
rep("PC3", round(e$values*100)[3]),
rep("PC4", round(e$values*100)[4])))
ggplot(data = scree_dat) +
geom_bar(aes(x=type), color="black", fill = "gray80", linewidth = 0.5) +
theme_classic()
We can re-create the PC plot from above by multiplying the
eigenvectors (or rotation matrix) by our original data. Essentially the
eigenvectors describe the relationship between our original data and the
transformed data and by multiplying them we can convert one to the
other. Note: I perform the matrix multiplication by hand and using
%*%
. They are the same.
pc1 <- dat[,1] * e$vectors[1,1] + dat[,2] * e$vectors[2,1] + dat[,3] * e$vectors[3,1] + dat[,4] * e$vectors[4,1]
pc2 <- dat[,1] * e$vectors[1,2] + dat[,2] * e$vectors[2,2] + dat[,3] * e$vectors[3,2] + dat[,4] * e$vectors[4,2]
pc3 <- dat[,1] * e$vectors[1,3] + dat[,2] * e$vectors[2,3] + dat[,3] * e$vectors[3,3] + dat[,4] * e$vectors[4,3]
pc4 <- dat[,1] * e$vectors[1,4] + dat[,2] * e$vectors[2,4] + dat[,3] * e$vectors[3,4] + dat[,4] * e$vectors[4,4]
out <- data.frame(X1= pc1, X2 = pc2, X3 = pc3, X4 = pc4)
# Using matrix multiplication
out1 <- data.frame(as.matrix(dat) %*% as.matrix(e$vectors))
ggplot(data = out1) +
geom_point(aes(X1, X2), pch = 21, fill = "darkblue", alpha = 0.5, size = 3) +
theme_classic()