PCA – Part 3: In the Trenches

Now that we have an intuition of what principal component analysis (PCA) is and understand some of the mathematics behind it, it is time we make PCA work for us.

Practical examples of PCA typically use Ronald Fisher’s famous “Iris” data set, which contains four measurements of leaf lenghts and widths of three subspecies of Iris flowers. To mix things up a little bit, I will use a data set that is closer to what you would encounter in the real world.

The “Human Activity Recognition Using Smartphones Data Set” available from the UCI Machine Learning Repository contains a total of 561 triaxial acceleration and angular velocity measurements of 30 subjects performing different movements such as sitting, standing, and walking. The researchers collected this data set to ask whether those measurements would be sufficient to tell the type of activity of the person. Instead of focusing on this classification problem, we will look at the structure of the data and investigate using PCA whether we can express the information contained in the 561 different measurements in a more compact form.

I will be using a subset of the data containing the measurements of only three subjects. As always, the code used for the pre-processing steps of the raw data can be found on GitHub.

Step 1: Explore the data

Let’s first load the pre-processed subset of the data into our R session.

# read data from Github
measurements <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/pca_part3_measurements.txt"))
description <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/pca_part3_description.txt"))

It’s always a good idea to check for a couple of basic things first. The big three I usually check are:

  • What are dimensions of the data, i.e. how many rows and columns?
  • What type of features are we dealing with, i.e. categorical, ordinal, continuous?
  • Are there any missing values?

The answer to those three questions will determine the amount of additional data munging we have to do before we can use the data for PCA.

# what are the dimensions of the data?
dim(measurements)
# what type of data are the features?
table(sapply(measurements, class))
# are there missing values?
any(is.na(measurements))

The data contains 990 samples (rows) with 561 measurements (columns) each. Clearly too many measurements for visualizing on a scatterplot. The measurements are all of type “numeric”, which means we are dealing with continuous variables. This is great because categorical and ordinal variable are not handled well by PCA. Those need to be “dummy coded“. We also don’t have to worry about missing values. Strategies for handling missing values are a topic on its own.

Before we run PCA on the data, we should look at the correlation structure of the features. If there are features, i.e. measurements in our case, that are highly correlated (or anti-correlated), there is redundancy within the data set and PCA will be able to find a more compact representation of the data.

# feature correlation before PCA
cor_m <- cor(measurements, method = "pearson")
# use only upper triangular matrix to avoid redundancy
upt_m <- cor_m[upper.tri(cor_m)]
# plot correlations as histogram
hist(upt_m, prob = TRUE)
# plot correlations as image
image.plot(cor_m, axes = FALSE)

The code was simplified for clarity. The full version can be found in the script.

pca_part3_fig1We see in the histogram on the left that there is a considerable number of highly correlated features, most of them positively correlated. Those features show up as yellow in the image representation to the right. PCA will likely be able to provide us with a good lower dimensional approximation of this data set.

Step 2: Run PCA

After all the preparation, running PCA is just one line of code. Remember, that we need to at least center the data before using PCA (Why? see Part 2). Scaling is technically only necessary if the magnitude of the features are vastly different. Note, that the data appears to be already centered and scaled from the get-go.

# run PCA
pc <- prcomp(measurements, center = TRUE, scale. = TRUE)

Depending on your system and the number of features of your data this may take a couple of seconds.

The call to “prcomp” has constructed new features by linear combinations of the old features and sorted them by their and weighted by the amount of variance they explain. Because the new features are the eigenvectors of the feature covariance matrix, they should be orthogonal, and hence uncorrelated, by definition. Let’s visualize this directly.

The new representation of the data is stored as a matrix named “x” in the list object we get back from “prcomp”. In our case, the matrix would be stored as “pc$x”.

# feature correlation before PCA
cor_r <- cor(pc$x, method = "pearson")
# use only upper triangular matrix to avoid redundancy
upt_r <- cor_r[upper.tri(cor_r)]
# plot correlations as histogram
hist(upt_r, prob = TRUE)
# plot correlations as image
image.plot(cor_r, axes = FALSE)

pca_part3_fig2

The new features are clearly no longer correlated to each other. As everything seems to be in order, we can now focus on the interpretation of the results.

Step 3: Interpret the results

The first thing you will want to check is how much variance is explained by each component. In PCA speak, this can be visualized with a “scree plot”. R conveniently has a built-in function to draw such a plot.

# draw a scree plot
screeplot(pc, npc = 10, type = "line")

pca_part3_fig3

This is about as good as it gets. A large amount of the variance is captured by the first principal component followed by a sharp decline as the remaining components gradually explain less and less variance approaching zero.

The decision of how many components we should use to get a good approximation of the data has to be made on a case-by-case basis. The cut-offs for the percent explained variance depends on the kind of data you are working with and its inherent covariance structure. The majority of the data sets you will encounter are not nearly as well behaved as this one, meaning that the decline in explained variance is much more shallow. Common cut-offs range from 80% to 95% of explained variance.

Let’s look at how many components we would need to explain a given amount of variance. In the R implementation of PCA, the variances explained by each principle component are stored in a vector called “sdev”. As the name implies, these are standard deviations or the square roots of the variances, which in turn are scaled versions of the eigenvalues. We will need to take the squares “sdev” to get back the variances.

# calculate explained variance as cumulative sum
# sdev are the square roots of the variance
var_expl <- cumsum(pc$sdev^2) / sum(pc$sdev^2)
# plot explained variance
plot(c(0, var_expl), type = "l", lwd = 2, ylim = c(0, 1), 
     xlab = "Principal Components", ylab = "Variance explained")
# plot number of components needed to for common cut-offs of variance explained
vars <- c(0.8, 0.9, 0.95, 0.99)
for (v in vars) {
npc <- which(var_expl > v)[1]
    lines(x = c(0, npc, npc), y = c(v, v, 0), lty = 3)
    text(x = npc, y = v - 0.05, labels = npc, pos = 4)
    points(x = npc, y = v)
}

pca_part3_fig4

The first principle component on its own explains more than 50% of the variance and we need only 20 components to get up to 80% of the explained variance. Fewer than 30% of the components (162 out of 561) are needed to capture 99% of the variance in the data set. This is a dramatic reduction of complexity. Being able to approximate the data set with a much smaller number of features can greatly speed up downstream analysis and can help to visualize the data graphically.

Finally, let’s investigate whether “variance” translates to “information”. In other words, do the prinicipal components associated with the largest eigenvalues discriminate between the different human activities?

If the class labels (“activities” in our case) are known, a good way to do look at the “information content” of the principal components is to look at scatter plots of the first couple of components and color-code the samples by class label. This code gives you a bare bones version of the figure shown below. The complete code can be found on Github.

# plot the first 8 principal components against each other
for(p in seq(1, 8, by = 2)) {
  plot(pc$x[, p:(p+1)], pch = 16, 
       col = as.numeric(description$activity_name))
}

pca_part3_fig5

We have seen previously that the first component alone explains about half of the variance and in this figure we see why. It almost perfectly separates non-moving “activities” (“laying”, “sitting”, “standing”) from moving activities (various types of “walking”). The second component does a reasonable job at telling the difference between walking and walking upstairs. As we move down the list, there remains visible structure but distinctions become somewhat less clear. One conclusion we can draw from this visualization is that it will most likely be most difficult to tell “sitting” apart from “standing” as none of the dimensions seems to be able to distinguish red and green samples. Oddly enough, the fifth component does a pretty good job of separating “laying” from “sitting” and “standing”.

Recap

PCA can be a powerful technique to obtain low dimensional approximations of data with lots of redundant features. The “Human Activity Recognition Using Smartphones Data Set” used in this tutorial is a particularly good example of that. Most real data sets will not be reduced to a few components so easily while retaining most of the information. But even cutting the number of features in half can lead to considerable time savings when using machine learning algorithms.

Here are a couple of useful questions when approaching a new data set to apply PCA to:

  1. Are the features numerical or do I have to convert categorial features?
  2. Are there missing values and if yes, which strategy do I apply to deal with them?
  3. What is the correlation structure of the data? Will PCA be effective in this case?
  4. What is the distribution of variances after PCA? Do I see a steep or shallow decline in explained variance?
  5. How much “explained variance” is a good enough approximation of the data? This is usually a compromise between how much potential information I am willing to sacrifice for cutting down computation time of follow-up analyses.

In the final part of this series, we will discuss some of the limitations of PCA.


Addendum: Understanding “prcomp”

The “prcomp” function is very convenient because it caclulates all the numbers we could possible want from our PCA analysis in one line. However, it is useful to know how those number were generated.

The three most frequently used objects returned by “prcomp” are

  • “rotation”: right eigenvectors (“feature eigenvectors”)
  • “sdev”: square roots of scaled eigenvalues
  • “x”: projection of original data onto the new features

Rotation

In Part 2, I mentioned that software implementations of PCA usually compute the eigenvectors of the data matrix using singular value decomposition (SVD) rather than eigendecomposition of the covariance matrix. In fact, R’s “prcomp” is no exception.

“Rotation” is a matrix whose columns are the right eigenvalues of the original data. We can reconstruct “rotation” using SVD.

# perform singular value decomposition on centered and scaled data
sv <- svd(scale(measurements))
# "prcomp" stores right eigenvectors in "rotation"
w <- pc$rotation
dimnames(w) = NULL
# "svd" stores right eigenvectors in matrix "v"
v <- sv$v
# check if the two matrices are equal
all.equal(w, v)

Sdev

Singular values are the square roots of the eigenvalues as we have seen in Part 2. “sdev” stands for standard deviation and thus stores the square roots of the variances. Thus, the squares of “sdev” and the squares of the singular values are directly proportional to each other and the scaling factor is the number of rows of the original data matrix minus 1.

# relationship between singular values and "sdev"
all.equal(sv$d^2/(nrow(sv$u)-1), pc$sdev^2)

x

The projection of the orignal data (“measurements”) onto its eigenbasis is automatically calculated by “prcomp” through its default argument “retx = TRUE” and stored in “x”. We can manually recreate the projection using matrix-matrix multiplication.

# manual projection of data
all.equal(pc$x, scale(measurements) %*% pc$rotation)

If we wanted to obtain a projection of the data onto a lower dimensional subspace, we just determine the number of components needed and subset the columns of matrix “x”. For example, if we wanted to get an approximation of the original data preserving 90% of the variance, we take the first 52 columns of “x”.

# projection of original data preserving 90% of variance
y90 <- pc$x[, 1:52]
# note that this is equivalent matrix multiplication with 
# the first 52 eigenvectors
all.equal(y90, scale(measurements) %*% pc$rotation[, 1:52])


Reproducibility

The full R code is available on Github.


Further reading

IRIS data set

Sebastian Raschka – Principle Component Analysis in 3 Simple Steps


PCA SERIES

Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

Advertisements
PCA – Part 3: In the Trenches

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s