PCA – Part 4: Potential Pitfalls

In the first three parts of this series on principal component analysis (PCA), we have talked about what PCA can do for us, what it is mathematically, and how to apply it in practice. Today, I will briefly discuss some of the potential caveats of PCA.

INformation and Noise

PCA looks for the dimensions with highest variance within the data and assumes that high variance is a proxy for “information”. This assumption is usually warranted otherwise PCA would not be useful.

In cases of unsupervised learning, that is if we have no class labels of the data available, looking for structure within the data based on the data itself is our only choice. In a sense, we cannot tell what parts of the data are information and what parts are noise.

If we have class labels available (supervised learning), we could in principle look for dimensions of variance that optimally separate the classes from each other. PCA does not do that. It is “class agnostic” and thus treats “information”-variance and “noise”-variance the same way.

It is possible that principle components associated with small eigenvalues nevertheless carry the most information. In other words, the size of the eigenvalue and the information content are not necessarily correlated. When choosing the number of components to project our data, we could thus lose important information. Luckily, such situations rarely happen in practice. Or we just never realize …

There are other techniques related to PCA that attempt to find dimensions of the data that optimally separate the data based on class labels. The most famous is Fisher’s “Linear Discriminant Analysis” (LDA) and its non-linear cousins “Quadratic Discriminant Analysis” (QDA).


In Part 3 of this series, we have looked at a data set containing a multitude of motion detection measurements of humans doing various activities. We used PCA to find a lower dimensional representation of those measurements that approximate the data well.

Each of the original measurements were quite tangible (despite their sometimes cryptic names) and therefore interpretable. After PCA, we are left with linear combinations of those original features, which may or may not be interpretable. It is far from guaranteed that the eigenvectors correspond to “real” entities, they may just be convenient summaries of the data.

We will rarely be able to say the first principle component means “X” and the second principle component means “Y”, however tempting it may be based on our preconceived notions of the data. A good example of that is mentioned in Cosma Shalizi’s excellent notes on PCA. Cavalli-Sforza et al. analyzed the distribution of human genes using PCA and interpreted the principal components as patterns of human migration and population expansion. Later, November and Stephens showed that similar patterns could be obtained using simulated data with spatial correlation. As humans are genetically more similar to humans they close to (at least historically), genetic data is necessarily spatially correlated and thus PCA will uncover such structures, even if they do not represent “real” events or are liable to misinterpretation.


Linear algebra tells us that eigenvectors are orthogonal to each other. A set of n orthogonal vectors form a basis of an n-dimensional subspace. The principle components are eigenvectors of the covariance matrix and the set of principle components form a basis for our data. We also say that the principle components are “uncorrelated”. This becomes obvious when we remember that matrix decomposition is sometimes called “diagonalization”. In the eigendecomposition, the matrix containing the eigenvalues has zeros everywhere but on its diagonal, which contains the eigenvalues.

Variance and covariance are measures in the L2 norm, which means that they involve the second moment or square. Being uncorrelated in the L2 norm does not mean that there is no “correlation” in higher norms, in other words the absence of correlation does not imply independence. In statistics, higher order norms are skew (“tailedness” or third moment) and kurtosis (“peakedness” or fourth moment). Techniques related to PCA such as Independent Component Analysis (IDA) can be used to extract two separate, but convolved signals (“independent components”) from each other based on higher order norms.

The distinction between correlation and independence is a technical point when it comes to the practical application of PCA but certainly worth being aware of.

Further reading

Cosma Shalizi – Principal Components: Mathematics, Example, Interpretation

Cavalli-Sforza et al. – The History and Geography of Human Genes (1994)

Novembre & Stephens – Interpreting principal component analyses of spatial genetic variation (2008)


Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

PCA – Part 4: Potential Pitfalls

PCA – Part 3: In the Trenches

Now that we have an intuition of what principal component analysis (PCA) is and understand some of the mathematics behind it, it is time we make PCA work for us.

Practical examples of PCA typically use Ronald Fisher’s famous “Iris” data set, which contains four measurements of leaf lenghts and widths of three subspecies of Iris flowers. To mix things up a little bit, I will use a data set that is closer to what you would encounter in the real world.

The “Human Activity Recognition Using Smartphones Data Set” available from the UCI Machine Learning Repository contains a total of 561 triaxial acceleration and angular velocity measurements of 30 subjects performing different movements such as sitting, standing, and walking. The researchers collected this data set to ask whether those measurements would be sufficient to tell the type of activity of the person. Instead of focusing on this classification problem, we will look at the structure of the data and investigate using PCA whether we can express the information contained in the 561 different measurements in a more compact form.

I will be using a subset of the data containing the measurements of only three subjects. As always, the code used for the pre-processing steps of the raw data can be found on GitHub.

Step 1: Explore the data

Let’s first load the pre-processed subset of the data into our R session.

# read data from Github
measurements <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/pca_part3_measurements.txt"))
description <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/pca_part3_description.txt"))

It’s always a good idea to check for a couple of basic things first. The big three I usually check are:

  • What are dimensions of the data, i.e. how many rows and columns?
  • What type of features are we dealing with, i.e. categorical, ordinal, continuous?
  • Are there any missing values?

The answer to those three questions will determine the amount of additional data munging we have to do before we can use the data for PCA.

# what are the dimensions of the data?
# what type of data are the features?
table(sapply(measurements, class))
# are there missing values?

The data contains 990 samples (rows) with 561 measurements (columns) each. Clearly too many measurements for visualizing on a scatterplot. The measurements are all of type “numeric”, which means we are dealing with continuous variables. This is great because categorical and ordinal variable are not handled well by PCA. Those need to be “dummy coded“. We also don’t have to worry about missing values. Strategies for handling missing values are a topic on its own.

Before we run PCA on the data, we should look at the correlation structure of the features. If there are features, i.e. measurements in our case, that are highly correlated (or anti-correlated), there is redundancy within the data set and PCA will be able to find a more compact representation of the data.

# feature correlation before PCA
cor_m <- cor(measurements, method = "pearson")
# use only upper triangular matrix to avoid redundancy
upt_m <- cor_m[upper.tri(cor_m)]
# plot correlations as histogram
hist(upt_m, prob = TRUE)
# plot correlations as image
image.plot(cor_m, axes = FALSE)

The code was simplified for clarity. The full version can be found in the script.

pca_part3_fig1We see in the histogram on the left that there is a considerable number of highly correlated features, most of them positively correlated. Those features show up as yellow in the image representation to the right. PCA will likely be able to provide us with a good lower dimensional approximation of this data set.

Step 2: Run PCA

After all the preparation, running PCA is just one line of code. Remember, that we need to at least center the data before using PCA (Why? see Part 2). Scaling is technically only necessary if the magnitude of the features are vastly different. Note, that the data appears to be already centered and scaled from the get-go.

# run PCA
pc <- prcomp(measurements, center = TRUE, scale. = TRUE)

Depending on your system and the number of features of your data this may take a couple of seconds.

The call to “prcomp” has constructed new features by linear combinations of the old features and sorted them by their and weighted by the amount of variance they explain. Because the new features are the eigenvectors of the feature covariance matrix, they should be orthogonal, and hence uncorrelated, by definition. Let’s visualize this directly.

The new representation of the data is stored as a matrix named “x” in the list object we get back from “prcomp”. In our case, the matrix would be stored as “pc$x”.

# feature correlation before PCA
cor_r <- cor(pc$x, method = "pearson")
# use only upper triangular matrix to avoid redundancy
upt_r <- cor_r[upper.tri(cor_r)]
# plot correlations as histogram
hist(upt_r, prob = TRUE)
# plot correlations as image
image.plot(cor_r, axes = FALSE)


The new features are clearly no longer correlated to each other. As everything seems to be in order, we can now focus on the interpretation of the results.

Step 3: Interpret the results

The first thing you will want to check is how much variance is explained by each component. In PCA speak, this can be visualized with a “scree plot”. R conveniently has a built-in function to draw such a plot.

# draw a scree plot
screeplot(pc, npc = 10, type = "line")


This is about as good as it gets. A large amount of the variance is captured by the first principal component followed by a sharp decline as the remaining components gradually explain less and less variance approaching zero.

The decision of how many components we should use to get a good approximation of the data has to be made on a case-by-case basis. The cut-offs for the percent explained variance depends on the kind of data you are working with and its inherent covariance structure. The majority of the data sets you will encounter are not nearly as well behaved as this one, meaning that the decline in explained variance is much more shallow. Common cut-offs range from 80% to 95% of explained variance.

Let’s look at how many components we would need to explain a given amount of variance. In the R implementation of PCA, the variances explained by each principle component are stored in a vector called “sdev”. As the name implies, these are standard deviations or the square roots of the variances, which in turn are scaled versions of the eigenvalues. We will need to take the squares “sdev” to get back the variances.

# calculate explained variance as cumulative sum
# sdev are the square roots of the variance
var_expl <- cumsum(pc$sdev^2) / sum(pc$sdev^2)
# plot explained variance
plot(c(0, var_expl), type = "l", lwd = 2, ylim = c(0, 1), 
     xlab = "Principal Components", ylab = "Variance explained")
# plot number of components needed to for common cut-offs of variance explained
vars <- c(0.8, 0.9, 0.95, 0.99)
for (v in vars) {
npc <- which(var_expl > v)[1]
    lines(x = c(0, npc, npc), y = c(v, v, 0), lty = 3)
    text(x = npc, y = v - 0.05, labels = npc, pos = 4)
    points(x = npc, y = v)


The first principle component on its own explains more than 50% of the variance and we need only 20 components to get up to 80% of the explained variance. Fewer than 30% of the components (162 out of 561) are needed to capture 99% of the variance in the data set. This is a dramatic reduction of complexity. Being able to approximate the data set with a much smaller number of features can greatly speed up downstream analysis and can help to visualize the data graphically.

Finally, let’s investigate whether “variance” translates to “information”. In other words, do the prinicipal components associated with the largest eigenvalues discriminate between the different human activities?

If the class labels (“activities” in our case) are known, a good way to do look at the “information content” of the principal components is to look at scatter plots of the first couple of components and color-code the samples by class label. This code gives you a bare bones version of the figure shown below. The complete code can be found on Github.

# plot the first 8 principal components against each other
for(p in seq(1, 8, by = 2)) {
  plot(pc$x[, p:(p+1)], pch = 16, 
       col = as.numeric(description$activity_name))


We have seen previously that the first component alone explains about half of the variance and in this figure we see why. It almost perfectly separates non-moving “activities” (“laying”, “sitting”, “standing”) from moving activities (various types of “walking”). The second component does a reasonable job at telling the difference between walking and walking upstairs. As we move down the list, there remains visible structure but distinctions become somewhat less clear. One conclusion we can draw from this visualization is that it will most likely be most difficult to tell “sitting” apart from “standing” as none of the dimensions seems to be able to distinguish red and green samples. Oddly enough, the fifth component does a pretty good job of separating “laying” from “sitting” and “standing”.


PCA can be a powerful technique to obtain low dimensional approximations of data with lots of redundant features. The “Human Activity Recognition Using Smartphones Data Set” used in this tutorial is a particularly good example of that. Most real data sets will not be reduced to a few components so easily while retaining most of the information. But even cutting the number of features in half can lead to considerable time savings when using machine learning algorithms.

Here are a couple of useful questions when approaching a new data set to apply PCA to:

  1. Are the features numerical or do I have to convert categorial features?
  2. Are there missing values and if yes, which strategy do I apply to deal with them?
  3. What is the correlation structure of the data? Will PCA be effective in this case?
  4. What is the distribution of variances after PCA? Do I see a steep or shallow decline in explained variance?
  5. How much “explained variance” is a good enough approximation of the data? This is usually a compromise between how much potential information I am willing to sacrifice for cutting down computation time of follow-up analyses.

In the final part of this series, we will discuss some of the limitations of PCA.

Addendum: Understanding “prcomp”

The “prcomp” function is very convenient because it caclulates all the numbers we could possible want from our PCA analysis in one line. However, it is useful to know how those number were generated.

The three most frequently used objects returned by “prcomp” are

  • “rotation”: right eigenvectors (“feature eigenvectors”)
  • “sdev”: square roots of scaled eigenvalues
  • “x”: projection of original data onto the new features


In Part 2, I mentioned that software implementations of PCA usually compute the eigenvectors of the data matrix using singular value decomposition (SVD) rather than eigendecomposition of the covariance matrix. In fact, R’s “prcomp” is no exception.

“Rotation” is a matrix whose columns are the right eigenvalues of the original data. We can reconstruct “rotation” using SVD.

# perform singular value decomposition on centered and scaled data
sv <- svd(scale(measurements))
# "prcomp" stores right eigenvectors in "rotation"
w <- pc$rotation
dimnames(w) = NULL
# "svd" stores right eigenvectors in matrix "v"
v <- sv$v
# check if the two matrices are equal
all.equal(w, v)


Singular values are the square roots of the eigenvalues as we have seen in Part 2. “sdev” stands for standard deviation and thus stores the square roots of the variances. Thus, the squares of “sdev” and the squares of the singular values are directly proportional to each other and the scaling factor is the number of rows of the original data matrix minus 1.

# relationship between singular values and "sdev"
all.equal(sv$d^2/(nrow(sv$u)-1), pc$sdev^2)


The projection of the orignal data (“measurements”) onto its eigenbasis is automatically calculated by “prcomp” through its default argument “retx = TRUE” and stored in “x”. We can manually recreate the projection using matrix-matrix multiplication.

# manual projection of data
all.equal(pc$x, scale(measurements) %*% pc$rotation)

If we wanted to obtain a projection of the data onto a lower dimensional subspace, we just determine the number of components needed and subset the columns of matrix “x”. For example, if we wanted to get an approximation of the original data preserving 90% of the variance, we take the first 52 columns of “x”.

# projection of original data preserving 90% of variance
y90 <- pc$x[, 1:52]
# note that this is equivalent matrix multiplication with 
# the first 52 eigenvectors
all.equal(y90, scale(measurements) %*% pc$rotation[, 1:52])


The full R code is available on Github.

Further reading

IRIS data set

Sebastian Raschka – Principle Component Analysis in 3 Simple Steps


Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

PCA – Part 3: In the Trenches

PCA – Part 2: A look behind the curtain

In Part 1 of this series, I provided a non-technical intuition about what principal component analysis (PCA) is and how we can use it to find lower dimensional projections of our data that conserve the majority of the signal.

In doing so, I have explicitly or implicitly made a number of assertions:

  1. PCA is a projection of the data onto a new set of uncorrelated (orthogonal) basis vectors, which turn out to be the eigenvectors of the covariance matrix of the data
  2. The new coordinate system is found by looking for the directions of maximum variance in the data
  3. The eigenvector associated with the largest eigenvalue is the direction of maximum variance
  4. The new basis vectors can be used to project the data onto a lower dimensional subspace that captures the majority of the signal of the data

In the second installment of this series on PCA, I will justify those assertions mathematically. For this I am relying heavily on this excellent resource. Less technical explanations and visual representations can be found in the reference section.

finding the ideal basis by maximizing the variance

We will first investigate how to find a new set of basis vectors that allows us to re-express the data taking into account its structure. For simplicity, we start with the projection of a single data vector \vec{x}, which we want to project onto a unit vector \hat{w}.

The component of the data vector \vec{x} along the unit vector \hat{w} is given by their dot product \vec{x} \cdot \hat{w}. The result of the dot product is a scalar, so the projected vector \vec{x_{p}} along \hat{w} is a scaled version of \hat{w}.

\vec{x_{p}} = (\vec{x} \cdot \hat{w}) \hat{w}

We can measure the error of the projection as the squared distance of the projected vector \vec{x_{p}} from the original vector \vec{x}. This is also called the residual sum of squares (RSS).

RSS(\hat{w}) = \| \vec{x} - \vec{x_p} \|^2 = \| \vec{x} - (\vec{x} \cdot \hat{w}) \hat{w} \|^2= \| \vec{x} \|^2 - 2 (\vec{x} \cdot \hat{w})^2 + \| \hat{w}\|^2

The length of the unit vector \hat{w} is 1 by definition, so the equation simplifies to

RSS(\hat{w}) = \| \vec{x} \|^2 - 2 (\vec{x} \cdot \hat{w})^2 + 1

If we want to find the unit vector \hat{w} that minimizes the RSS for n vectors \vec{x}, we just sum up the errors from the projections of all \vec{x_{i}} onto \hat{w}.

RSS(\hat{w}) = \sum \limits_{i=1}^{n} \| \vec{x_{i}} \|^2 - 2 (\vec{x_{i}} \cdot \hat{w})^2 + 1

For minimizing RSS with respect to \hat{w} we only care about the components of the equation that depend on \hat{w}. Isolation of the components that do depend on \hat{w} simplifies the problem considerably.

RSS(\hat{w}) = (n + \sum \limits_{i=1}^{n} \| \vec{x_{i}} \|^2) - 2 \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2

Notice that the component that depends on \hat{w} has a minus sign. This means that in order to minimize RSS, we need to maximize

\sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2

Since n does not depend on \hat{w} it is equivalent to maximize

\frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2

which is the sample mean of the square of the dot product between \vec{x} and \hat{w}. As we are not dealing with weighted averages, the sample mean is the same as the expected value (E). Recall that the mean of the square minus the square of the mean equals the variance

Var(z) = E(z^2) - [E(z)]^2

If we substitute (\vec{x_{i}} \cdot \hat{w}) for z and rearrange, we get

\frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2 = [\frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})]^2 + Var(\vec{x_{i}} \cdot \hat{w})

Now, let’s assume that all the vectors \vec{x} are centered, so that their mean is 0. If you have ever asked yourself why it is necessary to either center or scale the data before using PCA, this is why! This means that their projections onto \hat{w} also sum up to 0 and we are left with

Var(\vec{x_{i}} \cdot \hat{w}) = \frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2

Finding the unit vector \hat{w} that minimizes the projection error of the data vectors \vec{x_{i}} onto \hat{w} can be achieved by maximizing the variance of the dot products of \vec{x_{i}} and \hat{w}.


It makes intuitive sense that the dimension associated with the highest variance within the data has the potential to contain the most “information”. To be fair, there is also the most potential for noise. The math tells us that maximizing the variance is equivalent to minimizing the projection error, which makes even more intuitive sense.

the ideal basis for projection is the eigenbasis

The previous section established that we need to maximize the variance of \vec{x} \cdot \hat{w} to find the vector \hat{w} that minimizes the projection error. In this section we will go through the steps of the optimization problem and see that the set of unit vectors \hat{w} that maximize the variance upon projection of the data matrix \boldsymbol X are the eigenvectors of the feature covariance matrix \boldsymbol Q of \boldsymbol X.

It is convenient to transition from summation notation to matrix notation at this point. The variance expressed in matrix notation. If we stack up all our data vectors \vec{x} into a n \times p matrix \boldsymbol X, the dot product \boldsymbol X \boldsymbol w gives us the projection of the vectors onto \hat{w}.

\sigma_{\hat{w}}^2 = \frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2=\frac{1}{n} (\boldsymbol X \boldsymbol w)^T \boldsymbol X \boldsymbol w

The p \times p feature covariance matrix \boldsymbol Q is

\boldsymbol Q=\frac{1}{n}\boldsymbol X^T\boldsymbol X

We can see that simplification and rearrangement yields the following expression for the variance.

\sigma_{\hat{w}}^2=\frac{1}{n}\boldsymbol w^T\boldsymbol X^T\boldsymbol X\boldsymbol w=\boldsymbol w^T\frac{\boldsymbol X^T\boldsymbol X}{n}\boldsymbol w=\boldsymbol w^T\boldsymbol Q\boldsymbol w

In order to set up the optimization problem, we define a function f we want to maximize. In our case it is just the variance.

f(\boldsymbol w) = \boldsymbol w^T\boldsymbol Q\boldsymbol w

As we are not looking for all possible vectors \boldsymbol w but only unit vectors \boldsymbol w we set up the following constraint \|\boldsymbol w\|^2 = \boldsymbol w^T\boldsymbol w=1 and define the constraint function g.

g(\boldsymbol w) = \boldsymbol w^T \boldsymbol w = 1

Rearrangement and addition of the Langrange multiplier \lambda results in the constraint function u(\boldsymbol w, \lambda) = f(\boldsymbol w) - \lambda(g(\boldsymbol w) - 1), which we want to optimize.

u=\boldsymbol w^T\boldsymbol Q\boldsymbol w - \lambda (\boldsymbol w^T\boldsymbol w - 1)

Partial differentiation of u with respect to \boldsymbol w results in

\frac{\partial u}{\partial \boldsymbol w} = 2 \boldsymbol Q\boldsymbol w - 2 \lambda \boldsymbol w = 0

\boldsymbol Q\boldsymbol w =\lambda\boldsymbol w

This is the eigenvector equation for the feature covariance matrix \boldsymbol Q.


The desired vector \boldsymbol w is an eigenvector of the feature covariance matrix \boldsymbol Q. Among the p eigenvectors, the eigenvector with the largest eigenvalue will be the vector that captures the maximum variance or, equivalently, minimizes the residual sum of squares. It is the first principal component.

Sorting the eigenvector/eigenvalue pairs by eigenvalue from largest to smallest, we obtain all p principal components. By definition, eigenvectors are orthogonal to each other and are a basis of the p-dimensional feature space.

The sum of the eigenvalues is the sum of the variance described by each component

\sigma_{total}^2 = \sum \limits_{i=1}^p \lambda_{i}

We can now determine the fraction of variance explained by q principal components where q \leq p

\sigma_{explained}^2 / \sigma_{total}^2 = \sum \limits_{i=1}^q \lambda_{i} / \sum \limits_{i=1}^p \lambda_{i}

Stacking our eigenvectors \boldsymbol w as columns into a p \times p gives us the weight matrix \boldsymbol W. The connection to the eigendecomposition can be seen clearly now.

\sigma^2\boldsymbol I=\lambda\boldsymbol I=\boldsymbol \Lambda =\boldsymbol W^T\boldsymbol Q\boldsymbol W\propto \boldsymbol W^T \boldsymbol X^T \boldsymbol X \boldsymbol W

Rearranging yields the classical form of the eigendecomposition of a square matrix.

\boldsymbol X^T \boldsymbol X = \boldsymbol W \boldsymbol \Lambda \boldsymbol W^T

Relationship to singular value decomposition

Software implementations of PCA often use a matrix factorization called the singular value decomposition (SVD) to compute the eigenvector/eigenvalue pairs of a matrix. SVD is closely related to eigendecomposition but offers the advantage that it is more numerically accurate and can operate on matrices that are not square and does not require the computation of the feature correlation matrix \boldsymbol Q.

SVD decomposes a n \times p matrix \boldsymbol X into three components: An n \times n matrix \boldsymbol U, a diagonal n \times p matrix \boldsymbol \Sigma, and a p \times p matrix \boldsymbol V. \boldsymbol U and \boldsymbol V are both unitary matrices meaning that their transpose is also their inverse. The columns of the matrix \boldsymbol U contains the “left singular vectors” and the columns of \boldsymbol V the “right singular vectors”. The diagonal elements of \boldsymbol \Sigma are called the “singular values”, which are related to the eigenvalues.

\boldsymbol X = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T

We can think of SVD as breaking down the linear transformation performed by \boldsymbol X into a rotation (\boldsymbol U), a scaling (\boldsymbol \Sigma), and another rotation (\boldsymbol V).

The relationship to the eigendecomposition becomes clear when we investigate the SVD of \boldsymbol X^T \boldsymbol X

\boldsymbol X^T \boldsymbol X = (\boldsymbol U \boldsymbol \Sigma \boldsymbol V^T)^T \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T = \boldsymbol V \boldsymbol \Sigma^T \boldsymbol U^T \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T= \boldsymbol V \boldsymbol \Sigma^2 \boldsymbol V^T

Recall the eigendecomposition \boldsymbol X^T \boldsymbol X = \boldsymbol W \boldsymbol \Sigma \boldsymbol W^T. It is evident that the eigenvalues are the squares of the singular values

\boldsymbol \Lambda = \boldsymbol \Sigma^2

or, equivalently, the singular values are the square roots of the eigenvalues.

\boldsymbol \Sigma = \sqrt{\boldsymbol \Lambda}


Eigendecomposition and SVD are highly related techniques. Both can be used to calculate eigenvectors and eigenvalues of a matrix with the exception that eigendecomposition requires a square matrix and SVD does not.

Projection of the data onto the new basis

One of the promises of PCA was that it finds rotation of the data that is more “natural”. So far we have only identified the new coordinate system (the eigenvectors) and which directions are the most variable (based on the eigenvalues). We have not yet put the data into the new frame of reference. This process is called “projection” or “transformation”.

The projection of the data matrix \boldsymbol X onto its eigenbase \boldsymbol W is given by

\boldsymbol Y_{n \times p} = \boldsymbol X_{n \times p} \boldsymbol W_{p \times p}

If we want to project the data on a subspace of \boldsymbol W, we just use the first q columns of \boldsymbol W. Again we choose q by how much of the variance we want to conserve.

\boldsymbol Y_{n \times q} = \boldsymbol X_{n \times p} \boldsymbol W_{p \times q}

Maybe you have determined the eigenvectors and eigenvalues of \boldsymbol X using SVD. If we substitute \boldsymbol X = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T into the previous equation, we get

\boldsymbol Y = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T \boldsymbol W

\boldsymbol V and \boldsymbol W are both p \times p matrices of the right eigenvectors of \boldsymbol X.

\boldsymbol Y_{n \times p} = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T \boldsymbol V = \boldsymbol U_{n \times n} \boldsymbol \Sigma_{n \times p}

Alternatively, we could just use \boldsymbol V in place of \boldsymbol W

\boldsymbol Y_{n \times p} = \boldsymbol X_{n \times p} \boldsymbol V_{p \times p}

Using only q components to project onto a lower dimensional subspace yields

\boldsymbol Y_{n \times q} = \boldsymbol U_{n \times n} \boldsymbol \Sigma_{n \times q}

\boldsymbol Y_{n \times q} = \boldsymbol X_{n \times p} \boldsymbol V_{p \times q}


When applying PCA in practice, we let the software do the heavy lifting, i.e. let it figure out the eigenvectors and eigenvalues of the data matrix \boldsymbol X. Once we have the eigenvector/eigenvalue pairs, obtaining the projection onto the new basis is a simple matrix-matrix product.

Choosing how many components q to use for the projection of your data is the real work for you as a user of PCA. We will face this challenge in the next part of this series, which will cover the practical application of PCA using R.

Further reading


Lindsay Smith – A Tutorial on Principal Components Analysis

Jeff Jauregui – Principal component analysis with linear algebra

Cosma Shalizi – Principle Components – Mathematics, Example, Interpretation

Visual Explanation

George Dallas – Principle Component Analysis 4 Dummies

Sebastian Raschka – Principle Component Analysis in 3 Simple Steps

PCA series

Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

PCA – Part 2: A look behind the curtain

PCA – Part 1: An intuition

Pee-See-Ay. Just the sound of these three letters inspires awe in an experimental biologist. I am speaking from personal experience. Let’s grab a hammer and chisel and get to work on that pedestal.

In this first post of a series on principal component analysis (PCA), my goal is to avoid any mathematical formalism and provide an intuition about what PCA is and how this technique can be useful when analyzing data.

In follow-up posts, I will explain the mathematical basis of PCA, show practical examples of applying PCA using R, and discuss some points to keep in mind when using PCA.

When to do PCA?

Most explanations of PCA start out with what it is mathematically and how it can be calculated. We will get to that later. I usually find it more instructive to first understand in which situations a method like PCA should be applied. Once I know what the practical applications are, the mathematics tend to make a lot more sense.

My own work is in genomics, so the first example I can think of is generally related to gene expression. Imagine a number of patient samples for which you have measured the transcript levels of all expressed genes. The number of patient samples (<100) is usually much lower than the number of transcripts you measure (>10000). In other words, the number of observations (samples) is smaller than the number of features (transcripts).

For most types of analyses, especially machine learning algorithms, this is not a favorable starting point. When faced with a large number of features, it is safe to assume that a good number of them contain very little information, such as measurements on genes that are not differentially expressed, or that there is redundancy (co-linearity) between features, e.g. genes that vary together. In any event, too many features increase computation time and colinear features can even produce unstable results in some algorithms.

In such a situation, PCA can be extremely useful because one of its strengths is to optimize the features based on the “hidden” structure of the data. PCA aims to find a new representation of the data by extracting combinations of features (components) from the data that are uncorrelated with each other and ordered by “importance”. This allows us to approximate the data with a reduced number of features. Dimensional reduction makes PCA a valuable tool not only for preparing data for machine learning but also for exploratory data analysis and data visualization.

What does pca do?

Mathematically, the new features PCA provides us with are linear combinations of the old features. That means each of the old features weighs in – to different extents – to make up the new features. In PCA jargon, the weights of the original features are called “loadings“. It is a little bit like saying, take 1 cup of flour, 1 egg, 1 cup of milk, 3 teaspoons of baking powder, 1 teaspoon of sugar, and 1 teaspoon of salt and call it “pancake”. So our new feature (pancake) is made up of a combination of different amounts (the “loadings”) of the old features (the ingredients).

An important aspect of PCA is that the new features are “orthogonal” to each other, which is a more general way of expressing what I have called “uncorrelated” before. Orthogonality is a concept of linear algebra, which defines two vectors as orthogonal if their dot product is 0. It is related to “perpendicular”, which is commonly understood as having a “right” angle between to entities, such as the sides of a square or the x- and y-axis of our Cartesian coordinate system. In fact, you can think of PCA as a rotation of the data into a new coordinate system that “fits” the data more naturally. The axes of the new coordinate system are the eigenvectors of the data matrix. Eigenvectors are particular (“eigen”) to a given matrix that do not change direction under linear transformation by that matrix. Eigenvectors may, however, change length and/or orientation. The amount of scaling under linear transformation is called the eigenvalue. Each eigenvector of a matrix is associated with its own eigenvalue.

How does pca work?

In general, a square matrix with m rows and columns has m eigenvector/eigenvalue pairs. How does PCA know which ones are the most important? The key assumption of PCA is that the directions within the data that show the most variance contain the most information and, thus, are likely the most important. We find the eigenvectors with the highest variance along their direction by eigenvector decomposition of the covariance or correlation matrix of the data. The eigenvector with the largest eigenvalue explains the most variance. It is also called the “principal component“. Further components explain progressively less variance and are generally considered less “important”. The sum of all components explain the total variance of the data, so nothing is lost by applying the PCA itself. The original data can be reconstituted by rotating back to the original feature space.

Now we are in a position to understand why PCA can be used for dimensional reduction. If the first couple eigenvalues are much larger than the rest, they capture the majority of the variance, and a projection of the data onto the first couple of components will result in a good approximation of the original data. We implicitly assume that we have concentrated the “signal” in the first couple of components, and the rest capture merely the “noise” of the data. There is no strict cut-off for what constitutes the majority of variance. Depending on the field of research, people use numbers between 60% and 95%.


  • PCA is a data analysis method that uses linear combinations of existing feature vectors to re-expresses your data in terms of uncorrelated, orthogonal components.
  • The components are selected and sorted based on how much variance they explain within the data. The sum of all components explain the total variance.
  • PCA can be used to reduce the number of features within a data set by choosing only those components that explain the majority of variance and projecting the data onto a lower dimensional subspace. This can result in a good approximation of the original data.


Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

PCA – Part 1: An intuition

A closer look at the fisherman’s dilemma

In my previous post I defined the dangers of candidate selection in high-throughput screens as the “fisherman’s dilemma”. I have argued that our preconceived notions of how a biological system “should” behave, i.e. our inherent scientific bias, and a disproportional focus on p-values contribute to the frequent failure of high-throughput screens to yield tangible or reproducible results.

be conservative about p-values

Today, I would like to take a closer look at the relationship between the p-value and the positive predictive value (PPV), also known as the true positive rate or the posterior probability that a hit is a true hit. Despite its fancy name, the PPV is just the ratio true positives (TP) over the sum of true positives and false positives (FP):

PPV = TP / (TP + FP)

The number of true positives is the determined by the prior probability of there being a hit \pi and the statistical power 1-\beta, which is our ability to detect such a hit. Power is the complement of the type II error rate \beta.

TP = (1 - \beta) \pi

The number of false positives depends on the false positive rate (type I error rate) \alpha and the prior probability of there being no hit 1-\pi.

FP = \alpha (1 - \pi)

Putting these two equations together, we get:

PPV = (1 - \beta) \pi / [ (1 - \beta) \pi + \alpha (1 - \pi) ]

From this equation it is evident that just focusing on the significance level \alpha can lead to vastly different PPVs depending on where on the spectrum of prior probability and power we operate.

For the purpose of illustration, I have plotted the PPV for four commonly used significance levels 0.1, 0.05, 0.01, and 0.001. Green means higher PPV, and red means lower PPV. The black contour line shows where the PPV is 0.5, that is half of our hits are predicted to be false positives. For the optimists among us, half of our hits would likely be true positives.

fishing_part2_fig1From this figure it is clear that a p-value of 0.05 only works in situations of high prior probability and high power. I have marked the domain of high-throughput screens (HTS) rather generously at up to 0.25 prior probability and 0.25 power. Due to small sample sizes (low power) and the fact that any given perturbation is unlikely to have an effect on the majority of cellular components (low prior), most high-throughput screens operate in a space even closer to the origin in the deeply red area.

On the flip-side, this analysis tells us that if we are a little more conservative in what we call a hit, in other words if we lower the p-value cut-off to let’s say 0.001 or lower, we improve our chances of identifying true positives quite dramatically. Unless the high-throughput screen is plagued by terribly low power and prior probability, we actually have a chance that the majority of hits are true positives.

keep your guard up

In genomics, p-values often originate from statistical tests like t-tests comparing a number of control samples to a number of treatment samples. If the values obtained upon treatment have a low probability to have originated from the control (null) distribution, we say the treatment has a “significantly” effect. The t-statistics takes into account the difference of the means between control and treatment distributions and their variances and combines them into a single value, the t-value, from which the p-value is calculated.

In situations when we small sample sizes, such as in high-throughput screens, it can happen that by chance either the control or treatment values cluster together closely. This results in either a very narrow control or treatment distribution. Due to the confounding of effect size and precision in the t-value, even tiny effects can end up called “significant” as long as the variance is small enough. This is a common problem in microarrays with small sample numbers.

Fortunately, there are quite effective solutions for this problem. Genes with low p-values and small effect sizes can be identified using a volcano plot, which displays effect size against the negative logarithm of the p-value.

Increasing the sample size and/or using a Bayesian correction of the standard deviation as it is implemented in Bioconductor’s “limma” package for microarray analysis can help to ameliorate this problem.

possible Ways out of the fisherman’s dilemma

  • High-throughput screens in biomedical research usually operate in a domain of low power and low prior probability. Based on your estimate of power and prior probability, use a more conservative p-value cut-off than 0.05.
  • In addition to choosing the significance level \alpha based on the power and prior probability of your study, be wary of low p-values of hits with small effect sizes or apply corrections if possible.
  • Try to increase the power of your experiment by increasing sample size, or better by decreasing measurement error if possible.
  • The prior probability of having an effect is determined by nature and out of our control. We need to be aware of the possibility, however, that the prior probability is very low or even zero. In that case, it would be very hard or impossible to find a true positive.
  • Ditch the p-value and use a Bayesian approach.


The full R code can be found on Github.

Further reading

A closer look at the fisherman’s dilemma

On the dangers of fishing expeditions

Wherever you look, people do high-throughput screens. They are commonly referred to as hypothesis-generating experiments or, somewhat affectionately, “fishing expeditions”. The concept is alluring: Answer a broad scientific question in one fell swoop making use of recent technological advances in mass spectrometry, next-generation sequencing, and automated microscopy.

Months or even years later people present the results of their high-throughput screens and more often than not they struggle to distill concrete findings from their data.

A high-throughput screen may fail to produce interpretable results for a number of reasons.

  • Was the scientific question too broad to be answered? (Did we cast the fishing net too widely?)
  • Was the technology sensitive enough to pick up the expected differences? (Was the mesh of our fishing net too coarse?)
  • Should we have used a counterscreen? (What are all those dolphins doing in our net?)
  • Was the experimental design flawed? (Should we have packed a second net?)
  • Is there something to find out at all? (Are there any fish to catch?)

All of those questions are project-specific and some are beyond our control. Oftentimes, we realize what we should have done post factum.

So what are we supposed to do once the painstakingly acquired data awaits analysis on our hard drives? High-throughput screens typically yield intriguing clues by the bucket but the challenge lies in weeding out the dead ends. Or to stay with our fishing metaphor: What do we call a fish and what an old boot?

The standard approach followed by most people is to rely on a combination of “statistical significance” and “domain knowledge”. This sounds like objective theory married to a healthy dose of good old common sense. What could possibly go wrong?

Despite being appealing in theory, in practice this combination often fail to identify the right candidates for follow-up experiments. Or worse, it prevents you from realizing that the screen itself was a failure and you should spend your precious time and money on something else.

The reason for this phenomenon is partly to be found in the misuse of statistical theory and in the conscious or unconscious application of scientific bias. On top, their combination can lead to reinforcement of wrong choices that quickly send you on your way to no-man’s land.

The overvalued p-value

John Ioannidis’ piece on “Why most published research findings are false” caused quite a stir in the scientific community and, maybe even more so, in the pharmaceutical industry. His main point of contention is that the post-study probability that a hit is true does not only depend on the type I (false positive) error rate α (the significance level), but also on the type II (false negative) error rate β and the prevalence R of true hits. In Bayesian statistics, the prevalence would be called the prior probability.

Most statistics software packages are p-value generating machines. You feed them data, and they spit out p-values. If the p-value is below a certain threshold, commonly set at 0.05, we accept it as a hit. We all know that. Simple enough.

The p-value is the probability that a value as extreme or more is generated by a statistical model of our choice. The argumentation that a low p-value supports our hypothesis follows the classical straw man fallacy. We construct a straw man called the null hypothesis H0, show that our data is unlikely to be generated from H0 (the p-value reflects the probability), and conclude that by getting rid of our straw man, the alternative hypothesis H1, which happens to be our pet hypothesis, must be true.

The significance level α is the cut-off we somewhat arbitrarily set to 0.05. This means you still obtain a value as extreme or more extreme just by chance under the null hypothesis H0. Even in the best of cases, you would be wrong one out of twenty times. When hundreds or thousands of hypothesis tests are performed, which is ordinarily the case in modern genomics, this problem becomes so severe that it has to be addressed by a multiple testing correction. The fact that the specificity or true negative rate of an experiment is 1 – α further hints at the fact that the significance level has less to do with true hits but more with true “misses”. It is a little bit like saying what is the probability that it is a fish if you catch nothing. On its own it is certainly not a good predictor of whether your hit is true or not.

So, what is the function of the other two components that influence the posterior probability that a hit is true?

The complement of the type II error rate β is called the statistical power (1 – β) of a study design. It determines our ability to detect a hit if it is true (true positive). In other words, the probability that it is a fish if you catch something. We traditionally aim for a power of 0.8, which says that 80% of the hits are likely to be true positives and 20% false positives. Ideally, we would want the power to be even closer to 1 but as power depends on the sample number, it is often too expensive or too time consuming to have arbitrarily high power. Conversely, if an experiment has low power the majority of what we call hits are likely to be false positives. Statistical power is related to sensitivity or the true positive rate of the experiment. In machine learning, it is known as recall.

Prevalence is the probability of there being a true hit before you even start the experiment. It is the probability that there are fish where you you choose to cast your net. Intuitively, it makes sense that this number could make the difference between success and failure. In the realm of frequentist statistics, prevalence is doomed to live a life in the nether world. The reason for this is that prevalence is not a quantity that can be estimated from the data at hand but must either be derived from experience or “guessed”. However, the influence on the posterior probability that a hit is true can be huge. Even in a situation of relatively high prevalence, let’s say 50%, a p-value of 0.05 corresponds to a posterior probability of a true hit of 0.29. This means that in about 1/3 of the cases called significant, we are dealing with false positives.

How does all of this relate to high-throughput screens? By focusing exclusively on p-values we implicitly assume high power and high prevalence. Neither of which is typically true in high-throughput settings in modern biological research. Due to the high costs of such experiments, sample sizes are typically low and the differences we aim to detect are small. Both negatively affect statistical power. The prevalence is typically much less than 50%, more likely to be in the range of around 10%. We would not necessarily expect that upon some treatment more than 10% of genes are differentially expressed or that more than 10% of the phosphorylation events within a cell change, would we? A prevalence of 10% means that a p-value of 0.05 has a 89% chance of being a false positive. That is scary!

Conscious and unconscious bias

As human beings we all have preformed opinions and preferences that originate from our very own set of experiences. As scientists are humans too, we are no exception. Here is a fun experiment to try out for yourself. Generate a random list of 100 proteins or genes, take it to three principle investigators from different fields of biology, and tell them that they have the list of hits from your latest high-throughput screen fresh from the printer. It is not unlikely that you will walk out of their offices with three coherent but completely different stories of how the result of your screen could be interpreted in their respective field of research.

Modern biological research has only recently transitioned from a data-poor to a data-rich field. Most of us are trained to make decisions on limited information, fill in the blank spots creatively, and test the resulting hypothesis experimentally. How we frame our hypothesis critically depends on our own experience as a scientist and on the believes of the field. If a hypothesis coincides with what is in line with our own experience and what is the current thinking in the field, it is usually considered a “good” hypothesis. A value judgment based on subjectivity is the essence of bias. It happens all the time, consciously and unconsciously, and there is not much we can do about it.

In a high-throughput setting, we are very likely to encounter genes or proteins we have heard of before, worked with before, or simply feel sympathetic towards for whatever reason. I would wager that we are more likely to spot them on a list and select them for follow-up experiments, sometimes even despite contrary statistical evidence. It is called having a hunch.

Reality check

If we think about the combination of the shaky (or should I say nonexistent) foundation of the p-value as a predictor of finding a true hit and our intrinsic scientific biases, we should expect nothing else but a lack of tangible results of high-throughput screening. It is like giving a list of quasi-random names to a creative mind and asking for a story. You will get one, but whether it has anything to do with reality is a different question entirely.

If you look at published papers that include some form of high-throughput screens, you typically observe that one or two instances were “cherry picked” from a list and followed-up the old fashioned way. What happened to the great promises of systems biology, the understanding of complex regulatory patterns and emerging properties of biological networks?

It seems to me that this is another instance of “no free lunch”. You can’t have coverage and confidence at the same time. At least not at the moment.

In the meantime, have a close look at what dangles from your fishing rod. It might be an old boot masquerading as a fish. Don’t be fooled!

How to fish safely?

There are ways out of the fisherman’s / fisherwoman’s dilemma and I have listed some of them in a follow-up post. More information can be found in the articles listed below and the references therein.

Further reading

Three links to very accessible articles on the subject of p-values, statistical power, and prevalence:

On the dangers of fishing expeditions

Heatmaps – Part 3: How to create a microarray heatmap with R?

It is time to deal with some real data. I have hinted in Part 1 of this series that gene expression profiling using microarrays is a prime application for heatmaps. Today, we will look at the differences of gene expression in Acute Lymphoblastic Leukemia (ALL) samples that have either no cytogenetic abnormalities or the famous BCR/ABL chromosomal translocation (“Philadelphia chromosome”). Treatment of patients with the BCR/ABL translocation was the first big success of targeted chemotherapy using the small molecule kinase inhibitor Imatinib (Gleevec) around the turn of the century.

We will investigate whether the gene expression profile between the two types of ALL are different, and if yes, how well hierarchical clustering can detect the type of ALL from the microarray data. An important follow-up to such an analysis would be to determine the genes that contribute to a gene expression “fingerprint” that predicts the type of ALL simply based on the gene expression profile of a patient sample so that targeted therapy can be administered if available.

For this tutorial, I am assuming that you have a reasonable familiarity with R. You should know about the basic data types, be comfortable with subsetting, and be able to write simple functions.

This analysis is inspired by an example in the slightly dated but excellent book Bioconductor Case Studies.

Step 1: Prepare the data

The data itself is conveniently available in an R package called “ALL”.


Let’s look at what exactly we are dealing with here.

# look at help page associated with "ALL"
# determine class of "ALL"
# how much data are we dealing with?

There are several pieces of important information:

  1. The data is not a “data.frame” or “matrix” but an ExpressionSet. ExpressionSets are the go-to data representation for microarray data in a bundle of R libraries called “Bioconductor“. It not only makes it easy to extract the actual data as a “matrix” but also contains useful annotation. In our case “ALL” is an ExpressionSet with 12625 genes and 128 cancer samples.
  2. The information on the cytogenetic phenotype is stored in a variable called “mol.biol”. This will be useful to get a subset of the data.
  3. Annotation on whether the disease is B-cell or T-cell based can be found in the variable “BT”. Again, we will use this for extracting a subset of the data.

Heatmaps as a tool for data visualization works best if the data is not too diverse and not too large. Therefore, we will generate a subset of the “ALL” data that focuses on two types of ALL (“NEG” and “BCR/ABL”) that originate from B-cells.

# get samples with either no cytogenetic abnormalities (NEG)
# or the BCR-ABL translocation (BCR/ABL)
neg_bcrabl <- ALL$mol.biol %in% c("NEG", "BCR/ABL")
# get indices cancers originating from B-cells
bcell <- grepl("^B", ALL$BT)
# subset the ALL data set
all <- ALL[, bcell & neg_bcrabl]
# adjust the factor levels to reflect the subset
all$mol.biol <- droplevels(all$mol.biol)
all$mol.biol <- relevel(all$mol.biol, ref = "NEG")
# how much data are we left with?

We were able to reduce the number of cancer samples from 128 to 79. Good enough for now.

Let’s deal with the number of genes. A common approach is to assume that genes that do not display much variation across the samples are unlikely to be important for the analysis. They are either did not hybridize to the microarray, are not expressed, or simply did not change upon treatment. We will determine the most variable genes and use them for plotting a heatmap visualization of the data set.

# determine the standard deviation for all genes across the samples
# note that this is essentially an optimized version of
# apply(exprs(all), 1, sd)
all_sd <- rowSds(exprs(all))
# get the 200 most variable genes
top200 <- names(sort(all_sd, decreasing = TRUE))[1:200]
all_var <- all[top200, ]

Step 2: Decide on a distance metric

In our previous example, we used euclidean distance. Euclidean distance is the square root of the sum of the squared distance between each pair of elements of two vectors i and j

d_{ij}=\sqrt{\sum_{k=1}^{n}{(x_{ik} - x_{jk})^2}}

You can think of it as the “as the crow flies” distance between two vectors i and j in n dimensions.

One important aspect to consider about euclidean distance is that it is dominated by the absolute value of a feature x_k, not the shape of the overall vector. In gene expression studies, we are particularly interested in how genes of different expression levels co-vary across different conditions, genotypes or treatments. The most established metric to calculate the distance between samples in gene expression data is the complement of the correlation coefficient.

d_{ij}=1 - cor(\vec{x_i}, \vec{x_j})

Note that we use the complement of the correlation coefficient because the correlation coefficient by itself is a measure of similarity, not distance. The correlation coefficient is invariant under linear transformation, i.e. invariant to scale and location and takes into account the similarity of the shapes of two vectors. In most cases we would use Pearson correlation, unless we have reason to assume that there is a non-linear relationship of the expression levels between samples. Then we would use the rank-based Spearman correlation coefficient.

Let’s set up a distance function in R that will use later in our call to the “heatmap” function.

dist_cor <- function(x) {
    as.dist(1 - cor(t(x), method = "pearson"))

One little quirk of the “cor” function is that it calculates correlations on columns. Distances however are calculated on rows. A quick fix is to feed the transpose of the matrix to “cor”.

Step 3: Decide on a clustering method

There are many ways to cluster data but I will focus on one method commonly used in heatmaps: agglomerative hierarchical clustering. You can think of this as a bottom-up approach, in which all vectors start out as their own cluster and the algorithm iteratively merges the clusters that it determines the most similar until all clusters are merged into one. This results in a tree-like structure called a dendrogram, which depicts the distance between vectors as the length of the branches. One important aspect of agglomerative hierarchical clustering is that it is deterministic, i.e. it always ends up producing the same result on the same data no matter how many times you re-run the algorithm. This is different from k-means clustering, which produces different clustering dependent on an initial condition. One disadvantage of agglomerative clustering is that if one vector gets mis-assigned to some cluster early on, it will stay in that cluster until the end. K-means clustering can change cluster assignment at any time before convergence. This is why the way agglomerative hierarchical clustering determines the distance between clusters is of great importance to the final outcome.

In Part 2 of this tutorial we used the default method “complete” linkage, which determines the distance between to clusters A and B by determining the the maximum absolute distance between two vectors \vec{x} \in A and \vec{y} \in B.

d(A, B) = max \parallel (\vec{x} - \vec{y}) \parallel

Other methods use the minimum distance (“single”) or the average distance (“average”) to determine the distance between the clusters A and B. Single-link clustering tends to cluster via a “friends of friends” pattern, which typically results in a “stringy” clustering. As the distance depends on a single pair of vectors, it can handle irregular cluster shapes but it is sensitive to noise and outliers. At the opposite extreme, the complete-link clustering prefers to cluster vectors that are equally close together, which means it prefers globular clusters. It is less susceptible to noise and outliers but tends to break up big clusters into little ones. As you can imagine, the average-link method is somewhere in between. If you don’t already have an idea of which method to use based on experience or theoretical considerations, try which one works best for your problem.

The clustering method I will be using today is called Ward’s method. It determines the similarity between two clusters A and B based on the increase of the squared error upon merging the two clusters. This increase of variance \Delta is called the “merging cost”.

\Delta(A, B) = \frac{n_A n_B}{n_A + n_B} \parallel \vec{m}_A - \vec{m}_B \parallel ^{2}

where \vec{m}_k is center (centroid) of cluster k and n_k is the number of elements in cluster k.

Ward’s method uses cluster centroids and thus tends to be similar to the average-linkage method. In R, Ward’s method is implemented as “ward.D2”.

clus_wd2 <- function(x) {
    hclust(x, method = "ward.D2")

Step 4: Plot a microarray heatmap

It is customary in microarray heatmaps to use a “red-black-green” color scheme, where “green” signifies down-regulated genes, “black” unchanged genes, and “red” up-regulated” genes. Let’s implement a custom color scheme using the “RColorBrewer” package

redblackgreen <- colorRampPalette(c("green", "black", "red"))(n = 100)

When available it is often instructive to plot the class labels of the samples we are attempting to cluster as a color code. It is an important sanity check to see if we are on the right track or have made a careless mistake. In our case, the samples either show no abnormal cytogenetics (“NEG”) or have the BCR-ABL translocation (“BCR/ABL”).

class_labels <- ifelse(all_var$mol.biol == "NEG", "grey80", "grey20")

We will use the “heatmap.2” function implemented in the “gplots” package. It functions the same way as R’s in-built “heatmap” function but offers more functionality.

Both the “heatmap” and the “heatmap.2” functions require you to feed them your data as a “matrix” object. We can extract the gene expression data as a matrix from the ExpressionSet using the “exprs” function.

          # clustering
          distfun = dist_cor, 
          hclust = clus_wd2,
          # scaling (genes are in rows)
          scale = "row",
          # color
          col = redblackgreen, 
          # labels
          labRow = "", 
          ColSideColors = class_labels, 
          # tweaking
          trace = "none",
          density.info = "none")


Not as bad as it looks at first glance. If you look at the columns, the first two large clusters clearly separate a subpopulation of “NEG” samples (first cluster) and “BCR/ABL” samples (second cluster). The following smaller clusters are pretty homogenous too, just the last couple are more or less random. Also, remember that the branches can be rotated at the nodes without changing the topology of the dendrogram.

At the gene level we can likewise see clear patterns of down-regulated (green) and up-regulated genes (red) emerging, especially within the first two homogenous clusters.

Can we do better? Absolutely! We threw away most of the information by just taking the 200 most variable genes. Some might be just noisy genes, some might vary in response to other factors than the cytogenetic classification. We also have additional information on the patients such as sex, age, or whether the cancer went into remission. We would ideally make use of all of this information if we wanted to build a machine learning algorithm to distinguishes between different types of ALL. In this exercise our main purpose is visualization rather than analysis of the data, so let’s take a more straightforward way to select genes that distinguishes the two types of ALL.

Step 5: A “better” way of selecting genes

In the “ALL” data set each cancer sample is already classified by it cytogenetic properties. This is a luxurious situation because it allows us to tune the selection of genes we want to display based on the cancer type classification. We will use statistical tests to determine the differentially expressed genes and use them for our heatmap.

Note that this approach is fine if our purpose is to generate a visual summary of our data at hand but it is technically cheating. Why? Because you use the cancer type information to select the genes that are used for clustering the cancer types. It is a type of circular reasoning, or “data snooping” as it is called in machine learning jargon. This is why I took a truly unsupervised learning approach in the previous section and pretend that we did not know the class labels beforehand. Data snooping is a big problem in data science because it makes you think your model is better than it actually is. In reality, your model overfits your data at hand and it will likely not generalize well to future data.

Let’s start out by finding the genes that are differentially expressed between “NEG” and “BCR/ABL” samples. We will perform nonspecific filtering on the data first to remove genes that are either not expressed or don’t vary between the samples. This will increase the power of the t-tests later on.

# the shortest interval containing half of the data
# reasonable estimate of the "peak" of the distribution
sh <- shorth(all_sd)
# we take only genes that have a standard deviation
# greater than "sh"
all_sh <- all[all_sd >= sh, ]
# how many genes do we have left?

The distribution of standard deviations (“all_sd”) has a long tail towards the right (large values). This is typical for gene expression data. The “shorth” function is a simple and unbiased way to get an estimate of the peak of such a distribution to use as a cut-off to exclude genes with low variance. Using this approach, we were able to remove about 1/3 of the genes that are likely not relevant for our analysis. For more details, see the Bioconductor Case Studies.

Next, we will perform row-wise t-tests on all genes that are left. The cytogenetic classification “mol.biol” tells us which sample belongs to which group.

tt <- rowttests(all_sh, all_sh$mol.biol) 

This code performs 8812 separate t-tests. If we now took all genes that have a p-value smaller or equal to 0.05, we would expect around 440 genes to be in that category just by chance. This is an unacceptable number of false positives. The most common solution to this problem is to adjust the p-values for multiple testing, so that among the genes we chose our false discovery rate (FDR) is around 5%.

# use the Benjamini-Hochberg method to adjust 
tt$p.adj <- p.adjust(tt$p.value, method = "BH")
# subset the pre-filtered "all_sh" for genes
# with an adjusted p-value smaller or equal to 0.05
all_sig <- all_sh[tt$p.adj <= 0.05, ]
# how many genes are we left with?

We end up with 201 genes that are candidates for differential expression between the two types of ALL. As this number is very close to the number of genes we using for our variance-based filtering, we can plug the results directly into the “heatmap.2” function to compare the performance with our previous attempt.

          # clustering
          distfun = dist_cor, 
          hclust = clus_wd2,
          # scaling (genes are in rows)
          scale = "row",
          # color
          col = redblackgreen, 
          # labels
          labRow = "", 
          ColSideColors = class_labels, 
          # tweaking
          trace = "none",
          density.info = "none")

This will result in the following heatmap.


The two types of ALL segregate nicely into two distinct clusters (with a few exceptions). Note that the last four samples of the dark grey “BCR/ABL” bar actually cluster with the “NEG” samples. They just happen to be next to the other dark grey samples in this particular topology of the dendrogram.

When we look at the differentially expressed genes, we see something interesting. The “BCR/ABL” samples appear to have many more genes that are up-regulated (red) compared to the “NEG” samples. Only about 20% of the significantly different genes are down-regulated (green). The Bcr-Abl chimeric kinase is thought to be constitutively active, so one could rationalize such an outcome by suggesting that the kinase inappropriately drives pathways that lead to turning on transcription factors, which in turn up-regulate the expression of certain genes.

It is not surprising that we did better than in our previous attempt. We used the cancer type class labels to inform our choice of genes. The hierarchical clustering gives us back some of what we put in. However, to summarize the data visually, such an approach is ok.

Step 6: Have mercy with the color-challenged

A surprisingly large percentage of the population, mostly men because the responsible genes are X-linked, suffer from red-green color blindness. If you want to be nice, use a different color palette, such as yellow-blue

yellowblackblue <- colorRampPalette(c("dodgerblue", "black", "gold"))(n = 100)

Plotting the same heatmap with the altered color scheme looks like this. If this is clearer to you than the previous one, you might not only have learned something about heatmaps but also something about yourself today.



  • Data preparation and feature selection (e.g. genes) is critical for the outcome of any data visualization
  • Understand which distance and clustering method works best for your data
  • Be mindful about data snooping when it comes to the application of any machine learning algorithm (hierarchical clustering is an unsupervised machine learning algorithm)


The full R script can be found on Github.


This post is part 3 of a series on heatmaps:

Part 1: What is a heatmap?

Part 2: How to create a simple heatmap with R?

Part 3: How to create a microarray heatmap with R?

Heatmaps – Part 3: How to create a microarray heatmap with R?