Analyzing quantitative PCR data the tidy way

Previously in this series on tidy data: Taking up the cudgels for tidy data

One of the most challenging aspects of working with data is how easy it is to get lost. Even if the data sets are small. Multiple levels of hierarchy and grouping quickly confuse our human brains (at least mine). Recording such data in two dimensional spreadsheets naturally leads to blurring of the distinction between observation and variable. Such data requires constant reformatting and its structure may not be intuitive to your fellow researcher.

Here are the two main rules about tidy data as defined in Hadley Wickham’s paper:

  1. Each variable forms a column
  2. Each observation forms a row

A variable is an “attribute” of a given data point that describes the conditions when it was taken. Variables often are categorical (but they don’t need to be). For example, the gene tested or the genotype associated with a given measurement would be a categorical variables.

An observation is a measurement associated with an arbitrary number of variables. There are no measurements that are taken under identical conditions. Each observation is uniquely described by the variables and should form its own row.

Let’s look at an example of a typical recording of quantitative PCR data in Excel.


We have measurements for three different genotypes (“control”, “mutant1”, “mutant2”) from three separate experiments (“exp1”, “exp2”, “exp3”) with three replicates each (“rep1”, “rep2”, “rep3”).

Looking at the columns, we see that information on “experiment” and “replicate” are stored in the names of the columns rather than the entries of the columns. This will have to be changed.

There clearly are multiple measurements per row. More precisely, it looks like we have a set of nine measurements for each genotype. But this is not entirely true. Experiments are considered statistically independent as they are typically performed at different times and with different cells. They capture the full biological variability and we call them “biological replicates”. The repeated measurements done in each experiment are not statistically independent because they come from the same sample preparation and thus only capture sources of variance that originate from sample handling or instrumentation. We call them “technical replicates”. Technical replicates cannot be used for statistical inference that requires “statistical independence”, such as a t-test. As you can see, we have an implicit hierarchy in our data that is not expressed in the structure of the data representation shown above.

We will untangle all those complications one by one using R tools developed by Hadley Wickham and others to represent the same data in a tidy format suitable for statistical analysis and visualization. For details about how the code works, please consult the many excellent tutorials on dplyr, tidyr, ggplot2, and broom.

messy <- read.csv("qpcr_messy.csv", row.names = 1)

This is the original data read into R. Let’s get started.


Row names should form their own column

The “genotype” information is recorded as row names. “Genotype” clearly is a variable, so we should make “genotype” a full column.

tidy <- data.frame(messy) %>%
# make row names a column
mutate(genotype = rownames(messy))

qpcr_R_messy2What are our variables?

Next, we need to think about what are our variables. We have already identified “genotype” but what are the other ones? The way we do this is to ask ourselves what kind of information we would need to uniquely describe each observation. The experiment and replicate number are essential to differentiate each quantitative PCR measurement, so we need to create separate columns for “experiment” and “replicate”. We will do this in two steps. First we use “gather” to convert tabular data from wide to long format (we could have also used the more general “melt” function from the “reshape2” package). The former column names (e.g. “exp1_rep1”) are saved into a temporary column called “sample”. As this column contains information about two variables (“experiment” and “replicate”), we need to separate it into two columns to conform with the “each variable forms a column” rule. To do this, we use “separate” to split “sample” into the two columns “experiment” and “replicate”.

tidy <- tidy %>%
    # make each row a single measurement 
    gather(key = sample, value = measurement, -genotype) %>%
    # make each column a single variable
    separate(col = sample, into = c("experiment", "replicate"), sep = "_")

qpcr_R_messy3Here are the first 10 columns of the “tidy” representation of the initial Excel table. Before we can do statistical tests and visualization, we have to take care of one more thing.

Untangling implicit Domain specific hierarchies

Remember what we said before about the two different kind of replicates. Only data from biological replicates (“experiments”) are considered statistically independent samples, while technical replicates (“replicate”) are not. One common approach is to average the technical replicates (“replicate”) before any statistical test is applied. With tidy data, this is simple.

data <- tidy %>%
    # calculate mean of technical replicates by genotype and experiment
    group_by(genotype, experiment) %>%
    summarise(measurement = mean(measurement)) %>%

Having each variable as its own column makes the application of the same operation onto different groups straightforward. In our case, we calculate the mean of technical replicates for each genotype and experiment combination.

Now, the data is ready for analysis.

TIdy Statistical analysis of quantitative pcr data

The scientific rational for a quantitative PCR experiment is to find out whether the number of transcripts for a given gene is different between two or more conditions. We have measurements for one transcript in three distinct genotypes (“control”, “mutant1”, “mutant2”). Biological replicates are considered independent and measurements are assumed to be normally distributed around a “true” mean value. A t-test would be an appropriate choice for the comparison of two genotypes. In this case, we have three genotypes, so we will use one-way anova followed by Tukey’s post-hoc test.

mod <- data %>%
    # set "control" as reference
    mutate(genotype = relevel(factor(genotype), ref = "control")) %>%
    # one-way anova and Tukey's post hoc test
    do(tidy(TukeyHSD(aov(measurement ~ genotype, data = .))))

We generally want to compare the effect of a genetic mutation to a “control” condition. We therefore set the reference of “genotype” to “control”.

Using base R statistics functions like “aov” and “TukeyHSD” in a tidy data analysis workflow can pose problems because they were not created with the idea of “dplyr”-style piping (“%>%”) in mind. Piping requires that the input and output of each function is a data frame and that the input is the first argument of the function. The “aov” function neither takes the input data frame as its first argument, nor does it return a data frame but a specialized “aov” object. To add insult to injury, the “TukeyHSD” function only works with such a specialized “aov” object as input.

In situations like this, the “do” function comes in handy. Within the “do” function, the input of the previous line is accessible through the dot character, so we can use an arbitrary function within “do” and just refer to the input data at the appropriate place with “.”. As a final clean-up, the “tidy” function from the “broom” package makes sure that the output of the line is a data frame.

qpcr_tukeyTukey’s post hoc test thinks “mutant1” is different from “control” but “mutant2” is not. Let’s visualize the results to get a better idea of how the data looks like.

tidy Visualization of quantitative PCR data

We are dealing with few replicates, three in our case, so a bar graph is not the most efficient representation of our data. Plotting the individual data points and the confidence intervals gives us more information using less ink. We will use the “ggplot2” package because it is designed to work with data in the tidy format.

# genotype will be on the x-axis, measurements on the y-axis
ggplot(data, aes(x = genotype, y = measurement, col = experiment)) + 
    # plot the mean of each genotype as a cross
    stat_summary(fun.y = "mean", geom = "point", color = "black", shape = 3, size = 5) +
    # plot the 95% confidence interval for each genotype
    stat_summary( = "mean_cl_normal", geom = "errorbar", color = "black", width = 0.1) + 
    # we we add the averaged measurements for each experiment
    geom_point(shape = 16, size = 5) +

qpcr_visualizationWe can see why the first mutant is different from the “control” sample and the second is not. More replicates would be needed to test whether the small difference in means between “control” and “mutant2” is a true difference or not.

What I have shown here is just the tip of the iceberg. There are many more tools and functions to discover. The more data analysis you do, the more you will realize how important it is not to waste time formatting and reformatting the the data for each step of the analysis. Learning about how to tidy up your data is an important step towards that goal.

The R code can be found on Github.


Analyzing quantitative PCR data the tidy way

PCA – Part 5: Eigenpets

In this post scriptum to my series on Principal Component Analysis (PCA) I will show how PCA can be applied to image analysis. Given a number of images of faces, PCA decomposes those images into “eigenfaces” that are the basis of some facial recognition algorithms. Eigenfaces are the eigenvectors of the image data matrix. I have shown in Part 3 of this series that the eigenvectors that capture a given amount of variance of the data can be used to obtain an approximation of the original data using fewer dimensions. Likewise, we can approximate the image of a human face by a weighted combination of eigenfaces.

From images to matrices

A digital image is just a matrix of numbers. Fair game for PCA. You might ask yourself, though, how to coerce many matrices into a single data matrix with samples in the rows and features in the columns? Just stack the columns of the image matrices to get a single long vector and then stack the image vectors to obtain the data matrix. For example, a 64 by 64 image matrix would result in a 4096-element image vector, and 100 such image vectors would be stacked into a 100 by 4096 data matrix.

The more elements a matrix has, the more computationally expensive it becomes to do the matrix factorization that yields the eigenvectors and eigenvalues. As the number of images (samples) is usually much smaller than the number of pixels (features), it is more efficient to compute the eigenvectors of the transpose of the data matrix with the pixels in the rows and the images in the columns.

Fortunately, there is an easy way to get from the eigenvectors of the covariance matrix \boldsymbol A^T \boldsymbol A to those of the covariance matrix \boldsymbol A \boldsymbol A^T. The eigenvector equation of \boldsymbol A^T \boldsymbol A is

\boldsymbol A^T \boldsymbol A \vec{v} = \lambda \vec{v}

Multiplying \boldsymbol A to the left on both sides gives us important clues about the relationship between the eigenvectors and eigenvalues between the two matrices.

\boldsymbol A \boldsymbol A^T (\boldsymbol A \vec{v}) = \lambda (\boldsymbol A \vec{v})

The eigenvalues of the covariance matrices \boldsymbol A^T \boldsymbol A and \boldsymbol A \boldsymbol A^T are identical. If \boldsymbol A is an m by n matrix and m < n, there will be m nonzero eigenvalues and n-m zero eigenvalues.

To get from an eigenvector \vec{v} of \boldsymbol A^T \boldsymbol A to an eigenvector of \boldsymbol A \boldsymbol A^T, we just need to multiply by \boldsymbol A on the left.

What does an “eigenpet” look like?

For this demonstration, I will not be using images of human faces but, in line with the predominant interests of the internet, faces of cats and dogs. I obtained this data set when I took the course “Computational Methods for Data Analysis” on Coursera and converted the original data to text files to make them more accessible to R. The data can be found on Github.

# read data from Github
cats <- read.table(text = getURL(""), sep = ",")
dogs <- read.table(text = getURL(""), sep = ",")
# combine cats and dogs into single data frame
pets <- cbind(cats, dogs)

The data matrix already is in a convenient format. Each 64 by 64 pixel image has been converted into a 4096 pixel vector and each of the 160 image vector is stacked vertically to obtain a data matrix with dimensions 4096 by 160. As you might have guessed, there are 80 cats and 80 dogs in the data set.

The distinction between what is a sample and what is a feature becomes a little blurred in this analysis. Technically, the 160 images are the samples and the 4096 pixels are the features, so we should do PCA on a 160 by 4096 matrix. However, as discussed above it is more convenient to operate with the transpose of this matrix, which is why image data usually is prepared in its transposed form with features in rows and samples in columns.

As we have seen in theoretically in Part 2 of this series, it is necessary to center (and scale) the features before performing PCA. We are dealing with 8-bit greyscale pixel values that are naturally bounded between 0 and 255, so they already are on the same scale. Moreover, all features measure the same quantity (brightness), so it is customary to center the data by subtracting the “average” image from the data.

# compute "average" pet
pet0 <- rowMeans(pets)
cat0 <- rowMeans(pets[, 1:80])
dog0 <- rowMeans(pets[, 81:160])

So what does the average pet, the average cat, average dog look like?

# create grey scale color map
greys <- gray.colors(256, start = 0, end = 1)
# convenience function to plot pets
show_image <- function(v, n = 64, col = greys) {
    # transform image vector back to matrix
    m <- matrix(v, ncol = n, byrow = TRUE)
    # invert columns to obtain right orientation
    # plot using "image"
    image(m[, nrow(m):1], col = col, axes = FALSE)
# plot average pets
for (i in list(pet0, cat0, dog0)) {

pca_part5_fig1As expected, the average pet has features of both cats and dogs, while the average cat and dog are quite recognizable as members of their respective species.

Let’s run PCA on the data set and see what an “eigenpet” looks like.

# subtract average pet
pets0 <- pets - pet0
# run pca
pc <- prcomp(pets0, center = FALSE, scale. = FALSE)

Ordinarily, we would find the eigenvector matrix in the “rotation” object return to us by “prcomp”. Remember, that we did PCA on the transpose of the data matrix, so we have to do some additional work to get to the eigenvectors of the original data. I have shown above that to get from the eigenvectors of \boldsymbol A^T \boldsymbol A to the eigenvectors of \boldsymbol A \boldsymbol A^T, we just need to multiply by \boldsymbol A on the left. \boldsymbol A, in this case, is our centered data matrix “pets0”.

Note that the unscaled eigenvectors of the eigenfaces are equivalent to the projection of the data onto the eigenvectors of the transposed data matrix. Therefore, we do not have to explicitly compute them but can use the “pc$x” object.

# obtain unscaled eigenvectors of eigenfaces
u_unscaled <- as.matrix(pets0) %*% pc$rotation
# this turns out to be the same as the projection of the data 
# stored in "pc$x"
all.equal(u_unscaled, pc$x)

Both “u_unscaled” and “pc$x” are unscaled versions of the eigenvectors, which means that they are not unitary matrices. For plotting, this does not matter because the images will be scaled automatically. If scaled eigenvectors are important, it is more convenient to use singular value decomposition and use the left eigenvalues stored in the matrix “u”.

# singular value decomposition of data
sv <- svd(pets0)
# left eigenvalues are stored in matrix "u"
u <- sv$u

Let’s look at the first couple of “eigenpets” (eigenfaces).

# display first 6 eigenfaces
for (i in seq(6)) {
    show_image(pc$x[, i])

pca_part5_fig2Some eigenfaces are definitely more cat-like and others more dog-like. We also see a common issue with eigenfaces. The directions of highest variance are usually associated with the lighting conditions of the original images. Whether they are predominantly light or dark dominates the first couple of components. In this particular case, it appears that the images of cats have stronger contrasts between foreground and background. If we were to do face recognition, we would either preprocess the images to have comparable lighting conditions or exclude the first couple of eigenfaces.

Reconstruction of Pets using “Eigenpets”

Using the eigenfaces we can approximate or completely reconstruct the original images. Let’s see how this looks like with a couple of different variance cut-offs.

# a number of variance cut-offs
vars <- c(0.2, 0.5, 0.8, 0.9, 0.98, 1)
# calculate cumulative explained variance
var_expl <- cumsum(pc$sdev^2) / sum(pc$sdev^2)
# get the number of components that explain a given amount of variance
npc <- sapply(vars, function(v) which(var_expl >= v)[1])
# reconstruct four cats and four dogs
for (i in seq(79, 82)) {
    for (j in npc) {
        # project data using "j" principal components
        r <- pc$x[, 1:j] %*% t(pc$rotation[, 1:j])
        show_image(r[, i])
        text(x = 0.01, y = 0.05, pos = 4, labels = paste("v =", round(var_expl[j], 2)))
        text(x = 0.99, y = 0.05, pos = 2, labels = paste("pc =", j))

pca_part5_fig3Every pet starts out looking like a cat due to the dominance of lighting conditions. Somewhere between 80% and 90% of captured variance, every pet is clearly identifiable as either a cat or a dog. Even at 90% of capture variance, we use less than one third of all components for our approximation of the original image. This may be enough for a machine learning algorithm to tell apart a cat from a dog with high confidence. We see, however, that a lot of the detail of the image is contained in the last two thirds of the components.


The full R code is available on Github.

Further Reading

Scholarpedia – Eigenfaces

Jeff Jauregui – Principal Component Analysis with Linear Algebra


Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

PCA – Part 5: Eigenpets

PCA – Part 3: In the Trenches

Now that we have an intuition of what principal component analysis (PCA) is and understand some of the mathematics behind it, it is time we make PCA work for us.

Practical examples of PCA typically use Ronald Fisher’s famous “Iris” data set, which contains four measurements of leaf lenghts and widths of three subspecies of Iris flowers. To mix things up a little bit, I will use a data set that is closer to what you would encounter in the real world.

The “Human Activity Recognition Using Smartphones Data Set” available from the UCI Machine Learning Repository contains a total of 561 triaxial acceleration and angular velocity measurements of 30 subjects performing different movements such as sitting, standing, and walking. The researchers collected this data set to ask whether those measurements would be sufficient to tell the type of activity of the person. Instead of focusing on this classification problem, we will look at the structure of the data and investigate using PCA whether we can express the information contained in the 561 different measurements in a more compact form.

I will be using a subset of the data containing the measurements of only three subjects. As always, the code used for the pre-processing steps of the raw data can be found on GitHub.

Step 1: Explore the data

Let’s first load the pre-processed subset of the data into our R session.

# read data from Github
measurements <- read.table(text = getURL(""))
description <- read.table(text = getURL(""))

It’s always a good idea to check for a couple of basic things first. The big three I usually check are:

  • What are dimensions of the data, i.e. how many rows and columns?
  • What type of features are we dealing with, i.e. categorical, ordinal, continuous?
  • Are there any missing values?

The answer to those three questions will determine the amount of additional data munging we have to do before we can use the data for PCA.

# what are the dimensions of the data?
# what type of data are the features?
table(sapply(measurements, class))
# are there missing values?

The data contains 990 samples (rows) with 561 measurements (columns) each. Clearly too many measurements for visualizing on a scatterplot. The measurements are all of type “numeric”, which means we are dealing with continuous variables. This is great because categorical and ordinal variable are not handled well by PCA. Those need to be “dummy coded“. We also don’t have to worry about missing values. Strategies for handling missing values are a topic on its own.

Before we run PCA on the data, we should look at the correlation structure of the features. If there are features, i.e. measurements in our case, that are highly correlated (or anti-correlated), there is redundancy within the data set and PCA will be able to find a more compact representation of the data.

# feature correlation before PCA
cor_m <- cor(measurements, method = "pearson")
# use only upper triangular matrix to avoid redundancy
upt_m <- cor_m[upper.tri(cor_m)]
# plot correlations as histogram
hist(upt_m, prob = TRUE)
# plot correlations as image
image.plot(cor_m, axes = FALSE)

The code was simplified for clarity. The full version can be found in the script.

pca_part3_fig1We see in the histogram on the left that there is a considerable number of highly correlated features, most of them positively correlated. Those features show up as yellow in the image representation to the right. PCA will likely be able to provide us with a good lower dimensional approximation of this data set.

Step 2: Run PCA

After all the preparation, running PCA is just one line of code. Remember, that we need to at least center the data before using PCA (Why? see Part 2). Scaling is technically only necessary if the magnitude of the features are vastly different. Note, that the data appears to be already centered and scaled from the get-go.

# run PCA
pc <- prcomp(measurements, center = TRUE, scale. = TRUE)

Depending on your system and the number of features of your data this may take a couple of seconds.

The call to “prcomp” has constructed new features by linear combinations of the old features and sorted them by their and weighted by the amount of variance they explain. Because the new features are the eigenvectors of the feature covariance matrix, they should be orthogonal, and hence uncorrelated, by definition. Let’s visualize this directly.

The new representation of the data is stored as a matrix named “x” in the list object we get back from “prcomp”. In our case, the matrix would be stored as “pc$x”.

# feature correlation before PCA
cor_r <- cor(pc$x, method = "pearson")
# use only upper triangular matrix to avoid redundancy
upt_r <- cor_r[upper.tri(cor_r)]
# plot correlations as histogram
hist(upt_r, prob = TRUE)
# plot correlations as image
image.plot(cor_r, axes = FALSE)


The new features are clearly no longer correlated to each other. As everything seems to be in order, we can now focus on the interpretation of the results.

Step 3: Interpret the results

The first thing you will want to check is how much variance is explained by each component. In PCA speak, this can be visualized with a “scree plot”. R conveniently has a built-in function to draw such a plot.

# draw a scree plot
screeplot(pc, npc = 10, type = "line")


This is about as good as it gets. A large amount of the variance is captured by the first principal component followed by a sharp decline as the remaining components gradually explain less and less variance approaching zero.

The decision of how many components we should use to get a good approximation of the data has to be made on a case-by-case basis. The cut-offs for the percent explained variance depends on the kind of data you are working with and its inherent covariance structure. The majority of the data sets you will encounter are not nearly as well behaved as this one, meaning that the decline in explained variance is much more shallow. Common cut-offs range from 80% to 95% of explained variance.

Let’s look at how many components we would need to explain a given amount of variance. In the R implementation of PCA, the variances explained by each principle component are stored in a vector called “sdev”. As the name implies, these are standard deviations or the square roots of the variances, which in turn are scaled versions of the eigenvalues. We will need to take the squares “sdev” to get back the variances.

# calculate explained variance as cumulative sum
# sdev are the square roots of the variance
var_expl <- cumsum(pc$sdev^2) / sum(pc$sdev^2)
# plot explained variance
plot(c(0, var_expl), type = "l", lwd = 2, ylim = c(0, 1), 
     xlab = "Principal Components", ylab = "Variance explained")
# plot number of components needed to for common cut-offs of variance explained
vars <- c(0.8, 0.9, 0.95, 0.99)
for (v in vars) {
npc <- which(var_expl > v)[1]
    lines(x = c(0, npc, npc), y = c(v, v, 0), lty = 3)
    text(x = npc, y = v - 0.05, labels = npc, pos = 4)
    points(x = npc, y = v)


The first principle component on its own explains more than 50% of the variance and we need only 20 components to get up to 80% of the explained variance. Fewer than 30% of the components (162 out of 561) are needed to capture 99% of the variance in the data set. This is a dramatic reduction of complexity. Being able to approximate the data set with a much smaller number of features can greatly speed up downstream analysis and can help to visualize the data graphically.

Finally, let’s investigate whether “variance” translates to “information”. In other words, do the prinicipal components associated with the largest eigenvalues discriminate between the different human activities?

If the class labels (“activities” in our case) are known, a good way to do look at the “information content” of the principal components is to look at scatter plots of the first couple of components and color-code the samples by class label. This code gives you a bare bones version of the figure shown below. The complete code can be found on Github.

# plot the first 8 principal components against each other
for(p in seq(1, 8, by = 2)) {
  plot(pc$x[, p:(p+1)], pch = 16, 
       col = as.numeric(description$activity_name))


We have seen previously that the first component alone explains about half of the variance and in this figure we see why. It almost perfectly separates non-moving “activities” (“laying”, “sitting”, “standing”) from moving activities (various types of “walking”). The second component does a reasonable job at telling the difference between walking and walking upstairs. As we move down the list, there remains visible structure but distinctions become somewhat less clear. One conclusion we can draw from this visualization is that it will most likely be most difficult to tell “sitting” apart from “standing” as none of the dimensions seems to be able to distinguish red and green samples. Oddly enough, the fifth component does a pretty good job of separating “laying” from “sitting” and “standing”.


PCA can be a powerful technique to obtain low dimensional approximations of data with lots of redundant features. The “Human Activity Recognition Using Smartphones Data Set” used in this tutorial is a particularly good example of that. Most real data sets will not be reduced to a few components so easily while retaining most of the information. But even cutting the number of features in half can lead to considerable time savings when using machine learning algorithms.

Here are a couple of useful questions when approaching a new data set to apply PCA to:

  1. Are the features numerical or do I have to convert categorial features?
  2. Are there missing values and if yes, which strategy do I apply to deal with them?
  3. What is the correlation structure of the data? Will PCA be effective in this case?
  4. What is the distribution of variances after PCA? Do I see a steep or shallow decline in explained variance?
  5. How much “explained variance” is a good enough approximation of the data? This is usually a compromise between how much potential information I am willing to sacrifice for cutting down computation time of follow-up analyses.

In the final part of this series, we will discuss some of the limitations of PCA.

Addendum: Understanding “prcomp”

The “prcomp” function is very convenient because it caclulates all the numbers we could possible want from our PCA analysis in one line. However, it is useful to know how those number were generated.

The three most frequently used objects returned by “prcomp” are

  • “rotation”: right eigenvectors (“feature eigenvectors”)
  • “sdev”: square roots of scaled eigenvalues
  • “x”: projection of original data onto the new features


In Part 2, I mentioned that software implementations of PCA usually compute the eigenvectors of the data matrix using singular value decomposition (SVD) rather than eigendecomposition of the covariance matrix. In fact, R’s “prcomp” is no exception.

“Rotation” is a matrix whose columns are the right eigenvalues of the original data. We can reconstruct “rotation” using SVD.

# perform singular value decomposition on centered and scaled data
sv <- svd(scale(measurements))
# "prcomp" stores right eigenvectors in "rotation"
w <- pc$rotation
dimnames(w) = NULL
# "svd" stores right eigenvectors in matrix "v"
v <- sv$v
# check if the two matrices are equal
all.equal(w, v)


Singular values are the square roots of the eigenvalues as we have seen in Part 2. “sdev” stands for standard deviation and thus stores the square roots of the variances. Thus, the squares of “sdev” and the squares of the singular values are directly proportional to each other and the scaling factor is the number of rows of the original data matrix minus 1.

# relationship between singular values and "sdev"
all.equal(sv$d^2/(nrow(sv$u)-1), pc$sdev^2)


The projection of the orignal data (“measurements”) onto its eigenbasis is automatically calculated by “prcomp” through its default argument “retx = TRUE” and stored in “x”. We can manually recreate the projection using matrix-matrix multiplication.

# manual projection of data
all.equal(pc$x, scale(measurements) %*% pc$rotation)

If we wanted to obtain a projection of the data onto a lower dimensional subspace, we just determine the number of components needed and subset the columns of matrix “x”. For example, if we wanted to get an approximation of the original data preserving 90% of the variance, we take the first 52 columns of “x”.

# projection of original data preserving 90% of variance
y90 <- pc$x[, 1:52]
# note that this is equivalent matrix multiplication with 
# the first 52 eigenvectors
all.equal(y90, scale(measurements) %*% pc$rotation[, 1:52])


The full R code is available on Github.

Further reading

IRIS data set

Sebastian Raschka – Principle Component Analysis in 3 Simple Steps


Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

PCA – Part 3: In the Trenches

Heatmaps – Part 3: How to create a microarray heatmap with R?

It is time to deal with some real data. I have hinted in Part 1 of this series that gene expression profiling using microarrays is a prime application for heatmaps. Today, we will look at the differences of gene expression in Acute Lymphoblastic Leukemia (ALL) samples that have either no cytogenetic abnormalities or the famous BCR/ABL chromosomal translocation (“Philadelphia chromosome”). Treatment of patients with the BCR/ABL translocation was the first big success of targeted chemotherapy using the small molecule kinase inhibitor Imatinib (Gleevec) around the turn of the century.

We will investigate whether the gene expression profile between the two types of ALL are different, and if yes, how well hierarchical clustering can detect the type of ALL from the microarray data. An important follow-up to such an analysis would be to determine the genes that contribute to a gene expression “fingerprint” that predicts the type of ALL simply based on the gene expression profile of a patient sample so that targeted therapy can be administered if available.

For this tutorial, I am assuming that you have a reasonable familiarity with R. You should know about the basic data types, be comfortable with subsetting, and be able to write simple functions.

This analysis is inspired by an example in the slightly dated but excellent book Bioconductor Case Studies.

Step 1: Prepare the data

The data itself is conveniently available in an R package called “ALL”.


Let’s look at what exactly we are dealing with here.

# look at help page associated with "ALL"
# determine class of "ALL"
# how much data are we dealing with?

There are several pieces of important information:

  1. The data is not a “data.frame” or “matrix” but an ExpressionSet. ExpressionSets are the go-to data representation for microarray data in a bundle of R libraries called “Bioconductor“. It not only makes it easy to extract the actual data as a “matrix” but also contains useful annotation. In our case “ALL” is an ExpressionSet with 12625 genes and 128 cancer samples.
  2. The information on the cytogenetic phenotype is stored in a variable called “mol.biol”. This will be useful to get a subset of the data.
  3. Annotation on whether the disease is B-cell or T-cell based can be found in the variable “BT”. Again, we will use this for extracting a subset of the data.

Heatmaps as a tool for data visualization works best if the data is not too diverse and not too large. Therefore, we will generate a subset of the “ALL” data that focuses on two types of ALL (“NEG” and “BCR/ABL”) that originate from B-cells.

# get samples with either no cytogenetic abnormalities (NEG)
# or the BCR-ABL translocation (BCR/ABL)
neg_bcrabl <- ALL$mol.biol %in% c("NEG", "BCR/ABL")
# get indices cancers originating from B-cells
bcell <- grepl("^B", ALL$BT)
# subset the ALL data set
all <- ALL[, bcell & neg_bcrabl]
# adjust the factor levels to reflect the subset
all$mol.biol <- droplevels(all$mol.biol)
all$mol.biol <- relevel(all$mol.biol, ref = "NEG")
# how much data are we left with?

We were able to reduce the number of cancer samples from 128 to 79. Good enough for now.

Let’s deal with the number of genes. A common approach is to assume that genes that do not display much variation across the samples are unlikely to be important for the analysis. They are either did not hybridize to the microarray, are not expressed, or simply did not change upon treatment. We will determine the most variable genes and use them for plotting a heatmap visualization of the data set.

# determine the standard deviation for all genes across the samples
# note that this is essentially an optimized version of
# apply(exprs(all), 1, sd)
all_sd <- rowSds(exprs(all))
# get the 200 most variable genes
top200 <- names(sort(all_sd, decreasing = TRUE))[1:200]
all_var <- all[top200, ]

Step 2: Decide on a distance metric

In our previous example, we used euclidean distance. Euclidean distance is the square root of the sum of the squared distance between each pair of elements of two vectors i and j

d_{ij}=\sqrt{\sum_{k=1}^{n}{(x_{ik} - x_{jk})^2}}

You can think of it as the “as the crow flies” distance between two vectors i and j in n dimensions.

One important aspect to consider about euclidean distance is that it is dominated by the absolute value of a feature x_k, not the shape of the overall vector. In gene expression studies, we are particularly interested in how genes of different expression levels co-vary across different conditions, genotypes or treatments. The most established metric to calculate the distance between samples in gene expression data is the complement of the correlation coefficient.

d_{ij}=1 - cor(\vec{x_i}, \vec{x_j})

Note that we use the complement of the correlation coefficient because the correlation coefficient by itself is a measure of similarity, not distance. The correlation coefficient is invariant under linear transformation, i.e. invariant to scale and location and takes into account the similarity of the shapes of two vectors. In most cases we would use Pearson correlation, unless we have reason to assume that there is a non-linear relationship of the expression levels between samples. Then we would use the rank-based Spearman correlation coefficient.

Let’s set up a distance function in R that will use later in our call to the “heatmap” function.

dist_cor <- function(x) {
    as.dist(1 - cor(t(x), method = "pearson"))

One little quirk of the “cor” function is that it calculates correlations on columns. Distances however are calculated on rows. A quick fix is to feed the transpose of the matrix to “cor”.

Step 3: Decide on a clustering method

There are many ways to cluster data but I will focus on one method commonly used in heatmaps: agglomerative hierarchical clustering. You can think of this as a bottom-up approach, in which all vectors start out as their own cluster and the algorithm iteratively merges the clusters that it determines the most similar until all clusters are merged into one. This results in a tree-like structure called a dendrogram, which depicts the distance between vectors as the length of the branches. One important aspect of agglomerative hierarchical clustering is that it is deterministic, i.e. it always ends up producing the same result on the same data no matter how many times you re-run the algorithm. This is different from k-means clustering, which produces different clustering dependent on an initial condition. One disadvantage of agglomerative clustering is that if one vector gets mis-assigned to some cluster early on, it will stay in that cluster until the end. K-means clustering can change cluster assignment at any time before convergence. This is why the way agglomerative hierarchical clustering determines the distance between clusters is of great importance to the final outcome.

In Part 2 of this tutorial we used the default method “complete” linkage, which determines the distance between to clusters A and B by determining the the maximum absolute distance between two vectors \vec{x} \in A and \vec{y} \in B.

d(A, B) = max \parallel (\vec{x} - \vec{y}) \parallel

Other methods use the minimum distance (“single”) or the average distance (“average”) to determine the distance between the clusters A and B. Single-link clustering tends to cluster via a “friends of friends” pattern, which typically results in a “stringy” clustering. As the distance depends on a single pair of vectors, it can handle irregular cluster shapes but it is sensitive to noise and outliers. At the opposite extreme, the complete-link clustering prefers to cluster vectors that are equally close together, which means it prefers globular clusters. It is less susceptible to noise and outliers but tends to break up big clusters into little ones. As you can imagine, the average-link method is somewhere in between. If you don’t already have an idea of which method to use based on experience or theoretical considerations, try which one works best for your problem.

The clustering method I will be using today is called Ward’s method. It determines the similarity between two clusters A and B based on the increase of the squared error upon merging the two clusters. This increase of variance \Delta is called the “merging cost”.

\Delta(A, B) = \frac{n_A n_B}{n_A + n_B} \parallel \vec{m}_A - \vec{m}_B \parallel ^{2}

where \vec{m}_k is center (centroid) of cluster k and n_k is the number of elements in cluster k.

Ward’s method uses cluster centroids and thus tends to be similar to the average-linkage method. In R, Ward’s method is implemented as “ward.D2”.

clus_wd2 <- function(x) {
    hclust(x, method = "ward.D2")

Step 4: Plot a microarray heatmap

It is customary in microarray heatmaps to use a “red-black-green” color scheme, where “green” signifies down-regulated genes, “black” unchanged genes, and “red” up-regulated” genes. Let’s implement a custom color scheme using the “RColorBrewer” package

redblackgreen <- colorRampPalette(c("green", "black", "red"))(n = 100)

When available it is often instructive to plot the class labels of the samples we are attempting to cluster as a color code. It is an important sanity check to see if we are on the right track or have made a careless mistake. In our case, the samples either show no abnormal cytogenetics (“NEG”) or have the BCR-ABL translocation (“BCR/ABL”).

class_labels <- ifelse(all_var$mol.biol == "NEG", "grey80", "grey20")

We will use the “heatmap.2” function implemented in the “gplots” package. It functions the same way as R’s in-built “heatmap” function but offers more functionality.

Both the “heatmap” and the “heatmap.2” functions require you to feed them your data as a “matrix” object. We can extract the gene expression data as a matrix from the ExpressionSet using the “exprs” function.

          # clustering
          distfun = dist_cor, 
          hclust = clus_wd2,
          # scaling (genes are in rows)
          scale = "row",
          # color
          col = redblackgreen, 
          # labels
          labRow = "", 
          ColSideColors = class_labels, 
          # tweaking
          trace = "none",
 = "none")


Not as bad as it looks at first glance. If you look at the columns, the first two large clusters clearly separate a subpopulation of “NEG” samples (first cluster) and “BCR/ABL” samples (second cluster). The following smaller clusters are pretty homogenous too, just the last couple are more or less random. Also, remember that the branches can be rotated at the nodes without changing the topology of the dendrogram.

At the gene level we can likewise see clear patterns of down-regulated (green) and up-regulated genes (red) emerging, especially within the first two homogenous clusters.

Can we do better? Absolutely! We threw away most of the information by just taking the 200 most variable genes. Some might be just noisy genes, some might vary in response to other factors than the cytogenetic classification. We also have additional information on the patients such as sex, age, or whether the cancer went into remission. We would ideally make use of all of this information if we wanted to build a machine learning algorithm to distinguishes between different types of ALL. In this exercise our main purpose is visualization rather than analysis of the data, so let’s take a more straightforward way to select genes that distinguishes the two types of ALL.

Step 5: A “better” way of selecting genes

In the “ALL” data set each cancer sample is already classified by it cytogenetic properties. This is a luxurious situation because it allows us to tune the selection of genes we want to display based on the cancer type classification. We will use statistical tests to determine the differentially expressed genes and use them for our heatmap.

Note that this approach is fine if our purpose is to generate a visual summary of our data at hand but it is technically cheating. Why? Because you use the cancer type information to select the genes that are used for clustering the cancer types. It is a type of circular reasoning, or “data snooping” as it is called in machine learning jargon. This is why I took a truly unsupervised learning approach in the previous section and pretend that we did not know the class labels beforehand. Data snooping is a big problem in data science because it makes you think your model is better than it actually is. In reality, your model overfits your data at hand and it will likely not generalize well to future data.

Let’s start out by finding the genes that are differentially expressed between “NEG” and “BCR/ABL” samples. We will perform nonspecific filtering on the data first to remove genes that are either not expressed or don’t vary between the samples. This will increase the power of the t-tests later on.

# the shortest interval containing half of the data
# reasonable estimate of the "peak" of the distribution
sh <- shorth(all_sd)
# we take only genes that have a standard deviation
# greater than "sh"
all_sh <- all[all_sd >= sh, ]
# how many genes do we have left?

The distribution of standard deviations (“all_sd”) has a long tail towards the right (large values). This is typical for gene expression data. The “shorth” function is a simple and unbiased way to get an estimate of the peak of such a distribution to use as a cut-off to exclude genes with low variance. Using this approach, we were able to remove about 1/3 of the genes that are likely not relevant for our analysis. For more details, see the Bioconductor Case Studies.

Next, we will perform row-wise t-tests on all genes that are left. The cytogenetic classification “mol.biol” tells us which sample belongs to which group.

tt <- rowttests(all_sh, all_sh$mol.biol) 

This code performs 8812 separate t-tests. If we now took all genes that have a p-value smaller or equal to 0.05, we would expect around 440 genes to be in that category just by chance. This is an unacceptable number of false positives. The most common solution to this problem is to adjust the p-values for multiple testing, so that among the genes we chose our false discovery rate (FDR) is around 5%.

# use the Benjamini-Hochberg method to adjust 
tt$p.adj <- p.adjust(tt$p.value, method = "BH")
# subset the pre-filtered "all_sh" for genes
# with an adjusted p-value smaller or equal to 0.05
all_sig <- all_sh[tt$p.adj <= 0.05, ]
# how many genes are we left with?

We end up with 201 genes that are candidates for differential expression between the two types of ALL. As this number is very close to the number of genes we using for our variance-based filtering, we can plug the results directly into the “heatmap.2” function to compare the performance with our previous attempt.

          # clustering
          distfun = dist_cor, 
          hclust = clus_wd2,
          # scaling (genes are in rows)
          scale = "row",
          # color
          col = redblackgreen, 
          # labels
          labRow = "", 
          ColSideColors = class_labels, 
          # tweaking
          trace = "none",
 = "none")

This will result in the following heatmap.


The two types of ALL segregate nicely into two distinct clusters (with a few exceptions). Note that the last four samples of the dark grey “BCR/ABL” bar actually cluster with the “NEG” samples. They just happen to be next to the other dark grey samples in this particular topology of the dendrogram.

When we look at the differentially expressed genes, we see something interesting. The “BCR/ABL” samples appear to have many more genes that are up-regulated (red) compared to the “NEG” samples. Only about 20% of the significantly different genes are down-regulated (green). The Bcr-Abl chimeric kinase is thought to be constitutively active, so one could rationalize such an outcome by suggesting that the kinase inappropriately drives pathways that lead to turning on transcription factors, which in turn up-regulate the expression of certain genes.

It is not surprising that we did better than in our previous attempt. We used the cancer type class labels to inform our choice of genes. The hierarchical clustering gives us back some of what we put in. However, to summarize the data visually, such an approach is ok.

Step 6: Have mercy with the color-challenged

A surprisingly large percentage of the population, mostly men because the responsible genes are X-linked, suffer from red-green color blindness. If you want to be nice, use a different color palette, such as yellow-blue

yellowblackblue <- colorRampPalette(c("dodgerblue", "black", "gold"))(n = 100)

Plotting the same heatmap with the altered color scheme looks like this. If this is clearer to you than the previous one, you might not only have learned something about heatmaps but also something about yourself today.



  • Data preparation and feature selection (e.g. genes) is critical for the outcome of any data visualization
  • Understand which distance and clustering method works best for your data
  • Be mindful about data snooping when it comes to the application of any machine learning algorithm (hierarchical clustering is an unsupervised machine learning algorithm)


The full R script can be found on Github.


This post is part 3 of a series on heatmaps:

Part 1: What is a heatmap?

Part 2: How to create a simple heatmap with R?

Part 3: How to create a microarray heatmap with R?

Heatmaps – Part 3: How to create a microarray heatmap with R?

Heatmaps – Part 2: How to create a simple heatmap with R?

I will be using R to demonstrate how to create a simple heatmap and show the most important parameters of R’s build-in “heatmap” function.

R is a excellent programming language for statistical computing, bioinformatics, and data science. There are multiple high-quality tutorials and courses available online targeted at the beginner and intermediate level. For this tutorial, I am assuming that you have the basics down already.

Step 1: Get to know your data

To illustrate the major points of how to create a heatmap, I will be using a toy example that most of us can relate to (positively or negatively): cars. R has a build-in data set called “mtcars”, which contains 11 attributes such as miles per gallon and the number of cylinders for 32 car models from 1974. As it is good practice for every data analysis to familiarize yourself with the data set you are working with, you can do that in R using


Now, it is time to load the data into memory using the “data” function.


Next, we take a brief look at the data itself by asking for the dimension using “dim” and the data type using “class”.


So far so good. The data is a “data.frame” object with dimensions 32 rows and 11 columns. This is well within the range of “the two digit by two digit” rule we established earlier.

Step 2: Get to know the “heatmap” function

For plotting a heatmap in R, the data has to be a “matrix”. The “heatmap” function will complain, if we feed it a “data.frame” object. So let’s get that out of the way.

mat <- as.matrix(mtcars)

Now it’s time to try out the heatmap function.


heatmap_part2_fig1It looks like we have produced a reasonable heatmap but the colors look strange. One side is all red, and the other all yellow. What is going on here?

The danger of using the default parameters of any function is that we don’t get a chance to understand what is going on behind the scenes. As always, if you want information on a any R function, type “?heatmap”. In our case, “heatmap” assumes two things by default:

1) That we want clustering of rows and columns. This can be seen by the row and column dendrograms in the figure above. This is a reasonable default setting but we need to be aware of it. Especially because there are multiple different ways to perform the clustering.

2) That the attributes (features) of the cars such as “mpg” (miles per gallon) and “cyl” (cylinders) are in the rows and we want scaling of those features. The “mtcars” data set has the features in the columns, not the rows, which is why the scaling did not work. However, scaling will be necessary because “hp” (horse power) and “disp” (displacement) are about one to two orders of magnitude larger than the rest of the features. There are two potential solutions to this problem: either transpose the data matrix or scale by “column”. We will use the latter.

Let’s break down the creation of our heatmap step by step. To get a vanilla version, we will disable clustering by preventing reordering of rows and columns (Rowv = NA, Colv = NA) and ask for no scaling (scale = “none”).

heatmap(mat, Rowv = NA, Colv = NA, scale = "none")

heatmap_part2_fig2This is just the false-colored image representation of the data itself. We cannot make out the structure of the data because the image is dominated by the two features “disp” and “hp” that have high values. Let’s remedy that by scaling the features, which are located in the columns (scale = “column”).

heatmap(mat, Rowv = NA, Colv = NA, scale = "column")


Now that the features are at the same scale, we can clearly see that there is structure in the data. We are ready to add clustering to the heatmap. By default, the “heatmap” function uses “euclidean” distance and “complete” linkage for clustering. If this is gibberish to you, don’t worry. We will use the defaults for today and leave an explanation of the different distance measures and clustering algorithms for another time. Nevertheless, in the interest of code transparency it is always a good idea to be explicit about the clustering parameters you use. Your co-workers and your future self will be grateful.

# clustering of car attributes (rows)
row_dist <- dist(mat, method = "euclidean")
row_clus <- hclust(row_dist, method = "complete")

# clustering of cars (columns)
col_dist <- dist(t(mat), method = "euclidean")
col_clus <- hclust(col_dist, method = "complete")

        # clustering
        Rowv = as.dendrogram(row_clus), 
        Colv = as.dendrogram(col_clus), 
        # scaling
        scale = "column")

heatmap_part2_fig4Note that the “dist” function calculates the pair-wise distances between the rows. To calculate the distances between columns, we need to use the transpose of the matrix, which we get with the “t” function in R.

Also, the “heatmap” function expects a “dendrogram” object for ordering the rows and columns. We can coerce the clustered data into a dendrogram object by using “as.dendrogram”.

Step 3: Final tweaks

By default, “heatmap” uses “heat.colors” as its false-color palette. Beauty obviously is in the eye of the beholder, nevertheless I would like to make a counter-proposal for our color scheme. R easily lets you create your own color palettes using the “RColorBrewer” library. Let’s create a custom-made color palette, which I found here and liked.

yellowred <- colorRampPalette(c("lightyellow", "red"), space = "rgb")(100)

A function to plot a simple heatmap would look like this. Note that we are using the pre-computed row and column clustering from above.

        # clustering
        Rowv = as.dendrogram(row_clus), 
        Colv = as.dendrogram(col_clus), 
        # scaling
        scale = "column",
        # color
        col = yellowred)

heatmap_part2_fig5We can see that the clustering of cars found two main clusters, smaller utility cars on top and sports cars at the bottom. It makes sense that those two different types of cars can be distinguished by their attributes. Likewise, we can see that sports cars tend to be heavier (wt), have more cylinders (cyl), higher displacement (disp), and more horse power (hp). Conversely, they drive fewer miles per gallon of gas (mpg) and take less time on the first 1/4 mile (qsec).

A word of caution. The two features “mpg” and “qsec” cluster together purely based on their numeric values, not because of some qualitative criterion we might have a bias about. We would probably associate a low “mpg” score with something bad and a fast time on the first 1/4 mile as something good. The heatmap does not care.


  • For simple heatmaps, use the built-in R function “heatmap
  • Make sure your data is a matrix not a data.frame object
  • Be aware whether you want to scale rows or columns depending on what makes sense in your analysis
  • Be explicit about which distance measure and clustering algorithm you use


The full R script can be found on GitHub.

Heatmap series

This post is part 2 of a series on heatmaps:

Part 1: What is a heatmap?

Part 2: How to create a simple heatmap with R?

Part 3: How to create a microarray heatmap with R?

Heatmaps – Part 2: How to create a simple heatmap with R?

Heatmaps – Part 1: What is a heatmap?

What is a heatmap?

A heatmaps is a visual representation of a multidimensional data set using a false-colored image. The data itself is often in a matrix-like format such as an Excel spread sheet. In genetics, a common application for heatmaps is the visualization of microarray data with rows containing a list of gene transcripts and columns containing samples. The values themselves are measurements of the abundance of a particular transcript in a particular sample.

Why use a Heatmap?

There are two key motivations for using a heatmap to represent your data:

1) A heatmap allows the visualization of multidimensional data sets and thus provides an overview of the data set in a single figure. Other visualizations such as scatter plots are limited to two or three dimensions.

2) Heatmaps are almost exclusively used in conjunction with a machine learning technique called clustering. Clustering aims to identify structure within the data based on a measure of distance. Taking up the microarray example from before, we might be interested in which samples are more closely related to each other based on their transcriptional profile. In an ideal world, a given treatment would cause a dramatically different gene expression profile compared to control samples and clustering would separate control samples from treatment samples completely based on the structure within the data (i.e. the gene expression profile) without knowing what the samples are. In machine learning jargon such a technique is called unsupervised learning.

What do i need to be careful about with heatmaps?

The two theoretical advantages just mentioned come with their two corresponding practical shortcomings:

1) You might be tempted to just throw all the data in your Excel spread sheet into a heatmap and be done with it. In reality, it is an art to choose the right amount of complexity (i.e. number of rows and columns) for your visualization to convey the message you want to deliver. Too little data might be better represented with other visualizations such as box plots or scatter plots, too much data will make it hard to see the structure in the data and interferes with labeling of rows and columns. From my experience the sweet spot for labeled heatmaps is somewhere in the two digit by two digit realm, such as 50 genes for 10 samples. If labels can be omitted, 500 genes for 10 samples works just fine.

2) The keen reader might have noticed that I brushed over the details of clustering. There are not only multiple ways to cluster data but also multiple ways to calculate distances within the data. The specific choice of distance metric and clustering algorithm can lead to vastly different results and one should be aware of the idiosyncrasies and assumptions of each technique. Even once you have clustered your data and you are happy with the result, there are questions you should ask yourself: How many clusters do I choose to represent my data? Are the clusters stable if I add more data points or are they just haphazard occurrences? Do the clusters have meaningful interpretations in the system I am working with?

Heatmap series

This post is part 1 of a series on heatmaps:

Part 1: What is a heatmap?

Part 2: How to create a heatmap with R (in an ideal world)?

Part 3: How to create a microarray heatmap with R?

Heatmaps – Part 1: What is a heatmap?