Heatmaps – Part 3: How to create a microarray heatmap with R?

It is time to deal with some real data. I have hinted in Part 1 of this series that gene expression profiling using microarrays is a prime application for heatmaps. Today, we will look at the differences of gene expression in Acute Lymphoblastic Leukemia (ALL) samples that have either no cytogenetic abnormalities or the famous BCR/ABL chromosomal translocation (“Philadelphia chromosome”). Treatment of patients with the BCR/ABL translocation was the first big success of targeted chemotherapy using the small molecule kinase inhibitor Imatinib (Gleevec) around the turn of the century.

We will investigate whether the gene expression profile between the two types of ALL are different, and if yes, how well hierarchical clustering can detect the type of ALL from the microarray data. An important follow-up to such an analysis would be to determine the genes that contribute to a gene expression “fingerprint” that predicts the type of ALL simply based on the gene expression profile of a patient sample so that targeted therapy can be administered if available.

For this tutorial, I am assuming that you have a reasonable familiarity with R. You should know about the basic data types, be comfortable with subsetting, and be able to write simple functions.

This analysis is inspired by an example in the slightly dated but excellent book Bioconductor Case Studies.

Step 1: Prepare the data

The data itself is conveniently available in an R package called “ALL”.


Let’s look at what exactly we are dealing with here.

# look at help page associated with "ALL"
# determine class of "ALL"
# how much data are we dealing with?

There are several pieces of important information:

  1. The data is not a “data.frame” or “matrix” but an ExpressionSet. ExpressionSets are the go-to data representation for microarray data in a bundle of R libraries called “Bioconductor“. It not only makes it easy to extract the actual data as a “matrix” but also contains useful annotation. In our case “ALL” is an ExpressionSet with 12625 genes and 128 cancer samples.
  2. The information on the cytogenetic phenotype is stored in a variable called “mol.biol”. This will be useful to get a subset of the data.
  3. Annotation on whether the disease is B-cell or T-cell based can be found in the variable “BT”. Again, we will use this for extracting a subset of the data.

Heatmaps as a tool for data visualization works best if the data is not too diverse and not too large. Therefore, we will generate a subset of the “ALL” data that focuses on two types of ALL (“NEG” and “BCR/ABL”) that originate from B-cells.

# get samples with either no cytogenetic abnormalities (NEG)
# or the BCR-ABL translocation (BCR/ABL)
neg_bcrabl <- ALL$mol.biol %in% c("NEG", "BCR/ABL")
# get indices cancers originating from B-cells
bcell <- grepl("^B", ALL$BT)
# subset the ALL data set
all <- ALL[, bcell & neg_bcrabl]
# adjust the factor levels to reflect the subset
all$mol.biol <- droplevels(all$mol.biol)
all$mol.biol <- relevel(all$mol.biol, ref = "NEG")
# how much data are we left with?

We were able to reduce the number of cancer samples from 128 to 79. Good enough for now.

Let’s deal with the number of genes. A common approach is to assume that genes that do not display much variation across the samples are unlikely to be important for the analysis. They are either did not hybridize to the microarray, are not expressed, or simply did not change upon treatment. We will determine the most variable genes and use them for plotting a heatmap visualization of the data set.

# determine the standard deviation for all genes across the samples
# note that this is essentially an optimized version of
# apply(exprs(all), 1, sd)
all_sd <- rowSds(exprs(all))
# get the 200 most variable genes
top200 <- names(sort(all_sd, decreasing = TRUE))[1:200]
all_var <- all[top200, ]

Step 2: Decide on a distance metric

In our previous example, we used euclidean distance. Euclidean distance is the square root of the sum of the squared distance between each pair of elements of two vectors i and j

d_{ij}=\sqrt{\sum_{k=1}^{n}{(x_{ik} - x_{jk})^2}}

You can think of it as the “as the crow flies” distance between two vectors i and j in n dimensions.

One important aspect to consider about euclidean distance is that it is dominated by the absolute value of a feature x_k, not the shape of the overall vector. In gene expression studies, we are particularly interested in how genes of different expression levels co-vary across different conditions, genotypes or treatments. The most established metric to calculate the distance between samples in gene expression data is the complement of the correlation coefficient.

d_{ij}=1 - cor(\vec{x_i}, \vec{x_j})

Note that we use the complement of the correlation coefficient because the correlation coefficient by itself is a measure of similarity, not distance. The correlation coefficient is invariant under linear transformation, i.e. invariant to scale and location and takes into account the similarity of the shapes of two vectors. In most cases we would use Pearson correlation, unless we have reason to assume that there is a non-linear relationship of the expression levels between samples. Then we would use the rank-based Spearman correlation coefficient.

Let’s set up a distance function in R that will use later in our call to the “heatmap” function.

dist_cor <- function(x) {
    as.dist(1 - cor(t(x), method = "pearson"))

One little quirk of the “cor” function is that it calculates correlations on columns. Distances however are calculated on rows. A quick fix is to feed the transpose of the matrix to “cor”.

Step 3: Decide on a clustering method

There are many ways to cluster data but I will focus on one method commonly used in heatmaps: agglomerative hierarchical clustering. You can think of this as a bottom-up approach, in which all vectors start out as their own cluster and the algorithm iteratively merges the clusters that it determines the most similar until all clusters are merged into one. This results in a tree-like structure called a dendrogram, which depicts the distance between vectors as the length of the branches. One important aspect of agglomerative hierarchical clustering is that it is deterministic, i.e. it always ends up producing the same result on the same data no matter how many times you re-run the algorithm. This is different from k-means clustering, which produces different clustering dependent on an initial condition. One disadvantage of agglomerative clustering is that if one vector gets mis-assigned to some cluster early on, it will stay in that cluster until the end. K-means clustering can change cluster assignment at any time before convergence. This is why the way agglomerative hierarchical clustering determines the distance between clusters is of great importance to the final outcome.

In Part 2 of this tutorial we used the default method “complete” linkage, which determines the distance between to clusters A and B by determining the the maximum absolute distance between two vectors \vec{x} \in A and \vec{y} \in B.

d(A, B) = max \parallel (\vec{x} - \vec{y}) \parallel

Other methods use the minimum distance (“single”) or the average distance (“average”) to determine the distance between the clusters A and B. Single-link clustering tends to cluster via a “friends of friends” pattern, which typically results in a “stringy” clustering. As the distance depends on a single pair of vectors, it can handle irregular cluster shapes but it is sensitive to noise and outliers. At the opposite extreme, the complete-link clustering prefers to cluster vectors that are equally close together, which means it prefers globular clusters. It is less susceptible to noise and outliers but tends to break up big clusters into little ones. As you can imagine, the average-link method is somewhere in between. If you don’t already have an idea of which method to use based on experience or theoretical considerations, try which one works best for your problem.

The clustering method I will be using today is called Ward’s method. It determines the similarity between two clusters A and B based on the increase of the squared error upon merging the two clusters. This increase of variance \Delta is called the “merging cost”.

\Delta(A, B) = \frac{n_A n_B}{n_A + n_B} \parallel \vec{m}_A - \vec{m}_B \parallel ^{2}

where \vec{m}_k is center (centroid) of cluster k and n_k is the number of elements in cluster k.

Ward’s method uses cluster centroids and thus tends to be similar to the average-linkage method. In R, Ward’s method is implemented as “ward.D2”.

clus_wd2 <- function(x) {
    hclust(x, method = "ward.D2")

Step 4: Plot a microarray heatmap

It is customary in microarray heatmaps to use a “red-black-green” color scheme, where “green” signifies down-regulated genes, “black” unchanged genes, and “red” up-regulated” genes. Let’s implement a custom color scheme using the “RColorBrewer” package

redblackgreen <- colorRampPalette(c("green", "black", "red"))(n = 100)

When available it is often instructive to plot the class labels of the samples we are attempting to cluster as a color code. It is an important sanity check to see if we are on the right track or have made a careless mistake. In our case, the samples either show no abnormal cytogenetics (“NEG”) or have the BCR-ABL translocation (“BCR/ABL”).

class_labels <- ifelse(all_var$mol.biol == "NEG", "grey80", "grey20")

We will use the “heatmap.2” function implemented in the “gplots” package. It functions the same way as R’s in-built “heatmap” function but offers more functionality.

Both the “heatmap” and the “heatmap.2” functions require you to feed them your data as a “matrix” object. We can extract the gene expression data as a matrix from the ExpressionSet using the “exprs” function.

          # clustering
          distfun = dist_cor, 
          hclust = clus_wd2,
          # scaling (genes are in rows)
          scale = "row",
          # color
          col = redblackgreen, 
          # labels
          labRow = "", 
          ColSideColors = class_labels, 
          # tweaking
          trace = "none",
          density.info = "none")


Not as bad as it looks at first glance. If you look at the columns, the first two large clusters clearly separate a subpopulation of “NEG” samples (first cluster) and “BCR/ABL” samples (second cluster). The following smaller clusters are pretty homogenous too, just the last couple are more or less random. Also, remember that the branches can be rotated at the nodes without changing the topology of the dendrogram.

At the gene level we can likewise see clear patterns of down-regulated (green) and up-regulated genes (red) emerging, especially within the first two homogenous clusters.

Can we do better? Absolutely! We threw away most of the information by just taking the 200 most variable genes. Some might be just noisy genes, some might vary in response to other factors than the cytogenetic classification. We also have additional information on the patients such as sex, age, or whether the cancer went into remission. We would ideally make use of all of this information if we wanted to build a machine learning algorithm to distinguishes between different types of ALL. In this exercise our main purpose is visualization rather than analysis of the data, so let’s take a more straightforward way to select genes that distinguishes the two types of ALL.

Step 5: A “better” way of selecting genes

In the “ALL” data set each cancer sample is already classified by it cytogenetic properties. This is a luxurious situation because it allows us to tune the selection of genes we want to display based on the cancer type classification. We will use statistical tests to determine the differentially expressed genes and use them for our heatmap.

Note that this approach is fine if our purpose is to generate a visual summary of our data at hand but it is technically cheating. Why? Because you use the cancer type information to select the genes that are used for clustering the cancer types. It is a type of circular reasoning, or “data snooping” as it is called in machine learning jargon. This is why I took a truly unsupervised learning approach in the previous section and pretend that we did not know the class labels beforehand. Data snooping is a big problem in data science because it makes you think your model is better than it actually is. In reality, your model overfits your data at hand and it will likely not generalize well to future data.

Let’s start out by finding the genes that are differentially expressed between “NEG” and “BCR/ABL” samples. We will perform nonspecific filtering on the data first to remove genes that are either not expressed or don’t vary between the samples. This will increase the power of the t-tests later on.

# the shortest interval containing half of the data
# reasonable estimate of the "peak" of the distribution
sh <- shorth(all_sd)
# we take only genes that have a standard deviation
# greater than "sh"
all_sh <- all[all_sd >= sh, ]
# how many genes do we have left?

The distribution of standard deviations (“all_sd”) has a long tail towards the right (large values). This is typical for gene expression data. The “shorth” function is a simple and unbiased way to get an estimate of the peak of such a distribution to use as a cut-off to exclude genes with low variance. Using this approach, we were able to remove about 1/3 of the genes that are likely not relevant for our analysis. For more details, see the Bioconductor Case Studies.

Next, we will perform row-wise t-tests on all genes that are left. The cytogenetic classification “mol.biol” tells us which sample belongs to which group.

tt <- rowttests(all_sh, all_sh$mol.biol) 

This code performs 8812 separate t-tests. If we now took all genes that have a p-value smaller or equal to 0.05, we would expect around 440 genes to be in that category just by chance. This is an unacceptable number of false positives. The most common solution to this problem is to adjust the p-values for multiple testing, so that among the genes we chose our false discovery rate (FDR) is around 5%.

# use the Benjamini-Hochberg method to adjust 
tt$p.adj <- p.adjust(tt$p.value, method = "BH")
# subset the pre-filtered "all_sh" for genes
# with an adjusted p-value smaller or equal to 0.05
all_sig <- all_sh[tt$p.adj <= 0.05, ]
# how many genes are we left with?

We end up with 201 genes that are candidates for differential expression between the two types of ALL. As this number is very close to the number of genes we using for our variance-based filtering, we can plug the results directly into the “heatmap.2” function to compare the performance with our previous attempt.

          # clustering
          distfun = dist_cor, 
          hclust = clus_wd2,
          # scaling (genes are in rows)
          scale = "row",
          # color
          col = redblackgreen, 
          # labels
          labRow = "", 
          ColSideColors = class_labels, 
          # tweaking
          trace = "none",
          density.info = "none")

This will result in the following heatmap.


The two types of ALL segregate nicely into two distinct clusters (with a few exceptions). Note that the last four samples of the dark grey “BCR/ABL” bar actually cluster with the “NEG” samples. They just happen to be next to the other dark grey samples in this particular topology of the dendrogram.

When we look at the differentially expressed genes, we see something interesting. The “BCR/ABL” samples appear to have many more genes that are up-regulated (red) compared to the “NEG” samples. Only about 20% of the significantly different genes are down-regulated (green). The Bcr-Abl chimeric kinase is thought to be constitutively active, so one could rationalize such an outcome by suggesting that the kinase inappropriately drives pathways that lead to turning on transcription factors, which in turn up-regulate the expression of certain genes.

It is not surprising that we did better than in our previous attempt. We used the cancer type class labels to inform our choice of genes. The hierarchical clustering gives us back some of what we put in. However, to summarize the data visually, such an approach is ok.

Step 6: Have mercy with the color-challenged

A surprisingly large percentage of the population, mostly men because the responsible genes are X-linked, suffer from red-green color blindness. If you want to be nice, use a different color palette, such as yellow-blue

yellowblackblue <- colorRampPalette(c("dodgerblue", "black", "gold"))(n = 100)

Plotting the same heatmap with the altered color scheme looks like this. If this is clearer to you than the previous one, you might not only have learned something about heatmaps but also something about yourself today.



  • Data preparation and feature selection (e.g. genes) is critical for the outcome of any data visualization
  • Understand which distance and clustering method works best for your data
  • Be mindful about data snooping when it comes to the application of any machine learning algorithm (hierarchical clustering is an unsupervised machine learning algorithm)


The full R script can be found on Github.


This post is part 3 of a series on heatmaps:

Part 1: What is a heatmap?

Part 2: How to create a simple heatmap with R?

Part 3: How to create a microarray heatmap with R?

Heatmaps – Part 3: How to create a microarray heatmap with R?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s