Heatmaps – Part 2: How to create a simple heatmap with R?

I will be using R to demonstrate how to create a simple heatmap and show the most important parameters of R’s build-in “heatmap” function.

R is a excellent programming language for statistical computing, bioinformatics, and data science. There are multiple high-quality tutorials and courses available online targeted at the beginner and intermediate level. For this tutorial, I am assuming that you have the basics down already.

Step 1: Get to know your data

To illustrate the major points of how to create a heatmap, I will be using a toy example that most of us can relate to (positively or negatively): cars. R has a build-in data set called “mtcars”, which contains 11 attributes such as miles per gallon and the number of cylinders for 32 car models from 1974. As it is good practice for every data analysis to familiarize yourself with the data set you are working with, you can do that in R using

?mtcars

Now, it is time to load the data into memory using the “data” function.

data(mtcars)

Next, we take a brief look at the data itself by asking for the dimension using “dim” and the data type using “class”.

dim(mtcars)
class(mtcars)

So far so good. The data is a “data.frame” object with dimensions 32 rows and 11 columns. This is well within the range of “the two digit by two digit” rule we established earlier.

Step 2: Get to know the “heatmap” function

For plotting a heatmap in R, the data has to be a “matrix”. The “heatmap” function will complain, if we feed it a “data.frame” object. So let’s get that out of the way.

mat <- as.matrix(mtcars)

Now it’s time to try out the heatmap function.

heatmap(mat)

heatmap_part2_fig1It looks like we have produced a reasonable heatmap but the colors look strange. One side is all red, and the other all yellow. What is going on here?

The danger of using the default parameters of any function is that we don’t get a chance to understand what is going on behind the scenes. As always, if you want information on a any R function, type “?heatmap”. In our case, “heatmap” assumes two things by default:

1) That we want clustering of rows and columns. This can be seen by the row and column dendrograms in the figure above. This is a reasonable default setting but we need to be aware of it. Especially because there are multiple different ways to perform the clustering.

2) That the attributes (features) of the cars such as “mpg” (miles per gallon) and “cyl” (cylinders) are in the rows and we want scaling of those features. The “mtcars” data set has the features in the columns, not the rows, which is why the scaling did not work. However, scaling will be necessary because “hp” (horse power) and “disp” (displacement) are about one to two orders of magnitude larger than the rest of the features. There are two potential solutions to this problem: either transpose the data matrix or scale by “column”. We will use the latter.

Let’s break down the creation of our heatmap step by step. To get a vanilla version, we will disable clustering by preventing reordering of rows and columns (Rowv = NA, Colv = NA) and ask for no scaling (scale = “none”).

heatmap(mat, Rowv = NA, Colv = NA, scale = "none")

heatmap_part2_fig2This is just the false-colored image representation of the data itself. We cannot make out the structure of the data because the image is dominated by the two features “disp” and “hp” that have high values. Let’s remedy that by scaling the features, which are located in the columns (scale = “column”).

heatmap(mat, Rowv = NA, Colv = NA, scale = "column")

heatmap_part2_fig3

Now that the features are at the same scale, we can clearly see that there is structure in the data. We are ready to add clustering to the heatmap. By default, the “heatmap” function uses “euclidean” distance and “complete” linkage for clustering. If this is gibberish to you, don’t worry. We will use the defaults for today and leave an explanation of the different distance measures and clustering algorithms for another time. Nevertheless, in the interest of code transparency it is always a good idea to be explicit about the clustering parameters you use. Your co-workers and your future self will be grateful.

# clustering of car attributes (rows)
row_dist <- dist(mat, method = "euclidean")
row_clus <- hclust(row_dist, method = "complete")

# clustering of cars (columns)
col_dist <- dist(t(mat), method = "euclidean")
col_clus <- hclust(col_dist, method = "complete")

heatmap(mat, 
        # clustering
        Rowv = as.dendrogram(row_clus), 
        Colv = as.dendrogram(col_clus), 
        # scaling
        scale = "column")

heatmap_part2_fig4Note that the “dist” function calculates the pair-wise distances between the rows. To calculate the distances between columns, we need to use the transpose of the matrix, which we get with the “t” function in R.

Also, the “heatmap” function expects a “dendrogram” object for ordering the rows and columns. We can coerce the clustered data into a dendrogram object by using “as.dendrogram”.

Step 3: Final tweaks

By default, “heatmap” uses “heat.colors” as its false-color palette. Beauty obviously is in the eye of the beholder, nevertheless I would like to make a counter-proposal for our color scheme. R easily lets you create your own color palettes using the “RColorBrewer” library. Let’s create a custom-made color palette, which I found here and liked.

library(RColorBrewer)
yellowred <- colorRampPalette(c("lightyellow", "red"), space = "rgb")(100)

A function to plot a simple heatmap would look like this. Note that we are using the pre-computed row and column clustering from above.

heatmap(mat, 
        # clustering
        Rowv = as.dendrogram(row_clus), 
        Colv = as.dendrogram(col_clus), 
        # scaling
        scale = "column",
        # color
        col = yellowred)

heatmap_part2_fig5We can see that the clustering of cars found two main clusters, smaller utility cars on top and sports cars at the bottom. It makes sense that those two different types of cars can be distinguished by their attributes. Likewise, we can see that sports cars tend to be heavier (wt), have more cylinders (cyl), higher displacement (disp), and more horse power (hp). Conversely, they drive fewer miles per gallon of gas (mpg) and take less time on the first 1/4 mile (qsec).

A word of caution. The two features “mpg” and “qsec” cluster together purely based on their numeric values, not because of some qualitative criterion we might have a bias about. We would probably associate a low “mpg” score with something bad and a fast time on the first 1/4 mile as something good. The heatmap does not care.

Recap

  • For simple heatmaps, use the built-in R function “heatmap
  • Make sure your data is a matrix not a data.frame object
  • Be aware whether you want to scale rows or columns depending on what makes sense in your analysis
  • Be explicit about which distance measure and clustering algorithm you use

Reproducibility

The full R script can be found on GitHub.


Heatmap series

This post is part 2 of a series on heatmaps:

Part 1: What is a heatmap?

Part 2: How to create a simple heatmap with R?

Part 3: How to create a microarray heatmap with R?

Heatmaps – Part 2: How to create a simple heatmap with R?

Heatmaps – Part 1: What is a heatmap?

What is a heatmap?

A heatmaps is a visual representation of a multidimensional data set using a false-colored image. The data itself is often in a matrix-like format such as an Excel spread sheet. In genetics, a common application for heatmaps is the visualization of microarray data with rows containing a list of gene transcripts and columns containing samples. The values themselves are measurements of the abundance of a particular transcript in a particular sample.

Why use a Heatmap?

There are two key motivations for using a heatmap to represent your data:

1) A heatmap allows the visualization of multidimensional data sets and thus provides an overview of the data set in a single figure. Other visualizations such as scatter plots are limited to two or three dimensions.

2) Heatmaps are almost exclusively used in conjunction with a machine learning technique called clustering. Clustering aims to identify structure within the data based on a measure of distance. Taking up the microarray example from before, we might be interested in which samples are more closely related to each other based on their transcriptional profile. In an ideal world, a given treatment would cause a dramatically different gene expression profile compared to control samples and clustering would separate control samples from treatment samples completely based on the structure within the data (i.e. the gene expression profile) without knowing what the samples are. In machine learning jargon such a technique is called unsupervised learning.

What do i need to be careful about with heatmaps?

The two theoretical advantages just mentioned come with their two corresponding practical shortcomings:

1) You might be tempted to just throw all the data in your Excel spread sheet into a heatmap and be done with it. In reality, it is an art to choose the right amount of complexity (i.e. number of rows and columns) for your visualization to convey the message you want to deliver. Too little data might be better represented with other visualizations such as box plots or scatter plots, too much data will make it hard to see the structure in the data and interferes with labeling of rows and columns. From my experience the sweet spot for labeled heatmaps is somewhere in the two digit by two digit realm, such as 50 genes for 10 samples. If labels can be omitted, 500 genes for 10 samples works just fine.

2) The keen reader might have noticed that I brushed over the details of clustering. There are not only multiple ways to cluster data but also multiple ways to calculate distances within the data. The specific choice of distance metric and clustering algorithm can lead to vastly different results and one should be aware of the idiosyncrasies and assumptions of each technique. Even once you have clustered your data and you are happy with the result, there are questions you should ask yourself: How many clusters do I choose to represent my data? Are the clusters stable if I add more data points or are they just haphazard occurrences? Do the clusters have meaningful interpretations in the system I am working with?


Heatmap series

This post is part 1 of a series on heatmaps:

Part 1: What is a heatmap?

Part 2: How to create a heatmap with R (in an ideal world)?

Part 3: How to create a microarray heatmap with R?

Heatmaps – Part 1: What is a heatmap?