I will be using R to demonstrate how to create a simple heatmap and show the most important parameters of R’s build-in “heatmap” function.
R is a excellent programming language for statistical computing, bioinformatics, and data science. There are multiple high-quality tutorials and courses available online targeted at the beginner and intermediate level. For this tutorial, I am assuming that you have the basics down already.
Step 1: Get to know your data
To illustrate the major points of how to create a heatmap, I will be using a toy example that most of us can relate to (positively or negatively): cars. R has a build-in data set called “mtcars”, which contains 11 attributes such as miles per gallon and the number of cylinders for 32 car models from 1974. As it is good practice for every data analysis to familiarize yourself with the data set you are working with, you can do that in R using
Now, it is time to load the data into memory using the “data” function.
Next, we take a brief look at the data itself by asking for the dimension using “dim” and the data type using “class”.
So far so good. The data is a “data.frame” object with dimensions 32 rows and 11 columns. This is well within the range of “the two digit by two digit” rule we established earlier.
Step 2: Get to know the “heatmap” function
For plotting a heatmap in R, the data has to be a “matrix”. The “heatmap” function will complain, if we feed it a “data.frame” object. So let’s get that out of the way.
mat <- as.matrix(mtcars)
Now it’s time to try out the heatmap function.
The danger of using the default parameters of any function is that we don’t get a chance to understand what is going on behind the scenes. As always, if you want information on a any R function, type “?heatmap”. In our case, “heatmap” assumes two things by default:
1) That we want clustering of rows and columns. This can be seen by the row and column dendrograms in the figure above. This is a reasonable default setting but we need to be aware of it. Especially because there are multiple different ways to perform the clustering.
2) That the attributes (features) of the cars such as “mpg” (miles per gallon) and “cyl” (cylinders) are in the rows and we want scaling of those features. The “mtcars” data set has the features in the columns, not the rows, which is why the scaling did not work. However, scaling will be necessary because “hp” (horse power) and “disp” (displacement) are about one to two orders of magnitude larger than the rest of the features. There are two potential solutions to this problem: either transpose the data matrix or scale by “column”. We will use the latter.
Let’s break down the creation of our heatmap step by step. To get a vanilla version, we will disable clustering by preventing reordering of rows and columns (Rowv = NA, Colv = NA) and ask for no scaling (scale = “none”).
heatmap(mat, Rowv = NA, Colv = NA, scale = "none")
This is just the false-colored image representation of the data itself. We cannot make out the structure of the data because the image is dominated by the two features “disp” and “hp” that have high values. Let’s remedy that by scaling the features, which are located in the columns (scale = “column”).
heatmap(mat, Rowv = NA, Colv = NA, scale = "column")
Now that the features are at the same scale, we can clearly see that there is structure in the data. We are ready to add clustering to the heatmap. By default, the “heatmap” function uses “euclidean” distance and “complete” linkage for clustering. If this is gibberish to you, don’t worry. We will use the defaults for today and leave an explanation of the different distance measures and clustering algorithms for another time. Nevertheless, in the interest of code transparency it is always a good idea to be explicit about the clustering parameters you use. Your co-workers and your future self will be grateful.
# clustering of car attributes (rows) row_dist <- dist(mat, method = "euclidean") row_clus <- hclust(row_dist, method = "complete") # clustering of cars (columns) col_dist <- dist(t(mat), method = "euclidean") col_clus <- hclust(col_dist, method = "complete") heatmap(mat, # clustering Rowv = as.dendrogram(row_clus), Colv = as.dendrogram(col_clus), # scaling scale = "column")
Note that the “dist” function calculates the pair-wise distances between the rows. To calculate the distances between columns, we need to use the transpose of the matrix, which we get with the “t” function in R.
Also, the “heatmap” function expects a “dendrogram” object for ordering the rows and columns. We can coerce the clustered data into a dendrogram object by using “as.dendrogram”.
Step 3: Final tweaks
By default, “heatmap” uses “heat.colors” as its false-color palette. Beauty obviously is in the eye of the beholder, nevertheless I would like to make a counter-proposal for our color scheme. R easily lets you create your own color palettes using the “RColorBrewer” library. Let’s create a custom-made color palette, which I found here and liked.
library(RColorBrewer) yellowred <- colorRampPalette(c("lightyellow", "red"), space = "rgb")(100)
A function to plot a simple heatmap would look like this. Note that we are using the pre-computed row and column clustering from above.
heatmap(mat, # clustering Rowv = as.dendrogram(row_clus), Colv = as.dendrogram(col_clus), # scaling scale = "column", # color col = yellowred)
We can see that the clustering of cars found two main clusters, smaller utility cars on top and sports cars at the bottom. It makes sense that those two different types of cars can be distinguished by their attributes. Likewise, we can see that sports cars tend to be heavier (wt), have more cylinders (cyl), higher displacement (disp), and more horse power (hp). Conversely, they drive fewer miles per gallon of gas (mpg) and take less time on the first 1/4 mile (qsec).
A word of caution. The two features “mpg” and “qsec” cluster together purely based on their numeric values, not because of some qualitative criterion we might have a bias about. We would probably associate a low “mpg” score with something bad and a fast time on the first 1/4 mile as something good. The heatmap does not care.
- For simple heatmaps, use the built-in R function “heatmap“
- Make sure your data is a matrix not a data.frame object
- Be aware whether you want to scale rows or columns depending on what makes sense in your analysis
- Be explicit about which distance measure and clustering algorithm you use
The full R script can be found on GitHub.
This post is part 2 of a series on heatmaps:
Part 1: What is a heatmap?