What is a heatmap?
A heatmaps is a visual representation of a multidimensional data set using a false-colored image. The data itself is often in a matrix-like format such as an Excel spread sheet. In genetics, a common application for heatmaps is the visualization of microarray data with rows containing a list of gene transcripts and columns containing samples. The values themselves are measurements of the abundance of a particular transcript in a particular sample.
Why use a Heatmap?
There are two key motivations for using a heatmap to represent your data:
1) A heatmap allows the visualization of multidimensional data sets and thus provides an overview of the data set in a single figure. Other visualizations such as scatter plots are limited to two or three dimensions.
2) Heatmaps are almost exclusively used in conjunction with a machine learning technique called clustering. Clustering aims to identify structure within the data based on a measure of distance. Taking up the microarray example from before, we might be interested in which samples are more closely related to each other based on their transcriptional profile. In an ideal world, a given treatment would cause a dramatically different gene expression profile compared to control samples and clustering would separate control samples from treatment samples completely based on the structure within the data (i.e. the gene expression profile) without knowing what the samples are. In machine learning jargon such a technique is called unsupervised learning.
What do i need to be careful about with heatmaps?
The two theoretical advantages just mentioned come with their two corresponding practical shortcomings:
1) You might be tempted to just throw all the data in your Excel spread sheet into a heatmap and be done with it. In reality, it is an art to choose the right amount of complexity (i.e. number of rows and columns) for your visualization to convey the message you want to deliver. Too little data might be better represented with other visualizations such as box plots or scatter plots, too much data will make it hard to see the structure in the data and interferes with labeling of rows and columns. From my experience the sweet spot for labeled heatmaps is somewhere in the two digit by two digit realm, such as 50 genes for 10 samples. If labels can be omitted, 500 genes for 10 samples works just fine.
2) The keen reader might have noticed that I brushed over the details of clustering. There are not only multiple ways to cluster data but also multiple ways to calculate distances within the data. The specific choice of distance metric and clustering algorithm can lead to vastly different results and one should be aware of the idiosyncrasies and assumptions of each technique. Even once you have clustered your data and you are happy with the result, there are questions you should ask yourself: How many clusters do I choose to represent my data? Are the clusters stable if I add more data points or are they just haphazard occurrences? Do the clusters have meaningful interpretations in the system I am working with?
This post is part 1 of a series on heatmaps:
Part 1: What is a heatmap?