Why sequencing data is modeled as negative binomial

The goal of most sequencing experiments is to identify differences in gene expression between biological conditions such as the influence of a disease-linked genetic mutation or drug treatment. Fitting the correct statistical model to the data is an essential step before making inferences about differentially expressed genes. The negative binomial (NB) distribution has emerged as the model of choice to fit sequencing data. While the NB distribution is bread-and-butter to a statistician, the average experimental biologist may not be very familiar with it.

A first intuition

In a standard sequencing experiment (RNA-Seq), we map the sequencing reads to the reference genome and count how many reads fall within a given gene (or exon). This means that the input for the statistical analysis are discrete non-negative integers (“counts”) for each gene in each sample. The total number of reads for each sample tends to be in the millions, while the counts per gene vary considerably but tend to be in the tens, hundreds or thousands. Therefore, the chance of a given read to be mapped to any specific gene is rather small. Discrete events that are sampled out of a large pool with low probability sounds very much like a Poisson process. And indeed it is. In fact, earlier iterations of RNA-Seq analysis modeled sequencing data as a Poisson distribution. There is one problem, however. The variability of read counts in sequencing experiments tends to be larger than the Poisson distribution allows.

A fundamental property of the Poisson distribution is that its variance is equal to the mean. Here I plotted the gene-wise means versus their variance of the “bottomly” experiment provided by the ReCount project. The code to produce this plot can be found on Github.


It is obvious that the variance of counts is generally greater than their mean, especially for genes expressed at a higher level. This phenomenon is called “overdispersion“. The NB distribution is similar to a Poisson distribution but has an extra parameter called the “clumping” or “dispersion” parameter. It is like a Poisson distribution with more variance. Note, how the NB estimates of the mean-variance relationship (blue line) fits the observed values quite well. Thus, a reasonable first intuition of why the NB distribution is a proper way of fitting count data is that the dispersion parameter allows the extra wiggle room to model the “extra” variance that we empirically observe in RNA-Seq experiments.

A more rigorous justification

There are two mathematically equivalent formulations of the NB distribution. In its traditional form, which I will mention only for the sake of completion, the NB distribution estimates the probability of having a number of failures until a specified number of successes occur. An example for an application would be the expected number of games a striker goes without a goal (“failure”) before scoring (“success”). Note that, “success” and “failure” are not value judgements but just the two outcomes of a Bernoulli process and therefore interchangeable. Whenever you see the NB distribution used in this form, pay close attention to what is defined as a “success” and a “failure”. In is a common point of notational confusion. This definition is not terribly useful for understanding how the NB distribution relates to RNA-Seq count data.

The second definition sounds more intimidating but is much more useful. The NB distribution can be defined as a Poisson-Gamma mixture distribution. This means that the NB distribution is a weighted mixture of Poisson distributions where the rate parameter \lambda (i.e. the expected counts) is itself associated with uncertainty following a Gamma distribution. This sounds very similar to our earlier definition as a “Poisson distribution with extra variance”.

While it is convenient to have a distribution that fits our empirical observations it is not quite satisfying without a more theoretical justification. When comparing samples of different conditions we usually have multiple replicates of each condition. Those replicates need to be independent for statistical inference to be valid. Such replicates are called “biological” replicates because they come from independent animals, dishes, or cultures. In contrast, splitting a sample in two and running it through the sequencer twice would be a “technical” replicate. In general, there is more variance associated with biological replicates than technical replicates. If we assume that our samples are biological replicates, it is not surprising that the same transcript is present at slightly different levels in each sample, even under the same conditions. In other words, the Poisson process in each sample has a slightly different expected count parameter. This is the source of the “extra” variance (overdispersion) we observe in sequencing data. In the framework of the NB distribution, it is accounted for by allowing Gamma-distributed uncertainty about the expected counts (the Poisson rate) for each gene. Conversely, if we were to deal with technical replicates, there should be no overdispersion and a simple Poisson model would be adequate.

The variance (dispersion) \sigma^2 of a NB distribution can be expressed as function of the mean \mu and the dispersion parameter \alpha.

\sigma^2 = \mu + \alpha \mu^2

From this formula it is evident that the dispersion is always greater than the mean for \alpha > 0. If \alpha \rightarrow 0, the NB distribution is a Poisson distribution.

Dispersion estimates

Finally, a short note on the practical implications of estimating the dispersion of sequencing data. In a standard sequencing experiment, we have to be content with few biological replicates per condition due to the high costs associated with sequencing experiments and the large amount of time that goes into library preparations. This makes the gene-wise estimates of dispersion rather unreliable. Modern RNA-Seq analysis tools such as DESeq2 and edgeR combine the gene-wise dispersion estimate with an estimate of the expected dispersion rate based on all genes. This Bayesian “shrinkage” of the variance has already been applied successfully in microarray analysis. Although the implementation of this method varies between analysis tools, the concept of using information from the whole data set has emerged as a powerful technique to mitigate the shortcomings of having few replicates.

Why sequencing data is modeled as negative binomial

Analyzing quantitative PCR data the tidy way

Previously in this series on tidy data: Taking up the cudgels for tidy data

One of the most challenging aspects of working with data is how easy it is to get lost. Even if the data sets are small. Multiple levels of hierarchy and grouping quickly confuse our human brains (at least mine). Recording such data in two dimensional spreadsheets naturally leads to blurring of the distinction between observation and variable. Such data requires constant reformatting and its structure may not be intuitive to your fellow researcher.

Here are the two main rules about tidy data as defined in Hadley Wickham’s paper:

  1. Each variable forms a column
  2. Each observation forms a row

A variable is an “attribute” of a given data point that describes the conditions when it was taken. Variables often are categorical (but they don’t need to be). For example, the gene tested or the genotype associated with a given measurement would be a categorical variables.

An observation is a measurement associated with an arbitrary number of variables. There are no measurements that are taken under identical conditions. Each observation is uniquely described by the variables and should form its own row.

Let’s look at an example of a typical recording of quantitative PCR data in Excel.


We have measurements for three different genotypes (“control”, “mutant1”, “mutant2”) from three separate experiments (“exp1”, “exp2”, “exp3”) with three replicates each (“rep1”, “rep2”, “rep3”).

Looking at the columns, we see that information on “experiment” and “replicate” are stored in the names of the columns rather than the entries of the columns. This will have to be changed.

There clearly are multiple measurements per row. More precisely, it looks like we have a set of nine measurements for each genotype. But this is not entirely true. Experiments are considered statistically independent as they are typically performed at different times and with different cells. They capture the full biological variability and we call them “biological replicates”. The repeated measurements done in each experiment are not statistically independent because they come from the same sample preparation and thus only capture sources of variance that originate from sample handling or instrumentation. We call them “technical replicates”. Technical replicates cannot be used for statistical inference that requires “statistical independence”, such as a t-test. As you can see, we have an implicit hierarchy in our data that is not expressed in the structure of the data representation shown above.

We will untangle all those complications one by one using R tools developed by Hadley Wickham and others to represent the same data in a tidy format suitable for statistical analysis and visualization. For details about how the code works, please consult the many excellent tutorials on dplyr, tidyr, ggplot2, and broom.

messy <- read.csv("qpcr_messy.csv", row.names = 1)

This is the original data read into R. Let’s get started.


Row names should form their own column

The “genotype” information is recorded as row names. “Genotype” clearly is a variable, so we should make “genotype” a full column.

tidy <- data.frame(messy) %>%
# make row names a column
mutate(genotype = rownames(messy))

qpcr_R_messy2What are our variables?

Next, we need to think about what are our variables. We have already identified “genotype” but what are the other ones? The way we do this is to ask ourselves what kind of information we would need to uniquely describe each observation. The experiment and replicate number are essential to differentiate each quantitative PCR measurement, so we need to create separate columns for “experiment” and “replicate”. We will do this in two steps. First we use “gather” to convert tabular data from wide to long format (we could have also used the more general “melt” function from the “reshape2” package). The former column names (e.g. “exp1_rep1”) are saved into a temporary column called “sample”. As this column contains information about two variables (“experiment” and “replicate”), we need to separate it into two columns to conform with the “each variable forms a column” rule. To do this, we use “separate” to split “sample” into the two columns “experiment” and “replicate”.

tidy <- tidy %>%
    # make each row a single measurement 
    gather(key = sample, value = measurement, -genotype) %>%
    # make each column a single variable
    separate(col = sample, into = c("experiment", "replicate"), sep = "_")

qpcr_R_messy3Here are the first 10 columns of the “tidy” representation of the initial Excel table. Before we can do statistical tests and visualization, we have to take care of one more thing.

Untangling implicit Domain specific hierarchies

Remember what we said before about the two different kind of replicates. Only data from biological replicates (“experiments”) are considered statistically independent samples, while technical replicates (“replicate”) are not. One common approach is to average the technical replicates (“replicate”) before any statistical test is applied. With tidy data, this is simple.

data <- tidy %>%
    # calculate mean of technical replicates by genotype and experiment
    group_by(genotype, experiment) %>%
    summarise(measurement = mean(measurement)) %>%

Having each variable as its own column makes the application of the same operation onto different groups straightforward. In our case, we calculate the mean of technical replicates for each genotype and experiment combination.

Now, the data is ready for analysis.

TIdy Statistical analysis of quantitative pcr data

The scientific rational for a quantitative PCR experiment is to find out whether the number of transcripts for a given gene is different between two or more conditions. We have measurements for one transcript in three distinct genotypes (“control”, “mutant1”, “mutant2”). Biological replicates are considered independent and measurements are assumed to be normally distributed around a “true” mean value. A t-test would be an appropriate choice for the comparison of two genotypes. In this case, we have three genotypes, so we will use one-way anova followed by Tukey’s post-hoc test.

mod <- data %>%
    # set "control" as reference
    mutate(genotype = relevel(factor(genotype), ref = "control")) %>%
    # one-way anova and Tukey's post hoc test
    do(tidy(TukeyHSD(aov(measurement ~ genotype, data = .))))

We generally want to compare the effect of a genetic mutation to a “control” condition. We therefore set the reference of “genotype” to “control”.

Using base R statistics functions like “aov” and “TukeyHSD” in a tidy data analysis workflow can pose problems because they were not created with the idea of “dplyr”-style piping (“%>%”) in mind. Piping requires that the input and output of each function is a data frame and that the input is the first argument of the function. The “aov” function neither takes the input data frame as its first argument, nor does it return a data frame but a specialized “aov” object. To add insult to injury, the “TukeyHSD” function only works with such a specialized “aov” object as input.

In situations like this, the “do” function comes in handy. Within the “do” function, the input of the previous line is accessible through the dot character, so we can use an arbitrary function within “do” and just refer to the input data at the appropriate place with “.”. As a final clean-up, the “tidy” function from the “broom” package makes sure that the output of the line is a data frame.

qpcr_tukeyTukey’s post hoc test thinks “mutant1” is different from “control” but “mutant2” is not. Let’s visualize the results to get a better idea of how the data looks like.

tidy Visualization of quantitative PCR data

We are dealing with few replicates, three in our case, so a bar graph is not the most efficient representation of our data. Plotting the individual data points and the confidence intervals gives us more information using less ink. We will use the “ggplot2” package because it is designed to work with data in the tidy format.

# genotype will be on the x-axis, measurements on the y-axis
ggplot(data, aes(x = genotype, y = measurement, col = experiment)) + 
    # plot the mean of each genotype as a cross
    stat_summary(fun.y = "mean", geom = "point", color = "black", shape = 3, size = 5) +
    # plot the 95% confidence interval for each genotype
    stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", color = "black", width = 0.1) + 
    # we we add the averaged measurements for each experiment
    geom_point(shape = 16, size = 5) +

qpcr_visualizationWe can see why the first mutant is different from the “control” sample and the second is not. More replicates would be needed to test whether the small difference in means between “control” and “mutant2” is a true difference or not.

What I have shown here is just the tip of the iceberg. There are many more tools and functions to discover. The more data analysis you do, the more you will realize how important it is not to waste time formatting and reformatting the the data for each step of the analysis. Learning about how to tidy up your data is an important step towards that goal.

The R code can be found on Github.


Analyzing quantitative PCR data the tidy way

What is a large enough sample?

In my previous entry, I tried to clear up some of my own confusion about the Central Limit Theorem (CLT) and explained why it is such a valuable theoretical concept in statistics. To recap, the CLT describes how the means of a random sample of an unknown sampling distribution approach a normal distribution as the sample size n approaches \infty. The uncertainty about our estimate of the mean of the original sampling distribution is given by \sigma / \sqrt{n}, where \sigma is the standard deviation of the sampling distribution. We can see that the larger the sample size, the more certain we are about our estimate of the true mean.

The obvious practical question is what is a large enough sample size? The short answer is, it depends. A sample size of 30 is a pretty save bet for most real life applications.

To investigate the influence of sample size on the convergence of the distribution of the means, I will use simulated sampling from three different sampling distributions. All simulations were done using R. The code can be found on Github.

CLT in (simulated) action

Let’s consider a normal sampling distribution to start with. This is useful to illustrate the idea of how the uncertainty of our estimate of the true mean depends on the sample size n. Here is our normal sampling distribution with \mu = 4 and \sigma = 2.


Now we generate a large number m of random samples each with sample size n and calculate their means. If this confuses you, you are not alone. For now, understand that the only variable we are changing is the sample size n. m will just be a “large number”, such as 10000 in our case, so that we can draw a histogram of 10000 simulated means. We will do this four time, each time with a different sample size of n being either 2, 5, 15, or 30.

The histogram shows the distribution of simulated means and the blue curve illustrates the normal distribution predicted by the CLT with a mean of \mu and a standard deviation of \sigma / \sqrt{n}. In the lower panel, I show quantile-quantile plots to investigate the how well the distribution of the means fits a theoretical normal distribution.

clt_part2_normal_sampling Unsurprisingly, the means of random samples drawn from a perfect normal distribution are themselves normally distributed. Even with a sample size as small as 2. It is intuitive that small sample sizes have more uncertainty associated with our estimate of the true mean, which is reflected by the relatively broad normal distribution of the means. As we increase the sample size the distribution of the means becomes more pointy and narrow, indicating that our estimate of the true mean \mu becomes more and more accurate. Note also, that the y-axis changes as we increase the sample size. This is a visual confirmation that the standard deviation of the distribution of the means is given by \sigma / \sqrt{n}.

Let’s turn to an exponential sampling distribution with \lambda = 1/4 next. Recall that both the mean and standard deviation of an exponential distribution is 1 / \lambda. This one is clearly not normal.


I simulated random samples for different sample sizes as described above for the normal distribtion and calculated the means.


At smaller sample sizes, the deviation of the actual distribution of the means from the theoretical distribution of the means is obvious. It clearly retains some characteristics of an exponential distribution. As we increase the sample size, the fit becomes better and better, until it eventually morphes into a normal distribution.

Does the CLT hold for an arbitrary distribution? Well, let’s consider this crazy sampling distribution I made up using a combination of normal, exponential and uniform distributions.


Simulation of random samples using different sample sizes as before.


As predicted, the CLT holds even for a non-standard sampling distribution. Granted, I did not challenge the assumptions of the CLT too much using for example an extreme tail (skew). I trust this is good enough to convince you that it would just take a few more samples before convergence.

Why is a sample size of 30 large enough?

Back to our original question: what is a large enough sample? We have seen that the major determinant is the shape of the sampling distribution. The more normal it is to begin with, the fewer samples we will need to reach convergence towards a normal distribution of the means.

In practice we do not generate 10000 random samples (10000 experiments!) to get a distribution of the means. We estimate the mean and standard deviation from a single random sample. The larger the random sample, the better will be our estimate of the true mean \mu and the standard deviation \sigma. This follows directly from the Law of large numbers. It is often recommended in statistics textbooks that as a rule of thumb a sample size of 30 can be considered “large”. But why exactly 30? I think there is a practical and a pragmatic argument to be made.

In the simulations we saw that the distribution of the means of a random sample drawn from a (not too crazy) non-normal sampling distribution will be very close to normal. This means that our estimates of the mean and standard deviation of that distribution will be sufficient to describe the distribution of the means and we can use them in hypothesis testing with some confidence (no pun intended).

A more pragmatic argument would make use of the relationship between the sample size and our uncertainty about the true mean of the sampling distribution. Irrespective of the standard deviation \sigma of the sampling distribution, the standard error \sigma / \sqrt{n} decreases proportional to \sqrt{n}. Common sense dictates that increasing the sample size beyond a certain point will result in ever diminishing gains in precision. Here is a graphical representation of the relationship between the standard error and sample size.


As you can see, a sample size of 30 sits right at the point where the curve stops to have an exponential and starts to have a linear decrease. In other words, a sample size of 30 represents the sweet spot in terms of the most “bang for the buck”, no matter the magnitude of the standard deviation of the original sampling distribution.

You might ask, what if the standard deviation is a large value? Well, then our estimate of the true mean will be pretty bad. We will have to increase the sample size and deal with the fact that gains in precision will be ever smaller as n goes beyond 30.

In biomedical research we often face the situation that even a sample size of 30 is unattainable in terms of time or money. Fortunately, there is a solution for that dilemma: Student’s t-distribution. I will investigate how the CLT relates to the t-distribution and hypothesis testing in the next post.


The full R code is available on Github.

What is a large enough sample?

Unlimited confusion: The central limit theorem

Open any statistics textbook, and it won’t be long until you encounter the Central Limit Theorem (CLT). You will learn that it is the basis of key concepts of inferential statistics such as the well-known t-test. In my experience the CLT is also a source of great confusion as it is surprisingly hard to wrap your head around it.

If statisticians had their way, everything would have a normal (Gaussian) distribution. And in some ways, maybe that would be a fairer world to live in. Take salaries for example. However as it stands, not everything is normally distributed. We have a lot of small earthquakes and few strong ones, which is typical for an exponential distribution (or a Pareto distribution in this particular case).

The reason why the CLT is so important in statistics is that it allows us to describe the means of random samples of (virtually) any distribution with the parameters of a normal distribution (mean \mu and standard deviation \sigma) given that the sample size n of those random samples is large enough.

In more precise language, the CLT states that the mean \bar{x} of a random sample x_1, x_2, ..., x_n taken from the sampling distribution S follows a normal distribution centered at \mu with standard deviation \sigma / \sqrt{n} as n \rightarrow \infty.

There is a lot of information in this sentence. Let’s deconstruct it bit by bit and discuss the implications.

We need (virtually) no knowledge of the sampling distribution

The CLT ensures us that irrespective of the shape of the original distribution we are sampling from, the means of random samples will approach a normal distribution. In practice, this means that we do not need any information on how the random samples are generated. We just need to take a large enough sample and analyze it using well-developed statistical methods. In other words, every statistician’s dream.

Why do I say virtually no knowledge? In fact it is possible to construct sampling distributions that break the CLT. That happens if the sampling distribution has an infinite mean or an infinite standard deviation. You will never encounter such sampling distributions in your everyday experiments, so it is more of a technicality.

What are we sampling here?

For me, the major point of confusion about the CLT is that there are apparently two forms of sampling going on. First, each random sample drawn from the sampling distribution has a certain number of samples n. This random sample X will give us exactly one mean \bar{X}. How do we get to a distribution from exactly one number? The CLT says that if we take m random samples, each with a sampling size of n, the m means of the random samples will approach a normal distribution.

So does the statement “given that the sample size is large” refer to the number of instances per random samples n or the number of random samples m? Common sense says that it has to refer to n to be practical. If we had to conduct m experiments each with sample size n it would be either too time consuming or too expensive, especially if both n and m need to be large.

The key to understanding the CLT is that we can estimate the parameters of the distribution of the means from a single random sample of sample size n because we know that if we took m more such random samples, their means would be distributed normally.

Earlier, we established that the mean of the distribution of the means will approach the mean of the sampling distribution. So, we can estimate this parameter \hat{\mu} from the mean \bar{X} of our random sample. Our estimate will most likely not be completely accurate, but the Law of large numbers tells us that if n is not too small, our estimate will be reasonably good. But how good? What about the uncertainty of our measurement? The CLT tells us that the spread of the distribution of the means will be \sigma / \sqrt{n}.

The variance of the mean of a set of random variables is given by

Var(\bar{X}) = Var(\frac{1}{n} \sum_{i=1}^{n} X_i)

According to the Bienayme formula the variance of the sum of uncorrelated random variables is the sum of their variances.

Var(\frac{1}{n} \sum_{i=1}^{n} X_i) = \frac{1}{n^2} Var(\sum_{i=1}^{n} X_i) = \frac{1}{n^2} \sum_{i=1}^{n} Var(X_i)

The variances of the samples are identical variances because they samples come from the same distribution.

\frac{1}{n^2} \sum_{i=1}^{n} Var(X_i) = \frac{1}{n^2} n Var(X_i) = \frac{1}{n}Var(X_i)

Thus, the variance of the mean is \sigma^2 / n and, accordingly, the standard deviation is \sigma / \sqrt{n}.

To add more confusion, \sigma / \sqrt{n} is called the standard error of the mean but it is just the standard deviation of the distribution of the means. It relates the standard deviation of the sampling distribution to the sample size and quantifies our uncertainty about our estimate of the true mean. The larger the sample size n, the closer the estimate \hat{\mu} will be to the true mean \mu. The formula for the standard error of the mean tells us the precision of our estimate increases with the square root of the sample size. In practical terms, if we want to decrease the standard error by a factor of two, we need to increase the sample size by a factor of 4.

We typically do not know the true variance \sigma^2 of the sampling distribution. Again, we estimate it from our random sample using the unbiased estimator s^2

s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

The CLT demonstrates that the means of random samples drawn from an unknown sampling distribution will have a normal distribution. That alone would be interesting but not particularly useful. The fact that we can estimate the mean and standard deviation of the distribution of those means from a single sample makes it a cornerstone of many key concepts of inferential statistics such as hypothesis testing and and confidence intervals. The one reservation of the CLT is that the sample size needs to large enough. Fortunately, we can turn to the t-distribution of we don’t quite meet the criteria of a “large enough” sample size.

What is a large enough random sample?

The CLT states that the number of instances n that make up the random sample should approach \infty. That is clearly not practical. For most sampling distributions, especially if they already are close to a Gaussian distribution to start with, convergence will happen much sooner. A rough rule of thumb is that a sample size of n = 30 can be considered “large enough”. But it very much depends on the shape of the original sampling distribution. We will explore that in a follow-up post using simulation.

Unlimited confusion: The central limit theorem

PCA – Part 4: Potential Pitfalls

In the first three parts of this series on principal component analysis (PCA), we have talked about what PCA can do for us, what it is mathematically, and how to apply it in practice. Today, I will briefly discuss some of the potential caveats of PCA.

INformation and Noise

PCA looks for the dimensions with highest variance within the data and assumes that high variance is a proxy for “information”. This assumption is usually warranted otherwise PCA would not be useful.

In cases of unsupervised learning, that is if we have no class labels of the data available, looking for structure within the data based on the data itself is our only choice. In a sense, we cannot tell what parts of the data are information and what parts are noise.

If we have class labels available (supervised learning), we could in principle look for dimensions of variance that optimally separate the classes from each other. PCA does not do that. It is “class agnostic” and thus treats “information”-variance and “noise”-variance the same way.

It is possible that principle components associated with small eigenvalues nevertheless carry the most information. In other words, the size of the eigenvalue and the information content are not necessarily correlated. When choosing the number of components to project our data, we could thus lose important information. Luckily, such situations rarely happen in practice. Or we just never realize …

There are other techniques related to PCA that attempt to find dimensions of the data that optimally separate the data based on class labels. The most famous is Fisher’s “Linear Discriminant Analysis” (LDA) and its non-linear cousins “Quadratic Discriminant Analysis” (QDA).


In Part 3 of this series, we have looked at a data set containing a multitude of motion detection measurements of humans doing various activities. We used PCA to find a lower dimensional representation of those measurements that approximate the data well.

Each of the original measurements were quite tangible (despite their sometimes cryptic names) and therefore interpretable. After PCA, we are left with linear combinations of those original features, which may or may not be interpretable. It is far from guaranteed that the eigenvectors correspond to “real” entities, they may just be convenient summaries of the data.

We will rarely be able to say the first principle component means “X” and the second principle component means “Y”, however tempting it may be based on our preconceived notions of the data. A good example of that is mentioned in Cosma Shalizi’s excellent notes on PCA. Cavalli-Sforza et al. analyzed the distribution of human genes using PCA and interpreted the principal components as patterns of human migration and population expansion. Later, November and Stephens showed that similar patterns could be obtained using simulated data with spatial correlation. As humans are genetically more similar to humans they close to (at least historically), genetic data is necessarily spatially correlated and thus PCA will uncover such structures, even if they do not represent “real” events or are liable to misinterpretation.


Linear algebra tells us that eigenvectors are orthogonal to each other. A set of n orthogonal vectors form a basis of an n-dimensional subspace. The principle components are eigenvectors of the covariance matrix and the set of principle components form a basis for our data. We also say that the principle components are “uncorrelated”. This becomes obvious when we remember that matrix decomposition is sometimes called “diagonalization”. In the eigendecomposition, the matrix containing the eigenvalues has zeros everywhere but on its diagonal, which contains the eigenvalues.

Variance and covariance are measures in the L2 norm, which means that they involve the second moment or square. Being uncorrelated in the L2 norm does not mean that there is no “correlation” in higher norms, in other words the absence of correlation does not imply independence. In statistics, higher order norms are skew (“tailedness” or third moment) and kurtosis (“peakedness” or fourth moment). Techniques related to PCA such as Independent Component Analysis (IDA) can be used to extract two separate, but convolved signals (“independent components”) from each other based on higher order norms.

The distinction between correlation and independence is a technical point when it comes to the practical application of PCA but certainly worth being aware of.

Further reading

Cosma Shalizi – Principal Components: Mathematics, Example, Interpretation

Cavalli-Sforza et al. – The History and Geography of Human Genes (1994)

Novembre & Stephens – Interpreting principal component analyses of spatial genetic variation (2008)


Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

PCA – Part 4: Potential Pitfalls

PCA – Part 3: In the Trenches

Now that we have an intuition of what principal component analysis (PCA) is and understand some of the mathematics behind it, it is time we make PCA work for us.

Practical examples of PCA typically use Ronald Fisher’s famous “Iris” data set, which contains four measurements of leaf lenghts and widths of three subspecies of Iris flowers. To mix things up a little bit, I will use a data set that is closer to what you would encounter in the real world.

The “Human Activity Recognition Using Smartphones Data Set” available from the UCI Machine Learning Repository contains a total of 561 triaxial acceleration and angular velocity measurements of 30 subjects performing different movements such as sitting, standing, and walking. The researchers collected this data set to ask whether those measurements would be sufficient to tell the type of activity of the person. Instead of focusing on this classification problem, we will look at the structure of the data and investigate using PCA whether we can express the information contained in the 561 different measurements in a more compact form.

I will be using a subset of the data containing the measurements of only three subjects. As always, the code used for the pre-processing steps of the raw data can be found on GitHub.

Step 1: Explore the data

Let’s first load the pre-processed subset of the data into our R session.

# read data from Github
measurements <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/pca_part3_measurements.txt"))
description <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/pca_part3_description.txt"))

It’s always a good idea to check for a couple of basic things first. The big three I usually check are:

  • What are dimensions of the data, i.e. how many rows and columns?
  • What type of features are we dealing with, i.e. categorical, ordinal, continuous?
  • Are there any missing values?

The answer to those three questions will determine the amount of additional data munging we have to do before we can use the data for PCA.

# what are the dimensions of the data?
# what type of data are the features?
table(sapply(measurements, class))
# are there missing values?

The data contains 990 samples (rows) with 561 measurements (columns) each. Clearly too many measurements for visualizing on a scatterplot. The measurements are all of type “numeric”, which means we are dealing with continuous variables. This is great because categorical and ordinal variable are not handled well by PCA. Those need to be “dummy coded“. We also don’t have to worry about missing values. Strategies for handling missing values are a topic on its own.

Before we run PCA on the data, we should look at the correlation structure of the features. If there are features, i.e. measurements in our case, that are highly correlated (or anti-correlated), there is redundancy within the data set and PCA will be able to find a more compact representation of the data.

# feature correlation before PCA
cor_m <- cor(measurements, method = "pearson")
# use only upper triangular matrix to avoid redundancy
upt_m <- cor_m[upper.tri(cor_m)]
# plot correlations as histogram
hist(upt_m, prob = TRUE)
# plot correlations as image
image.plot(cor_m, axes = FALSE)

The code was simplified for clarity. The full version can be found in the script.

pca_part3_fig1We see in the histogram on the left that there is a considerable number of highly correlated features, most of them positively correlated. Those features show up as yellow in the image representation to the right. PCA will likely be able to provide us with a good lower dimensional approximation of this data set.

Step 2: Run PCA

After all the preparation, running PCA is just one line of code. Remember, that we need to at least center the data before using PCA (Why? see Part 2). Scaling is technically only necessary if the magnitude of the features are vastly different. Note, that the data appears to be already centered and scaled from the get-go.

# run PCA
pc <- prcomp(measurements, center = TRUE, scale. = TRUE)

Depending on your system and the number of features of your data this may take a couple of seconds.

The call to “prcomp” has constructed new features by linear combinations of the old features and sorted them by their and weighted by the amount of variance they explain. Because the new features are the eigenvectors of the feature covariance matrix, they should be orthogonal, and hence uncorrelated, by definition. Let’s visualize this directly.

The new representation of the data is stored as a matrix named “x” in the list object we get back from “prcomp”. In our case, the matrix would be stored as “pc$x”.

# feature correlation before PCA
cor_r <- cor(pc$x, method = "pearson")
# use only upper triangular matrix to avoid redundancy
upt_r <- cor_r[upper.tri(cor_r)]
# plot correlations as histogram
hist(upt_r, prob = TRUE)
# plot correlations as image
image.plot(cor_r, axes = FALSE)


The new features are clearly no longer correlated to each other. As everything seems to be in order, we can now focus on the interpretation of the results.

Step 3: Interpret the results

The first thing you will want to check is how much variance is explained by each component. In PCA speak, this can be visualized with a “scree plot”. R conveniently has a built-in function to draw such a plot.

# draw a scree plot
screeplot(pc, npc = 10, type = "line")


This is about as good as it gets. A large amount of the variance is captured by the first principal component followed by a sharp decline as the remaining components gradually explain less and less variance approaching zero.

The decision of how many components we should use to get a good approximation of the data has to be made on a case-by-case basis. The cut-offs for the percent explained variance depends on the kind of data you are working with and its inherent covariance structure. The majority of the data sets you will encounter are not nearly as well behaved as this one, meaning that the decline in explained variance is much more shallow. Common cut-offs range from 80% to 95% of explained variance.

Let’s look at how many components we would need to explain a given amount of variance. In the R implementation of PCA, the variances explained by each principle component are stored in a vector called “sdev”. As the name implies, these are standard deviations or the square roots of the variances, which in turn are scaled versions of the eigenvalues. We will need to take the squares “sdev” to get back the variances.

# calculate explained variance as cumulative sum
# sdev are the square roots of the variance
var_expl <- cumsum(pc$sdev^2) / sum(pc$sdev^2)
# plot explained variance
plot(c(0, var_expl), type = "l", lwd = 2, ylim = c(0, 1), 
     xlab = "Principal Components", ylab = "Variance explained")
# plot number of components needed to for common cut-offs of variance explained
vars <- c(0.8, 0.9, 0.95, 0.99)
for (v in vars) {
npc <- which(var_expl > v)[1]
    lines(x = c(0, npc, npc), y = c(v, v, 0), lty = 3)
    text(x = npc, y = v - 0.05, labels = npc, pos = 4)
    points(x = npc, y = v)


The first principle component on its own explains more than 50% of the variance and we need only 20 components to get up to 80% of the explained variance. Fewer than 30% of the components (162 out of 561) are needed to capture 99% of the variance in the data set. This is a dramatic reduction of complexity. Being able to approximate the data set with a much smaller number of features can greatly speed up downstream analysis and can help to visualize the data graphically.

Finally, let’s investigate whether “variance” translates to “information”. In other words, do the prinicipal components associated with the largest eigenvalues discriminate between the different human activities?

If the class labels (“activities” in our case) are known, a good way to do look at the “information content” of the principal components is to look at scatter plots of the first couple of components and color-code the samples by class label. This code gives you a bare bones version of the figure shown below. The complete code can be found on Github.

# plot the first 8 principal components against each other
for(p in seq(1, 8, by = 2)) {
  plot(pc$x[, p:(p+1)], pch = 16, 
       col = as.numeric(description$activity_name))


We have seen previously that the first component alone explains about half of the variance and in this figure we see why. It almost perfectly separates non-moving “activities” (“laying”, “sitting”, “standing”) from moving activities (various types of “walking”). The second component does a reasonable job at telling the difference between walking and walking upstairs. As we move down the list, there remains visible structure but distinctions become somewhat less clear. One conclusion we can draw from this visualization is that it will most likely be most difficult to tell “sitting” apart from “standing” as none of the dimensions seems to be able to distinguish red and green samples. Oddly enough, the fifth component does a pretty good job of separating “laying” from “sitting” and “standing”.


PCA can be a powerful technique to obtain low dimensional approximations of data with lots of redundant features. The “Human Activity Recognition Using Smartphones Data Set” used in this tutorial is a particularly good example of that. Most real data sets will not be reduced to a few components so easily while retaining most of the information. But even cutting the number of features in half can lead to considerable time savings when using machine learning algorithms.

Here are a couple of useful questions when approaching a new data set to apply PCA to:

  1. Are the features numerical or do I have to convert categorial features?
  2. Are there missing values and if yes, which strategy do I apply to deal with them?
  3. What is the correlation structure of the data? Will PCA be effective in this case?
  4. What is the distribution of variances after PCA? Do I see a steep or shallow decline in explained variance?
  5. How much “explained variance” is a good enough approximation of the data? This is usually a compromise between how much potential information I am willing to sacrifice for cutting down computation time of follow-up analyses.

In the final part of this series, we will discuss some of the limitations of PCA.

Addendum: Understanding “prcomp”

The “prcomp” function is very convenient because it caclulates all the numbers we could possible want from our PCA analysis in one line. However, it is useful to know how those number were generated.

The three most frequently used objects returned by “prcomp” are

  • “rotation”: right eigenvectors (“feature eigenvectors”)
  • “sdev”: square roots of scaled eigenvalues
  • “x”: projection of original data onto the new features


In Part 2, I mentioned that software implementations of PCA usually compute the eigenvectors of the data matrix using singular value decomposition (SVD) rather than eigendecomposition of the covariance matrix. In fact, R’s “prcomp” is no exception.

“Rotation” is a matrix whose columns are the right eigenvalues of the original data. We can reconstruct “rotation” using SVD.

# perform singular value decomposition on centered and scaled data
sv <- svd(scale(measurements))
# "prcomp" stores right eigenvectors in "rotation"
w <- pc$rotation
dimnames(w) = NULL
# "svd" stores right eigenvectors in matrix "v"
v <- sv$v
# check if the two matrices are equal
all.equal(w, v)


Singular values are the square roots of the eigenvalues as we have seen in Part 2. “sdev” stands for standard deviation and thus stores the square roots of the variances. Thus, the squares of “sdev” and the squares of the singular values are directly proportional to each other and the scaling factor is the number of rows of the original data matrix minus 1.

# relationship between singular values and "sdev"
all.equal(sv$d^2/(nrow(sv$u)-1), pc$sdev^2)


The projection of the orignal data (“measurements”) onto its eigenbasis is automatically calculated by “prcomp” through its default argument “retx = TRUE” and stored in “x”. We can manually recreate the projection using matrix-matrix multiplication.

# manual projection of data
all.equal(pc$x, scale(measurements) %*% pc$rotation)

If we wanted to obtain a projection of the data onto a lower dimensional subspace, we just determine the number of components needed and subset the columns of matrix “x”. For example, if we wanted to get an approximation of the original data preserving 90% of the variance, we take the first 52 columns of “x”.

# projection of original data preserving 90% of variance
y90 <- pc$x[, 1:52]
# note that this is equivalent matrix multiplication with 
# the first 52 eigenvectors
all.equal(y90, scale(measurements) %*% pc$rotation[, 1:52])


The full R code is available on Github.

Further reading

IRIS data set

Sebastian Raschka – Principle Component Analysis in 3 Simple Steps


Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

PCA – Part 3: In the Trenches

A closer look at the fisherman’s dilemma

In my previous post I defined the dangers of candidate selection in high-throughput screens as the “fisherman’s dilemma”. I have argued that our preconceived notions of how a biological system “should” behave, i.e. our inherent scientific bias, and a disproportional focus on p-values contribute to the frequent failure of high-throughput screens to yield tangible or reproducible results.

be conservative about p-values

Today, I would like to take a closer look at the relationship between the p-value and the positive predictive value (PPV), also known as the true positive rate or the posterior probability that a hit is a true hit. Despite its fancy name, the PPV is just the ratio true positives (TP) over the sum of true positives and false positives (FP):

PPV = TP / (TP + FP)

The number of true positives is the determined by the prior probability of there being a hit \pi and the statistical power 1-\beta, which is our ability to detect such a hit. Power is the complement of the type II error rate \beta.

TP = (1 - \beta) \pi

The number of false positives depends on the false positive rate (type I error rate) \alpha and the prior probability of there being no hit 1-\pi.

FP = \alpha (1 - \pi)

Putting these two equations together, we get:

PPV = (1 - \beta) \pi / [ (1 - \beta) \pi + \alpha (1 - \pi) ]

From this equation it is evident that just focusing on the significance level \alpha can lead to vastly different PPVs depending on where on the spectrum of prior probability and power we operate.

For the purpose of illustration, I have plotted the PPV for four commonly used significance levels 0.1, 0.05, 0.01, and 0.001. Green means higher PPV, and red means lower PPV. The black contour line shows where the PPV is 0.5, that is half of our hits are predicted to be false positives. For the optimists among us, half of our hits would likely be true positives.

fishing_part2_fig1From this figure it is clear that a p-value of 0.05 only works in situations of high prior probability and high power. I have marked the domain of high-throughput screens (HTS) rather generously at up to 0.25 prior probability and 0.25 power. Due to small sample sizes (low power) and the fact that any given perturbation is unlikely to have an effect on the majority of cellular components (low prior), most high-throughput screens operate in a space even closer to the origin in the deeply red area.

On the flip-side, this analysis tells us that if we are a little more conservative in what we call a hit, in other words if we lower the p-value cut-off to let’s say 0.001 or lower, we improve our chances of identifying true positives quite dramatically. Unless the high-throughput screen is plagued by terribly low power and prior probability, we actually have a chance that the majority of hits are true positives.

keep your guard up

In genomics, p-values often originate from statistical tests like t-tests comparing a number of control samples to a number of treatment samples. If the values obtained upon treatment have a low probability to have originated from the control (null) distribution, we say the treatment has a “significantly” effect. The t-statistics takes into account the difference of the means between control and treatment distributions and their variances and combines them into a single value, the t-value, from which the p-value is calculated.

In situations when we small sample sizes, such as in high-throughput screens, it can happen that by chance either the control or treatment values cluster together closely. This results in either a very narrow control or treatment distribution. Due to the confounding of effect size and precision in the t-value, even tiny effects can end up called “significant” as long as the variance is small enough. This is a common problem in microarrays with small sample numbers.

Fortunately, there are quite effective solutions for this problem. Genes with low p-values and small effect sizes can be identified using a volcano plot, which displays effect size against the negative logarithm of the p-value.

Increasing the sample size and/or using a Bayesian correction of the standard deviation as it is implemented in Bioconductor’s “limma” package for microarray analysis can help to ameliorate this problem.

possible Ways out of the fisherman’s dilemma

  • High-throughput screens in biomedical research usually operate in a domain of low power and low prior probability. Based on your estimate of power and prior probability, use a more conservative p-value cut-off than 0.05.
  • In addition to choosing the significance level \alpha based on the power and prior probability of your study, be wary of low p-values of hits with small effect sizes or apply corrections if possible.
  • Try to increase the power of your experiment by increasing sample size, or better by decreasing measurement error if possible.
  • The prior probability of having an effect is determined by nature and out of our control. We need to be aware of the possibility, however, that the prior probability is very low or even zero. In that case, it would be very hard or impossible to find a true positive.
  • Ditch the p-value and use a Bayesian approach.


The full R code can be found on Github.

Further reading

A closer look at the fisherman’s dilemma