Why is NGS read alignment so slow (fast)?

Modern sequencing machines produce a wealth of information. Unfortunately, it comes in the form of a jumble of millions of short sequence fragments (“reads”), for which we don’t know where on the genome they originated from. The job of read alignment tools like Bowtie2, BWA, or STAR is to bring order to chaos and tell us the likely location of a read on a reference genome.

Even if we have access to a high performance computer cluster, the alignment step is computationally expensive and thus rate-limiting in many next-generation sequencing pipelines. In perfect wet-lab biologist jargon, I have heard it referred to as the “overnight step”. Why does it take so long to determine the location of a short fragment to a reference sequence?

The bioinformatician’s question of how to map a read to a genome translates to the computer scientist’s question of how to match a pattern to a string. Luckily, the latter is a well-studied problem.

A naive approach

The simplest solution would be to take a read (the “pattern”) and slide it along the genomic sequence (the “string”) and compare at each position whether the read matches the genome. This “brute force” approach is perfectly valid and works just fine. Let’s take a look at some numbers to see if it is feasible in our context.

The human genome has approximately 3 billion bases, the typical length of a read is around 100 bases, and a typical experiment has about 10 million reads per sample. This means that a single read has roughly 3 billion possible position and for each position we would have to make 100 comparisons. Those 300 billion comparisons must be made for each of the 10 million reads, so we would end up with the fantastically large number of 3 quintillion comparisons per sample. Usually we have more than one sample. If we allow for even a single mismatch, things get completely out of hand. The brute force approach is clearly not an option.

Note that this is the worst case scenario and that there are better ways of searching a short pattern in a string. Even so, the major problem why the the brute force approach is slow remains: It makes many unnecessary comparisons. In other words, it searches for matches in areas where there is no chance of finding anything useful.

The power of indexing

When phones still had cords, people used to either memorize numbers or look them up in a phone book. If your friend’s last name started with an “S”, you wouldn’t look for his or her number in the “T” section. The implicit assumption was that all last names were ordered alphabetically and there was no chance that Mrs. Smith was to be found next to Mr. Taylor. By listing the names of people in alphabetical order, a phone book effectively limits the search space and allows you to find any number relatively quickly. A structure that facilitates lookup of large volumes of data using keys is called an “index“.

Like phone books, suffix arrays are structures that are designed for efficient searching of large bodies of text. To construct a suffix array, you sort all substrings of the original string that contain the last character (“suffixes”) in lexicographical order and record the positions relative to the unsorted suffixes. The effect is similar to a phone book. Suffixes starting with the identical character end up next to each other in the array. With a suffix array to guide the search looking up the location of a pattern in a string is extremely fast because every occurrence of the pattern is equivalent to locating the suffixes that begin with the pattern. Two binary searches for the start and end positions of the pattern within the suffix array result in the location of the pattern within the string. There are two problems, however.

The first issue is that we need to invest time to construct the suffix array from the original sequence of characters. Usually that’s not a serious problem because there exist algorithms to build suffix arrays that scale well to extremely long strings such as the human genome. On top, re-use of the index for multiple sequencing experiments will amortize the cost in time that went into building the index.

The second problem is more serious. It takes multiple times the space of the raw string to store a suffix array. This remains true even if we only implicitly store the order of the suffixes rather than the suffixes themselves. The memory footprint of the raw sequence of the human genome is on the order of several gigabytes. Working with a suffix array multiple times that size becomes troublesome if we want to keep it in RAM. This is a good examples that besides accuracy and speed, memory consumption is also an important consideration when determining how useful an algorithm or data structure is.

What we are looking for is a data structure that combines the fast lookup times of a suffix array with the low memory footprint of the brute-force approach.

The Burrows-Wheeler transform

The Burrows-Wheeler transform (BWT) is a reversible permutation of a sequence of characters that is more “compressible” than the original sequence. The BWT involves lexicographical sorting of all permutations of a string so that identical characters end up next to each other. This facilitates compact encoding as a sequence of ten A’s can be stored as “AAAAAAAAAA” or “10A” without loss of information.

There are important similarities between the suffix array and the BWT. Recall that a suffix array involves lexicographical sorting of all suffixes of a string. The BWT is the last column of a matrix of all lexicographical sorting of all permutations of a string. As the suffixes are necessarily contained in the string permutations and both are lexicographically sorted, the order of the elements in a suffix array and the Burrows-Wheeler permutation matrix must be identical. In fact, the BWT can be efficiently calculated from a suffix array and the BWT implicitly encodes a suffix array.

The BWT allows for a compact representation of the original string but it is by itself not very well suited for fast lookup of the location of a pattern. By augmenting the BWT with a table of ranks of each character in the BWT and a partial suffix arrays, we obtain a data structure that gets very close to the fast pattern matching found in suffix arrays while maintaining a memory footprint close to the raw string. Such a data structure is called an FM-index and is the basis of alignment tools like Bowtie2 and BWA (Burrows-Wheeler Aligner). The commands “bowtie2-build” and “bwa index” build the respective software-specific FM-indices that are used for running the alignment.

How to handle mismatches?

The FM-index based mapping of reads to the genome is only fast for exact matches. In practice, read mapping must be tolerant to mismatches. Mismatches can occur due to technical reasons such as PCR artifacts or incorrect base calling during the sequencing process. Conversely, true variations between sequences (SNPs, CNV, Indels) are among the most interesting biological results we have obtained from having sequenced thousands of human genomes. We definitely do not want an alignment tool to discard all reads with mismatches by default just because they aren’t perfect matches to the reference genome.

One strategy that is used in modern sequence alignment tools like Bowtie2 is to split the reads into “seeds”. Exact matches of seeds to the genome are found using the FM-index and then extended using variants of more sensitive sequence alignment algorithms like Needleman-Wunsch or Smith-Waterman. In this way, Bowtie2 balances the speed of finding the exact location of reads on the genome with a certain error tolerance that allows the identification of possibly interesting sequence variants.

What constitutes a valid alignment and how it scores can be tuned by the user with the help of command line arguments such as the number of allowed mismatches and gap penalties. This is where the biggest differences between alignment tools is observed. Which alignment tool to use is ultimately a matter of personal preference. In general, BWA is thought to have higher precision and is thus favored in variant calling, while Bowtie2 appears to be faster and more sensitive but may lack some of BWA’s precision.

The development of fast and accurate read alignment tools was an essential contribution to the current boom in genomics research. Without decades of research and algorithm development in computer science, we would be waiting for days or weeks for our read alignments to finish. So what are a few hours?

Other posts on next-generation sequence analysis:

Why we use the negative binomial distribution to model sequencing reads?

Why sequencing data is modeled as negative binomial

The goal of most sequencing experiments is to identify differences in gene expression between biological conditions such as the influence of a disease-linked genetic mutation or drug treatment. Fitting the correct statistical model to the data is an essential step before making inferences about differentially expressed genes. The negative binomial (NB) distribution has emerged as the model of choice to fit sequencing data. While the NB distribution is bread-and-butter to a statistician, the average experimental biologist may not be very familiar with it.

A first intuition

In a standard sequencing experiment (RNA-Seq), we map the sequencing reads to the reference genome and count how many reads fall within a given gene (or exon). This means that the input for the statistical analysis are discrete non-negative integers (“counts”) for each gene in each sample. The total number of reads for each sample tends to be in the millions, while the counts per gene vary considerably but tend to be in the tens, hundreds or thousands. Therefore, the chance of a given read to be mapped to any specific gene is rather small. Discrete events that are sampled out of a large pool with low probability sounds very much like a Poisson process. And indeed it is. In fact, earlier iterations of RNA-Seq analysis modeled sequencing data as a Poisson distribution. There is one problem, however. The variability of read counts in sequencing experiments tends to be larger than the Poisson distribution allows.

A fundamental property of the Poisson distribution is that its variance is equal to the mean. Here I plotted the gene-wise means versus their variance of the “bottomly” experiment provided by the ReCount project. The code to produce this plot can be found on Github. It is obvious that the variance of counts is generally greater than their mean, especially for genes expressed at a higher level. This phenomenon is called “overdispersion“. The NB distribution is similar to a Poisson distribution but has an extra parameter called the “clumping” or “dispersion” parameter. It is like a Poisson distribution with more variance. Note, how the NB estimates of the mean-variance relationship (blue line) fits the observed values quite well. Thus, a reasonable first intuition of why the NB distribution is a proper way of fitting count data is that the dispersion parameter allows the extra wiggle room to model the “extra” variance that we empirically observe in RNA-Seq experiments.

A more rigorous justification

There are two mathematically equivalent formulations of the NB distribution. In its traditional form, which I will mention only for the sake of completion, the NB distribution estimates the probability of having a number of failures until a specified number of successes occur. An example for an application would be the expected number of games a striker goes without a goal (“failure”) before scoring (“success”). Note that, “success” and “failure” are not value judgements but just the two outcomes of a Bernoulli process and therefore interchangeable. Whenever you see the NB distribution used in this form, pay close attention to what is defined as a “success” and a “failure”. In is a common point of notational confusion. This definition is not terribly useful for understanding how the NB distribution relates to RNA-Seq count data.

The second definition sounds more intimidating but is much more useful. The NB distribution can be defined as a Poisson-Gamma mixture distribution. This means that the NB distribution is a weighted mixture of Poisson distributions where the rate parameter $\lambda$ (i.e. the expected counts) is itself associated with uncertainty following a Gamma distribution. This sounds very similar to our earlier definition as a “Poisson distribution with extra variance”.

While it is convenient to have a distribution that fits our empirical observations it is not quite satisfying without a more theoretical justification. When comparing samples of different conditions we usually have multiple replicates of each condition. Those replicates need to be independent for statistical inference to be valid. Such replicates are called “biological” replicates because they come from independent animals, dishes, or cultures. In contrast, splitting a sample in two and running it through the sequencer twice would be a “technical” replicate. In general, there is more variance associated with biological replicates than technical replicates. If we assume that our samples are biological replicates, it is not surprising that the same transcript is present at slightly different levels in each sample, even under the same conditions. In other words, the Poisson process in each sample has a slightly different expected count parameter. This is the source of the “extra” variance (overdispersion) we observe in sequencing data. In the framework of the NB distribution, it is accounted for by allowing Gamma-distributed uncertainty about the expected counts (the Poisson rate) for each gene. Conversely, if we were to deal with technical replicates, there should be no overdispersion and a simple Poisson model would be adequate.

The variance (dispersion) $\sigma^2$ of a NB distribution can be expressed as function of the mean $\mu$ and the dispersion parameter $\alpha$. $\sigma^2 = \mu + \alpha \mu^2$

From this formula it is evident that the dispersion is always greater than the mean for $\alpha > 0$. If $\alpha \rightarrow 0$, the NB distribution is a Poisson distribution.

Dispersion estimates

Finally, a short note on the practical implications of estimating the dispersion of sequencing data. In a standard sequencing experiment, we have to be content with few biological replicates per condition due to the high costs associated with sequencing experiments and the large amount of time that goes into library preparations. This makes the gene-wise estimates of dispersion rather unreliable. Modern RNA-Seq analysis tools such as DESeq2 and edgeR combine the gene-wise dispersion estimate with an estimate of the expected dispersion rate based on all genes. This Bayesian “shrinkage” of the variance has already been applied successfully in microarray analysis. Although the implementation of this method varies between analysis tools, the concept of using information from the whole data set has emerged as a powerful technique to mitigate the shortcomings of having few replicates.

Analyzing quantitative PCR data the tidy way

Previously in this series on tidy data: Taking up the cudgels for tidy data

One of the most challenging aspects of working with data is how easy it is to get lost. Even if the data sets are small. Multiple levels of hierarchy and grouping quickly confuse our human brains (at least mine). Recording such data in two dimensional spreadsheets naturally leads to blurring of the distinction between observation and variable. Such data requires constant reformatting and its structure may not be intuitive to your fellow researcher.

Here are the two main rules about tidy data as defined in Hadley Wickham’s paper:

1. Each variable forms a column
2. Each observation forms a row

A variable is an “attribute” of a given data point that describes the conditions when it was taken. Variables often are categorical (but they don’t need to be). For example, the gene tested or the genotype associated with a given measurement would be a categorical variables.

An observation is a measurement associated with an arbitrary number of variables. There are no measurements that are taken under identical conditions. Each observation is uniquely described by the variables and should form its own row.

Let’s look at an example of a typical recording of quantitative PCR data in Excel. We have measurements for three different genotypes (“control”, “mutant1”, “mutant2”) from three separate experiments (“exp1”, “exp2”, “exp3”) with three replicates each (“rep1”, “rep2”, “rep3”).

Looking at the columns, we see that information on “experiment” and “replicate” are stored in the names of the columns rather than the entries of the columns. This will have to be changed.

There clearly are multiple measurements per row. More precisely, it looks like we have a set of nine measurements for each genotype. But this is not entirely true. Experiments are considered statistically independent as they are typically performed at different times and with different cells. They capture the full biological variability and we call them “biological replicates”. The repeated measurements done in each experiment are not statistically independent because they come from the same sample preparation and thus only capture sources of variance that originate from sample handling or instrumentation. We call them “technical replicates”. Technical replicates cannot be used for statistical inference that requires “statistical independence”, such as a t-test. As you can see, we have an implicit hierarchy in our data that is not expressed in the structure of the data representation shown above.

We will untangle all those complications one by one using R tools developed by Hadley Wickham and others to represent the same data in a tidy format suitable for statistical analysis and visualization. For details about how the code works, please consult the many excellent tutorials on dplyr, tidyr, ggplot2, and broom.

messy <- read.csv("qpcr_messy.csv", row.names = 1)


This is the original data read into R. Let’s get started. Row names should form their own column

The “genotype” information is recorded as row names. “Genotype” clearly is a variable, so we should make “genotype” a full column.

tidy <- data.frame(messy) %>%
# make row names a column
mutate(genotype = rownames(messy)) What are our variables?

Next, we need to think about what are our variables. We have already identified “genotype” but what are the other ones? The way we do this is to ask ourselves what kind of information we would need to uniquely describe each observation. The experiment and replicate number are essential to differentiate each quantitative PCR measurement, so we need to create separate columns for “experiment” and “replicate”. We will do this in two steps. First we use “gather” to convert tabular data from wide to long format (we could have also used the more general “melt” function from the “reshape2” package). The former column names (e.g. “exp1_rep1”) are saved into a temporary column called “sample”. As this column contains information about two variables (“experiment” and “replicate”), we need to separate it into two columns to conform with the “each variable forms a column” rule. To do this, we use “separate” to split “sample” into the two columns “experiment” and “replicate”.

tidy <- tidy %>%
# make each row a single measurement
gather(key = sample, value = measurement, -genotype) %>%
# make each column a single variable
separate(col = sample, into = c("experiment", "replicate"), sep = "_") Here are the first 10 columns of the “tidy” representation of the initial Excel table. Before we can do statistical tests and visualization, we have to take care of one more thing.

Untangling implicit Domain specific hierarchies

Remember what we said before about the two different kind of replicates. Only data from biological replicates (“experiments”) are considered statistically independent samples, while technical replicates (“replicate”) are not. One common approach is to average the technical replicates (“replicate”) before any statistical test is applied. With tidy data, this is simple.

data <- tidy %>%
# calculate mean of technical replicates by genotype and experiment
group_by(genotype, experiment) %>%
summarise(measurement = mean(measurement)) %>%
ungroup()


Having each variable as its own column makes the application of the same operation onto different groups straightforward. In our case, we calculate the mean of technical replicates for each genotype and experiment combination.

Now, the data is ready for analysis.

TIdy Statistical analysis of quantitative pcr data

The scientific rational for a quantitative PCR experiment is to find out whether the number of transcripts for a given gene is different between two or more conditions. We have measurements for one transcript in three distinct genotypes (“control”, “mutant1”, “mutant2”). Biological replicates are considered independent and measurements are assumed to be normally distributed around a “true” mean value. A t-test would be an appropriate choice for the comparison of two genotypes. In this case, we have three genotypes, so we will use one-way anova followed by Tukey’s post-hoc test.

mod <- data %>%
# set "control" as reference
mutate(genotype = relevel(factor(genotype), ref = "control")) %>%
# one-way anova and Tukey's post hoc test
do(tidy(TukeyHSD(aov(measurement ~ genotype, data = .))))


We generally want to compare the effect of a genetic mutation to a “control” condition. We therefore set the reference of “genotype” to “control”.

Using base R statistics functions like “aov” and “TukeyHSD” in a tidy data analysis workflow can pose problems because they were not created with the idea of “dplyr”-style piping (“%>%”) in mind. Piping requires that the input and output of each function is a data frame and that the input is the first argument of the function. The “aov” function neither takes the input data frame as its first argument, nor does it return a data frame but a specialized “aov” object. To add insult to injury, the “TukeyHSD” function only works with such a specialized “aov” object as input.

In situations like this, the “do” function comes in handy. Within the “do” function, the input of the previous line is accessible through the dot character, so we can use an arbitrary function within “do” and just refer to the input data at the appropriate place with “.”. As a final clean-up, the “tidy” function from the “broom” package makes sure that the output of the line is a data frame. Tukey’s post hoc test thinks “mutant1” is different from “control” but “mutant2” is not. Let’s visualize the results to get a better idea of how the data looks like.

tidy Visualization of quantitative PCR data

We are dealing with few replicates, three in our case, so a bar graph is not the most efficient representation of our data. Plotting the individual data points and the confidence intervals gives us more information using less ink. We will use the “ggplot2” package because it is designed to work with data in the tidy format.

# genotype will be on the x-axis, measurements on the y-axis
ggplot(data, aes(x = genotype, y = measurement, col = experiment)) +
# plot the mean of each genotype as a cross
stat_summary(fun.y = "mean", geom = "point", color = "black", shape = 3, size = 5) +
# plot the 95% confidence interval for each genotype
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", color = "black", width = 0.1) +
# we we add the averaged measurements for each experiment
geom_point(shape = 16, size = 5) +
theme_classic() We can see why the first mutant is different from the “control” sample and the second is not. More replicates would be needed to test whether the small difference in means between “control” and “mutant2” is a true difference or not.

What I have shown here is just the tip of the iceberg. There are many more tools and functions to discover. The more data analysis you do, the more you will realize how important it is not to waste time formatting and reformatting the the data for each step of the analysis. Learning about how to tidy up your data is an important step towards that goal.

The R code can be found on Github.

Taking up the cudgels for tidy data

The abundance of data has led to a revolution in marketing and advertisement as well as in biomedical research. A decade ago, the emerging field of “systems biology” (no pun intended) promised to take basic research to the next level through the use of high-throughput screens and “big data”. Institutes were built, huge projects were funded, but surprisingly little of substance has been accomplished since.

There are three main reasons, I think, two of which are under our control and one is not.

During our first forays into understanding biology as a “system” we underestimated the complexity of even an isolated cell, let alone a multi-cellular organism. A somewhat complete description of even the most basic regulatory mechanisms and pathways remains a dream to this day due to myriads of adaptive mechanisms and cross-talks preventing the formulation of a coherent view. Unfortunately, this problem has to be overcome with improved scientific methodology or analysis and will take time.

There are things we can do right now, however.

Overly optimistic or incorrect interpretation of statistical results and “cherry-picking” of hits of high-throughput screens has led to a surprising number of publications that cannot be replicated or even be reproduced. I have written more extensively about this particular problem I refer to as the “Fisherman’s dilemma“.

The lack of standards for structuring data is another reason that prevents the use of existing data and makes merging data sets from different studies or sources unnecessary painful and time consuming. A common saying is that data science is 80% data cleaning, 20% data analysis. The same is true for bioinformatics, where one needs to wrestle with incomplete and messy datasets or, if it’s your lucky day, just different data formats. This problem is especially prevalent in meta-analysis of scientific data. Arguably, the integration of datasets from different sources is where we would predict to find some of the most important and universal results. Why else spend the time to generate expensive datasets if we don’t use them to compare and cross-reference?

If we were to spent less time on the arduous task of “cleaning” data, we could focus our attention on the question itself and the implementation of the analysis. In recent years, Hadley Wickham and others have developed a suite of R tools that help to “tidy up” messy data and establish and enforce a “grammar of data” that allows easy visualization, statistical modeling, and data merging without the need to “translate” the data for each step of the analysis. Hadley deservedly gets a lot of credit for his dplyr, reshape2, tidyr, and ggplot2 packages, but not nearly enough. At this point David Robinson’s excellent broom package for cleaning results from statistical models should also be mentioned.

The idea of tidy data is surprisingly simple. Here are the two most basic rules (see Hadley’s paper for more details).

1. Each variable forms a column.
2. Each observation forms a row.

Here is an example. This is a standard form of recording biological research data such as data from a PCR experiment with three replicates. At first glance, the data looks pretty tidy. Genes in rows, replicates in columns. What’s wrong here?

In fact, this way of recording data violates both basic rules of a tidy dataset. First, the “replicates” are not distinct variables but instances of the same variable, which violates the first rule. Second, each measurement is a distinct observation and should have its own row. Clearly, this is not the case either.

This is how the same data looks like once cleaned-up. Each variable is a column, each observation is a row.

Representing data “the tidy way” is not novel. It has been called the “long” format previously, as opposed to the “wide” (“messy”) format. Although “tidy” and “messy” imply a value judgement, it is important to note that while the tidy/long format has distinct advantages for data analysis, the wide format is often seen as the more intuitive and almost always is the more concise.

The most important advantage of tidy data in data analysis is that there is one way of representing the data in a tidy format, while there are many possible ways of having a messy data structure. Take the example of the messy data from above. Storing replicates in rows and genes in columns (the transpose) would have been an equivalent representation to the one shown above. However, cleaning up both representations results in the same tidy data representation shown. This advantage becomes even more important with datasets that contain multiple variables.

A related, but more technical advantage of the tidy format is that it simplifies the use of loops and vectorized programming (implicit loops) because the “one variable, one column – one observation, one row” structure enforces a “linearization” of the data that is more easily dealt with from a programming perspective.

Having data in a consistent format allows feeding data into visualization and modeling tools without spending time on getting the data in the right shape. Similarly, tidy dataset from different sources can be more easily merged and analyzed together.

While data in marketing is sometimes called “cheap”, research data in science often is generally very expensive, both in terms of time and money. Taking the extra step of recording and sharing data in a “tidy” format, would make data analysis in biomedical research and clinical trials more effective and potentially more productive.

In a follow-up post, I will cover the practical application of some of the R tools developed to work with tidy data using an example most experimental biologists are familiar with: statistical analysis and visualization of quantitative PCR.

What is a large enough sample?

In my previous entry, I tried to clear up some of my own confusion about the Central Limit Theorem (CLT) and explained why it is such a valuable theoretical concept in statistics. To recap, the CLT describes how the means of a random sample of an unknown sampling distribution approach a normal distribution as the sample size $n$ approaches $\infty$. The uncertainty about our estimate of the mean of the original sampling distribution is given by $\sigma / \sqrt{n}$, where $\sigma$ is the standard deviation of the sampling distribution. We can see that the larger the sample size, the more certain we are about our estimate of the true mean.

The obvious practical question is what is a large enough sample size? The short answer is, it depends. A sample size of 30 is a pretty save bet for most real life applications.

To investigate the influence of sample size on the convergence of the distribution of the means, I will use simulated sampling from three different sampling distributions. All simulations were done using R. The code can be found on Github.

CLT in (simulated) action

Let’s consider a normal sampling distribution to start with. This is useful to illustrate the idea of how the uncertainty of our estimate of the true mean depends on the sample size $n$. Here is our normal sampling distribution with $\mu$ = 4 and $\sigma$ = 2. Now we generate a large number $m$ of random samples each with sample size $n$ and calculate their means. If this confuses you, you are not alone. For now, understand that the only variable we are changing is the sample size $n$. $m$ will just be a “large number”, such as 10000 in our case, so that we can draw a histogram of 10000 simulated means. We will do this four time, each time with a different sample size of $n$ being either 2, 5, 15, or 30.

The histogram shows the distribution of simulated means and the blue curve illustrates the normal distribution predicted by the CLT with a mean of $\mu$ and a standard deviation of $\sigma / \sqrt{n}$. In the lower panel, I show quantile-quantile plots to investigate the how well the distribution of the means fits a theoretical normal distribution. Unsurprisingly, the means of random samples drawn from a perfect normal distribution are themselves normally distributed. Even with a sample size as small as 2. It is intuitive that small sample sizes have more uncertainty associated with our estimate of the true mean, which is reflected by the relatively broad normal distribution of the means. As we increase the sample size the distribution of the means becomes more pointy and narrow, indicating that our estimate of the true mean $\mu$ becomes more and more accurate. Note also, that the y-axis changes as we increase the sample size. This is a visual confirmation that the standard deviation of the distribution of the means is given by $\sigma / \sqrt{n}$.

Let’s turn to an exponential sampling distribution with $\lambda$ = 1/4 next. Recall that both the mean and standard deviation of an exponential distribution is $1 / \lambda$. This one is clearly not normal. I simulated random samples for different sample sizes as described above for the normal distribtion and calculated the means. At smaller sample sizes, the deviation of the actual distribution of the means from the theoretical distribution of the means is obvious. It clearly retains some characteristics of an exponential distribution. As we increase the sample size, the fit becomes better and better, until it eventually morphes into a normal distribution.

Does the CLT hold for an arbitrary distribution? Well, let’s consider this crazy sampling distribution I made up using a combination of normal, exponential and uniform distributions. Simulation of random samples using different sample sizes as before. As predicted, the CLT holds even for a non-standard sampling distribution. Granted, I did not challenge the assumptions of the CLT too much using for example an extreme tail (skew). I trust this is good enough to convince you that it would just take a few more samples before convergence.

Why is a sample size of 30 large enough?

Back to our original question: what is a large enough sample? We have seen that the major determinant is the shape of the sampling distribution. The more normal it is to begin with, the fewer samples we will need to reach convergence towards a normal distribution of the means.

In practice we do not generate 10000 random samples (10000 experiments!) to get a distribution of the means. We estimate the mean and standard deviation from a single random sample. The larger the random sample, the better will be our estimate of the true mean $\mu$ and the standard deviation $\sigma$. This follows directly from the Law of large numbers. It is often recommended in statistics textbooks that as a rule of thumb a sample size of 30 can be considered “large”. But why exactly 30? I think there is a practical and a pragmatic argument to be made.

In the simulations we saw that the distribution of the means of a random sample drawn from a (not too crazy) non-normal sampling distribution will be very close to normal. This means that our estimates of the mean and standard deviation of that distribution will be sufficient to describe the distribution of the means and we can use them in hypothesis testing with some confidence (no pun intended).

A more pragmatic argument would make use of the relationship between the sample size and our uncertainty about the true mean of the sampling distribution. Irrespective of the standard deviation $\sigma$ of the sampling distribution, the standard error $\sigma / \sqrt{n}$ decreases proportional to $\sqrt{n}$. Common sense dictates that increasing the sample size beyond a certain point will result in ever diminishing gains in precision. Here is a graphical representation of the relationship between the standard error and sample size. As you can see, a sample size of 30 sits right at the point where the curve stops to have an exponential and starts to have a linear decrease. In other words, a sample size of 30 represents the sweet spot in terms of the most “bang for the buck”, no matter the magnitude of the standard deviation of the original sampling distribution.

You might ask, what if the standard deviation is a large value? Well, then our estimate of the true mean will be pretty bad. We will have to increase the sample size and deal with the fact that gains in precision will be ever smaller as $n$ goes beyond 30.

In biomedical research we often face the situation that even a sample size of 30 is unattainable in terms of time or money. Fortunately, there is a solution for that dilemma: Student’s t-distribution. I will investigate how the CLT relates to the t-distribution and hypothesis testing in the next post.

Reproducibility

The full R code is available on Github.

Unlimited confusion: The central limit theorem

Open any statistics textbook, and it won’t be long until you encounter the Central Limit Theorem (CLT). You will learn that it is the basis of key concepts of inferential statistics such as the well-known t-test. In my experience the CLT is also a source of great confusion as it is surprisingly hard to wrap your head around it.

If statisticians had their way, everything would have a normal (Gaussian) distribution. And in some ways, maybe that would be a fairer world to live in. Take salaries for example. However as it stands, not everything is normally distributed. We have a lot of small earthquakes and few strong ones, which is typical for an exponential distribution (or a Pareto distribution in this particular case).

The reason why the CLT is so important in statistics is that it allows us to describe the means of random samples of (virtually) any distribution with the parameters of a normal distribution (mean $\mu$ and standard deviation $\sigma$) given that the sample size $n$ of those random samples is large enough.

In more precise language, the CLT states that the mean $\bar{x}$ of a random sample $x_1, x_2, ..., x_n$ taken from the sampling distribution $S$ follows a normal distribution centered at $\mu$ with standard deviation $\sigma / \sqrt{n}$ as $n \rightarrow \infty$.

There is a lot of information in this sentence. Let’s deconstruct it bit by bit and discuss the implications.

We need (virtually) no knowledge of the sampling distribution

The CLT ensures us that irrespective of the shape of the original distribution we are sampling from, the means of random samples will approach a normal distribution. In practice, this means that we do not need any information on how the random samples are generated. We just need to take a large enough sample and analyze it using well-developed statistical methods. In other words, every statistician’s dream.

Why do I say virtually no knowledge? In fact it is possible to construct sampling distributions that break the CLT. That happens if the sampling distribution has an infinite mean or an infinite standard deviation. You will never encounter such sampling distributions in your everyday experiments, so it is more of a technicality.

What are we sampling here?

For me, the major point of confusion about the CLT is that there are apparently two forms of sampling going on. First, each random sample drawn from the sampling distribution has a certain number of samples $n$. This random sample $X$ will give us exactly one mean $\bar{X}$. How do we get to a distribution from exactly one number? The CLT says that if we take $m$ random samples, each with a sampling size of $n$, the $m$ means of the random samples will approach a normal distribution.

So does the statement “given that the sample size is large” refer to the number of instances per random samples $n$ or the number of random samples $m$? Common sense says that it has to refer to $n$ to be practical. If we had to conduct $m$ experiments each with sample size $n$ it would be either too time consuming or too expensive, especially if both $n$ and $m$ need to be large.

The key to understanding the CLT is that we can estimate the parameters of the distribution of the means from a single random sample of sample size $n$ because we know that if we took $m$ more such random samples, their means would be distributed normally.

Earlier, we established that the mean of the distribution of the means will approach the mean of the sampling distribution. So, we can estimate this parameter $\hat{\mu}$ from the mean $\bar{X}$ of our random sample. Our estimate will most likely not be completely accurate, but the Law of large numbers tells us that if $n$ is not too small, our estimate will be reasonably good. But how good? What about the uncertainty of our measurement? The CLT tells us that the spread of the distribution of the means will be $\sigma / \sqrt{n}$.

The variance of the mean of a set of random variables is given by $Var(\bar{X}) = Var(\frac{1}{n} \sum_{i=1}^{n} X_i)$

According to the Bienayme formula the variance of the sum of uncorrelated random variables is the sum of their variances. $Var(\frac{1}{n} \sum_{i=1}^{n} X_i) = \frac{1}{n^2} Var(\sum_{i=1}^{n} X_i) = \frac{1}{n^2} \sum_{i=1}^{n} Var(X_i)$

The variances of the samples are identical variances because they samples come from the same distribution. $\frac{1}{n^2} \sum_{i=1}^{n} Var(X_i) = \frac{1}{n^2} n Var(X_i) = \frac{1}{n}Var(X_i)$

Thus, the variance of the mean is $\sigma^2 / n$ and, accordingly, the standard deviation is $\sigma / \sqrt{n}$.

To add more confusion, $\sigma / \sqrt{n}$ is called the standard error of the mean but it is just the standard deviation of the distribution of the means. It relates the standard deviation of the sampling distribution to the sample size and quantifies our uncertainty about our estimate of the true mean. The larger the sample size $n$, the closer the estimate $\hat{\mu}$ will be to the true mean $\mu$. The formula for the standard error of the mean tells us the precision of our estimate increases with the square root of the sample size. In practical terms, if we want to decrease the standard error by a factor of two, we need to increase the sample size by a factor of 4.

We typically do not know the true variance $\sigma^2$ of the sampling distribution. Again, we estimate it from our random sample using the unbiased estimator $s^2$ $s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$

The CLT demonstrates that the means of random samples drawn from an unknown sampling distribution will have a normal distribution. That alone would be interesting but not particularly useful. The fact that we can estimate the mean and standard deviation of the distribution of those means from a single sample makes it a cornerstone of many key concepts of inferential statistics such as hypothesis testing and and confidence intervals. The one reservation of the CLT is that the sample size needs to large enough. Fortunately, we can turn to the t-distribution of we don’t quite meet the criteria of a “large enough” sample size.

What is a large enough random sample?

The CLT states that the number of instances $n$ that make up the random sample should approach $\infty$. That is clearly not practical. For most sampling distributions, especially if they already are close to a Gaussian distribution to start with, convergence will happen much sooner. A rough rule of thumb is that a sample size of n = 30 can be considered “large enough”. But it very much depends on the shape of the original sampling distribution. We will explore that in a follow-up post using simulation.

Tidy unnesting

At least once a week, I have to work with a data set that has “nested” measurements in one of its columns. Such data violates rule #2 of Hadley Wickam’s definition of tidy data. Not all observations are in separate rows. In order to work with this data set, we generally want to “unnest” those measurements to make each a separate row. The “untidy” way

Ordinarily, I would convert the untidy data to tidy data using a cumbersome sequence of messy commands that involve splitting the entries in “value” into lists, then counting the elements of each list, and replicating each row of the data frame accordingly.

library(stringr)
# split "value" into lists
values <- str_split(data$value, ";") # count number of elements for each list n <- sapply(values, length) # replicate rownames of data based on elements for each list row_rep <- unlist(mapply(rep, rownames(data), n)) # replicate rows of original data data_tidy <- data[row_rep, ] # replace nested measurements with unnested measurements data_tidy$value <- unlist(values)
# reformat row names
rownames(data_tidy) <- seq(nrow(data_tidy))


The “tidy” way

Is it really necessary to go through all those (untidy!) steps to tidy up that data set? It turns out, it is not. In comes the “unnest” function in the “tidyr” package.

library(tidyr)
# use dplyr/magrittr style piping
data_tidy2 <- data %>%
# split "value" into lists
transform(value = str_split(value, ";")) %>%
# unnest magic
unnest(value)


Much tidier! Just split and “unnest”. On top of that, “unnest” nicely fits into a “dplyr”-style data processing workflow using “magrittr” piping (“%>%”)

all.equal(data_tidy, data_tidy2)


The results of both methods are equivalent. Your choice, I have made mine!

Synonymous fission yeast gene names

This is how the workflow would look like using an ID mapping file of synonymous gene names of the fission yeast Schizosaccharomyces pombe. The file can be obtained from the “Pombase” website.

# read data