Modern sequencing machines produce a wealth of information. Unfortunately, it comes in the form of a jumble of millions of short sequence fragments (“reads”), for which we don’t know where on the genome they originated from. The job of read alignment tools like Bowtie2, BWA, or STAR is to bring order to chaos and tell us the likely location of a read on a reference genome.
Even if we have access to a high performance computer cluster, the alignment step is computationally expensive and thus rate-limiting in many next-generation sequencing pipelines. In perfect wet-lab biologist jargon, I have heard it referred to as the “overnight step”. Why does it take so long to determine the location of a short fragment to a reference sequence?
The bioinformatician’s question of how to map a read to a genome translates to the computer scientist’s question of how to match a pattern to a string. Luckily, the latter is a well-studied problem.
A naive approach
The simplest solution would be to take a read (the “pattern”) and slide it along the genomic sequence (the “string”) and compare at each position whether the read matches the genome. This “brute force” approach is perfectly valid and works just fine. Let’s take a look at some numbers to see if it is feasible in our context.
The human genome has approximately 3 billion bases, the typical length of a read is around 100 bases, and a typical experiment has about 10 million reads per sample. This means that a single read has roughly 3 billion possible position and for each position we would have to make 100 comparisons. Those 300 billion comparisons must be made for each of the 10 million reads, so we would end up with the fantastically large number of 3 quintillion comparisons per sample. Usually we have more than one sample. If we allow for even a single mismatch, things get completely out of hand. The brute force approach is clearly not an option.
Note that this is the worst case scenario and that there are better ways of searching a short pattern in a string. Even so, the major problem why the the brute force approach is slow remains: It makes many unnecessary comparisons. In other words, it searches for matches in areas where there is no chance of finding anything useful.
The power of indexing
When phones still had cords, people used to either memorize numbers or look them up in a phone book. If your friend’s last name started with an “S”, you wouldn’t look for his or her number in the “T” section. The implicit assumption was that all last names were ordered alphabetically and there was no chance that Mrs. Smith was to be found next to Mr. Taylor. By listing the names of people in alphabetical order, a phone book effectively limits the search space and allows you to find any number relatively quickly. A structure that facilitates lookup of large volumes of data using keys is called an “index“.
Like phone books, suffix arrays are structures that are designed for efficient searching of large bodies of text. To construct a suffix array, you sort all substrings of the original string that contain the last character (“suffixes”) in lexicographical order and record the positions relative to the unsorted suffixes. The effect is similar to a phone book. Suffixes starting with the identical character end up next to each other in the array. With a suffix array to guide the search looking up the location of a pattern in a string is extremely fast because every occurrence of the pattern is equivalent to locating the suffixes that begin with the pattern. Two binary searches for the start and end positions of the pattern within the suffix array result in the location of the pattern within the string. There are two problems, however.
The first issue is that we need to invest time to construct the suffix array from the original sequence of characters. Usually that’s not a serious problem because there exist algorithms to build suffix arrays that scale well to extremely long strings such as the human genome. On top, re-use of the index for multiple sequencing experiments will amortize the cost in time that went into building the index.
The second problem is more serious. It takes multiple times the space of the raw string to store a suffix array. This remains true even if we only implicitly store the order of the suffixes rather than the suffixes themselves. The memory footprint of the raw sequence of the human genome is on the order of several gigabytes. Working with a suffix array multiple times that size becomes troublesome if we want to keep it in RAM. This is a good examples that besides accuracy and speed, memory consumption is also an important consideration when determining how useful an algorithm or data structure is.
What we are looking for is a data structure that combines the fast lookup times of a suffix array with the low memory footprint of the brute-force approach.
The Burrows-Wheeler transform
The Burrows-Wheeler transform (BWT) is a reversible permutation of a sequence of characters that is more “compressible” than the original sequence. The BWT involves lexicographical sorting of all permutations of a string so that identical characters end up next to each other. This facilitates compact encoding as a sequence of ten A’s can be stored as “AAAAAAAAAA” or “10A” without loss of information.
There are important similarities between the suffix array and the BWT. Recall that a suffix array involves lexicographical sorting of all suffixes of a string. The BWT is the last column of a matrix of all lexicographical sorting of all permutations of a string. As the suffixes are necessarily contained in the string permutations and both are lexicographically sorted, the order of the elements in a suffix array and the Burrows-Wheeler permutation matrix must be identical. In fact, the BWT can be efficiently calculated from a suffix array and the BWT implicitly encodes a suffix array.
The BWT allows for a compact representation of the original string but it is by itself not very well suited for fast lookup of the location of a pattern. By augmenting the BWT with a table of ranks of each character in the BWT and a partial suffix arrays, we obtain a data structure that gets very close to the fast pattern matching found in suffix arrays while maintaining a memory footprint close to the raw string. Such a data structure is called an FM-index and is the basis of alignment tools like Bowtie2 and BWA (Burrows-Wheeler Aligner). The commands “bowtie2-build” and “bwa index” build the respective software-specific FM-indices that are used for running the alignment.
How to handle mismatches?
The FM-index based mapping of reads to the genome is only fast for exact matches. In practice, read mapping must be tolerant to mismatches. Mismatches can occur due to technical reasons such as PCR artifacts or incorrect base calling during the sequencing process. Conversely, true variations between sequences (SNPs, CNV, Indels) are among the most interesting biological results we have obtained from having sequenced thousands of human genomes. We definitely do not want an alignment tool to discard all reads with mismatches by default just because they aren’t perfect matches to the reference genome.
One strategy that is used in modern sequence alignment tools like Bowtie2 is to split the reads into “seeds”. Exact matches of seeds to the genome are found using the FM-index and then extended using variants of more sensitive sequence alignment algorithms like Needleman-Wunsch or Smith-Waterman. In this way, Bowtie2 balances the speed of finding the exact location of reads on the genome with a certain error tolerance that allows the identification of possibly interesting sequence variants.
What constitutes a valid alignment and how it scores can be tuned by the user with the help of command line arguments such as the number of allowed mismatches and gap penalties. This is where the biggest differences between alignment tools is observed. Which alignment tool to use is ultimately a matter of personal preference. In general, BWA is thought to have higher precision and is thus favored in variant calling, while Bowtie2 appears to be faster and more sensitive but may lack some of BWA’s precision.
The development of fast and accurate read alignment tools was an essential contribution to the current boom in genomics research. Without decades of research and algorithm development in computer science, we would be waiting for days or weeks for our read alignments to finish. So what are a few hours?
Other posts on next-generation sequence analysis: