The abundance of data has led to a revolution in marketing and advertisement as well as in biomedical research. A decade ago, the emerging field of “systems biology” (no pun intended) promised to take basic research to the next level through the use of high-throughput screens and “big data”. Institutes were built, huge projects were funded, but surprisingly little of substance has been accomplished since.
There are three main reasons, I think, two of which are under our control and one is not.
During our first forays into understanding biology as a “system” we underestimated the complexity of even an isolated cell, let alone a multi-cellular organism. A somewhat complete description of even the most basic regulatory mechanisms and pathways remains a dream to this day due to myriads of adaptive mechanisms and cross-talks preventing the formulation of a coherent view. Unfortunately, this problem has to be overcome with improved scientific methodology or analysis and will take time.
There are things we can do right now, however.
Overly optimistic or incorrect interpretation of statistical results and “cherry-picking” of hits of high-throughput screens has led to a surprising number of publications that cannot be replicated or even be reproduced. I have written more extensively about this particular problem I refer to as the “Fisherman’s dilemma“.
The lack of standards for structuring data is another reason that prevents the use of existing data and makes merging data sets from different studies or sources unnecessary painful and time consuming. A common saying is that data science is 80% data cleaning, 20% data analysis. The same is true for bioinformatics, where one needs to wrestle with incomplete and messy datasets or, if it’s your lucky day, just different data formats. This problem is especially prevalent in meta-analysis of scientific data. Arguably, the integration of datasets from different sources is where we would predict to find some of the most important and universal results. Why else spend the time to generate expensive datasets if we don’t use them to compare and cross-reference?
If we were to spent less time on the arduous task of “cleaning” data, we could focus our attention on the question itself and the implementation of the analysis. In recent years, Hadley Wickham and others have developed a suite of R tools that help to “tidy up” messy data and establish and enforce a “grammar of data” that allows easy visualization, statistical modeling, and data merging without the need to “translate” the data for each step of the analysis. Hadley deservedly gets a lot of credit for his dplyr, reshape2, tidyr, and ggplot2 packages, but not nearly enough. At this point David Robinson’s excellent broom package for cleaning results from statistical models should also be mentioned.
The idea of tidy data is surprisingly simple. Here are the two most basic rules (see Hadley’s paper for more details).
- Each variable forms a column.
- Each observation forms a row.
Here is an example.
This is a standard form of recording biological research data such as data from a PCR experiment with three replicates. At first glance, the data looks pretty tidy. Genes in rows, replicates in columns. What’s wrong here?
In fact, this way of recording data violates both basic rules of a tidy dataset. First, the “replicates” are not distinct variables but instances of the same variable, which violates the first rule. Second, each measurement is a distinct observation and should have its own row. Clearly, this is not the case either.
This is how the same data looks like once cleaned-up.
Each variable is a column, each observation is a row.
Representing data “the tidy way” is not novel. It has been called the “long” format previously, as opposed to the “wide” (“messy”) format. Although “tidy” and “messy” imply a value judgement, it is important to note that while the tidy/long format has distinct advantages for data analysis, the wide format is often seen as the more intuitive and almost always is the more concise.
The most important advantage of tidy data in data analysis is that there is one way of representing the data in a tidy format, while there are many possible ways of having a messy data structure. Take the example of the messy data from above. Storing replicates in rows and genes in columns (the transpose) would have been an equivalent representation to the one shown above. However, cleaning up both representations results in the same tidy data representation shown. This advantage becomes even more important with datasets that contain multiple variables.
A related, but more technical advantage of the tidy format is that it simplifies the use of loops and vectorized programming (implicit loops) because the “one variable, one column – one observation, one row” structure enforces a “linearization” of the data that is more easily dealt with from a programming perspective.
Having data in a consistent format allows feeding data into visualization and modeling tools without spending time on getting the data in the right shape. Similarly, tidy dataset from different sources can be more easily merged and analyzed together.
While data in marketing is sometimes called “cheap”, research data in science often is generally very expensive, both in terms of time and money. Taking the extra step of recording and sharing data in a “tidy” format, would make data analysis in biomedical research and clinical trials more effective and potentially more productive.
In a follow-up post, I will cover the practical application of some of the R tools developed to work with tidy data using an example most experimental biologists are familiar with: statistical analysis and visualization of quantitative PCR.