Taking up the cudgels for tidy data

The abundance of data has led to a revolution in marketing and advertisement as well as in biomedical research. A decade ago, the emerging field of “systems biology” (no pun intended) promised to take basic research to the next level through the use of high-throughput screens and “big data”. Institutes were built, huge projects were funded, but surprisingly little of substance has been accomplished since.

There are three main reasons, I think, two of which are under our control and one is not.

During our first forays into understanding biology as a “system” we underestimated the complexity of even an isolated cell, let alone a multi-cellular organism. A somewhat complete description of even the most basic regulatory mechanisms and pathways remains a dream to this day due to myriads of adaptive mechanisms and cross-talks preventing the formulation of a coherent view. Unfortunately, this problem has to be overcome with improved scientific methodology or analysis and will take time.

There are things we can do right now, however.

Overly optimistic or incorrect interpretation of statistical results and “cherry-picking” of hits of high-throughput screens has led to a surprising number of publications that cannot be replicated or even be reproduced. I have written more extensively about this particular problem I refer to as the “Fisherman’s dilemma“.

The lack of standards for structuring data is another reason that prevents the use of existing data and makes merging data sets from different studies or sources unnecessary painful and time consuming. A common saying is that data science is 80% data cleaning, 20% data analysis. The same is true for bioinformatics, where one needs to wrestle with incomplete and messy datasets or, if it’s your lucky day, just different data formats. This problem is especially prevalent in meta-analysis of scientific data. Arguably, the integration of datasets from different sources is where we would predict to find some of the most important and universal results. Why else spend the time to generate expensive datasets if we don’t use them to compare and cross-reference?

If we were to spent less time on the arduous task of “cleaning” data, we could focus our attention on the question itself and the implementation of the analysis. In recent years, Hadley Wickham and others have developed a suite of R tools that help to “tidy up” messy data and establish and enforce a “grammar of data” that allows easy visualization, statistical modeling, and data merging without the need to “translate” the data for each step of the analysis. Hadley deservedly gets a lot of credit for his dplyr, reshape2, tidyr, and ggplot2 packages, but not nearly enough. At this point David Robinson’s excellent broom package for cleaning results from statistical models should also be mentioned.

The idea of tidy data is surprisingly simple. Here are the two most basic rules (see Hadley’s paper for more details).

  1. Each variable forms a column.
  2. Each observation forms a row.

Here is an example.


This is a standard form of recording biological research data such as data from a PCR experiment with three replicates. At first glance, the data looks pretty tidy. Genes in rows, replicates in columns. What’s wrong here?

In fact, this way of recording data violates both basic rules of a tidy dataset. First, the “replicates” are not distinct variables but instances of the same variable, which violates the first rule. Second, each measurement is a distinct observation and should have its own row. Clearly, this is not the case either.

This is how the same data looks like once cleaned-up.


Each variable is a column, each observation is a row.

Representing data “the tidy way” is not novel. It has been called the “long” format previously, as opposed to the “wide” (“messy”) format. Although “tidy” and “messy” imply a value judgement, it is important to note that while the tidy/long format has distinct advantages for data analysis, the wide format is often seen as the more intuitive and almost always is the more concise.

The most important advantage of tidy data in data analysis is that there is one way of representing the data in a tidy format, while there are many possible ways of having a messy data structure. Take the example of the messy data from above. Storing replicates in rows and genes in columns (the transpose) would have been an equivalent representation to the one shown above. However, cleaning up both representations results in the same tidy data representation shown. This advantage becomes even more important with datasets that contain multiple variables.

A related, but more technical advantage of the tidy format is that it simplifies the use of loops and vectorized programming (implicit loops) because the “one variable, one column – one observation, one row” structure enforces a “linearization” of the data that is more easily dealt with from a programming perspective.

Having data in a consistent format allows feeding data into visualization and modeling tools without spending time on getting the data in the right shape. Similarly, tidy dataset from different sources can be more easily merged and analyzed together.

While data in marketing is sometimes called “cheap”, research data in science often is generally very expensive, both in terms of time and money. Taking the extra step of recording and sharing data in a “tidy” format, would make data analysis in biomedical research and clinical trials more effective and potentially more productive.

In a follow-up post, I will cover the practical application of some of the R tools developed to work with tidy data using an example most experimental biologists are familiar with: statistical analysis and visualization of quantitative PCR.

Taking up the cudgels for tidy data

On the dangers of fishing expeditions

Wherever you look, people do high-throughput screens. They are commonly referred to as hypothesis-generating experiments or, somewhat affectionately, “fishing expeditions”. The concept is alluring: Answer a broad scientific question in one fell swoop making use of recent technological advances in mass spectrometry, next-generation sequencing, and automated microscopy.

Months or even years later people present the results of their high-throughput screens and more often than not they struggle to distill concrete findings from their data.

A high-throughput screen may fail to produce interpretable results for a number of reasons.

  • Was the scientific question too broad to be answered? (Did we cast the fishing net too widely?)
  • Was the technology sensitive enough to pick up the expected differences? (Was the mesh of our fishing net too coarse?)
  • Should we have used a counterscreen? (What are all those dolphins doing in our net?)
  • Was the experimental design flawed? (Should we have packed a second net?)
  • Is there something to find out at all? (Are there any fish to catch?)

All of those questions are project-specific and some are beyond our control. Oftentimes, we realize what we should have done post factum.

So what are we supposed to do once the painstakingly acquired data awaits analysis on our hard drives? High-throughput screens typically yield intriguing clues by the bucket but the challenge lies in weeding out the dead ends. Or to stay with our fishing metaphor: What do we call a fish and what an old boot?

The standard approach followed by most people is to rely on a combination of “statistical significance” and “domain knowledge”. This sounds like objective theory married to a healthy dose of good old common sense. What could possibly go wrong?

Despite being appealing in theory, in practice this combination often fail to identify the right candidates for follow-up experiments. Or worse, it prevents you from realizing that the screen itself was a failure and you should spend your precious time and money on something else.

The reason for this phenomenon is partly to be found in the misuse of statistical theory and in the conscious or unconscious application of scientific bias. On top, their combination can lead to reinforcement of wrong choices that quickly send you on your way to no-man’s land.

The overvalued p-value

John Ioannidis’ piece on “Why most published research findings are false” caused quite a stir in the scientific community and, maybe even more so, in the pharmaceutical industry. His main point of contention is that the post-study probability that a hit is true does not only depend on the type I (false positive) error rate α (the significance level), but also on the type II (false negative) error rate β and the prevalence R of true hits. In Bayesian statistics, the prevalence would be called the prior probability.

Most statistics software packages are p-value generating machines. You feed them data, and they spit out p-values. If the p-value is below a certain threshold, commonly set at 0.05, we accept it as a hit. We all know that. Simple enough.

The p-value is the probability that a value as extreme or more is generated by a statistical model of our choice. The argumentation that a low p-value supports our hypothesis follows the classical straw man fallacy. We construct a straw man called the null hypothesis H0, show that our data is unlikely to be generated from H0 (the p-value reflects the probability), and conclude that by getting rid of our straw man, the alternative hypothesis H1, which happens to be our pet hypothesis, must be true.

The significance level α is the cut-off we somewhat arbitrarily set to 0.05. This means you still obtain a value as extreme or more extreme just by chance under the null hypothesis H0. Even in the best of cases, you would be wrong one out of twenty times. When hundreds or thousands of hypothesis tests are performed, which is ordinarily the case in modern genomics, this problem becomes so severe that it has to be addressed by a multiple testing correction. The fact that the specificity or true negative rate of an experiment is 1 – α further hints at the fact that the significance level has less to do with true hits but more with true “misses”. It is a little bit like saying what is the probability that it is a fish if you catch nothing. On its own it is certainly not a good predictor of whether your hit is true or not.

So, what is the function of the other two components that influence the posterior probability that a hit is true?

The complement of the type II error rate β is called the statistical power (1 – β) of a study design. It determines our ability to detect a hit if it is true (true positive). In other words, the probability that it is a fish if you catch something. We traditionally aim for a power of 0.8, which says that 80% of the hits are likely to be true positives and 20% false positives. Ideally, we would want the power to be even closer to 1 but as power depends on the sample number, it is often too expensive or too time consuming to have arbitrarily high power. Conversely, if an experiment has low power the majority of what we call hits are likely to be false positives. Statistical power is related to sensitivity or the true positive rate of the experiment. In machine learning, it is known as recall.

Prevalence is the probability of there being a true hit before you even start the experiment. It is the probability that there are fish where you you choose to cast your net. Intuitively, it makes sense that this number could make the difference between success and failure. In the realm of frequentist statistics, prevalence is doomed to live a life in the nether world. The reason for this is that prevalence is not a quantity that can be estimated from the data at hand but must either be derived from experience or “guessed”. However, the influence on the posterior probability that a hit is true can be huge. Even in a situation of relatively high prevalence, let’s say 50%, a p-value of 0.05 corresponds to a posterior probability of a true hit of 0.29. This means that in about 1/3 of the cases called significant, we are dealing with false positives.

How does all of this relate to high-throughput screens? By focusing exclusively on p-values we implicitly assume high power and high prevalence. Neither of which is typically true in high-throughput settings in modern biological research. Due to the high costs of such experiments, sample sizes are typically low and the differences we aim to detect are small. Both negatively affect statistical power. The prevalence is typically much less than 50%, more likely to be in the range of around 10%. We would not necessarily expect that upon some treatment more than 10% of genes are differentially expressed or that more than 10% of the phosphorylation events within a cell change, would we? A prevalence of 10% means that a p-value of 0.05 has a 89% chance of being a false positive. That is scary!

Conscious and unconscious bias

As human beings we all have preformed opinions and preferences that originate from our very own set of experiences. As scientists are humans too, we are no exception. Here is a fun experiment to try out for yourself. Generate a random list of 100 proteins or genes, take it to three principle investigators from different fields of biology, and tell them that they have the list of hits from your latest high-throughput screen fresh from the printer. It is not unlikely that you will walk out of their offices with three coherent but completely different stories of how the result of your screen could be interpreted in their respective field of research.

Modern biological research has only recently transitioned from a data-poor to a data-rich field. Most of us are trained to make decisions on limited information, fill in the blank spots creatively, and test the resulting hypothesis experimentally. How we frame our hypothesis critically depends on our own experience as a scientist and on the believes of the field. If a hypothesis coincides with what is in line with our own experience and what is the current thinking in the field, it is usually considered a “good” hypothesis. A value judgment based on subjectivity is the essence of bias. It happens all the time, consciously and unconsciously, and there is not much we can do about it.

In a high-throughput setting, we are very likely to encounter genes or proteins we have heard of before, worked with before, or simply feel sympathetic towards for whatever reason. I would wager that we are more likely to spot them on a list and select them for follow-up experiments, sometimes even despite contrary statistical evidence. It is called having a hunch.

Reality check

If we think about the combination of the shaky (or should I say nonexistent) foundation of the p-value as a predictor of finding a true hit and our intrinsic scientific biases, we should expect nothing else but a lack of tangible results of high-throughput screening. It is like giving a list of quasi-random names to a creative mind and asking for a story. You will get one, but whether it has anything to do with reality is a different question entirely.

If you look at published papers that include some form of high-throughput screens, you typically observe that one or two instances were “cherry picked” from a list and followed-up the old fashioned way. What happened to the great promises of systems biology, the understanding of complex regulatory patterns and emerging properties of biological networks?

It seems to me that this is another instance of “no free lunch”. You can’t have coverage and confidence at the same time. At least not at the moment.

In the meantime, have a close look at what dangles from your fishing rod. It might be an old boot masquerading as a fish. Don’t be fooled!

How to fish safely?

There are ways out of the fisherman’s / fisherwoman’s dilemma and I have listed some of them in a follow-up post. More information can be found in the articles listed below and the references therein.

Further reading

Three links to very accessible articles on the subject of p-values, statistical power, and prevalence:

On the dangers of fishing expeditions