Wherever you look, people do high-throughput screens. They are commonly referred to as hypothesis-generating experiments or, somewhat affectionately, “fishing expeditions”. The concept is alluring: Answer a broad scientific question in one fell swoop making use of recent technological advances in mass spectrometry, next-generation sequencing, and automated microscopy.
Months or even years later people present the results of their high-throughput screens and more often than not they struggle to distill concrete findings from their data.
A high-throughput screen may fail to produce interpretable results for a number of reasons.
- Was the scientific question too broad to be answered? (Did we cast the fishing net too widely?)
- Was the technology sensitive enough to pick up the expected differences? (Was the mesh of our fishing net too coarse?)
- Should we have used a counterscreen? (What are all those dolphins doing in our net?)
- Was the experimental design flawed? (Should we have packed a second net?)
- Is there something to find out at all? (Are there any fish to catch?)
All of those questions are project-specific and some are beyond our control. Oftentimes, we realize what we should have done post factum.
So what are we supposed to do once the painstakingly acquired data awaits analysis on our hard drives? High-throughput screens typically yield intriguing clues by the bucket but the challenge lies in weeding out the dead ends. Or to stay with our fishing metaphor: What do we call a fish and what an old boot?
The standard approach followed by most people is to rely on a combination of “statistical significance” and “domain knowledge”. This sounds like objective theory married to a healthy dose of good old common sense. What could possibly go wrong?
Despite being appealing in theory, in practice this combination often fail to identify the right candidates for follow-up experiments. Or worse, it prevents you from realizing that the screen itself was a failure and you should spend your precious time and money on something else.
The reason for this phenomenon is partly to be found in the misuse of statistical theory and in the conscious or unconscious application of scientific bias. On top, their combination can lead to reinforcement of wrong choices that quickly send you on your way to no-man’s land.
The overvalued p-value
John Ioannidis’ piece on “Why most published research findings are false” caused quite a stir in the scientific community and, maybe even more so, in the pharmaceutical industry. His main point of contention is that the post-study probability that a hit is true does not only depend on the type I (false positive) error rate α (the significance level), but also on the type II (false negative) error rate β and the prevalence R of true hits. In Bayesian statistics, the prevalence would be called the prior probability.
Most statistics software packages are p-value generating machines. You feed them data, and they spit out p-values. If the p-value is below a certain threshold, commonly set at 0.05, we accept it as a hit. We all know that. Simple enough.
The p-value is the probability that a value as extreme or more is generated by a statistical model of our choice. The argumentation that a low p-value supports our hypothesis follows the classical straw man fallacy. We construct a straw man called the null hypothesis H0, show that our data is unlikely to be generated from H0 (the p-value reflects the probability), and conclude that by getting rid of our straw man, the alternative hypothesis H1, which happens to be our pet hypothesis, must be true.
The significance level α is the cut-off we somewhat arbitrarily set to 0.05. This means you still obtain a value as extreme or more extreme just by chance under the null hypothesis H0. Even in the best of cases, you would be wrong one out of twenty times. When hundreds or thousands of hypothesis tests are performed, which is ordinarily the case in modern genomics, this problem becomes so severe that it has to be addressed by a multiple testing correction. The fact that the specificity or true negative rate of an experiment is 1 – α further hints at the fact that the significance level has less to do with true hits but more with true “misses”. It is a little bit like saying what is the probability that it is a fish if you catch nothing. On its own it is certainly not a good predictor of whether your hit is true or not.
So, what is the function of the other two components that influence the posterior probability that a hit is true?
The complement of the type II error rate β is called the statistical power (1 – β) of a study design. It determines our ability to detect a hit if it is true (true positive). In other words, the probability that it is a fish if you catch something. We traditionally aim for a power of 0.8, which says that 80% of the hits are likely to be true positives and 20% false positives. Ideally, we would want the power to be even closer to 1 but as power depends on the sample number, it is often too expensive or too time consuming to have arbitrarily high power. Conversely, if an experiment has low power the majority of what we call hits are likely to be false positives. Statistical power is related to sensitivity or the true positive rate of the experiment. In machine learning, it is known as recall.
Prevalence is the probability of there being a true hit before you even start the experiment. It is the probability that there are fish where you you choose to cast your net. Intuitively, it makes sense that this number could make the difference between success and failure. In the realm of frequentist statistics, prevalence is doomed to live a life in the nether world. The reason for this is that prevalence is not a quantity that can be estimated from the data at hand but must either be derived from experience or “guessed”. However, the influence on the posterior probability that a hit is true can be huge. Even in a situation of relatively high prevalence, let’s say 50%, a p-value of 0.05 corresponds to a posterior probability of a true hit of 0.29. This means that in about 1/3 of the cases called significant, we are dealing with false positives.
How does all of this relate to high-throughput screens? By focusing exclusively on p-values we implicitly assume high power and high prevalence. Neither of which is typically true in high-throughput settings in modern biological research. Due to the high costs of such experiments, sample sizes are typically low and the differences we aim to detect are small. Both negatively affect statistical power. The prevalence is typically much less than 50%, more likely to be in the range of around 10%. We would not necessarily expect that upon some treatment more than 10% of genes are differentially expressed or that more than 10% of the phosphorylation events within a cell change, would we? A prevalence of 10% means that a p-value of 0.05 has a 89% chance of being a false positive. That is scary!
Conscious and unconscious bias
As human beings we all have preformed opinions and preferences that originate from our very own set of experiences. As scientists are humans too, we are no exception. Here is a fun experiment to try out for yourself. Generate a random list of 100 proteins or genes, take it to three principle investigators from different fields of biology, and tell them that they have the list of hits from your latest high-throughput screen fresh from the printer. It is not unlikely that you will walk out of their offices with three coherent but completely different stories of how the result of your screen could be interpreted in their respective field of research.
Modern biological research has only recently transitioned from a data-poor to a data-rich field. Most of us are trained to make decisions on limited information, fill in the blank spots creatively, and test the resulting hypothesis experimentally. How we frame our hypothesis critically depends on our own experience as a scientist and on the believes of the field. If a hypothesis coincides with what is in line with our own experience and what is the current thinking in the field, it is usually considered a “good” hypothesis. A value judgment based on subjectivity is the essence of bias. It happens all the time, consciously and unconsciously, and there is not much we can do about it.
In a high-throughput setting, we are very likely to encounter genes or proteins we have heard of before, worked with before, or simply feel sympathetic towards for whatever reason. I would wager that we are more likely to spot them on a list and select them for follow-up experiments, sometimes even despite contrary statistical evidence. It is called having a hunch.
If we think about the combination of the shaky (or should I say nonexistent) foundation of the p-value as a predictor of finding a true hit and our intrinsic scientific biases, we should expect nothing else but a lack of tangible results of high-throughput screening. It is like giving a list of quasi-random names to a creative mind and asking for a story. You will get one, but whether it has anything to do with reality is a different question entirely.
If you look at published papers that include some form of high-throughput screens, you typically observe that one or two instances were “cherry picked” from a list and followed-up the old fashioned way. What happened to the great promises of systems biology, the understanding of complex regulatory patterns and emerging properties of biological networks?
It seems to me that this is another instance of “no free lunch”. You can’t have coverage and confidence at the same time. At least not at the moment.
In the meantime, have a close look at what dangles from your fishing rod. It might be an old boot masquerading as a fish. Don’t be fooled!
How to fish safely?
There are ways out of the fisherman’s / fisherwoman’s dilemma and I have listed some of them in a follow-up post. More information can be found in the articles listed below and the references therein.
Three links to very accessible articles on the subject of p-values, statistical power, and prevalence:
- John Ioannidis – Why most published research findings are false
- Regina Nuzzo – Scientific method: Statistical errors
- Nature Methods – Points of Significance