In my previous post I defined the dangers of candidate selection in high-throughput screens as the “fisherman’s dilemma”. I have argued that our preconceived notions of how a biological system “should” behave, i.e. our inherent scientific bias, and a disproportional focus on p-values contribute to the frequent failure of high-throughput screens to yield tangible or reproducible results.
be conservative about p-values
Today, I would like to take a closer look at the relationship between the p-value and the positive predictive value (PPV), also known as the true positive rate or the posterior probability that a hit is a true hit. Despite its fancy name, the PPV is just the ratio true positives (TP) over the sum of true positives and false positives (FP):
The number of true positives is the determined by the prior probability of there being a hit and the statistical power , which is our ability to detect such a hit. Power is the complement of the type II error rate .
The number of false positives depends on the false positive rate (type I error rate) and the prior probability of there being no hit .
Putting these two equations together, we get:
From this equation it is evident that just focusing on the significance level can lead to vastly different PPVs depending on where on the spectrum of prior probability and power we operate.
For the purpose of illustration, I have plotted the PPV for four commonly used significance levels 0.1, 0.05, 0.01, and 0.001. Green means higher PPV, and red means lower PPV. The black contour line shows where the PPV is 0.5, that is half of our hits are predicted to be false positives. For the optimists among us, half of our hits would likely be true positives.
From this figure it is clear that a p-value of 0.05 only works in situations of high prior probability and high power. I have marked the domain of high-throughput screens (HTS) rather generously at up to 0.25 prior probability and 0.25 power. Due to small sample sizes (low power) and the fact that any given perturbation is unlikely to have an effect on the majority of cellular components (low prior), most high-throughput screens operate in a space even closer to the origin in the deeply red area.
On the flip-side, this analysis tells us that if we are a little more conservative in what we call a hit, in other words if we lower the p-value cut-off to let’s say 0.001 or lower, we improve our chances of identifying true positives quite dramatically. Unless the high-throughput screen is plagued by terribly low power and prior probability, we actually have a chance that the majority of hits are true positives.
keep your guard up
In genomics, p-values often originate from statistical tests like t-tests comparing a number of control samples to a number of treatment samples. If the values obtained upon treatment have a low probability to have originated from the control (null) distribution, we say the treatment has a “significantly” effect. The t-statistics takes into account the difference of the means between control and treatment distributions and their variances and combines them into a single value, the t-value, from which the p-value is calculated.
In situations when we small sample sizes, such as in high-throughput screens, it can happen that by chance either the control or treatment values cluster together closely. This results in either a very narrow control or treatment distribution. Due to the confounding of effect size and precision in the t-value, even tiny effects can end up called “significant” as long as the variance is small enough. This is a common problem in microarrays with small sample numbers.
Fortunately, there are quite effective solutions for this problem. Genes with low p-values and small effect sizes can be identified using a volcano plot, which displays effect size against the negative logarithm of the p-value.
Increasing the sample size and/or using a Bayesian correction of the standard deviation as it is implemented in Bioconductor’s “limma” package for microarray analysis can help to ameliorate this problem.
possible Ways out of the fisherman’s dilemma
- High-throughput screens in biomedical research usually operate in a domain of low power and low prior probability. Based on your estimate of power and prior probability, use a more conservative p-value cut-off than 0.05.
- In addition to choosing the significance level based on the power and prior probability of your study, be wary of low p-values of hits with small effect sizes or apply corrections if possible.
- Try to increase the power of your experiment by increasing sample size, or better by decreasing measurement error if possible.
- The prior probability of having an effect is determined by nature and out of our control. We need to be aware of the possibility, however, that the prior probability is very low or even zero. In that case, it would be very hard or impossible to find a true positive.
- Ditch the p-value and use a Bayesian approach.
The full R code can be found on Github.
- John Ioannidis – Why most published research findings are false
- Wacholder et al. – Assessing the probability that a positive report is false: an approach for molecular epidemiology studies