In my previous post I defined the dangers of candidate selection in high-throughput screens as the “fisherman’s dilemma”. I have argued that our preconceived notions of how a biological system “should” behave, i.e. our inherent scientific bias, and a disproportional focus on p-values contribute to the frequent failure of high-throughput screens to yield tangible or reproducible results.

### be conservative about p-values

Today, I would like to take a closer look at the relationship between the p-value and the positive predictive value (PPV), also known as the true positive rate or the posterior probability that a hit is a true hit. Despite its fancy name, the PPV is just the ratio true positives (TP) over the sum of true positives and false positives (FP):

The number of true positives is the determined by the prior probability of there being a hit and the statistical power , which is our ability to detect such a hit. Power is the complement of the type II error rate .

The number of false positives depends on the false positive rate (type I error rate) and the prior probability of there being no hit .

Putting these two equations together, we get:

From this equation it is evident that just focusing on the significance level can lead to vastly different PPVs depending on where on the spectrum of prior probability and power we operate.

For the purpose of illustration, I have plotted the PPV for four commonly used significance levels 0.1, 0.05, 0.01, and 0.001. Green means higher PPV, and red means lower PPV. The black contour line shows where the PPV is 0.5, that is half of our hits are predicted to be false positives. For the optimists among us, half of our hits would likely be true positives.

From this figure it is clear that a p-value of 0.05 only works in situations of high prior probability and high power. I have marked the domain of high-throughput screens (HTS) rather generously at up to 0.25 prior probability and 0.25 power. Due to small sample sizes (low power) and the fact that any given perturbation is unlikely to have an effect on the majority of cellular components (low prior), most high-throughput screens operate in a space even closer to the origin in the deeply red area.

On the flip-side, this analysis tells us that if we are a little more conservative in what we call a hit, in other words if we lower the p-value cut-off to let’s say 0.001 or lower, we improve our chances of identifying true positives quite dramatically. Unless the high-throughput screen is plagued by terribly low power and prior probability, we actually have a chance that the majority of hits are true positives.

### keep your guard up

In genomics, p-values often originate from statistical tests like t-tests comparing a number of control samples to a number of treatment samples. If the values obtained upon treatment have a low probability to have originated from the control (null) distribution, we say the treatment has a “significantly” effect. The t-statistics takes into account the difference of the means between control and treatment distributions and their variances and combines them into a single value, the t-value, from which the p-value is calculated.

In situations when we small sample sizes, such as in high-throughput screens, it can happen that by chance either the control or treatment values cluster together closely. This results in either a very narrow control or treatment distribution. Due to the confounding of effect size and precision in the t-value, even tiny effects can end up called “significant” as long as the variance is small enough. This is a common problem in microarrays with small sample numbers.

Fortunately, there are quite effective solutions for this problem. Genes with low p-values and small effect sizes can be identified using a volcano plot, which displays effect size against the negative logarithm of the p-value.

Increasing the sample size and/or using a Bayesian correction of the standard deviation as it is implemented in Bioconductor’s “limma” package for microarray analysis can help to ameliorate this problem.

### possible Ways out of the fisherman’s dilemma

- High-throughput screens in biomedical research usually operate in a domain of low power and low prior probability. Based on your estimate of power and prior probability, use a more conservative p-value cut-off than 0.05.
- In addition to choosing the significance level based on the power and prior probability of your study, be wary of low p-values of hits with small effect sizes or apply corrections if possible.
- Try to increase the power of your experiment by increasing sample size, or better by decreasing measurement error if possible.
- The prior probability of having an effect is determined by nature and out of our control. We need to be aware of the possibility, however, that the prior probability is very low or even zero. In that case, it would be very hard or impossible to find a true positive.
- Ditch the p-value and use a Bayesian approach.

#### Reproducibility

The **full R code** can be found on Github.

#### Further reading

- John Ioannidis – Why most published research findings are false
- Wacholder et al. – Assessing the probability that a positive report is false: an approach for molecular epidemiology studies

Dear Dr. Lipp,

I very much enjoyed your discussion about false discovery rates and misinterpretation of p-values. Much of the literature in my area (environmental chemistry) has its share of underpowered studies, showing very small effects, that might be significant for at that least one study, and that are nonetheless of questionable importance in the scheme of things.

I have been following an effort to conduct high-throughput in vitro screening for chemical toxicity with respect to human and environmental contaminants. Applying the false discovery rate concept to a recent article on endocrine testing using in vitro assays (Rotroff, et al.) yields some interesting results. (This article was chosen only because it appeared just after my reading about false-discovery rates, and contained information about the sensitivity and selectivity for both a small set of in vitro and broader set of in vivo data). With sensitivity as the probability of detecting true positives, and specificity as the probability of detecting true negatives; the assumption needed concerned the prevalence of active chemicals in the population of chemicals to be screened, which I estimated as 1%. Even the reference chemical screening (at 100% sensitivity and 87.5% specificity) generated a large number of false positives, and a true positive discovery rate of about 7.5%. The broader screening, at less sensitivity and specificity (91% and 65%, respectively) had a true positive discovery rate of only 2.6%.

This led me to an interesting, though likely very naïve, question: what would happen if one were to iteratively apply a high-throughput screening assay, using the true and false positives from each previous iteration? Re-screening the true and false positives should [slightly] mimick reproducing studies (in some highly biased manner). Calculations show that for 91% sensitivity and 65% selectivity iterative screening would require only about 58,000 tests (positives plus false positives recycled) in addition to the original 100,000 tests, to achieve about 90% [of the remaining] true positives detection. This would identify only about one-half of the active chemicals with which one started, but the remainder are dropped as false negatives along the way. Specificity being more important than sensitivity for getting the greatest number of true positives while minimizing the number of false positives.

Since this seems too easy, I am wondering what additional factors about the distributions of true and false positive samples that I haven’t considered and that would preclude doing an iterative approach to high-throughput screening? If you have the time and inclination, I would like to learn more about the shortcomings of an iterative approach.

Sincerely,

Robert

Rotroff, D., M. Martin, D. Dix, D. Filer, K. Houck, T. Knudsen, N. Sipes, D. Reif, M. Xia, R. Huang, and R. Judson. Predictive Endocrine Testing in the 21st Century Using In Vitro Assays of Estrogen Receptor Signaling Responses. Environ. Sci. Technol. 2014, 48, 8706-8716.

LikeLiked by 1 person

Dear Robert,

thank you very much for your thoughtful reply. I don’t think there is anything logically wrong with your line of thinking on iterative testing, especially because you describe a situation of high sensitivity. This means that you would not throw out many true hits as false negatives in each round.

I am not familiar with the field of environmental chemistry, but from my experience in biomedical research iterative testing would be prohibitively expensive most of the time. This is also part of the reason why high-throughput screens are usually not replicated by another lab. Most high-throughput screens employ secondary screening of the hits of the first round using an orthogonal assay system, rather than the same assay system. For example, if the primary screen was a biochemical assay, the secondary screen would be imaging-based for example. This maximizes the cost/benefit relationship and yields the most “robust” hits because it minimizes the influence of assay induced bias.

A more subtle point that I did not talk about in the original post is that power calculations are only true for a single p-value, or in practice an exceedingly small interval around it such as all p-values between 0.049 and 0.051. High-throughput screens typically yield a range of p-values, all of which would be associated with their own positive predictive value. That means that following up a hit with a p-value of 1e-10 is much less likely to be a false positive than one with a p-value of 0.049. This is especially true if orthogonal screening is applied to eliminate assay bias.

I hope this helps and good luck with your research!

Best,

Jesse

LikeLike