Backtest Methodology
PhaseFolio validates probability-of-success predictions against historical drug outcomes using held-out cohorts whose fates are now known. The first published cohort is 16 rheumatoid-arthritis programs with full outcome history; AUC is 0.625 with Wilson 95% confidence intervals reported on both the conventional ≥50% and optimal Youden ≥40% accuracy thresholds.
What a backtest measures
Discrimination, not point accuracy.
A drug-stage backtest takes a fixed cohort of drugs that entered a clinical phase by a cutoff date and asks: did the model rank the drugs that ultimately succeeded above the drugs that ultimately failed? This is a discrimination question, not a point-accuracy question. A model that always predicts 30% PoS for every drug in a cohort with a 30% base rate has perfect average calibration and zero discrimination — useful for budget-setting, useless for picking winners.
The primary headline metric is therefore the area under the ROC curve (AUC), which measures the probability that a randomly chosen approved drug received a higher predicted PoS than a randomly chosen failed drug. AUC ranges from 0.5 (no skill) to 1.0 (perfect ranking).
Wilson score interval on accuracy
Why a small-sample binary accuracy needs a confidence interval.
Beyond ranking, we report a binary call accuracy: the model “calls approved” whenever its predicted cumulative PoS at entry exceeds a cutoff, and the call is correct iff that matches the observed outcome. We report this at two cutoffs: the conventional ≥50% classifier midpoint, and the ≥40% optimal-Youden cutoff identified by the threshold sweep.
At small cohort sizes (the rheumatoid-arthritis cohort is 16 drugs) a raw percentage like “75% accurate” is misleading because it implies a precision the data cannot support. We therefore wrap every accuracy figure in a Wilson score interval at 95% confidence. The Wilson interval is the standard binomial-proportion CI for small n — it is asymmetric, never crosses 0 or 1, and behaves correctly when the observed proportion is at the boundary (which it often is for small biotech cohorts).
Worked example. With 11 of 16 calls correct, the point accuracy is 68.8%; the 95% Wilson interval is roughly (44%, 86%). That width is the honest signal. Anyone treating “68.8% accurate” as a precise claim is over-reading 16 data points.
Calibration plot
Predicted vs. observed by deciles — a separate question from discrimination.
Calibration asks a different question than discrimination: when the model says “30% PoS,” do roughly 30% of those drugs ultimately get approved? The backtest plot bins predicted probabilities into deciles and overlays observed approval frequency. Perfect calibration sits on the diagonal; under-prediction sits above; over-prediction sits below.
At the rheumatoid-arthritis cohort size, individual decile bins are too small to draw firm conclusions about systematic mis-calibration. The plot is published honestly — including the noisy bins — rather than smoothed away.
Rheumatoid-arthritis cohort
The first published validation cohort.
The first published backtest is rheumatoid arthritis: 16 drugs that entered Phase II by the cohort cutoff, scored at entry by the engine, and tracked to either regulatory approval or program termination. The full per-drug ledger lives on the Intelligence dashboard backtest page; the headline numbers are below.
| Metric | Value | Note |
|---|---|---|
| Cohort size | 16 drugs | Phase II entrants, RA, full outcome history |
| AUC | 0.625 | Above no-skill (0.500); modest discriminatory power |
| Accuracy at ≥50% cutoff | 11/16 = 68.8% | 95% Wilson CI (44%, 86%) |
| Accuracy at ≥40% cutoff | 12/16 = 75.0% | Optimal Youden threshold from sweep |
| Engine version | v1.0 | Static BIO/QLS 2021 base rates |
Source: PhaseFolio rheumatoid-arthritis backtest run; per-drug ledger and decile calibration plot at /dashboard/intelligence/rheumatoid-arthritis/backtest.
Sample limitations
What 16 drugs can and cannot tell you.
- Single-indication. RA is one therapeutic area. AUC of 0.625 in RA does not generalize to oncology, neurology, or rare disease. NSCLC is the next cohort and is in progress.
- Wide confidence band. The 95% Wilson interval on accuracy spans roughly (44%, 86%) at this cohort size. The point estimate alone is not a trustworthy summary; the interval is the right object to cite.
- Survivor bias in the source data. The cohort is built from drugs whose Phase II entry could be reliably identified in public registries. Programs that died before public disclosure are unrepresented; this biases the base rate upward by an unknown amount.
- Modifier sparsity. Within 16 drugs, several modifier combinations appear once or zero times. The backtest cannot distinguish whether the genetic-validation modifier or the orphan-designation modifier is doing more work; the cohort is too small for sub-stratification.
- Calibration vs. discrimination. AUC is a ranking metric. A model with poor calibration can still have respectable AUC. Use AUC for pick-the-winner questions; use the calibration plot for size-the-bet questions.
References
01Wilson, E.B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158), 209–212.
02Brown, L.D., Cai, T.T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 16(2), 101–133.
03Hanley, J.A., & McNeil, B.J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36.
04Thomas, D.W., Burns, J., Audette, J., Carroll, A., Dow-Hygelund, C., & Hay, M. (2021). Clinical Development Success Rates and Contributing Factors 2011–2020. BIO, QLS Advisors, Informa Pharma Intelligence.
Methodology version: methodology@2026-04 · Last updated: 2026-04-30
PhaseFolio