Backtest Methodology
This is methodology@2026-05-16 as it was on 2026-05-16 — the immutable record any signed export stamped methodology@2026-05-16 was computed under. It is intentionally the citation prose, not the current presentation, and it will never change. The methodology has since advanced (current is methodology@2026-05-21). Current version · Full version history · Verify an export
PhaseFolio validates probability-of-success predictions against historical drug outcomes using held-out cohorts whose fates are now known. Three cohorts are published: rheumatoid arthritis (n=16, AUC 0.625), non-small-cell lung cancer (n=59, AUC 0.709), and antimicrobial (n=36, AUC 0.629). Each page leads with the strongest signal in its cohort — pairwise AUC where discrimination is solid, Wilson-CI accuracy at the optimal Youden cutoff at small n. The antimicrobial cohort's pre-Sprint-1 discrimination gap (AUC 0.524; engine PoS identical for successes and failures) was closed to 0.629 by one cohort-validatable scored multiplier (single-asset sponsor fragility); two other candidate multipliers were deliberately demoted to non-scored risk flags after a pre-publication ablation, and the full ablation (baseline 0.524 / M3-only 0.631 / M1+M2-only 0.797 / all-three 0.782) is published rather than only the largest number. Cohorts differ in decision anchor (Phase 2 for RA/NSCLC, Phase 3 for antimicrobial) by a single disclosed rule: anchor at the earliest decision point at which the cohort's failure population is registry-observable. Full antimicrobial Sprint-1 forensics, the C. difficile sub-cohort, and per-drug ledgers are published at /research/backtest-antimicrobial.
1. What a backtest measures
A drug-stage backtest takes a fixed cohort of drugs that entered a clinical phase by a cutoff date and asks two distinct questions: did the model rank successes above failures (discrimination), and when the model said "30% PoS," did roughly 30% of those drugs actually succeed (absolute calibration)? These are different questions, answered by different metrics, with different sensitivities to cohort construction.
2. Wilson score interval on accuracy
Beyond ranking, we report a binary call accuracy: the model "calls approved" whenever its predicted cumulative PoS at entry exceeds a cutoff, and the call is correct iff that matches the observed outcome. We report this at two cutoffs: the conventional ≥50% classifier midpoint, and the ≥40% optimal-Youden cutoff identified by the threshold sweep.
At small cohort sizes a raw percentage like "75% accurate" is misleading because it implies a precision the data cannot support. We therefore wrap every accuracy figure in a Wilson score interval at 95% confidence. The Wilson interval is the standard binomial-proportion CI for small n — it is asymmetric, never crosses 0 or 1, and behaves correctly when the observed proportion is at the boundary (which it often is for small biotech cohorts).
Equation 1 — Wilson score interval: CI = ( p̂ + z²/(2n) ± z × √[ p̂(1−p̂)/n + z²/(4n²) ] ) / ( 1 + z²/n ), where p̂ = observed accuracy (correct calls / cohort size), n = cohort size, and z = 1.96 for a 95% CI.
3. Discrimination vs. absolute calibration
Discrimination (AUC) measures pairwise ranking. The area under the ROC curve is the probability that a randomly chosen approved drug received a higher predicted PoS than a randomly chosen failed drug. AUC ranges from 0.5 (no skill) to 1.0 (perfect ranking). It is invariant to the absolute level of predicted probabilities — a model that systematically under-predicts by 20 percentage points across the board can still achieve perfect AUC if its ordering is correct.
Absolute calibration measures whether predicted probabilities match observed frequencies. When the model says 30% PoS, do roughly 30% of those drugs ultimately succeed? This is the metric a valuation tool is judged on most directly — calibration drives sizing and discounting decisions, not ranking. Calibration is sensitive to cohort selection: a cohort weighted toward registry-visible survivors shows points above the diagonal independent of engine accuracy.
Both questions matter. Each backtest page leads with whichever metric is the strongest signal in its cohort: AUC when discrimination is solid (≥ 0.70 cutoff), Wilson-CI accuracy at the optimal Youden cutoff when cohort size makes a single-number AUC less informative.
4. Calibration plot
The calibration plot bins predicted probabilities into quintiles and overlays observed approval frequency. Perfect calibration sits on the diagonal; points above the diagonal indicate the cohort's observed approval rate exceeds what the engine predicts at that bucket.
Calibration plots inherit cohort selection bias. PhaseFolio's cohorts are built from drugs whose Phase 2 entry could be reliably identified in public registries — a survivor-biased subset of the universe of all programs that ever entered Phase 2. Engine PoS values are calibrated to the population base rate (BIO/QLS 2021); the cohort's observed approval rate is higher than the population because invisibly-failed programs are absent. The vertical gap between the engine's prediction and the cohort's observed rate therefore reflects cohort survivorship at least as much as engine miscalibration. Pairwise AUC, by contrast, is invariant to this bias provided successes and failures are equally well-represented in the cohort.
5. Published cohorts
Three held-out cohorts are published; the success criterion is indication-specific FDA approval for all three (drug approved for the named indication, regardless of prior approvals in other indications). They differ in decision anchor by design — see §6.
| Metric | Rheumatoid arthritis | NSCLC | Antimicrobial |
|---|---|---|---|
| Cohort size | 16 (Phase 2 entrants) | 59 (41 approved / 18 failed) | 36 (25 approved / 11 not) |
| Decision anchor | Phase 2 entry | Phase 2 entry | Phase 3 entry |
| Lead signal | Wilson-CI accuracy | Pairwise AUC | Pairwise AUC + gap disclosure |
| Pairwise AUC | 0.625 | 0.709 (738 pairs, 523 concordant) | 0.629 (was 0.524 pre-Sprint-1) |
| Secondary metric | 12/16 = 75.0% accuracy, 95% Wilson CI 51–90% (≥40% Youden cutoff) | mean predicted PoS 8.3% vs 3.6% (separation gap 4.7pp) | 25/36 = 69.4%, 95% Wilson CI 53–82% (separation gap 0.7pp) |
| Status | Directional at small n | PASS (≥0.70 good-discrimination) | PASS (≥0.60) post-Sprint-1 |
| Engine | v1.0 (BIO/QLS 2021 base rates) | v1.0 (BIO/QLS 2021 base rates) | v1.0 + antibacterial multipliers + Sprint-1 M3 scored multiplier |
| Full results | /research/backtest-ra | /research/backtest-nsclc | /research/backtest-antimicrobial |
Per-drug ledgers and quintile calibration plots are in the intelligence dashboard. The antimicrobial cohort is LLM CMO-grade verified (Claude Opus 4.7 acting in a chief-medical-officer reviewer role, not a human medical officer) against ClinicalTrials.gov NCT records, FDA approval letters, and SEC 8-K filings. RA (0.625) and NSCLC (0.709) were re-run as a regression post-Sprint-1 and are number-identical — the antibacterial multipliers are no-ops outside the antimicrobial cohort.
Antimicrobial Sprint-1 — substrate-honest summary (2026-05-16). Pre-Sprint-1 the antibacterial Phase 3-entry PoS was well-calibrated as a point estimate (mean ~0.91 vs observed 25/36 = 69.4%) but did not discriminate approved from failed (AUC 0.524; mean PoS identical for successes and failures). Sprint-1 tested three candidate antibacterial multipliers, each pre-registered on mechanism-class / endpoint-design / financial-structure evidence dated before each drug's decision date (never on retrospective outcome). A pre-publication ablation then decided which could legitimately score the engine: baseline (no Sprint-1) 0.524; M3-only 0.631 (cohort-validatable → scores); M1+M2-only 0.797 (fires only on failures → does not score); all three 0.782 (not shipped).
Decision: only M3 scores. Single-asset sponsor fragility (M3, 0.80× odds-ratio on phase_3 + nda_bla) is the sole scored Sprint-1 signal because it is the only one the cohort can validate — it fires on three approvals (plazomicin, eravacycline cIAI, lefamulin) as well as failures, so a skeptic can independently check it does not merely track outcomes. Final shipped scored AUC = 0.629 (PASS ≥0.60), after a same-day LPAD-gate fix (−0.002, immaterial; the M3-only scoring-decision ablation point was 0.631). M1 (hepatotoxicity mechanism-class) and M2 (sustained-clinical-response endpoint fragility) fire only on this cohort's failures with no approved counterexample (no approved ketolide/DHFR drug; no approved SCR-fragile CDI design), so the cohort structurally cannot self-validate them; they were demoted to non-scored risk flags (HEPATOTOX_CLASS_PRIOR / SCR_ENDPOINT_FRAGILITY), raising risk-flag sensitivity 72.7% → 90.9% (enrichment ratio 1.06 → 1.37) without inflating the scored AUC with a prior the test set cannot test. The full ablation is published, not just the largest number (0.782), because a headline that is mostly an unvalidatable imported prior is not one a CMO advisor should be asked to trust. Full Sprint-1 forensics — per-multiplier rationale, the C. difficile sub-cohort that does not separate (the M2-scored spurious 1.000 the ablation was designed to catch), the LPAD-gate fix, and the per-drug ledger — are published at /research/backtest-antimicrobial.
6. Sample limitations
- Cross-cohort comparability and anchor selection. All three cohorts use indication-specific FDA approval as the success criterion but differ in decision anchor by design: RA and NSCLC at Phase 2 entry, the antimicrobial cohort at Phase 3. One rule drives this — anchor at the earliest decision point at which the cohort's failure population is observable in public registries, so the cohort is not survivorship-truncated on the failure side. Oncology and RA Phase-2 failures are densely registered, so Phase-2 anchoring is unbiased there; antibacterial Phase-2 deaths are mostly small-biotech business discontinuations that are not registry-observable. A reproducible scan of the antimicrobial substrate (4,102 trials / 81 distinct drugs) finds only 7 Phase-2-terminal programs, ≤4 outside the cohort, none registry-flagged as failed — effectively zero clean Phase-2 antibacterial failures, so a Phase-2-anchored antibacterial cohort would be survivorship-fatal. Phase 3 is the earliest anchor at which that universe is small, bounded and FDA-traceable (hence primary-source-complete at n=36). The per-indication anchor difference is a disclosed consequence of data observability, not an inconsistency; cohorts are published per-indication, not aggregated into a single calibration plot. Full treatment at /research/backtest-antimicrobial.
- Survivor bias in source data. Cohorts are built from drugs whose Phase II entry could be reliably identified in public registries. Programs that died before public disclosure are unrepresented; this biases observed approval rates upward by an unknown amount and inflates points above the diagonal in calibration plots independent of engine accuracy.
- Wide confidence bands at small n. The 95% Wilson interval on RA accuracy spans roughly (51%, 90%). The point estimate alone is not a trustworthy summary; the interval is the right object to cite. NSCLC at n=59 supports tighter intervals.
- Modifier sparsity. Within each cohort, several modifier combinations appear once or zero times. The backtest cannot distinguish whether the genetic-validation modifier or the orphan-designation modifier is doing more work; cohorts are too small for sub-stratification.
- Discrimination ≠ calibration. AUC is a ranking metric; absolute calibration is a sizing metric. A model with poor calibration can still have respectable AUC. Use AUC for pick-the-winner questions; use the calibration plot for size-the-bet questions, with the survivorship caveat.
- Engine evolution & remediation. The antimicrobial build used a corrected trial-duration computation; an audit then found the prior method had under-recorded durations in the RA enrichment substrate (NSCLC was checked and was unaffected). The RA substrate has been recomputed and the root cause fixed at source. This field is not consumed by the scored backtest path, so published AUCs (RA 0.625 / NSCLC 0.709 / antimicrobial 0.629), which derive from cohort-level stage assumptions, are unchanged — the correction affects only customer-facing duration figures in exports. The build also added first-class antibacterial support (an infectious-disease endpoint-tier taxonomy and the correct FDA review-division mapping). Named here rather than applied silently — a methodology worth trusting names its own corrections.
Key facts
| RA cohort size | n = 16 RA Phase 2 entrants with terminal outcomes |
| RA AUC | 0.625 |
| RA accuracy (optimal ≥40% cutoff) | 12/16 = 75.0% (95% Wilson CI 51–90%) |
| NSCLC cohort size | n = 59 NSCLC Phase 2 entrants |
| NSCLC AUC | 0.709 (738 pairs, 523 concordant) |
| NSCLC mean predicted PoS, successes / failures | 8.3% / 3.6% (separation gap 4.7pp) |
| Antimicrobial cohort size | n = 36 antibacterial Phase 3 entrants 2004-2019 (25 approved / 11 not approved) |
| Antimicrobial AUC | 0.629 post-Sprint-1 + same-day LPAD-gate fix (was 0.524; +0.105 from the M3 scored multiplier; M3-only scoring-decision ablation point 0.631, LPAD fix −0.002; PASS ≥0.60) |
| Antimicrobial Sprint-1 ablation | scoring-decision ablation (pre-LPAD-fix): baseline 0.524 / M3-only 0.631 / M1M2-only 0.797 / all-three 0.782 — only M3 scores (cohort-validatable: fires on 3 approvals too); M1/M2 demoted to non-scored flags. Final shipped 0.629 after same-day LPAD-gate fix. |
| Antimicrobial separation gap | 0.7pp (rank-discrimination, not confident probability separation — the engine ranks better than it separates; disclosed) |
| Antimicrobial risk-flag sensitivity | 90.9% (10/11 failures flagged, up from 72.7%; enrichment ratio 1.37 up from 1.06 — M1/M2 demotion strengthened the flag layer) |
| Antimicrobial C. diff sub-cohort | n=5; scored engine does NOT separate it (cadazolid-FAIL 0.922 > bezlotoxumab-APPR 0.918) — honest, vs the spurious 1.000 M2-scored would produce; 3 failures carry SCR_ENDPOINT_FRAGILITY flag. Full detail at /research/backtest-antimicrobial |
| Antimicrobial cohort verification | LLM CMO-grade — Claude Opus 4.7 acting in a chief-medical-officer reviewer role (not a human medical officer), against ClinicalTrials.gov NCT records, FDA approval letters, SEC 8-K filings |
| Antimicrobial regression | RA 0.625 / NSCLC 0.709 number-identical post-Sprint-1 (multiplier no-op outside AMR cohort) |
| Engine version | v1.0 + antibacterial multipliers + Sprint-1 M3 scored multiplier (2026-05-16). Pre-existing LPAD phase_3-gate no-op disclosed; isolated fix following. |
References
01Wilson, E.B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158), 209–212.
02Brown, L.D., Cai, T.T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 16(2), 101–133.
03Hanley, J.A., & McNeil, B.J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36.
04Thomas, D.W., Burns, J., Audette, J., Carroll, A., Dow-Hygelund, C., & Hay, M. (2021). Clinical Development Success Rates and Contributing Factors 2011–2020. BIO, QLS Advisors, Informa Pharma Intelligence.
05Youden, W.J. (1950). Index for rating diagnostic tests. Cancer, 3(1), 32–35.
Frozen snapshot · methodology version: methodology@2026-05-16 · Last updated: 2026-05-16 · Version history →