Every backtest evaluates the production rNPV engine against a held-out cohort of historical drugs whose real-world fate is now known — using only information available before each drug’s decision point. No future data leaks into the model.3 cohorts are published; each leads with the strongest signal its sample can support.
Early calibration cohort. Directional signal at small n — Wilson 95% accuracy intervals span chance level, so it is read as direction, not confirmation.
Strongest discrimination signal in the published cohorts — 738 ranking pairs, clears the conventional ≥0.70 good-discrimination bar.
Pre-Sprint-1 the engine did not discriminate (AUC 0.524). One cohort-validatable scored multiplier closed it to 0.629; the full ablation is published, not just the largest number.
The same production engine, the same scoring discipline (pairwise AUC, Wilson-CI accuracy, risk-flag sensitivity), and one disclosed rule for the per-indication decision anchor: anchor at the earliest decision point at which the cohort’s failure population is observable in public registries. The methodology page carries the side-by-side cross-cohort comparison, the discrimination-vs- calibration framing, and the full antimicrobial Sprint-1 ablation.