PhaseFolio
PhaseFolio Signal Evaluations — Full Report

Drug-Specific Predictive Signals in Oncology: What Held and What Didn't

Tamal Adebisi (PhaseFolio); Claude Opus 4.7
2026-05-28 · evaluations@2026-05-28
Abstract

We tested whether early-phase objective response rate (ORR) magnitude — a drug-development signal several commercial vendors market as predictive — adds predictive value beyond molecular biomarker quality for oncology programs entering Phase II/III. On a held-out cohort of 85 non-small-cell lung cancer (NSCLC) programs with now-known outcomes, a joint biomarker-quality × ORR-bucket model improved held-out AUC over a biomarker-only baseline by only 0.5 percentage points (paired DeLong p = 0.48), and the comparison is statistically unpowerable: detecting even a 3-point gain at this baseline would require roughly 830 programs. Biomarker quality itself was validated and is scored in the production engine (+5.2 percentage points of held-out AUC; genomic-validated cohort odds ratio 5.59 against a conservative literature anchor of 1.35). We conclude that early-phase ORR magnitude carries essentially no predictive signal independent of biomarker quality, and we publish the full ablation and power analysis as a transparent negative result.

1. Background

PhaseFolio's probability-of-success engine scores a drug-specific multiplier for biomarker quality — whether a program enrolls patients by a genomic-grade molecular alteration (genomic_validated), a protein-only marker (protein_only), or no molecular selection (unselected). A second candidate signal, early-phase ORR magnitude, was carried as a non-scored informational flag because, in earlier work, combining it with biomarker quality reduced held-out accuracy (the combined-signal AUC of 0.615 fell below the biomarker-only 0.670 at 50% cohort coverage) — a classic double-counting effect, since the two signals are correlated.

A proposed framework sought to recover ORR's contribution through a joint biomarker × ORR odds-ratio table with modality-conditional thresholds. An adversarial design review raised three blocking concerns: (1) the proposed +3-percentage-point validation gate was underpowered at the available cohort size; (2) the joint grid would collapse to mostly-empty cells under a governance rule requiring at least three distinct sponsors per cell; and (3) the claim that a joint table could "never regress" against the baseline is not logically guaranteed. We therefore ran a pre-specified feasibility gate — sharpened to per-cell counts and an explicit power calculation — before committing to any framework.

2. Methods

Cohort. 85 NSCLC programs that reached a Phase II/III decision, comprising both approvals and failures, each classified by biomarker quality and (where available) early-phase ORR.

Joint-cell construction. Programs were cross-classified into a biomarker-quality × ORR-bucket grid, evaluated under two interpretations: a pooled grid (9 possible cells) and a modality-conditional grid (36 possible cells across four modalities). Cell odds ratios were fit on a 70% training split and applied to the 30% held-out split. Global ORR thresholds were used as the optimistic case; modality-conditional thresholds only subdivide cells further.

Comparison. The joint model's held-out AUC was compared to the production (engine 2.6.0) biomarker-only baseline using a proper paired DeLong test, which accounts for the covariance between two AUCs measured on the same cases. This is the more favorable test than the independent approximation used elsewhere — and the joint model still fails it.

Power. We computed the minimum detectable AUC difference at 80% power (α = 0.05), scaling the standard error by approximately 1/√N from the observed held-out biomarker-vs-baseline standard error as an achievable-signal proxy. Random seed 42 throughout.

3. Results

3.1 The joint cells cannot be populated

Only 28 of 85 programs carried both a biomarker classification and an extractable early-phase ORR, split 23 approved / 5 failed. Failures reported an extractable early-phase ORR at roughly 12% (5 of 41) versus approvals at roughly 52% (23 of 44). This is intrinsic survivorship bias: failed programs disproportionately never publish an early-phase ORR, so the cells that do populate are approval-dominated and ORR cannot discriminate outcome within them.

Grid interpretation Possible cells Populated Clear governance gate (≥3 sponsors + both outcomes) Clear stricter gate (N ≥ 6)
Pooled (biomarker × ORR) 9 6 3 1
Per-modality (× 4 modalities) 36 8 3 1

The modality-conditional interpretation leaves 28 of 36 cells empty. Only the genomic-validated × high-ORR cell is robustly populated (N = 15) — but at 14 approved / 1 failed it is near-perfectly separated, which itself defeats stable odds-ratio estimation.

3.2 No marginal signal beyond biomarker quality

Model Held-out AUC Δ vs biomarker-only Paired DeLong p
Structural baseline 0.618
Biomarker-only (engine 2.6.0, shipped) 0.670
Joint table, governance-gated 0.676 +0.005 0.480
Joint table, no gate (every cell fit) 0.687 +0.016 0.377

Under the governance gate, only one joint cell clears the fit on the training split, so the joint table is essentially the biomarker-only result with one earned cell. It does correctly remove the earlier double-counting drag (recovering the combined signal from 0.615 back to 0.676 ≈ the biomarker-only 0.670) — but that only returns to where the production engine already sits. Even maximally overfit with no governance gate, it reaches +1.6 points, never the +3-point gate and never close to statistical significance. Early-phase ORR carries almost no predictive signal independent of biomarker quality, because the programs with an extractable ORR are the genomic-validated successes that biomarker quality already flags.

3.3 Unpowerable at reachable scale

Scenario Held-out N Minimum detectable Δ
NSCLC, current 27 0.088
NSCLC, full (N = 85) ~30 0.083
Pooled four-indication ~100 0.046
RA-regime target ~250 0.029

Detecting the +3-point gate requires roughly 250 held-out programs ≈ 830 total — an order of magnitude beyond the 85 available, and consistent with our prior power analysis. Even a pooled four-indication cohort (~100 held-out) detects only ~4.6 points. The observed ORR signal (§3.2) is +0.5 points; the power floor (§3.3) is +8.8 points. The two gaps are independent and each is individually disqualifying.

4. Discussion

Early-phase ORR scoring fails on two independent grounds: the marginal signal is approximately zero, and the validation is roughly 10× out of power. Neither is fixable with more data extraction — the survivorship bias (failures do not report ORR) is structural, and the required scale is an order of magnitude away.

The result is substrate-positive. Biomarker quality (genomic_validated 1.35× / protein_only 0.85× / unselected 1.00×, log-odds, Phase II/III) is real and stable: +5.2 percentage points of held-out AUC, validated on NSCLC, with a genomic-validated cohort odds ratio of 5.59 against a conservative literature anchor of 1.35 (N = 21, clears the governance gate). We ship the conservative anchor and disclose the overshoot rather than overfit to our own cohort. Publishing that we tested a competitor-marketed signal — early-phase ORR magnitude — and found it adds no predictive value beyond biomarker quality, with the ablation and power analysis attached, is a form of differentiation that an opacity-based vendor cannot match: they cannot publish negative results on the signals they sell.

5. Limitations

The biomarker-quality validation is single-indication (NSCLC, N = 85); generalization to other solid tumors is under active multi-indication validation and will be published as it completes. The ORR analysis is bounded by survivorship bias (failed programs under-report early-phase ORR) and by statistical power at this cohort size. This is a transparent methods evaluation, not a peer-reviewed publication.

6. Conclusion

Early-phase ORR magnitude remains a non-scored, surfaced flag in the production engine; it is not scored. Phase 2 reframes to multi-indication validation of the already-shipped biomarker-quality multiplier — expanding the cohort to breast, melanoma, and colorectal cancer and testing whether the shipped anchors generalize beyond NSCLC, scoping per-indication where they do not — and to publication of this negative result.

References

  1. Schwaederle M., et al. (2016). Association of biomarker-based treatment strategies with response rates and progression-free survival in refractory malignancies: a meta-analysis.
  2. Vreman R.A., et al. (2020). Phase 2-to-Phase 3 attrition and winner's-curse correction in oncology development.
  3. Zhang J., et al. (2022). Investigator-assessed versus blinded independent central review of objective response rate.
Appendix A

Analysis Code (Python)

View-only; results are reproducible from this listing. No download is provided — the analysis script is presented here for full transparency.

"""Gate-0 Phase 2 feasibility analysis (2026-05-28). Decides — BEFORE any Phase 2 build — whether promoting `phase1_orr` to a scored signal via a joint (biomarker_quality x ORR-bucket) odds-ratio table is viable. Produced the data behind the reframe of Phase 2 away from phase1_orr scoring and toward multi-indication validation of the already- shipped `biomarker_quality` multiplier. The published methodology kill of phase1_orr cites this script's output. Two decisive questions: PART 1 — Per-cell feasibility: can the joint grid be populated to the governance gate (>=3 distinct sponsors AND >=1 approval AND >=1 failure)? Computed on the real NSCLC v4 extractions, under the pooled-9-cell and per-modality-36-cell interpretations. PART 2 — Marginal signal + power: does a joint table beat the 2.6.0 biomarker-only baseline on held-out data, and is +3pp even detectable at the available N? Uses a PROPER paired DeLong covariance test (not the independent approximation in backtest_drug_specific_phase1.py, which overstates variance and understates power). Run: # Activate your Python virtualenv, then: python gate0_phase2_joint_feasibility.py Inputs: the v4 drug-signal extractions dataset (43-drug signals) the NSCLC cohort definition (N=85 cohort) """ from __future__ import annotations import json import math import os import sys from collections import defaultdict from pathlib import Path import numpy as np # Put the directory holding the `cohorts` package on the import path, then # import the NSCLC cohort definition. Adjust COHORTS_ROOT to wherever you keep it. COHORTS_ROOT = os.path.dirname(os.path.abspath(__file__)) sys.path.insert(0, COHORTS_ROOT) from cohorts import nsclc # noqa: E402 # Point DATA_DIR at wherever you keep the extraction/cohort data. The drug-signal # extractions are a newline-delimited JSON file (one record per line). DATA_DIR = Path(__file__).resolve().parent EXTRACTIONS_FILE = "phase0-merged-v4-extractions.ndjson" V4 = DATA_DIR / EXTRACTIONS_FILE # Global ORR buckets (production backtest thresholds). Modality-conditional # thresholds (per the original Phase 2 spec) would only SPLIT these further, # so global buckets are the OPTIMISTIC case for cell population. ORR_BUCKETS = [("low", 0, 25), ("mid", 25, 40), ("high", 40, 101)] # 2.6.0 biomarker-only literature anchors (the Phase 2 baseline). BIOMARKER_ANCHOR = { "genomic_validated": 1.35, "protein_only": 0.85, "unselected": 1.00, "unknown": 1.00, } SEED = 42 def orr_bucket(v): if v is None: return None for label, lo, hi in ORR_BUCKETS: if lo <= v < hi: return label return None def sponsor_norm(s: str) -> str: """Normalize sponsor for distinct-count (strips parentheticals / co-marketing). Over-counting distinct sponsors is OPTIMISTIC for the gate.""" return (s or "Unknown").split("(")[0].split("/")[0].strip().lower() # ----------------------------- Load + join ----------------------------- def load(): signals = {} for line in V4.read_text().splitlines(): line = line.strip() if not line or line.startswith("#"): continue row = json.loads(line) if "_error" in row: continue nm = row.get("drug_name", "").lower() if nm: signals[nm] = row recs = [] for d in nsclc.COHORT: sig = signals.get(d["drug"].lower()) bm = orr_val = orr_mod = orr_b = None if sig: bm = (sig.get("biomarker_quality") or {}).get("value") orr = sig.get("phase1_orr") or {} orr_val = orr.get("orr_percent") orr_mod = orr.get("orr_modality_bucket") if orr_val is not None and orr_mod: orr_b = orr_bucket(orr_val) recs.append({ "drug": d["drug"], "sponsor": sponsor_norm(d.get("sponsor", "Unknown")), "approved": 1 if d.get("outcome") == "approved" else 0, "modality": d.get("modality"), "has_sig": bool(sig), "biomarker": bm, "orr_val": orr_val, "orr_mod": orr_mod, "orr_bucket": orr_b, "rec": d, }) return recs # ----------------------------- PART 1 — cell feasibility ----------------------------- def cell_table(rows, keyfn): cells = defaultdict(lambda: {"n": 0, "app": 0, "fail": 0, "sponsors": set()}) for r in rows: k = keyfn(r) if k is None: continue c = cells[k] c["n"] += 1 c["app"] += r["approved"] c["fail"] += 1 - r["approved"] c["sponsors"].add(r["sponsor"]) return cells def key_pooled(r): if r["biomarker"] and r["orr_bucket"]: return f"{r['biomarker']:>17} | {r['orr_bucket']:>4}" return None def key_permod(r): if r["biomarker"] and r["orr_bucket"] and r["orr_mod"]: return f"{r['biomarker']:>17} | {r['orr_bucket']:>4} | {r['orr_mod']}" return None def report_cells(title, cells, n_grid): print(f"\n{'='*72}\nPART 1 — {title}\n{'='*72}") print(f"{'cell':<48} {'N':>2} {'app':>3} {'fail':>4} {'spons':>5} gate") spec_pass = nge6_pass = 0 for k in sorted(cells): c = cells[k] nsp = len(c["sponsors"]) both = c["app"] >= 1 and c["fail"] >= 1 spec_gate = both and nsp >= 3 nge6_gate = spec_gate and c["n"] >= 6 spec_pass += spec_gate nge6_pass += nge6_gate tag = "PASS(N>=6)" if nge6_gate else ("PASS" if spec_gate else ("both-only" if both else "1-sided")) print(f"{k:<48} {c['n']:>2} {c['app']:>3} {c['fail']:>4} {nsp:>5} {tag}") print(f"\nGrid: {n_grid} possible | populated {len(cells)} | empty {n_grid-len(cells)}") print(f"Clearing SPEC gate (>=3 sponsors + both outcomes): {spec_pass}") print(f"Clearing STRICTER gate (+ N>=6): {nge6_pass}") return spec_pass, nge6_pass # ----------------------------- Scoring helpers ----------------------------- def _logit_apply(p, m): if p <= 0 or p >= 1 or m == 1.0: return p o = p / (1 - p) return (o * m) / (1 + o * m) def baseline_score(d): """Portable engine-2.5.0 proxy baseline (matches production backtest).""" base = 1.0 mod = d.get("modality", "small_molecule") p3 = {"small_molecule": 0.62, "monoclonal_antibody": 0.65, "fusion_protein": 0.62, "bispecific": 0.55, "adc": 0.60, "peptide": 0.45, "cell_therapy": 0.50} for stage in d.get("remaining_stages", []): nm = stage.get("name", "") if "Phase 3" in nm: base *= p3.get(mod, 0.60) elif "NDA" in nm or "BLA" in nm: base *= 0.86 if d.get("first_in_class"): base = _logit_apply(base, 0.90) if d.get("orphan_designation"): base = _logit_apply(base, 1.15) if d.get("biomarker_strategy") == "companion_dx": base = _logit_apply(base, 1.20) if d.get("genetic_validation"): base = _logit_apply(base, 1.10) return base def stratified_split(rows, seed): import random rng = random.Random(seed) app = [r for r in rows if r["approved"] == 1] fail = [r for r in rows if r["approved"] == 0] rng.shuffle(app) rng.shuffle(fail) na, nf = int(0.70 * len(app)), int(0.70 * len(fail)) return app[:na] + fail[:nf], app[na:] + fail[nf:] # ----------------------------- Proper paired DeLong ----------------------------- def _placement(pos, neg): pos, neg = np.asarray(pos, float), np.asarray(neg, float) m, n = len(pos), len(neg) v10 = np.array([(np.sum(pos[i] > neg) + 0.5 * np.sum(pos[i] == neg)) / n for i in range(m)]) v01 = np.array([(np.sum(pos > neg[j]) + 0.5 * np.sum(pos == neg[j])) / m for j in range(n)]) return v10.mean(), v10, v01 def delong_paired(scores_a, scores_b, labels): labels = np.asarray(labels) a, b = np.asarray(scores_a, float), np.asarray(scores_b, float) pm = labels == 1 m, n = int(pm.sum()), int((~pm).sum()) auc_a, v10a, v01a = _placement(a[pm], a[~pm]) auc_b, v10b, v01b = _placement(b[pm], b[~pm]) S = np.cov(np.vstack([v10a, v10b])) / m + np.cov(np.vstack([v01a, v01b])) / n var_diff = S[0, 0] + S[1, 1] - 2 * S[0, 1] se = math.sqrt(max(var_diff, 1e-12)) diff = auc_a - auc_b z = diff / se if se > 0 else 0.0 p = 2 * (1 - 0.5 * (1 + math.erf(abs(z) / math.sqrt(2)))) return auc_a, auc_b, diff, se, z, p def auc_only(scores, labels): labels = np.asarray(labels) a, _, _ = _placement(np.asarray(scores)[labels == 1], np.asarray(scores)[labels == 0]) return a def fit_joint(train_rows, keyfn, gate=True): base_rate = sum(r["approved"] for r in train_rows) / len(train_rows) base_odds = base_rate / (1 - base_rate) cells = cell_table([r for r in train_rows if r["biomarker"] and r["orr_bucket"]], keyfn) ors, cleared = {}, 0 for k, c in cells.items(): both = c["app"] >= 1 and c["fail"] >= 1 if gate and not (both and len(c["sponsors"]) >= 3): continue rate = c["app"] / c["n"] if rate <= 0 or rate >= 1: # perfect separation -> can't compute OR continue ors[k] = (rate / (1 - rate)) / base_odds cleared += 1 return ors, cleared def score_model(rows, mode, joint_ors=None, keyfn=None): out = [] for r in rows: p = baseline_score(r["rec"]) if mode == "biomarker_only" and r["biomarker"]: p = _logit_apply(p, BIOMARKER_ANCHOR.get(r["biomarker"], 1.0)) elif mode == "joint": k = keyfn(r) if (r["biomarker"] and r["orr_bucket"]) else None if k is not None and k in joint_ors: p = _logit_apply(p, joint_ors[k]) elif r["biomarker"]: p = _logit_apply(p, BIOMARKER_ANCHOR.get(r["biomarker"], 1.0)) # FLOOR out.append(p) return out def main(): recs = load() n_total = len(recs) n_app = sum(r["approved"] for r in recs) covered = [r for r in recs if r["has_sig"]] with_bm = [r for r in recs if r["biomarker"]] with_both = [r for r in recs if r["biomarker"] and r["orr_bucket"]] print("=" * 72 + "\nPART 0 — Coverage\n" + "=" * 72) print(f"Cohort N={n_total} ({n_app} approved / {n_total - n_app} failed)") print(f"Drugs with ANY signal: {len(covered)} ({len(covered)/n_total:.0%})") print(f"Drugs with biomarker: {len(with_bm)}") print(f"Drugs with biomarker+ORR:{len(with_both)} (joint-cell population)") print(f" approved/failed: {sum(r['approved'] for r in with_both)}/" f"{sum(1-r['approved'] for r in with_both)} <-- survivorship: failures underreport ORR") report_cells("Pooled 9-cell joint grid (biomarker x ORR)", cell_table(with_both, key_pooled), 9) report_cells("Per-modality 36-cell joint grid", cell_table(with_both, key_permod), 36) print("\n" + "=" * 72 + "\nPART 2 — Joint table vs biomarker-only (held-out 30%)\n" + "=" * 72) train, held = stratified_split(recs, SEED) held_labels = [r["approved"] for r in held] for label, keyfn, gate in [ ("JOINT pooled-9, governance-gated", key_pooled, True), ("JOINT per-mod-36, governance-gated", key_permod, True), ("JOINT pooled-9, NO gate (overfit every cell)", key_pooled, False), ]: joint_ors, cleared = fit_joint(train, keyfn, gate=gate) aj, ab, diff, se, z, p = delong_paired( score_model(held, "joint", joint_ors, keyfn), score_model(held, "biomarker_only"), held_labels) ab2 = auc_only(score_model(held, "baseline"), held_labels) print(f"\n{label}\n train cells cleared: {cleared} | held N={len(held)} " f"({sum(held_labels)}/{len(held)-sum(held_labels)}) | structural baseline AUC={ab2:.3f}") print(f" biomarker-only={ab:.3f} joint={aj:.3f} diff={diff:+.3f} " f"(paired DeLong SE={se:.3f}, z={z:+.2f}, p={p:.3f})") print(f" GATE (+0.030 @ p<0.05): {'PASS' if (diff >= 0.03 and p < 0.05) else 'FAIL'}") print("\n" + "=" * 72 + "\nPART 2b — Minimum detectable AUC delta (paired DeLong)\n" + "=" * 72) s_base = score_model(held, "baseline") s_bm = score_model(held, "biomarker_only") _, _, _, se_layer, _, _ = delong_paired(s_bm, s_base, held_labels) Z = 1.959963985 + 0.8416212336 # 80% power, two-sided alpha 0.05 nh = len(held) for n_proj, tag in [(nh, f"NSCLC now (held={nh})"), (30, "NSCLC full N=85 (~30 held)"), (100, "pooled 4-indication (~100 held)"), (250, "RA-regime (~250 held)")]: mdd = Z * se_layer * math.sqrt(nh / n_proj) print(f" {tag:<38} min detectable AUC delta @80% power = {mdd:.3f}") print("\n Target to clear gate = 0.030. Research §8.3: +3pp at ~0.67 baseline needs") print(" N~100-120/cohort; 0.625 baseline ~250 'out of reach'. Joint-vs-biomarker") print(" isolates a SMALLER delta -> needs MORE N. ~250 held = ~830 drugs.") if __name__ == "__main__": main()