1. Background
PhaseFolio's probability-of-success engine scores a drug-specific multiplier for biomarker quality — whether a program enrolls patients by a genomic-grade molecular alteration (genomic_validated), a protein-only marker (protein_only), or no molecular selection (unselected). A second candidate signal, early-phase ORR magnitude, was carried as a non-scored informational flag because, in earlier work, combining it with biomarker quality reduced held-out accuracy (the combined-signal AUC of 0.615 fell below the biomarker-only 0.670 at 50% cohort coverage) — a classic double-counting effect, since the two signals are correlated.
A proposed framework sought to recover ORR's contribution through a joint biomarker × ORR odds-ratio table with modality-conditional thresholds. An adversarial design review raised three blocking concerns: (1) the proposed +3-percentage-point validation gate was underpowered at the available cohort size; (2) the joint grid would collapse to mostly-empty cells under a governance rule requiring at least three distinct sponsors per cell; and (3) the claim that a joint table could "never regress" against the baseline is not logically guaranteed. We therefore ran a pre-specified feasibility gate — sharpened to per-cell counts and an explicit power calculation — before committing to any framework.
2. Methods
Cohort. 85 NSCLC programs that reached a Phase II/III decision, comprising both approvals and failures, each classified by biomarker quality and (where available) early-phase ORR.
Joint-cell construction. Programs were cross-classified into a biomarker-quality × ORR-bucket grid, evaluated under two interpretations: a pooled grid (9 possible cells) and a modality-conditional grid (36 possible cells across four modalities). Cell odds ratios were fit on a 70% training split and applied to the 30% held-out split. Global ORR thresholds were used as the optimistic case; modality-conditional thresholds only subdivide cells further.
Comparison. The joint model's held-out AUC was compared to the production (engine 2.6.0) biomarker-only baseline using a proper paired DeLong test, which accounts for the covariance between two AUCs measured on the same cases. This is the more favorable test than the independent approximation used elsewhere — and the joint model still fails it.
Power. We computed the minimum detectable AUC difference at 80% power (α = 0.05), scaling the standard error by approximately 1/√N from the observed held-out biomarker-vs-baseline standard error as an achievable-signal proxy. Random seed 42 throughout.
3. Results
3.1 The joint cells cannot be populated
Only 28 of 85 programs carried both a biomarker classification and an extractable early-phase ORR, split 23 approved / 5 failed. Failures reported an extractable early-phase ORR at roughly 12% (5 of 41) versus approvals at roughly 52% (23 of 44). This is intrinsic survivorship bias: failed programs disproportionately never publish an early-phase ORR, so the cells that do populate are approval-dominated and ORR cannot discriminate outcome within them.
| Grid interpretation |
Possible cells |
Populated |
Clear governance gate (≥3 sponsors + both outcomes) |
Clear stricter gate (N ≥ 6) |
| Pooled (biomarker × ORR) |
9 |
6 |
3 |
1 |
| Per-modality (× 4 modalities) |
36 |
8 |
3 |
1 |
The modality-conditional interpretation leaves 28 of 36 cells empty. Only the genomic-validated × high-ORR cell is robustly populated (N = 15) — but at 14 approved / 1 failed it is near-perfectly separated, which itself defeats stable odds-ratio estimation.
3.2 No marginal signal beyond biomarker quality
| Model |
Held-out AUC |
Δ vs biomarker-only |
Paired DeLong p |
| Structural baseline |
0.618 |
— |
— |
| Biomarker-only (engine 2.6.0, shipped) |
0.670 |
— |
— |
| Joint table, governance-gated |
0.676 |
+0.005 |
0.480 |
| Joint table, no gate (every cell fit) |
0.687 |
+0.016 |
0.377 |
Under the governance gate, only one joint cell clears the fit on the training split, so the joint table is essentially the biomarker-only result with one earned cell. It does correctly remove the earlier double-counting drag (recovering the combined signal from 0.615 back to 0.676 ≈ the biomarker-only 0.670) — but that only returns to where the production engine already sits. Even maximally overfit with no governance gate, it reaches +1.6 points, never the +3-point gate and never close to statistical significance. Early-phase ORR carries almost no predictive signal independent of biomarker quality, because the programs with an extractable ORR are the genomic-validated successes that biomarker quality already flags.
3.3 Unpowerable at reachable scale
| Scenario |
Held-out N |
Minimum detectable Δ |
| NSCLC, current |
27 |
0.088 |
| NSCLC, full (N = 85) |
~30 |
0.083 |
| Pooled four-indication |
~100 |
0.046 |
| RA-regime target |
~250 |
0.029 |
Detecting the +3-point gate requires roughly 250 held-out programs ≈ 830 total — an order of magnitude beyond the 85 available, and consistent with our prior power analysis. Even a pooled four-indication cohort (~100 held-out) detects only ~4.6 points. The observed ORR signal (§3.2) is +0.5 points; the power floor (§3.3) is +8.8 points. The two gaps are independent and each is individually disqualifying.
4. Discussion
Early-phase ORR scoring fails on two independent grounds: the marginal signal is approximately zero, and the validation is roughly 10× out of power. Neither is fixable with more data extraction — the survivorship bias (failures do not report ORR) is structural, and the required scale is an order of magnitude away.
The result is substrate-positive. Biomarker quality (genomic_validated 1.35× / protein_only 0.85× / unselected 1.00×, log-odds, Phase II/III) is real and stable: +5.2 percentage points of held-out AUC, validated on NSCLC, with a genomic-validated cohort odds ratio of 5.59 against a conservative literature anchor of 1.35 (N = 21, clears the governance gate). We ship the conservative anchor and disclose the overshoot rather than overfit to our own cohort. Publishing that we tested a competitor-marketed signal — early-phase ORR magnitude — and found it adds no predictive value beyond biomarker quality, with the ablation and power analysis attached, is a form of differentiation that an opacity-based vendor cannot match: they cannot publish negative results on the signals they sell.
5. Limitations
The biomarker-quality validation is single-indication (NSCLC, N = 85); generalization to other solid tumors is under active multi-indication validation and will be published as it completes. The ORR analysis is bounded by survivorship bias (failed programs under-report early-phase ORR) and by statistical power at this cohort size. This is a transparent methods evaluation, not a peer-reviewed publication.
6. Conclusion
Early-phase ORR magnitude remains a non-scored, surfaced flag in the production engine; it is not scored. Phase 2 reframes to multi-indication validation of the already-shipped biomarker-quality multiplier — expanding the cohort to breast, melanoma, and colorectal cancer and testing whether the shipped anchors generalize beyond NSCLC, scoping per-indication where they do not — and to publication of this negative result.
References
- Schwaederle M., et al. (2016). Association of biomarker-based treatment strategies with response rates and progression-free survival in refractory malignancies: a meta-analysis.
- Vreman R.A., et al. (2020). Phase 2-to-Phase 3 attrition and winner's-curse correction in oncology development.
- Zhang J., et al. (2022). Investigator-assessed versus blinded independent central review of objective response rate.
Appendix A
Analysis Code (Python)
View-only; results are reproducible from this listing. No download is provided — the analysis script is presented here for full transparency.
"""Gate-0 Phase 2 feasibility analysis (2026-05-28).
Decides — BEFORE any Phase 2 build — whether promoting `phase1_orr` to a
scored signal via a joint (biomarker_quality x ORR-bucket) odds-ratio table
is viable. Produced the data behind the reframe of Phase 2 away from
phase1_orr scoring and toward multi-indication validation of the already-
shipped `biomarker_quality` multiplier. The published methodology kill of
phase1_orr cites this script's output.
Two decisive questions:
PART 1 — Per-cell feasibility: can the joint grid be populated to the
governance gate (>=3 distinct sponsors AND >=1 approval AND
>=1 failure)? Computed on the real NSCLC v4 extractions, under
the pooled-9-cell and per-modality-36-cell interpretations.
PART 2 — Marginal signal + power: does a joint table beat the 2.6.0
biomarker-only baseline on held-out data, and is +3pp even
detectable at the available N? Uses a PROPER paired DeLong
covariance test (not the independent approximation in
backtest_drug_specific_phase1.py, which overstates variance
and understates power).
Run:
# Activate your Python virtualenv, then:
python gate0_phase2_joint_feasibility.py
Inputs:
the v4 drug-signal extractions dataset (43-drug signals)
the NSCLC cohort definition (N=85 cohort)
"""
from __future__ import annotations
import json
import math
import os
import sys
from collections import defaultdict
from pathlib import Path
import numpy as np
# Put the directory holding the `cohorts` package on the import path, then
# import the NSCLC cohort definition. Adjust COHORTS_ROOT to wherever you keep it.
COHORTS_ROOT = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(0, COHORTS_ROOT)
from cohorts import nsclc # noqa: E402
# Point DATA_DIR at wherever you keep the extraction/cohort data. The drug-signal
# extractions are a newline-delimited JSON file (one record per line).
DATA_DIR = Path(__file__).resolve().parent
EXTRACTIONS_FILE = "phase0-merged-v4-extractions.ndjson"
V4 = DATA_DIR / EXTRACTIONS_FILE
# Global ORR buckets (production backtest thresholds). Modality-conditional
# thresholds (per the original Phase 2 spec) would only SPLIT these further,
# so global buckets are the OPTIMISTIC case for cell population.
ORR_BUCKETS = [("low", 0, 25), ("mid", 25, 40), ("high", 40, 101)]
# 2.6.0 biomarker-only literature anchors (the Phase 2 baseline).
BIOMARKER_ANCHOR = {
"genomic_validated": 1.35,
"protein_only": 0.85,
"unselected": 1.00,
"unknown": 1.00,
}
SEED = 42
def orr_bucket(v):
if v is None:
return None
for label, lo, hi in ORR_BUCKETS:
if lo <= v < hi:
return label
return None
def sponsor_norm(s: str) -> str:
"""Normalize sponsor for distinct-count (strips parentheticals / co-marketing).
Over-counting distinct sponsors is OPTIMISTIC for the gate."""
return (s or "Unknown").split("(")[0].split("/")[0].strip().lower()
# ----------------------------- Load + join -----------------------------
def load():
signals = {}
for line in V4.read_text().splitlines():
line = line.strip()
if not line or line.startswith("#"):
continue
row = json.loads(line)
if "_error" in row:
continue
nm = row.get("drug_name", "").lower()
if nm:
signals[nm] = row
recs = []
for d in nsclc.COHORT:
sig = signals.get(d["drug"].lower())
bm = orr_val = orr_mod = orr_b = None
if sig:
bm = (sig.get("biomarker_quality") or {}).get("value")
orr = sig.get("phase1_orr") or {}
orr_val = orr.get("orr_percent")
orr_mod = orr.get("orr_modality_bucket")
if orr_val is not None and orr_mod:
orr_b = orr_bucket(orr_val)
recs.append({
"drug": d["drug"], "sponsor": sponsor_norm(d.get("sponsor", "Unknown")),
"approved": 1 if d.get("outcome") == "approved" else 0,
"modality": d.get("modality"), "has_sig": bool(sig), "biomarker": bm,
"orr_val": orr_val, "orr_mod": orr_mod, "orr_bucket": orr_b, "rec": d,
})
return recs
# ----------------------------- PART 1 — cell feasibility -----------------------------
def cell_table(rows, keyfn):
cells = defaultdict(lambda: {"n": 0, "app": 0, "fail": 0, "sponsors": set()})
for r in rows:
k = keyfn(r)
if k is None:
continue
c = cells[k]
c["n"] += 1
c["app"] += r["approved"]
c["fail"] += 1 - r["approved"]
c["sponsors"].add(r["sponsor"])
return cells
def key_pooled(r):
if r["biomarker"] and r["orr_bucket"]:
return f"{r['biomarker']:>17} | {r['orr_bucket']:>4}"
return None
def key_permod(r):
if r["biomarker"] and r["orr_bucket"] and r["orr_mod"]:
return f"{r['biomarker']:>17} | {r['orr_bucket']:>4} | {r['orr_mod']}"
return None
def report_cells(title, cells, n_grid):
print(f"\n{'='*72}\nPART 1 — {title}\n{'='*72}")
print(f"{'cell':<48} {'N':>2} {'app':>3} {'fail':>4} {'spons':>5} gate")
spec_pass = nge6_pass = 0
for k in sorted(cells):
c = cells[k]
nsp = len(c["sponsors"])
both = c["app"] >= 1 and c["fail"] >= 1
spec_gate = both and nsp >= 3
nge6_gate = spec_gate and c["n"] >= 6
spec_pass += spec_gate
nge6_pass += nge6_gate
tag = "PASS(N>=6)" if nge6_gate else ("PASS" if spec_gate else ("both-only" if both else "1-sided"))
print(f"{k:<48} {c['n']:>2} {c['app']:>3} {c['fail']:>4} {nsp:>5} {tag}")
print(f"\nGrid: {n_grid} possible | populated {len(cells)} | empty {n_grid-len(cells)}")
print(f"Clearing SPEC gate (>=3 sponsors + both outcomes): {spec_pass}")
print(f"Clearing STRICTER gate (+ N>=6): {nge6_pass}")
return spec_pass, nge6_pass
# ----------------------------- Scoring helpers -----------------------------
def _logit_apply(p, m):
if p <= 0 or p >= 1 or m == 1.0:
return p
o = p / (1 - p)
return (o * m) / (1 + o * m)
def baseline_score(d):
"""Portable engine-2.5.0 proxy baseline (matches production backtest)."""
base = 1.0
mod = d.get("modality", "small_molecule")
p3 = {"small_molecule": 0.62, "monoclonal_antibody": 0.65, "fusion_protein": 0.62,
"bispecific": 0.55, "adc": 0.60, "peptide": 0.45, "cell_therapy": 0.50}
for stage in d.get("remaining_stages", []):
nm = stage.get("name", "")
if "Phase 3" in nm:
base *= p3.get(mod, 0.60)
elif "NDA" in nm or "BLA" in nm:
base *= 0.86
if d.get("first_in_class"):
base = _logit_apply(base, 0.90)
if d.get("orphan_designation"):
base = _logit_apply(base, 1.15)
if d.get("biomarker_strategy") == "companion_dx":
base = _logit_apply(base, 1.20)
if d.get("genetic_validation"):
base = _logit_apply(base, 1.10)
return base
def stratified_split(rows, seed):
import random
rng = random.Random(seed)
app = [r for r in rows if r["approved"] == 1]
fail = [r for r in rows if r["approved"] == 0]
rng.shuffle(app)
rng.shuffle(fail)
na, nf = int(0.70 * len(app)), int(0.70 * len(fail))
return app[:na] + fail[:nf], app[na:] + fail[nf:]
# ----------------------------- Proper paired DeLong -----------------------------
def _placement(pos, neg):
pos, neg = np.asarray(pos, float), np.asarray(neg, float)
m, n = len(pos), len(neg)
v10 = np.array([(np.sum(pos[i] > neg) + 0.5 * np.sum(pos[i] == neg)) / n for i in range(m)])
v01 = np.array([(np.sum(pos > neg[j]) + 0.5 * np.sum(pos == neg[j])) / m for j in range(n)])
return v10.mean(), v10, v01
def delong_paired(scores_a, scores_b, labels):
labels = np.asarray(labels)
a, b = np.asarray(scores_a, float), np.asarray(scores_b, float)
pm = labels == 1
m, n = int(pm.sum()), int((~pm).sum())
auc_a, v10a, v01a = _placement(a[pm], a[~pm])
auc_b, v10b, v01b = _placement(b[pm], b[~pm])
S = np.cov(np.vstack([v10a, v10b])) / m + np.cov(np.vstack([v01a, v01b])) / n
var_diff = S[0, 0] + S[1, 1] - 2 * S[0, 1]
se = math.sqrt(max(var_diff, 1e-12))
diff = auc_a - auc_b
z = diff / se if se > 0 else 0.0
p = 2 * (1 - 0.5 * (1 + math.erf(abs(z) / math.sqrt(2))))
return auc_a, auc_b, diff, se, z, p
def auc_only(scores, labels):
labels = np.asarray(labels)
a, _, _ = _placement(np.asarray(scores)[labels == 1], np.asarray(scores)[labels == 0])
return a
def fit_joint(train_rows, keyfn, gate=True):
base_rate = sum(r["approved"] for r in train_rows) / len(train_rows)
base_odds = base_rate / (1 - base_rate)
cells = cell_table([r for r in train_rows if r["biomarker"] and r["orr_bucket"]], keyfn)
ors, cleared = {}, 0
for k, c in cells.items():
both = c["app"] >= 1 and c["fail"] >= 1
if gate and not (both and len(c["sponsors"]) >= 3):
continue
rate = c["app"] / c["n"]
if rate <= 0 or rate >= 1: # perfect separation -> can't compute OR
continue
ors[k] = (rate / (1 - rate)) / base_odds
cleared += 1
return ors, cleared
def score_model(rows, mode, joint_ors=None, keyfn=None):
out = []
for r in rows:
p = baseline_score(r["rec"])
if mode == "biomarker_only" and r["biomarker"]:
p = _logit_apply(p, BIOMARKER_ANCHOR.get(r["biomarker"], 1.0))
elif mode == "joint":
k = keyfn(r) if (r["biomarker"] and r["orr_bucket"]) else None
if k is not None and k in joint_ors:
p = _logit_apply(p, joint_ors[k])
elif r["biomarker"]:
p = _logit_apply(p, BIOMARKER_ANCHOR.get(r["biomarker"], 1.0)) # FLOOR
out.append(p)
return out
def main():
recs = load()
n_total = len(recs)
n_app = sum(r["approved"] for r in recs)
covered = [r for r in recs if r["has_sig"]]
with_bm = [r for r in recs if r["biomarker"]]
with_both = [r for r in recs if r["biomarker"] and r["orr_bucket"]]
print("=" * 72 + "\nPART 0 — Coverage\n" + "=" * 72)
print(f"Cohort N={n_total} ({n_app} approved / {n_total - n_app} failed)")
print(f"Drugs with ANY signal: {len(covered)} ({len(covered)/n_total:.0%})")
print(f"Drugs with biomarker: {len(with_bm)}")
print(f"Drugs with biomarker+ORR:{len(with_both)} (joint-cell population)")
print(f" approved/failed: {sum(r['approved'] for r in with_both)}/"
f"{sum(1-r['approved'] for r in with_both)} <-- survivorship: failures underreport ORR")
report_cells("Pooled 9-cell joint grid (biomarker x ORR)", cell_table(with_both, key_pooled), 9)
report_cells("Per-modality 36-cell joint grid", cell_table(with_both, key_permod), 36)
print("\n" + "=" * 72 + "\nPART 2 — Joint table vs biomarker-only (held-out 30%)\n" + "=" * 72)
train, held = stratified_split(recs, SEED)
held_labels = [r["approved"] for r in held]
for label, keyfn, gate in [
("JOINT pooled-9, governance-gated", key_pooled, True),
("JOINT per-mod-36, governance-gated", key_permod, True),
("JOINT pooled-9, NO gate (overfit every cell)", key_pooled, False),
]:
joint_ors, cleared = fit_joint(train, keyfn, gate=gate)
aj, ab, diff, se, z, p = delong_paired(
score_model(held, "joint", joint_ors, keyfn),
score_model(held, "biomarker_only"), held_labels)
ab2 = auc_only(score_model(held, "baseline"), held_labels)
print(f"\n{label}\n train cells cleared: {cleared} | held N={len(held)} "
f"({sum(held_labels)}/{len(held)-sum(held_labels)}) | structural baseline AUC={ab2:.3f}")
print(f" biomarker-only={ab:.3f} joint={aj:.3f} diff={diff:+.3f} "
f"(paired DeLong SE={se:.3f}, z={z:+.2f}, p={p:.3f})")
print(f" GATE (+0.030 @ p<0.05): {'PASS' if (diff >= 0.03 and p < 0.05) else 'FAIL'}")
print("\n" + "=" * 72 + "\nPART 2b — Minimum detectable AUC delta (paired DeLong)\n" + "=" * 72)
s_base = score_model(held, "baseline")
s_bm = score_model(held, "biomarker_only")
_, _, _, se_layer, _, _ = delong_paired(s_bm, s_base, held_labels)
Z = 1.959963985 + 0.8416212336 # 80% power, two-sided alpha 0.05
nh = len(held)
for n_proj, tag in [(nh, f"NSCLC now (held={nh})"), (30, "NSCLC full N=85 (~30 held)"),
(100, "pooled 4-indication (~100 held)"), (250, "RA-regime (~250 held)")]:
mdd = Z * se_layer * math.sqrt(nh / n_proj)
print(f" {tag:<38} min detectable AUC delta @80% power = {mdd:.3f}")
print("\n Target to clear gate = 0.030. Research §8.3: +3pp at ~0.67 baseline needs")
print(" N~100-120/cohort; 0.625 baseline ~250 'out of reach'. Joint-vs-biomarker")
print(" isolates a SMALLER delta -> needs MORE N. ~250 held = ~830 drugs.")
if __name__ == "__main__":
main()