PhaseFolio — Drug-Specific Predictive Signals in Oncology: What Held and What Didn't — Full Report

1. Background

PhaseFolio's probability-of-success engine scores a drug-specific multiplier for biomarker quality — whether a program enrolls patients by a genomic-grade molecular alteration (genomic_validated), a protein-only marker (protein_only), or no molecular selection (unselected). A second candidate signal, early-phase ORR magnitude, was carried as a non-scored informational flag because, in earlier work, combining it with biomarker quality reduced held-out accuracy (the combined-signal AUC of 0.615 fell below the biomarker-only 0.670 at 50% cohort coverage) — a classic double-counting effect, since the two signals are correlated.

A proposed framework sought to recover ORR's contribution through a joint biomarker × ORR odds-ratio table with modality-conditional thresholds. An adversarial design review raised three blocking concerns: (1) the proposed +3-percentage-point validation gate was underpowered at the available cohort size; (2) the joint grid would collapse to mostly-empty cells under a governance rule requiring at least three distinct sponsors per cell; and (3) the claim that a joint table could "never regress" against the baseline is not logically guaranteed. We therefore ran a pre-specified feasibility gate — sharpened to per-cell counts and an explicit power calculation — before committing to any framework.

2. Methods

Cohort. 85 NSCLC programs that reached a Phase II/III decision, comprising both approvals and failures, each classified by biomarker quality and (where available) early-phase ORR.

Joint-cell construction. Programs were cross-classified into a biomarker-quality × ORR-bucket grid, evaluated under two interpretations: a pooled grid (9 possible cells) and a modality-conditional grid (36 possible cells across four modalities). Cell odds ratios were fit on a 70% training split and applied to the 30% held-out split. Global ORR thresholds were used as the optimistic case; modality-conditional thresholds only subdivide cells further.

Comparison. The joint model's held-out AUC was compared to the production (engine 2.6.0) biomarker-only baseline using a proper paired DeLong test, which accounts for the covariance between two AUCs measured on the same cases. This is the more favorable test than the independent approximation used elsewhere — and the joint model still fails it.

Power. We computed the minimum detectable AUC difference at 80% power (α = 0.05), scaling the standard error by approximately 1/√N from the observed held-out biomarker-vs-baseline standard error as an achievable-signal proxy. Random seed 42 throughout.

3. Results

3.1 The joint cells cannot be populated

Only 28 of 85 programs carried both a biomarker classification and an extractable early-phase ORR, split 23 approved / 5 failed. Failures reported an extractable early-phase ORR at roughly 12% (5 of 41) versus approvals at roughly 52% (23 of 44). This is intrinsic survivorship bias: failed programs disproportionately never publish an early-phase ORR, so the cells that do populate are approval-dominated and ORR cannot discriminate outcome within them.

Grid interpretation	Possible cells	Populated	Clear governance gate (≥3 sponsors + both outcomes)	Clear stricter gate (N ≥ 6)
Pooled (biomarker × ORR)	9	6	3	1
Per-modality (× 4 modalities)	36	8	3	1

The modality-conditional interpretation leaves 28 of 36 cells empty. Only the genomic-validated × high-ORR cell is robustly populated (N = 15) — but at 14 approved / 1 failed it is near-perfectly separated, which itself defeats stable odds-ratio estimation.

3.2 No marginal signal beyond biomarker quality

Model	Held-out AUC	Δ vs biomarker-only	Paired DeLong p
Structural baseline	0.618	—	—
Biomarker-only (engine 2.6.0, shipped)	0.670	—	—
Joint table, governance-gated	0.676	+0.005	0.480
Joint table, no gate (every cell fit)	0.687	+0.016	0.377

Under the governance gate, only one joint cell clears the fit on the training split, so the joint table is essentially the biomarker-only result with one earned cell. It does correctly remove the earlier double-counting drag (recovering the combined signal from 0.615 back to 0.676 ≈ the biomarker-only 0.670) — but that only returns to where the production engine already sits. Even maximally overfit with no governance gate, it reaches +1.6 points, never the +3-point gate and never close to statistical significance. Early-phase ORR carries almost no predictive signal independent of biomarker quality, because the programs with an extractable ORR are the genomic-validated successes that biomarker quality already flags.

3.3 Unpowerable at reachable scale

Scenario	Held-out N	Minimum detectable Δ
NSCLC, current	27	0.088
NSCLC, full (N = 85)	~30	0.083
Pooled four-indication	~100	0.046
RA-regime target	~250	0.029

Detecting the +3-point gate requires roughly 250 held-out programs ≈ 830 total — an order of magnitude beyond the 85 available, and consistent with our prior power analysis. Even a pooled four-indication cohort (~100 held-out) detects only ~4.6 points. The observed ORR signal (§3.2) is +0.5 points; the power floor (§3.3) is +8.8 points. The two gaps are independent and each is individually disqualifying.

4. Discussion

Early-phase ORR scoring fails on two independent grounds: the marginal signal is approximately zero, and the validation is roughly 10× out of power. Neither is fixable with more data extraction — the survivorship bias (failures do not report ORR) is structural, and the required scale is an order of magnitude away.

The result is substrate-positive. Biomarker quality (genomic_validated 1.35× / protein_only 0.85× / unselected 1.00×, log-odds, Phase II/III) is real and stable: +5.2 percentage points of held-out AUC, validated on NSCLC, with a genomic-validated cohort odds ratio of 5.59 against a conservative literature anchor of 1.35 (N = 21, clears the governance gate). We ship the conservative anchor and disclose the overshoot rather than overfit to our own cohort. Publishing that we tested a competitor-marketed signal — early-phase ORR magnitude — and found it adds no predictive value beyond biomarker quality, with the ablation and power analysis attached, is a form of differentiation that an opacity-based vendor cannot match: they cannot publish negative results on the signals they sell.

5. Limitations

The biomarker-quality validation is single-indication (NSCLC, N = 85); generalization to other solid tumors is under active multi-indication validation and will be published as it completes. The ORR analysis is bounded by survivorship bias (failed programs under-report early-phase ORR) and by statistical power at this cohort size. This is a transparent methods evaluation, not a peer-reviewed publication.

6. Conclusion

Early-phase ORR magnitude remains a non-scored, surfaced flag in the production engine; it is not scored. Phase 2 reframes to multi-indication validation of the already-shipped biomarker-quality multiplier — expanding the cohort to breast, melanoma, and colorectal cancer and testing whether the shipped anchors generalize beyond NSCLC, scoping per-indication where they do not — and to publication of this negative result.

References

Schwaederle M., et al. (2016). Association of biomarker-based treatment strategies with response rates and progression-free survival in refractory malignancies: a meta-analysis.
Vreman R.A., et al. (2020). Phase 2-to-Phase 3 attrition and winner's-curse correction in oncology development.
Zhang J., et al. (2022). Investigator-assessed versus blinded independent central review of objective response rate.

Analysis Code (Python)

View-only; results are reproducible from this listing. No download is provided — the analysis script is presented here for full transparency.

"""Gate-0 Phase 2 feasibility analysis (2026-05-28).

Decides — BEFORE any Phase 2 build — whether promoting `phase1_orr` to a
scored signal via a joint (biomarker_quality x ORR-bucket) odds-ratio table
is viable. Produced the data behind the reframe of Phase 2 away from
phase1_orr scoring and toward multi-indication validation of the already-
shipped `biomarker_quality` multiplier. The published methodology kill of
phase1_orr cites this script's output.

Two decisive questions:

  PART 1 — Per-cell feasibility: can the joint grid be populated to the
           governance gate (>=3 distinct sponsors AND >=1 approval AND
           >=1 failure)? Computed on the real NSCLC v4 extractions, under
           the pooled-9-cell and per-modality-36-cell interpretations.

  PART 2 — Marginal signal + power: does a joint table beat the 2.6.0
           biomarker-only baseline on held-out data, and is +3pp even
           detectable at the available N? Uses a PROPER paired DeLong
           covariance test (not the independent approximation in
           backtest_drug_specific_phase1.py, which overstates variance
           and understates power).

Run:
    # Activate your Python virtualenv, then:
    python gate0_phase2_joint_feasibility.py

Inputs:
    the v4 drug-signal extractions dataset  (43-drug signals)
    the NSCLC cohort definition             (N=85 cohort)
"""
from __future__ import annotations

import json
import math
import os
import sys
from collections import defaultdict
from pathlib import Path

import numpy as np

# Put the directory holding the `cohorts` package on the import path, then
# import the NSCLC cohort definition. Adjust COHORTS_ROOT to wherever you keep it.
COHORTS_ROOT = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(0, COHORTS_ROOT)
from cohorts import nsclc  # noqa: E402

# Point DATA_DIR at wherever you keep the extraction/cohort data. The drug-signal
# extractions are a newline-delimited JSON file (one record per line).
DATA_DIR = Path(__file__).resolve().parent
EXTRACTIONS_FILE = "phase0-merged-v4-extractions.ndjson"
V4 = DATA_DIR / EXTRACTIONS_FILE

# Global ORR buckets (production backtest thresholds). Modality-conditional
# thresholds (per the original Phase 2 spec) would only SPLIT these further,
# so global buckets are the OPTIMISTIC case for cell population.
ORR_BUCKETS = [("low", 0, 25), ("mid", 25, 40), ("high", 40, 101)]

# 2.6.0 biomarker-only literature anchors (the Phase 2 baseline).
BIOMARKER_ANCHOR = {
    "genomic_validated": 1.35,
    "protein_only": 0.85,
    "unselected": 1.00,
    "unknown": 1.00,
}
SEED = 42


def orr_bucket(v):
    if v is None:
        return None
    for label, lo, hi in ORR_BUCKETS:
        if lo <= v < hi:
            return label
    return None


def sponsor_norm(s: str) -> str:
    """Normalize sponsor for distinct-count (strips parentheticals / co-marketing).
    Over-counting distinct sponsors is OPTIMISTIC for the gate."""
    return (s or "Unknown").split("(")[0].split("/")[0].strip().lower()


# ----------------------------- Load + join -----------------------------

def load():
    signals = {}
    for line in V4.read_text().splitlines():
        line = line.strip()
        if not line or line.startswith("#"):
            continue
        row = json.loads(line)
        if "_error" in row:
            continue
        nm = row.get("drug_name", "").lower()
        if nm:
            signals[nm] = row

    recs = []
    for d in nsclc.COHORT:
        sig = signals.get(d["drug"].lower())
        bm = orr_val = orr_mod = orr_b = None
        if sig:
            bm = (sig.get("biomarker_quality") or {}).get("value")
            orr = sig.get("phase1_orr") or {}
            orr_val = orr.get("orr_percent")
            orr_mod = orr.get("orr_modality_bucket")
            if orr_val is not None and orr_mod:
                orr_b = orr_bucket(orr_val)
        recs.append({
            "drug": d["drug"], "sponsor": sponsor_norm(d.get("sponsor", "Unknown")),
            "approved": 1 if d.get("outcome") == "approved" else 0,
            "modality": d.get("modality"), "has_sig": bool(sig), "biomarker": bm,
            "orr_val": orr_val, "orr_mod": orr_mod, "orr_bucket": orr_b, "rec": d,
        })
    return recs


# ----------------------------- PART 1 — cell feasibility -----------------------------

def cell_table(rows, keyfn):
    cells = defaultdict(lambda: {"n": 0, "app": 0, "fail": 0, "sponsors": set()})
    for r in rows:
        k = keyfn(r)
        if k is None:
            continue
        c = cells[k]
        c["n"] += 1
        c["app"] += r["approved"]
        c["fail"] += 1 - r["approved"]
        c["sponsors"].add(r["sponsor"])
    return cells


def key_pooled(r):
    if r["biomarker"] and r["orr_bucket"]:
        return f"{r['biomarker']:>17} | {r['orr_bucket']:>4}"
    return None


def key_permod(r):
    if r["biomarker"] and r["orr_bucket"] and r["orr_mod"]:
        return f"{r['biomarker']:>17} | {r['orr_bucket']:>4} | {r['orr_mod']}"
    return None


def report_cells(title, cells, n_grid):
    print(f"\n{'='*72}\nPART 1 — {title}\n{'='*72}")
    print(f"{'cell':<48} {'N':>2} {'app':>3} {'fail':>4} {'spons':>5}  gate")
    spec_pass = nge6_pass = 0
    for k in sorted(cells):
        c = cells[k]
        nsp = len(c["sponsors"])
        both = c["app"] >= 1 and c["fail"] >= 1
        spec_gate = both and nsp >= 3
        nge6_gate = spec_gate and c["n"] >= 6
        spec_pass += spec_gate
        nge6_pass += nge6_gate
        tag = "PASS(N>=6)" if nge6_gate else ("PASS" if spec_gate else ("both-only" if both else "1-sided"))
        print(f"{k:<48} {c['n']:>2} {c['app']:>3} {c['fail']:>4} {nsp:>5}  {tag}")
    print(f"\nGrid: {n_grid} possible | populated {len(cells)} | empty {n_grid-len(cells)}")
    print(f"Clearing SPEC gate (>=3 sponsors + both outcomes): {spec_pass}")
    print(f"Clearing STRICTER gate (+ N>=6):                   {nge6_pass}")
    return spec_pass, nge6_pass


# ----------------------------- Scoring helpers -----------------------------

def _logit_apply(p, m):
    if p <= 0 or p >= 1 or m == 1.0:
        return p
    o = p / (1 - p)
    return (o * m) / (1 + o * m)


def baseline_score(d):
    """Portable engine-2.5.0 proxy baseline (matches production backtest)."""
    base = 1.0
    mod = d.get("modality", "small_molecule")
    p3 = {"small_molecule": 0.62, "monoclonal_antibody": 0.65, "fusion_protein": 0.62,
          "bispecific": 0.55, "adc": 0.60, "peptide": 0.45, "cell_therapy": 0.50}
    for stage in d.get("remaining_stages", []):
        nm = stage.get("name", "")
        if "Phase 3" in nm:
            base *= p3.get(mod, 0.60)
        elif "NDA" in nm or "BLA" in nm:
            base *= 0.86
    if d.get("first_in_class"):
        base = _logit_apply(base, 0.90)
    if d.get("orphan_designation"):
        base = _logit_apply(base, 1.15)
    if d.get("biomarker_strategy") == "companion_dx":
        base = _logit_apply(base, 1.20)
    if d.get("genetic_validation"):
        base = _logit_apply(base, 1.10)
    return base


def stratified_split(rows, seed):
    import random
    rng = random.Random(seed)
    app = [r for r in rows if r["approved"] == 1]
    fail = [r for r in rows if r["approved"] == 0]
    rng.shuffle(app)
    rng.shuffle(fail)
    na, nf = int(0.70 * len(app)), int(0.70 * len(fail))
    return app[:na] + fail[:nf], app[na:] + fail[nf:]


# ----------------------------- Proper paired DeLong -----------------------------

def _placement(pos, neg):
    pos, neg = np.asarray(pos, float), np.asarray(neg, float)
    m, n = len(pos), len(neg)
    v10 = np.array([(np.sum(pos[i] > neg) + 0.5 * np.sum(pos[i] == neg)) / n for i in range(m)])
    v01 = np.array([(np.sum(pos > neg[j]) + 0.5 * np.sum(pos == neg[j])) / m for j in range(n)])
    return v10.mean(), v10, v01


def delong_paired(scores_a, scores_b, labels):
    labels = np.asarray(labels)
    a, b = np.asarray(scores_a, float), np.asarray(scores_b, float)
    pm = labels == 1
    m, n = int(pm.sum()), int((~pm).sum())
    auc_a, v10a, v01a = _placement(a[pm], a[~pm])
    auc_b, v10b, v01b = _placement(b[pm], b[~pm])
    S = np.cov(np.vstack([v10a, v10b])) / m + np.cov(np.vstack([v01a, v01b])) / n
    var_diff = S[0, 0] + S[1, 1] - 2 * S[0, 1]
    se = math.sqrt(max(var_diff, 1e-12))
    diff = auc_a - auc_b
    z = diff / se if se > 0 else 0.0
    p = 2 * (1 - 0.5 * (1 + math.erf(abs(z) / math.sqrt(2))))
    return auc_a, auc_b, diff, se, z, p


def auc_only(scores, labels):
    labels = np.asarray(labels)
    a, _, _ = _placement(np.asarray(scores)[labels == 1], np.asarray(scores)[labels == 0])
    return a


def fit_joint(train_rows, keyfn, gate=True):
    base_rate = sum(r["approved"] for r in train_rows) / len(train_rows)
    base_odds = base_rate / (1 - base_rate)
    cells = cell_table([r for r in train_rows if r["biomarker"] and r["orr_bucket"]], keyfn)
    ors, cleared = {}, 0
    for k, c in cells.items():
        both = c["app"] >= 1 and c["fail"] >= 1
        if gate and not (both and len(c["sponsors"]) >= 3):
            continue
        rate = c["app"] / c["n"]
        if rate <= 0 or rate >= 1:   # perfect separation -> can't compute OR
            continue
        ors[k] = (rate / (1 - rate)) / base_odds
        cleared += 1
    return ors, cleared


def score_model(rows, mode, joint_ors=None, keyfn=None):
    out = []
    for r in rows:
        p = baseline_score(r["rec"])
        if mode == "biomarker_only" and r["biomarker"]:
            p = _logit_apply(p, BIOMARKER_ANCHOR.get(r["biomarker"], 1.0))
        elif mode == "joint":
            k = keyfn(r) if (r["biomarker"] and r["orr_bucket"]) else None
            if k is not None and k in joint_ors:
                p = _logit_apply(p, joint_ors[k])
            elif r["biomarker"]:
                p = _logit_apply(p, BIOMARKER_ANCHOR.get(r["biomarker"], 1.0))  # FLOOR
        out.append(p)
    return out


def main():
    recs = load()
    n_total = len(recs)
    n_app = sum(r["approved"] for r in recs)
    covered = [r for r in recs if r["has_sig"]]
    with_bm = [r for r in recs if r["biomarker"]]
    with_both = [r for r in recs if r["biomarker"] and r["orr_bucket"]]

    print("=" * 72 + "\nPART 0 — Coverage\n" + "=" * 72)
    print(f"Cohort N={n_total}  ({n_app} approved / {n_total - n_app} failed)")
    print(f"Drugs with ANY signal:   {len(covered)} ({len(covered)/n_total:.0%})")
    print(f"Drugs with biomarker:    {len(with_bm)}")
    print(f"Drugs with biomarker+ORR:{len(with_both)}  (joint-cell population)")
    print(f"  approved/failed:       {sum(r['approved'] for r in with_both)}/"
          f"{sum(1-r['approved'] for r in with_both)}  <-- survivorship: failures underreport ORR")

    report_cells("Pooled 9-cell joint grid (biomarker x ORR)", cell_table(with_both, key_pooled), 9)
    report_cells("Per-modality 36-cell joint grid", cell_table(with_both, key_permod), 36)

    print("\n" + "=" * 72 + "\nPART 2 — Joint table vs biomarker-only (held-out 30%)\n" + "=" * 72)
    train, held = stratified_split(recs, SEED)
    held_labels = [r["approved"] for r in held]
    for label, keyfn, gate in [
        ("JOINT pooled-9, governance-gated", key_pooled, True),
        ("JOINT per-mod-36, governance-gated", key_permod, True),
        ("JOINT pooled-9, NO gate (overfit every cell)", key_pooled, False),
    ]:
        joint_ors, cleared = fit_joint(train, keyfn, gate=gate)
        aj, ab, diff, se, z, p = delong_paired(
            score_model(held, "joint", joint_ors, keyfn),
            score_model(held, "biomarker_only"), held_labels)
        ab2 = auc_only(score_model(held, "baseline"), held_labels)
        print(f"\n{label}\n  train cells cleared: {cleared} | held N={len(held)} "
              f"({sum(held_labels)}/{len(held)-sum(held_labels)}) | structural baseline AUC={ab2:.3f}")
        print(f"  biomarker-only={ab:.3f}  joint={aj:.3f}  diff={diff:+.3f} "
              f"(paired DeLong SE={se:.3f}, z={z:+.2f}, p={p:.3f})")
        print(f"  GATE (+0.030 @ p<0.05): {'PASS' if (diff >= 0.03 and p < 0.05) else 'FAIL'}")

    print("\n" + "=" * 72 + "\nPART 2b — Minimum detectable AUC delta (paired DeLong)\n" + "=" * 72)
    s_base = score_model(held, "baseline")
    s_bm = score_model(held, "biomarker_only")
    _, _, _, se_layer, _, _ = delong_paired(s_bm, s_base, held_labels)
    Z = 1.959963985 + 0.8416212336  # 80% power, two-sided alpha 0.05
    nh = len(held)
    for n_proj, tag in [(nh, f"NSCLC now (held={nh})"), (30, "NSCLC full N=85 (~30 held)"),
                        (100, "pooled 4-indication (~100 held)"), (250, "RA-regime (~250 held)")]:
        mdd = Z * se_layer * math.sqrt(nh / n_proj)
        print(f"  {tag:<38} min detectable AUC delta @80% power = {mdd:.3f}")
    print("\n  Target to clear gate = 0.030. Research §8.3: +3pp at ~0.67 baseline needs")
    print("  N~100-120/cohort; 0.625 baseline ~250 'out of reach'. Joint-vs-biomarker")
    print("  isolates a SMALLER delta -> needs MORE N. ~250 held = ~830 drugs.")


if __name__ == "__main__":
    main()