Methodology · Network Benchmarks

Network Benchmarks

PhaseFolio commits to publishing an anonymized, machine-parseable benchmark dump on an annual cadence; the first dump ships Q4 2026. Cohort cells are the unit of aggregation (indication × modality × biomarker × stage); cells with cohort sizes below k=5 (and at least three distinct orgs) are suppressed; asset-level rows are never published. Both JSON and CSV artifacts ship with identical content under a stable schema version.

Public Commitment

The first annual dump is targeted for Q4 2026. Subsequent dumps ship on a 12-month cadence. The methodology on this page is the contract; if any element changes before publication, the change is documented here with a dated note rather than silently revised.

What the dump contains

Aggregate distributions, not asset-level rows.

The dump is a set of aggregate distributions across PhaseFolio scenarios that were entered into the platform during the trailing window. It is published as two parallel artifacts — a JSON file for programmatic consumers (AI agents, academic pipelines) and a CSV file for spreadsheet workflows. Both files carry identical content and an identical schema version.

Each row in the aggregate is a cohort cell: a combination of indication, modality, biomarker strategy, and stage at entry. For each cell, the dump publishes summary statistics for the rNPV inputs and outputs — quartiles and means for peak revenue, WACC, per-stage cost, ramp years, exclusivity years, MFP discount assumption, and resulting rNPV — plus the cohort size.

Asset-level rows are never published. Only aggregate cells with cohort sizes above the suppression threshold (Section 2) appear in the file. An IC memo can cite “the median PhaseFolio rNPV for Phase II oncology monoclonal antibodies in 2026 was $X with IQR $Y–$Z”; it cannot recover any individual asset.

Anonymization approach

K-anonymity, suppression thresholds, and the explicit ban on asset-level rows.

The anonymization regime is designed so that no individual asset, scenario, or org is recoverable from the published file. Three protections combine:

Cell suppression at k = 5. A cohort cell ships only if it contains at least five distinct scenarios from at least three distinct orgs. Cells below the threshold are suppressed entirely — they do not appear in the file under any aggregate label.
No org identifiers. The dump carries no org-level totals, no org-level rankings, and no anonymized org IDs of any kind. The unit of aggregation is the cohort cell, not the org.
No free-text fields. Asset names, sponsor names, internal notes, and the Evidence Register excerpts are excluded from the dump regardless of cohort size. Only the categorical scenario inputs and numeric outputs survive.

Orgs may opt their data out of the dump entirely. The default is opt-in; a setting on the org page lets an admin remove the org's contribution to future dumps without affecting the live product. Past dumps are immutable — we do not retroactively redact a published file, but a subsequent dump will respect the new opt-out.

Cohort cuts

The categorical axes used to define cells.

The same categorical axes that drive the engine's base-rate matrix drive the dump cohorts. This makes the published file directly comparable to the methodology referenced in this hub:

Indication — 11 therapeutic areas (oncology solid, oncology hematologic, rare disease, neurology, immunology, infectious disease, cardiovascular, metabolic, respiratory, dermatology, ophthalmology).
Modality — 8 wizard modalities (small molecule, peptide, monoclonal antibody, bispecific, ADC, cell therapy, gene therapy, other).
Biomarker strategy — 3 levels (none, enrichment, companion diagnostic).
Stage at entry — the engine's entry stage for the scenario (Preclinical, Phase I, Phase II, Phase III, NDA/BLA).

The full cross-product is 11 × 8 × 3 × 5 = 1,320 possible cells. After k = 5 suppression, the published file will carry substantially fewer; the suppressed-cell count will itself be reported in the dump preamble so consumers can see the coverage.

Lag and timing

Why the dump trails the live product by 12–18 months.

The dump publishes scenarios from a trailing 12–18 month window, ending no later than the cutoff date stated in the file preamble. This lag is not incidental — it is a deliberate moat-thickener. Real-time benchmarking is paid (L3); the lagged dump (L4) is open public good.

The lag also lets us settle three messy issues before publication: outcomes for scenarios that resolve during the window get joined back to the inputs; suppression decisions for borderline cells get re-checked; opt-out requests filed during the window are honored.

Citation format

Stable URLs, schema versions, and a recommended attribution string.

Each annual dump is published at a stable URL of the form /benchmarks/<year> with a content hash and a schema version. The URL is permanent — once a dump ships, that URL never breaks.

The recommended attribution string for academic and IC use:

PhaseFolio Network Benchmarks (Q4 2026). Anonymized aggregate biotech investment scenarios. Schema v1.0. Available at phasefolio.com/benchmarks/2026.

Programmatic consumers (AI agents, academic pipelines) should pin the schema version they consumed and check it on each fetch — future dumps may add fields, but no field will ever change semantics within a major schema version. The schema itself is documented at /api/v1/benchmarks/schema.

What the dump does not represent

Selection bias, opt-outs, and the non-random nature of the user base.

The PhaseFolio user base is not the universe of biotech investors. Aggregate statistics describe what users chose to model in PhaseFolio — not the global pipeline. Cohort size in the dump reflects modeling activity, not real-world prevalence.
Opt-out is non-random. Orgs that opt out of the dump may do so for reasons that correlate with their portfolio composition. The dump cannot correct for this; users should not treat the file as a representative cross-section of all PhaseFolio activity.
Inputs are user assumptions. A median peak-revenue figure in the dump is the median of what users assumed, not the median of any downstream realized outcome. Treat the file as a benchmark of analyst belief, not of realized economics.
The dump is not the engine. Citing a benchmark cell does not substitute for citing the underlying methodology. Both should appear in any serious analysis.