SEC EDGAR Deal-Term Extraction
Structured deal terms are extracted from public SEC EDGAR filings for a curated biotech sponsor cohort, retained with source provenance, and used as aggregate-eligible comparator medians to pre-fill the rNPV wizard's deal-structure form (deal-suggestion behavior unchanged since engine 2.4.0; current engine 2.6.0). Each deal carries an independent cohort classification (rule + LLM with disagreement-gated review) so the engine cascade reads from a clean per-row cohort indication. Cohort classifier v2 (2026-05-26) broadens the AMR cohort to anti-infectives — antibacterial, antifungal, antiviral, and topical-antibacterial deals are admitted under the same engine taxonomy. The analyst sees each suggestion with source attribution and can accept, edit, or clear any field before computing.
1. What this section covers
PhaseFolio ingests deal-term data from SEC EDGAR for a curated cohort of public biotech sponsors. The extracted fields include upfront cash, near-term and total milestones, royalty ranges, equity consideration, counterparty, asset, indication, effective date, territory, and exclusivity.
v1 extends the v0 ingestion substrate into a canonical read layer and wizard integration. On non-Full-Ownership deal-type selection, the rNPV wizard can now suggest deal terms sourced from aggregate SEC comparator medians when enough high-quality comparables exist (wizard-suggest behavior unchanged since engine 2.3.2; current engine 2.6.0). The analyst sees the suggestion, reviews it, and can accept, edit, or clear any field before computing.
2. Data source and legal posture
The source is SEC EDGAR via the SEC's official endpoints (data.sec.gov and www.sec.gov). SEC expressly permits free access and reuse of EDGAR public filing content; PhaseFolio extracts factual deal terms from public filings and surfaces them with source links, short quotes, extraction-risk disclaimers, and no SEC endorsement language.
No AGPL- or other copyleft-licensed code sits in the ingestion path. The pipeline does not depend on the edgartools or sec-edgar-mcp projects.
3. Cohort and forms covered
The current dataset is the AMR-anchored biotech sponsor cohort resolved from PhaseFolio's enriched antimicrobial trial dataset. The v1 ingestion path supports Form 8-K and Form 6-K filings filed on or after 2020-01-01. Customer-facing foreign-private-issuer cohort claims and 10-K/10-Q Exhibit 10.x extraction remain separate counsel-review events.
- 8-K Item 1.01: material definitive agreement summaries, used for the counsel-approved wizard-suggest disclosure.
- 6-K: ingestion-enabled for foreign private issuer coverage; any broader customer-facing FPI claims require supplemental counsel review.
- 10-K/10-Q Exhibit 10.x: out of v1 scope because operative agreement text, redactions, and deduplication require a separate design.
Cohort definition (v2, 2026-05-26 — anti-infective broadening). The cohort engine-taxonomy name is unchanged (the cohort's indication classification remains "antimicrobial") but the inclusion definition expands from AMR-narrow (systemic antibacterial focus) to anti-infectives more broadly: antibacterial, antifungal, and antiviral agents — including influenza and RSV antivirals — and topical antibacterial agents (minocycline, mupirocin, fusidic acid, retapamulin) admitted by drug-class match regardless of dermatology indication framing. HIV antiretrovirals are flagged ambiguous and routed to human review because HIV licensing economics differ materially from the broader anti-infective cohort and warrant a separate cohort definition (deferred to v2.1). The broadening was deferred at v1 protocol freeze (see the AMR deal-value backtest pre-registration protocol §1, "a v2 protocol may admit them after methodology bump") and adopted via the directive on 2026-05-26 to ship a sequential A-then-B cohort expansion. Exclusion patterns are unchanged: oncology, immunology, metabolic disease, and pure gene-editing platform deals remain out of cohort.
4. Extraction pipeline
- Per-CIK submissions fetch. For each cohort sponsor, the pipeline pulls SEC submissions JSON and filters to supported forms on or after 2020-01-01.
- Cheap deterministic classifier. Each candidate filing is passed through
classifier@2026-05-25-v1to drop obvious non-deal filings before LLM extraction. The Item 1.01 short-circuit applies to 8-K only because 6-K filings do not use the same item numbering. - LLM-assisted structured extraction. Deal candidates are sent to
deal_terms_extractor@2026-05-25-v2, which separates equity dollars from equity share counts and gives platform-broad deals an indication-null path for cohort fallback. - Provenance capture. Every extraction stores field values, source section, a source quote capped at 500 characters, extractor version, raw response, and confidence.
- Review gating. Rows below the confidence threshold are flagged with
needs_review. Aggregate medians exclude unreviewed flagged rows untilreviewed_atis populated by the admin review route. - Cohort classification (engine cascade unchanged since 2.4.0; current engine 2.6.0). Each ingested deal is also classified for cohort membership at extraction time. Two passes run on every filing: a deterministic include/exclude/ambiguous keyword pass and a context-aware LLM verdict using the filing's own source quote. Deals where the two passes disagree, or where an ambiguous pattern fires, are excluded from engine medians until a human reviewer clears the classification via the admin cohort-review route. Cohort assignments and their rule/LLM verdicts are stored alongside each deal as provenance. The engine cascade reads from the per-row cohort indication classification, decoupling per-cohort scoping from the LLM-raw indication string.
5. Fields extracted
- Upfront cash in USD millions.
- Near-term milestones and total milestones in USD millions.
- Royalty rate range, including low and high percentages when disclosed.
- Equity component, split into USD value and share count when filings disclose shares rather than dollars.
- Counterparty, licensor, licensee, asset or program, indication, effective date, territory, and exclusivity.
Royalty band conventions such as "low single-digit" and "mid teens" follow defensible industry anchors approximating BIO Industry Analysis 2021 and Tufts CSDD deal-term tables. Re-anchoring requires an extractor version bump.
6. Wizard integration (behavior unchanged since engine 2.4.0; current engine 2.6.0)
When an analyst selects a non-Full-Ownership deal type in the rNPV wizard (Royalty Only, Milestone + Royalty, or Profit Split), PhaseFolio queries the SEC EDGAR substrate for comparable deal medians and pre-fills the wizard's Step 4 deal-structure form with suggested values. The analyst sees a gold-bordered banner naming the source cohort, the number of comparables, and a link to this methodology page, and can accept, edit, or clear any field before computing the rNPV.
Cascading match (unchanged from v1):
- Try indication × deal type. If at least 3 aggregate-eligible comparable deals exist, use their median economics.
- Otherwise fall back to indication only. If at least 3 aggregate-eligible comparable deals exist, use their median economics.
- Otherwise no suggestion is offered, and the analyst enters deal terms from scratch.
Only aggregate-eligible deal terms feed the medians: rows where needs_review = false, or rows that were flagged but later reviewer-cleared. Per-deal administrative display can include flagged rows, but engine medians cannot.
Full Ownership is treated as an explicit "no deal" signal — no suggestions, no pre-fill, no overlay. The engine computes the full-product rNPV under the analyst's explicit ownership choice.
Why this design: the prior engine 2.3.0 / 2.3.1 design auto-populated empty deal structures server-side, which silently overrode the analyst's intent when Full Ownership had been selected (the wizard default produces an empty deal_structure over the wire). 2.3.2 moves the suggestion to the wizard so the analyst has explicit visibility and consent before any SEC-sourced value enters the model. The results-page disclosure card discloses N comparables, match level, cohort, medians, methodology link, and source accession links — same fields as v1, with the framing corrected to "suggested + accepted" rather than "auto-populated".
7. Storage and access
Extracted deal terms, their source provenance, and the underlying filing records are stored privately and exposed only through a gated read layer; raw filing text is never redistributed. Customer-facing surfaces use source links, accession numbers, and short quotes rather than the raw filing.
The canonical public read layer is the query_sec_deal_terms tool, which returns per-deal provenance; aggregate comparator medians that feed the engine are served through the same gated layer.
8. Limitations and non-claims
- 8-K and 6-K summaries are not full contracts. They can omit true-up provisions, anti-stacking, step-downs, change-of-control terms, and other operative clauses.
- Redactions are not reconstructed. Current Reg S-K Item 601(b)(10)(iv) permits redaction of specific provisions or terms where the information is customarily and actually treated as private or confidential and is not material. Redacted fields surface as null unless another public filing discloses the same value.
- LLM extraction can be wrong. Values may be incomplete or incorrect, and the source filing controls.
- The cohort is curated, not exhaustive. v1 covers AMR comparators only. NSCLC, RA, modality matching, and Exhibit 10.x extraction are future scopes.
- No SEC endorsement. PhaseFolio is not affiliated with, endorsed by, or certified by the SEC.
9. Versioning
| Artifact | Version | Reason |
|---|---|---|
| Dataset | sec_deal_terms@2026-Q3-2 | v1.5 schema migration + per-row cohort assignment; AMR cohort backfilled from CMO ground truth; 3 anti-infective deals (CD388 influenza antiviral, AVCs influenza antiviral, AMZEEQ topical minocycline) reclassified under v2 on 2026-05-26. |
| Extractor prompt | deal_terms_extractor@2026-05-25-v2 | Unchanged — re-extraction is a v1.6+ scope. |
| Deal-deal classifier | classifier@2026-05-25-v1 | Deterministic deal pre-filter unchanged. |
| Cohort classifier | cohort_classifier@2026-05-26-v2 | Anti-infective broadening: admits antifungal + antiviral (influenza, RSV) + topical antibacterial (minocycline, mupirocin) by drug-class match. HIV antivirals flagged ambiguous (deferred to v2.1). Exclude patterns unchanged. |
| Engine (behavior last changed) | 2.4.0 | Where the cascade behavior was last touched: the engine cascade reads from the per-row cohort indication classification. Unchanged since; the current engine is 2.6.0. v2 broadens which rows carry the "antimicrobial" taxonomy, not the cascade logic. |
| Methodology | methodology@2026-05-26-v3 | Documents cohort classifier v2 anti-infective broadening + the 3 reclassified rows. |
The outgoing methodology remains frozen at /methodology/sec-deal-terms?version=methodology@2026-05-26-v2, ?version=methodology@2026-05-26, ?version=methodology@2026-Q3, and ?version=methodology@2026-05-25 / ?version=2026-Q2.
Key Facts
| Current methodology | methodology@2026-05-26-v3 |
| Dataset version | sec_deal_terms@2026-Q3-2 |
| Extractor prompt | deal_terms_extractor@2026-05-25-v2 |
| Deal-deal classifier | classifier@2026-05-25-v1 |
| Cohort classifier | cohort_classifier@2026-05-26-v2 (anti-infective broadening: antibacterial + antifungal + antiviral + topical antibacterial; HIV deferred to v2.1) |
| Engine integration | Behavior unchanged since engine 2.4.0 (current engine 2.6.0): the wizard pre-fills Step 4 from SEC comparator medians on non-Full-Ownership deal-type selection; the analyst can accept, edit, or clear before computing. The cascade reads from the per-row cohort indication classification. |
| Review gate | Aggregate medians exclude unreviewed needs_review rows AND deals where the rule and LLM cohort verdicts disagree until a reviewer clears them. |
| Analyst precedence | Full Ownership = hard 'no deal' signal — no suggestions, no overlay. Edit/Clear in the wizard fully governs what reaches the engine. |
| Legal posture | Factual extraction from public SEC EDGAR filings with source links, short quotes, extraction-risk disclaimers, and no SEC endorsement language. |
References
01US Securities and Exchange Commission — EDGAR filing access.
02SEC Fair Access Policy — User-Agent and 10 requests/second rate cap.
03Regulation S-K Item 601 — Exhibits and redaction provisions for material agreements.
04BIO Industry Analysis 2021 — Clinical Development Success Rates and Contributing Factors.
05Tufts Center for the Study of Drug Development — deal-term and royalty-rate tables.
Methodology version: methodology@2026-05-26-v3 · Last updated: 2026-05-26 · Version history →