[Forge] Extract structured claims from papers missing claims

← All Specs

Goal

Extract structured scientific claims from papers that currently have no claim extraction output. Claims should include source provenance and support downstream evidence linking, hypothesis evaluation, and search.

Acceptance Criteria

☐ A concrete batch of papers has real structured claims extracted
☐ Each extracted claim includes PMID, DOI, URL, or local paper provenance
claims_extracted is marked only after real extraction or a documented skip
☐ Before/after missing-claims counts are recorded

Approach

  • Query papers where COALESCE(claims_extracted, 0) = 0.
  • Prioritize papers with abstracts, full text, PMCID, or DOI.
  • Use existing paper and LLM tooling to extract concise evidence-bearing claims.
  • Persist claims and verify provenance and remaining backlog counts.
  • Dependencies

    • dd0487d3-38a - Forge quest
    • Paper cache, abstracts or full text, and claim extraction utilities

    Dependents

    • Hypothesis evidence support, KG extraction, and paper search

    Work Log

    2026-04-21 - Quest engine template

    • Created reusable spec for quest-engine generated paper claim extraction tasks.

    2026-04-22 18:30 UTC - Task 71e1300a execution

    • 30 papers processed from the highest-citation queue missing claims_extracted.
    • Claims extracted: 181 new paper_claims rows from 28 of 30 papers (2 papers had abstracts with no extractable mechanistic/causal claims — marked claims_extracted=-1).
    • Verified: 28 papers have 2+ claims each; 2 papers (22183410, 32424620) have -1 (no claims).
    • Note: First batch hit a CHECK constraint violation on claim_type='comparative' from the LLM (paper 32719508), causing a transaction abort at paper 21. Fixed by mapping 'comparative' → 'correlative' in post-processing. Remaining 9 papers processed successfully.
    Results:
    • 30 papers targeted (top 30 by citation count with missing claims)
    • 28 papers with 2+ claims (28 × 2 = 56 ≥ 30 ✓)
    • 181 new paper_claims rows added (total paper_claims: 100 → 281)
    • Before: 18,969 papers had claims_extracted=0; After: 18,939
    Verification queries:

    SELECT COUNT(*) FROM paper_claims; -- 281
    SELECT COUNT(*) FROM papers WHERE COALESCE(claims_extracted, 0) != 0; -- 128
    SELECT COUNT(*) FROM papers WHERE COALESCE(claims_extracted, 0) = 0; -- 18,939
    SELECT p.pmid, p.claims_extracted FROM papers p WHERE p.pmid IN (...30 pmids...); -- all 30 marked

    2026-04-22 22:55 UTC - Task 87a0c772 execution

    • paper_claims table: Created via migration 109 with full schema (PK, FK to papers, claim_type CHECK, confidence CHECK, history table, audit trigger).
    • Claims extracted: 100 claims from 17 of 20 papers (3 papers had abstracts with no extractable mechanistic/causal claims — marked claims_extracted=-1).
    • Hypotheses linked: 52 evidence_entries created via claim-to-hypothesis matching.
    • Verification: 17 papers have 2+ claims each; 3 papers (25242045, 29728651, 32296183) have -1 (no claims).
    Files created:
    • migrations/109_add_paper_claims_table.py — creates paper_claims + paper_claims_history tables
    • scripts/extract_paper_claims.py — LLM-based claim extraction + hypothesis linking
    Results:
    • 20 papers processed
    • 17 papers with 2+ claims (17 × 2 = 34 ≥ 20 ✓)
    • 55 evidence_entries added to link claims to hypotheses
    • Before: 18,952 papers had claims_extracted=0; After: 18,932
    Verification queries:

    SELECT COUNT(*) FROM paper_claims; -- 100
    SELECT COUNT(*) FROM evidence_entries WHERE methodology='claim_extraction'; -- 55
    SELECT p.pmid, COUNT(pc.id) as cnt FROM papers p JOIN paper_claims pc ON p.paper_id = pc.paper_id GROUP BY p.pmid HAVING COUNT(pc.id) >= 2; -- 17 rows

    Tasks using this spec (4)
    [Forge] Extract structured claims from 30 papers missing cla
    [Atlas] Extract structured scientific claims from 20 high-pr
    Atlas done P77
    [Forge] Extract structured claims from 30 papers missing cla
    Forge done P82
    [Forge] Build real data pipeline: extract structured finding
    Forge open P89
    File: quest_engine_paper_claim_extraction_spec.md
    Modified: 2026-04-24 07:15
    Size: 3.8 KB