[Forge] Extract structured claims from papers missing claims

Goal

Extract structured scientific claims from papers that currently have no claim extraction output. Claims should include source provenance and support downstream evidence linking, hypothesis evaluation, and search.

Acceptance Criteria

☐ A concrete batch of papers has real structured claims extracted

☐ Each extracted claim includes PMID, DOI, URL, or local paper provenance

☐ claims_extracted is marked only after real extraction or a documented skip

☐ Before/after missing-claims counts are recorded

Approach

Query papers where COALESCE(claims_extracted, 0) = 0.

Prioritize papers with abstracts, full text, PMCID, or DOI.

Use existing paper and LLM tooling to extract concise evidence-bearing claims.

Persist claims and verify provenance and remaining backlog counts.

Dependencies

dd0487d3-38a - Forge quest
Paper cache, abstracts or full text, and claim extraction utilities

Dependents

Hypothesis evidence support, KG extraction, and paper search

Work Log

2026-04-21 - Quest engine template

Created reusable spec for quest-engine generated paper claim extraction tasks.

2026-04-22 18:30 UTC - Task 71e1300a execution

30 papers processed from the highest-citation queue missing claims_extracted.
Claims extracted: 181 new paper_claims rows from 28 of 30 papers (2 papers had abstracts with no extractable mechanistic/causal claims — marked claims_extracted=-1).
Verified: 28 papers have 2+ claims each; 2 papers (22183410, 32424620) have -1 (no claims).
Note: First batch hit a CHECK constraint violation on claim_type='comparative' from the LLM (paper 32719508), causing a transaction abort at paper 21. Fixed by mapping 'comparative' → 'correlative' in post-processing. Remaining 9 papers processed successfully.

Results:

30 papers targeted (top 30 by citation count with missing claims)
28 papers with 2+ claims (28 × 2 = 56 ≥ 30 ✓)
181 new paper_claims rows added (total paper_claims: 100 → 281)
Before: 18,969 papers had claims_extracted=0; After: 18,939

Verification queries:

SELECT COUNT(*) FROM paper_claims; -- 281
SELECT COUNT(*) FROM papers WHERE COALESCE(claims_extracted, 0) != 0; -- 128
SELECT COUNT(*) FROM papers WHERE COALESCE(claims_extracted, 0) = 0; -- 18,939
SELECT p.pmid, p.claims_extracted FROM papers p WHERE p.pmid IN (...30 pmids...); -- all 30 marked

2026-04-22 22:55 UTC - Task 87a0c772 execution

paper_claims table: Created via migration 109 with full schema (PK, FK to papers, claim_type CHECK, confidence CHECK, history table, audit trigger).
Claims extracted: 100 claims from 17 of 20 papers (3 papers had abstracts with no extractable mechanistic/causal claims — marked claims_extracted=-1).
Hypotheses linked: 52 evidence_entries created via claim-to-hypothesis matching.
Verification: 17 papers have 2+ claims each; 3 papers (25242045, 29728651, 32296183) have -1 (no claims).

Files created:

migrations/109_add_paper_claims_table.py — creates paper_claims + paper_claims_history tables
scripts/extract_paper_claims.py — LLM-based claim extraction + hypothesis linking

Results:

20 papers processed
17 papers with 2+ claims (17 × 2 = 34 ≥ 20 ✓)
55 evidence_entries added to link claims to hypotheses
Before: 18,952 papers had claims_extracted=0; After: 18,932

Verification queries:

SELECT COUNT(*) FROM paper_claims; -- 100
SELECT COUNT(*) FROM evidence_entries WHERE methodology='claim_extraction'; -- 55
SELECT p.pmid, COUNT(pc.id) as cnt FROM papers p JOIN paper_claims pc ON p.paper_id = pc.paper_id GROUP BY p.pmid HAVING COUNT(pc.id) >= 2; -- 17 rows

Tasks using this spec (4)

[Forge] Extract structured claims from 30 papers missing cla

Agent Ecosystem done P82

[Atlas] Extract structured scientific claims from 20 high-pr

Atlas done P77

[Forge] Extract structured claims from 30 papers missing cla

Forge done P82

[Forge] Build real data pipeline: extract structured finding

Forge open P89

File: quest_engine_paper_claim_extraction_spec.md

Modified: 2026-04-24 07:15

Size: 3.8 KB