Goal
Backfill missing paper abstracts where PubMed or another trusted provider supplies real metadata. Abstract coverage improves search, evidence linking, KG extraction, and agent context.
Acceptance Criteria
☑ A concrete batch of papers gains non-empty abstracts
☑ Each abstract comes from PubMed, Semantic Scholar, OpenAlex, CrossRef, or another cited source
☑ Papers that genuinely lack abstracts are skipped with a note rather than filled with placeholders
☑ Before/after missing-abstract counts are recorded
Approach
Query papers where abstract IS NULL OR length(abstract) < 10.
Fetch metadata through paper_cache.get_paper or the existing multi-provider paper cache.
Update only rows where a real abstract is found.
Verify the updated rows and remaining backlog count.Dependencies
dd0487d3-38a - Forge quest
paper_cache metadata providers
Dependents
- Literature search, KG extraction, and hypothesis evidence pipelines
Work Log
2026-04-20 - Quest engine template
- Created reusable spec for quest-engine generated paper abstract backfill tasks.
2026-04-22 - Task ff601c3f-63ca-4821-b005-89255c68bfec
- Attempted to backfill abstracts for papers with
abstract IS NULL OR LENGTH(abstract) < 10.
- Before count: 221 papers with missing/short abstracts
- After count: 219 papers (2 updated)
- Conclusion: Task asked for 30 papers; only 2 had retrievable abstracts from PubMed/CrossRef. Remaining papers are either hallucinated entries or papers whose journals don't provide abstracts.
2026-04-23 01:50 UTC — Slot 70
- Rebased onto main to resolve prior conflicts; applied upstream spec acceptance criteria
- Verified current state: 158 papers still missing abstracts with numeric PMIDs
- Ran backfill script (
backfill_abstracts.py from slot 72 commit b3572ca97):
- Updated 29 of 30 target papers with real abstracts via PubMed efetch retmode=text
- 1 paper skipped: PMID 32909228 ("Behind the Mask") — editorial with no abstract body
- Before: 201 missing/short → After: 172 (29 papers cleared)
- Fixed minor SQL error in verification query (
ORDER BY without aggregate on grouped query)
- Evidence: Selected papers after backfill have 1390–4011 char abstracts from PubMed
- Commit:
b3572ca97 (script) + e461911b9 (fix + spec)
2026-04-25 19:36 UTC — Task a9be7d2b
- Ran
backfill_abstracts.py against 40 papers with numeric PMIDs
- First batch: 38 of 40 updated with real abstracts via PubMed efetch retmode=text; 2 skipped (PMID 25302784 "Echogenic: a poem" and PMID 29274069 "New Horizons" — genuine poems/short-form with no abstract)
- Second batch (after rebase): 32 of 40 updated; 8 skipped (editorials, letters, poems, very short replies)
- Before count: 1724 missing/short → After count: 1654 (70 papers cleared across both runs)
- Skipped papers are genuine content types without abstracts (poems, editorials, letters to editor, news items)
- Fixed SQL error: removed
ORDER BY created_at DESC LIMIT N from verification COUNT query
- Script updated to process 40 papers per run (was 30)
- Commit: (this run)