Goal
Backfill missing paper abstracts where PubMed or another trusted provider supplies real metadata. Abstract coverage improves search, evidence linking, KG extraction, and agent context.
Acceptance Criteria
☑ A concrete batch of papers gains non-empty abstracts
☑ Each abstract comes from PubMed, Semantic Scholar, OpenAlex, CrossRef, or another cited source
☑ Papers that genuinely lack abstracts are skipped with a note rather than filled with placeholders
☑ Before/after missing-abstract counts are recorded
Approach
Query papers where abstract IS NULL OR length(abstract) < 10.
Fetch metadata through paper_cache.get_paper or the existing multi-provider paper cache.
Update only rows where a real abstract is found.
Verify the updated rows and remaining backlog count.Dependencies
dd0487d3-38a - Forge quest
paper_cache metadata providers
Dependents
- Literature search, KG extraction, and hypothesis evidence pipelines
Work Log
2026-04-20 - Quest engine template
- Created reusable spec for quest-engine generated paper abstract backfill tasks.
2026-04-22 - Task ff601c3f-63ca-4821-b005-89255c68bfec
- Attempted to backfill abstracts for papers with
abstract IS NULL OR LENGTH(abstract) < 10.
- Before count: 221 papers with missing/short abstracts
- After count: 219 papers (2 updated)
- Conclusion: Task asked for 30 papers; only 2 had retrievable abstracts from PubMed/CrossRef. Remaining papers are either hallucinated entries or papers whose journals don't provide abstracts.
2026-04-23 01:50 UTC — Slot 70
- Rebased onto main to resolve prior conflicts; applied upstream spec acceptance criteria
- Verified current state: 158 papers still missing abstracts with numeric PMIDs
- Ran backfill script (
backfill_abstracts.py from slot 72 commit b3572ca97):
- Updated 29 of 30 target papers with real abstracts via PubMed efetch retmode=text
- 1 paper skipped: PMID 32909228 ("Behind the Mask") — editorial with no abstract body
- Before: 201 missing/short → After: 172 (29 papers cleared)
- Fixed minor SQL error in verification query (
ORDER BY without aggregate on grouped query)
- Evidence: Selected papers after backfill have 1390–4011 char abstracts from PubMed
- Commit:
b3572ca97 (script) + e461911b9 (fix + spec)
2026-04-22 16:45 UTC — Slot 72
- Ran
backfill_abstracts.py against 30 papers with numeric PMIDs
- Fixed broken XML-mode parsing by switching to
retmode=text + custom text parser
- 29 of 30 papers updated with real abstracts from PubMed (efetch retmode=text)
- 1 paper skipped: PMID 32909228 ("Behind the Mask") — genuine editorial with no abstract body
- Before count: 229 missing/short → After count: 200 (29 papers cleared)
- Acceptable failure mode per spec: papers genuinely lacking abstracts are skipped with a note
- Commit:
b3572ca97