[Forge] Add PubMed abstracts to papers missing them

← All Specs

Goal

Backfill missing paper abstracts where PubMed or another trusted provider supplies real metadata. Abstract coverage improves search, evidence linking, KG extraction, and agent context.

Acceptance Criteria

☑ A concrete batch of papers gains non-empty abstracts
☑ Each abstract comes from PubMed, Semantic Scholar, OpenAlex, CrossRef, or another cited source
☑ Papers that genuinely lack abstracts are skipped with a note rather than filled with placeholders
☑ Before/after missing-abstract counts are recorded

Approach

  • Query papers where abstract IS NULL OR length(abstract) < 10.
  • Fetch metadata through paper_cache.get_paper or the existing multi-provider paper cache.
  • Update only rows where a real abstract is found.
  • Verify the updated rows and remaining backlog count.
  • Dependencies

    • dd0487d3-38a - Forge quest
    • paper_cache metadata providers

    Dependents

    • Literature search, KG extraction, and hypothesis evidence pipelines

    Work Log

    2026-04-20 - Quest engine template

    • Created reusable spec for quest-engine generated paper abstract backfill tasks.

    2026-04-22 - Task ff601c3f-63ca-4821-b005-89255c68bfec

    • Attempted to backfill abstracts for papers with abstract IS NULL OR LENGTH(abstract) < 10.
    • Before count: 221 papers with missing/short abstracts
    • After count: 219 papers (2 updated)
    • Conclusion: Task asked for 30 papers; only 2 had retrievable abstracts from PubMed/CrossRef. Remaining papers are either hallucinated entries or papers whose journals don't provide abstracts.

    2026-04-23 01:50 UTC — Slot 70

    • Rebased onto main to resolve prior conflicts; applied upstream spec acceptance criteria
    • Verified current state: 158 papers still missing abstracts with numeric PMIDs
    • Ran backfill script (backfill_abstracts.py from slot 72 commit b3572ca97):
    - Updated 29 of 30 target papers with real abstracts via PubMed efetch retmode=text
    - 1 paper skipped: PMID 32909228 ("Behind the Mask") — editorial with no abstract body
    - Before: 201 missing/short → After: 172 (29 papers cleared)
    • Fixed minor SQL error in verification query (ORDER BY without aggregate on grouped query)
    • Evidence: Selected papers after backfill have 1390–4011 char abstracts from PubMed
    • Commit: b3572ca97 (script) + e461911b9 (fix + spec)

    2026-04-22 16:45 UTC — Slot 72

    • Ran backfill_abstracts.py against 30 papers with numeric PMIDs
    • Fixed broken XML-mode parsing by switching to retmode=text + custom text parser
    • 29 of 30 papers updated with real abstracts from PubMed (efetch retmode=text)
    • 1 paper skipped: PMID 32909228 ("Behind the Mask") — genuine editorial with no abstract body
    • Before count: 229 missing/short → After count: 200 (29 papers cleared)
    • Acceptable failure mode per spec: papers genuinely lacking abstracts are skipped with a note
    • Commit: b3572ca97

    Tasks using this spec (5)
    [Forge] Add PubMed abstracts to 30 papers missing them
    Forge done P82
    [Forge] Add PubMed abstracts to 30 papers missing them
    Forge done P82
    [Forge] Add PubMed abstracts to 30 papers missing them
    Forge done P82
    [Forge] Add PubMed abstracts to 20 papers citing neurodegene
    Forge open P82
    [Atlas] Backfill PubMed abstracts for 40 papers missing them
    Atlas open P63
    File: quest_engine_paper_abstract_backfill_spec.md
    Modified: 2026-04-24 07:15
    Size: 3.0 KB