[Atlas] Backfill PubMed abstracts for 40 papers missing them done

← Atlas
Many papers in the papers table have a PMID but no abstract — the abstract fetch failed or was skipped. Without abstracts, LLM-based claim extraction and gap generation can't process these papers. ## Steps 1. Query: `SELECT id, pmid, title FROM papers WHERE (abstract IS NULL OR abstract = '') AND pmid IS NOT NULL LIMIT 40` 2. For each paper: call paper_cache.get_paper(pmid) to fetch the full record including abstract 3. Update: `UPDATE papers SET abstract = '', updated_at = NOW() WHERE id = ''` 4. Commit batch updates ## Acceptance Criteria - [ ] 40 papers checked for missing abstracts - [ ] All papers with retrievable abstracts now have abstract field populated - [ ] Changes committed and pushed

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Atlas] Backfill PubMed abstracts for 70 papers across 2 runs; limit 40/run [task:a9be7d2b-e95d-4875-a45c-be6cbfabd8e6]2026-04-25
Spec File

Goal

Backfill missing paper abstracts where PubMed or another trusted provider supplies real metadata. Abstract coverage improves search, evidence linking, KG extraction, and agent context.

Acceptance Criteria

☑ A concrete batch of papers gains non-empty abstracts
☑ Each abstract comes from PubMed, Semantic Scholar, OpenAlex, CrossRef, or another cited source
☑ Papers that genuinely lack abstracts are skipped with a note rather than filled with placeholders
☑ Before/after missing-abstract counts are recorded

Approach

  • Query papers where abstract IS NULL OR length(abstract) < 10.
  • Fetch metadata through paper_cache.get_paper or the existing multi-provider paper cache.
  • Update only rows where a real abstract is found.
  • Verify the updated rows and remaining backlog count.
  • Dependencies

    • dd0487d3-38a - Forge quest
    • paper_cache metadata providers

    Dependents

    • Literature search, KG extraction, and hypothesis evidence pipelines

    Work Log

    2026-04-20 - Quest engine template

    • Created reusable spec for quest-engine generated paper abstract backfill tasks.

    2026-04-22 - Task ff601c3f-63ca-4821-b005-89255c68bfec

    • Attempted to backfill abstracts for papers with abstract IS NULL OR LENGTH(abstract) < 10.
    • Before count: 221 papers with missing/short abstracts
    • After count: 219 papers (2 updated)
    • Conclusion: Task asked for 30 papers; only 2 had retrievable abstracts from PubMed/CrossRef. Remaining papers are either hallucinated entries or papers whose journals don't provide abstracts.

    2026-04-23 01:50 UTC — Slot 70

    • Rebased onto main to resolve prior conflicts; applied upstream spec acceptance criteria
    • Verified current state: 158 papers still missing abstracts with numeric PMIDs
    • Ran backfill script (backfill_abstracts.py from slot 72 commit b3572ca97):
    - Updated 29 of 30 target papers with real abstracts via PubMed efetch retmode=text
    - 1 paper skipped: PMID 32909228 ("Behind the Mask") — editorial with no abstract body
    - Before: 201 missing/short → After: 172 (29 papers cleared)
    • Fixed minor SQL error in verification query (ORDER BY without aggregate on grouped query)
    • Evidence: Selected papers after backfill have 1390–4011 char abstracts from PubMed
    • Commit: b3572ca97 (script) + e461911b9 (fix + spec)

    2026-04-25 19:36 UTC — Task a9be7d2b

    • Ran backfill_abstracts.py against 40 papers with numeric PMIDs
    • First batch: 38 of 40 updated with real abstracts via PubMed efetch retmode=text; 2 skipped (PMID 25302784 "Echogenic: a poem" and PMID 29274069 "New Horizons" — genuine poems/short-form with no abstract)
    • Second batch (after rebase): 32 of 40 updated; 8 skipped (editorials, letters, poems, very short replies)
    • Before count: 1724 missing/short → After count: 1654 (70 papers cleared across both runs)
    • Skipped papers are genuine content types without abstracts (poems, editorials, letters to editor, news items)
    • Fixed SQL error: removed ORDER BY created_at DESC LIMIT N from verification COUNT query
    • Script updated to process 40 papers per run (was 30)
    • Commit: (this run)

    Sibling Tasks in Quest (Atlas) ↗