SciDEX — Task: [Forge] Add PubMed abstracts to 20 papers citing n

Papers on neurodegeneration targets that lack abstracts reduce the quality of hypothesis evidence linking and debate context. Verification: - 20 papers on AD, PD, ALS, or HD targets gain real PubMed abstracts - Each abstract is fetched via paper_cache.get_paper or direct PubMed fetch - No generated or placeholder abstracts are stored Start by selecting papers from PostgreSQL (dbname=scidex user=scidex_app) where abstract IS NULL or length(abstract) < 50 and the paper appears in evidence_for/evidence_against of active hypotheses. Use paper_cache to fetch abstracts by PMID, DOI, or title. Update only rows with real abstracts found and verify before/after counts.

Spec File

Goal

Backfill missing paper abstracts where PubMed or another trusted provider supplies real metadata. Abstract coverage improves search, evidence linking, KG extraction, and agent context.

Acceptance Criteria

☑ A concrete batch of papers gains non-empty abstracts

☑ Each abstract comes from PubMed, Semantic Scholar, OpenAlex, CrossRef, or another cited source

☑ Papers that genuinely lack abstracts are skipped with a note rather than filled with placeholders

☑ Before/after missing-abstract counts are recorded

Approach

Query papers where abstract IS NULL OR length(abstract) < 10.

Fetch metadata through paper_cache.get_paper or the existing multi-provider paper cache.

Update only rows where a real abstract is found.

Verify the updated rows and remaining backlog count.

Dependencies

dd0487d3-38a - Forge quest
paper_cache metadata providers

Dependents

Literature search, KG extraction, and hypothesis evidence pipelines

Work Log

2026-04-20 - Quest engine template

Created reusable spec for quest-engine generated paper abstract backfill tasks.

2026-04-22 - Task ff601c3f-63ca-4821-b005-89255c68bfec

Attempted to backfill abstracts for papers with abstract IS NULL OR LENGTH(abstract) < 10.
Before count: 221 papers with missing/short abstracts
After count: 219 papers (2 updated)
Conclusion: Task asked for 30 papers; only 2 had retrievable abstracts from PubMed/CrossRef. Remaining papers are either hallucinated entries or papers whose journals don't provide abstracts.

2026-04-23 01:50 UTC — Slot 70

Rebased onto main to resolve prior conflicts; applied upstream spec acceptance criteria
Verified current state: 158 papers still missing abstracts with numeric PMIDs
Ran backfill script (backfill_abstracts.py from slot 72 commit b3572ca97):

- Updated 29 of 30 target papers with real abstracts via PubMed efetch retmode=text
- 1 paper skipped: PMID 32909228 ("Behind the Mask") — editorial with no abstract body
- Before: 201 missing/short → After: 172 (29 papers cleared)

Fixed minor SQL error in verification query (ORDER BY without aggregate on grouped query)
Evidence: Selected papers after backfill have 1390–4011 char abstracts from PubMed
Commit: b3572ca97 (script) + e461911b9 (fix + spec)

2026-04-25 19:36 UTC — Task a9be7d2b

Ran backfill_abstracts.py against 40 papers with numeric PMIDs
First batch: 38 of 40 updated with real abstracts via PubMed efetch retmode=text; 2 skipped (PMID 25302784 "Echogenic: a poem" and PMID 29274069 "New Horizons" — genuine poems/short-form with no abstract)
Second batch (after rebase): 32 of 40 updated; 8 skipped (editorials, letters, poems, very short replies)
Before count: 1724 missing/short → After count: 1654 (70 papers cleared across both runs)
Skipped papers are genuine content types without abstracts (poems, editorials, letters to editor, news items)
Fixed SQL error: removed ORDER BY created_at DESC LIMIT N from verification COUNT query
Script updated to process 40 papers per run (was 30)
Commit: (this run)