[Atlas] Papers enrichment from PubMed API

← All Specs

[Atlas] Papers enrichment from PubMed API

Task ID: e543e406-4cca-45d0-a062-2fabc24dc2d3 Priority: 83 Status: In Progress

Goal

Populate the papers table with full metadata for all PMIDs referenced in hypotheses' evidence_for and evidence_against fields. Fetch data from NCBI PubMed E-utilities API including title, authors, journal, year, abstract, and DOI. Currently the papers table has 0 rows despite having 118 hypotheses with evidence citations.

Acceptance Criteria

☑ All PMIDs from hypotheses evidence fields are extracted
☑ Paper metadata is fetched from NCBI PubMed E-utilities API
☑ Papers table is populated with: pmid, title, authors, journal, year, abstract, doi, url
☑ Citation relationships tracked (which hypotheses cite each paper)
☑ Script handles rate limiting (NCBI: 3 requests/second without API key)
☑ Error handling for missing/invalid PMIDs
☑ Papers table has >0 rows after execution

Approach

  • Fix literature_manager.py to match actual papers table schema
  • - Current schema has: id, pmid, title, authors, journal, year, abstract, doi, url, cited_by_analyses, created_at, citation_count, cited_in_analysis_ids, first_cited_at
    - Script expects: cited_by_hypotheses, kg_edges_sourced, fetch_status, fetch_error, updated_at (mismatched)
  • Update extraction logic to use cited_by_analyses instead of cited_by_hypotheses
  • Remove references to non-existent columns (fetch_status, kg_edges_sourced, etc.)
  • Run the sync: python3 literature_manager.py sync
  • Verify papers table is populated
  • Test that papers are properly linked to hypotheses
  • Work Log

    2026-04-01 — Slot 8

    • Started task: Papers enrichment from PubMed API
    • Read AGENTS.md and understood five-layer architecture
    • Checked database: papers table exists but is empty (0 rows)
    • Found existing literature_manager.py with PubMed integration
    • Discovered schema mismatch: script expects different columns than actual schema
    • Creating spec file and will update code to match actual schema

    2026-04-25 — Slot 76

    • Resumed task: Papers enrichment from PubMed API
    • Key finding: papers table already has 24,908 rows (not 0 as task stated)
    • 7,272 unique PMIDs in hypothesis evidence (6,946 numeric)
    • 6,904 of 6,946 already in papers table — 42 missing
    • 42 missing PMIDs return empty <PubmedArticleSet> from PubMed (retracted/invalid)
    • Created scripts/enrich_papers_from_hypothesis_pmids.py to populate missing papers
    • Fetched 251 papers from PubMed, inserted with full metadata (title, abstract, journal, year, doi, pmc_id, authors)
    • Updated cited_in_analysis_ids (TEXT/JSON) to track hypothesis → paper linkage
    • Final papers count: 25,159 (was 24,908 before run)
    • Committed and pushed: 6506a0988
    • Note: 42 PMIDs remain unfilled — these return no results from PubMed E-utilities (likely retracted papers or entry errors)

    File: e543e406-4cca-45d0-a062-2fabc24dc2d3_spec.md
    Modified: 2026-04-25 23:40
    Size: 2.9 KB