[Atlas] PubMed evidence update pipeline

← All Specs

Goal

> ## Continuous-process anchor
>
> This spec describes an instance of one of the retired-script themes
> documented in docs/design/retired_scripts_patterns.md. Before
> implementing, read:
>
> 1. The "Design principles for continuous processes" section of that
> atlas — every principle is load-bearing. In particular:
> - LLMs for semantic judgment; rules for syntactic validation.
> - Gap-predicate driven, not calendar-driven.
> - Idempotent + version-stamped + observable.
> - No hardcoded entity lists, keyword lists, or canonical-name tables.
> - Three surfaces: FastAPI + orchestra + MCP.
> - Progressive improvement via outcome-feedback loop.
> 2. The theme entry in the atlas matching this task's capability:
> AG3 (pick the closest from Atlas A1–A7, Agora AG1–AG5,
> Exchange EX1–EX4, Forge F1–F2, Senate S1–S8, Cross-cutting X1–X2).
> 3. If the theme is not yet rebuilt as a continuous process, follow
> docs/planning/specs/rebuild_theme_template_spec.md to scaffold it
> BEFORE doing the per-instance work.
>
> **Specific scripts named below in this spec are retired and must not
> be rebuilt as one-offs.** Implement (or extend) the corresponding
> continuous process instead.

Fetch new PubMed abstracts and update evidence links for SciDEX hypotheses on a daily recurring schedule. The pipeline searches for recent papers related to stalest hypotheses, classifies them as supporting/contradicting, and updates hypothesis evidence_for/evidence_against JSON arrays and the hypothesis_papers junction table.

Acceptance Criteria

☑ Pipeline script pubmed_update_pipeline.py runs successfully against live DB
☑ Hypotheses get updated evidence with PMIDs added to papers table
☑ pubmed_update_log tracks each run with counts
☑ New papers appear in hypothesis evidence arrays (evidence_for/evidence_against)
☑ No duplicate PMIDs added for same hypothesis
☑ Last update timestamp updated on hypotheses

Approach

Pipeline Overview

The pipeline (pubmed_update_pipeline.py) is already implemented and was last run 2026-04-10. This recurring task executes it daily to keep hypothesis evidence fresh.

  • Select stale hypotheses — query hypotheses ordered by last_evidence_update ASC NULLS FIRST, limit 20
  • Build PubMed queries — from title terms + target gene + mechanism terms
  • Search PubMedesearch.fcgi with date filter (mindate from last update, maxdate today)
  • Fetch metadataesummary.fcgi for title/authors/journal/year, efetch.fcgi for abstracts
  • Classify direction — 'for' or 'against' based on query type + abstract heuristics
  • Write to DB — insert into papers, hypothesis_papers, update evidence_for/evidence_against JSON
  • Log — record run in pubmed_update_log
  • Rate Limiting

    • NCBI allows 3 req/s without API key; pipeline uses 0.4s delay (~2.5 req/s)
    • 1s pause between hypotheses

    Dependencies

    • pubmed_update_pipeline.py — the pipeline script (already exists)
    • papers table, hypothesis_papers table, hypotheses table
    • NCBI PubMed E-utilities (free, no API key required)

    Dependents

    • Atlas knowledge graph quality — hypotheses with fresh evidence
    • Exchange market pricing — evidence feeds into composite scores

    Work Log

    2026-04-12 00:01 PT — Slot minimax:59

    • Task claimed and started
    • Reviewed existing pipeline code (pubmed_update_pipeline.py, pubmed_evidence_pipeline.py)
    • Checked DB state: 335 hypotheses, 15,943 papers, 6,553 hypothesis_papers entries
    • Last pipeline run was 2026-04-10 (yesterday)
    • Created spec file and executed pipeline run

    2026-04-12 00:15 PT — Run complete

    • Pipeline executed: processed 20 hypotheses (stalest first by last_evidence_update)
    • +19 new supporting papers added to evidence arrays
    • 11 duplicate PMIDs skipped
    • 23 new paper rows inserted into papers table (created_at 2026-04-11)
    • pubmed_update_log updated with run timestamps
    • Committed spec file and pushed to main
    • orchestra complete failed due to Orchestra DB issue (unrelated to task)

    2026-04-22 07:40 PT — Slot minimax:70

    • Pipeline was broken: SQLite-only syntax (PRAGMA, INSERT OR IGNORE, tuple unpacking) incompatible with PostgreSQL
    • Fixed in local worktree: removed SQLite PRAGMA statements, converted INSERT OR IGNORE/REPLACE to PostgreSQL INSERT ... ON CONFLICT DO NOTHING/DO UPDATE, fixed row unpacking for dict-like rows, fixed evidence_for/against type handling (already list in PostgreSQL)
    • Tested: pipeline ran successfully against live PostgreSQL DB, processed 13 hypotheses, added 40+ new papers
    • Push blocked: remote branch orchestra/task/61065e83-pubmed-evidence-update-pipeline has divergent history from another agent's commit (b42a72eee, "Restore PubMed evidence update pipeline on PostgreSQL") — both branches have different PostgreSQL fixes
    • Note: remote branch already has a partial PostgreSQL fix. Local commit c1b084cc7 has additional compatibility fixes. Merge or rebase required to reconcile.

    Tasks using this spec (1)
    [Atlas] PubMed evidence update pipeline
    Atlas blocked P50
    File: 61065e83_698_spec.md
    Modified: 2026-04-25 23:40
    Size: 5.3 KB