Spec: [Forge] Build Automated PubMed Update Pipeline for Hypothesis Evidence

← All Specs

Spec: [Forge] Build Automated PubMed Update Pipeline for Hypothesis Evidence

Task ID: c5bbaa6b-75a8-40a6-a9d3-3efa6ebb396c Layer: Forge Priority: P88

Problem

Hypotheses need fresh literature evidence. The existing pubmed_enrichment.py does one-shot backfills but doesn't track which papers have already been seen, can't find new papers since the last run, and doesn't append to existing evidence.

Solution

Build pubmed_update_pipeline.py — a recurring pipeline that:
  • Selects top N hypotheses by composite_score
  • For each, searches PubMed with a date filter (only papers since last check)
  • Deduplicates against existing PMIDs already in evidence_for/evidence_against
  • Appends new citations to evidence_for/evidence_against
  • Tracks last-checked timestamps in a pubmed_update_log table
  • Can be run via CLI or cron
  • DB Migration

    Add pubmed_update_log table:
    • hypothesis_id TEXT (FK)
    • last_checked_at TEXT
    • papers_found INTEGER
    • papers_added INTEGER

    Testing

    • Run with --dry-run to verify search queries without writing
    • Verify with sqlite3 PostgreSQL "SELECT id, title, json_array_length(evidence_for) FROM hypotheses ORDER BY composite_score DESC LIMIT 5"

    Work Log

    • 2026-04-02: Started implementation; created pubmed_update_pipeline.py (SQLite)
    • 2026-04-25: Verified already resolved on main. scidex/agora/pubmed_update_pipeline.py (651 lines, PostgreSQL via get_db() shim) exists on origin/main. Backward-compat shim at pubmed_update_pipeline.py present. pubmed_update_log table confirmed in PostgreSQL DB (PRIMARY KEY on hypothesis_id). Ran live test: 3 hypotheses processed, 2 new papers added to TREM2 hypothesis evidence_for. Pipeline handles date filtering, deduplication, and incremental updates as designed.

    Already Resolved — 2026-04-25 23:52:00Z

    Evidence: scidex/agora/pubmed_update_pipeline.py present on origin/main with 651 lines implementing the full PostgreSQL-backed incremental PubMed pipeline. pubmed_update_log table exists with PK on hypothesis_id. Live run verified: 3 hypotheses processed, 2 papers added to TREM2 evidence.

    Commit that landed the fix: Multiple squash-merges landed this; files are present on origin/main HEAD.

    Summary: Automated PubMed update pipeline is implemented, tested, and operational.

    File: c5bbaa6b_pubmed_update_pipeline_spec.md
    Modified: 2026-04-25 23:53
    Size: 2.3 KB