[Atlas] Create papers table and literature corpus tracking done

← Atlas
The Atlas world model vision requires tracking every PubMed paper cited across SciDEX. Create a 'papers' table with schema: id, pmid, title, authors, journal, year, abstract, cited_in_analysis_ids, first_cited_at. Backfill from existing analyses' PubMed citations. Add /api/papers endpoint.

Completion Notes

Auto-release: non-recurring task produced no commits this iteration; requeuing for next cycle

Git Commits (2)

[Verify] papers table and literature corpus — already resolved [task:a592b68f-1688-4977-b1b7-ba8434ed96cf]2026-04-25
Merge remote-tracking branch 'origin/orchestra/task/a3b4b25b-ea46-4e82-b4c6-898ffc33e37a'2026-04-01
Spec File

[Atlas] Create papers table and literature corpus tracking

Goal

The Atlas world model vision requires tracking every PubMed paper cited across SciDEX. This creates a comprehensive literature corpus that can be queried, analyzed, and linked to the knowledge graph. Build the infrastructure to track papers, backfill from existing analyses, and expose via API.

Acceptance Criteria

papers table created in PostgreSQL with proper schema
☑ Backfill script extracts PMIDs from existing analyses and populates papers table
/api/papers endpoint returns papers with citation counts
☑ Papers are linked to analyses that cite them
☑ All existing pages continue to load (200 status)
☑ No broken links introduced
☑ Code follows existing patterns

Approach

  • Read existing code - Understand how analyses currently store PubMed citations
  • - Check analyses table schema
    - Check hypotheses table for PMIDs
    - Look at how post_process.py extracts citations

  • Create papers table - Schema design:

  • CREATE TABLE IF NOT EXISTS papers (
           id INTEGER PRIMARY KEY AUTOINCREMENT,
           pmid TEXT UNIQUE NOT NULL,
           title TEXT,
           authors TEXT,
           journal TEXT,
           year INTEGER,
           abstract TEXT,
           cited_in_analysis_ids TEXT,  -- JSON array
           first_cited_at TEXT,
           created_at TEXT DEFAULT CURRENT_TIMESTAMP
       )

  • Build backfill script - Extract PMIDs from:
  • - hypotheses.evidence_for (contains PMIDs)
    - hypotheses.evidence_against (contains PMIDs)
    - Analysis debate transcripts
    - Parse and deduplicate

  • Add /api/papers endpoint - Return:
  • - List of all papers
    - Citation counts
    - Filter by analysis, year, journal
    - Pagination support

  • Test thoroughly:
  • - Run backfill script
    - Verify papers table populated
    - Test /api/papers endpoint
    - Check all main pages still load

    Work Log

    2026-04-01 21:19 PT — Slot 5

    • Task assigned: Create papers table and literature corpus tracking
    • Created spec file
    • Examined database schema - found PMIDs stored in hypotheses.evidence_for/evidence_against as JSON
    • Created papers table with schema: id, pmid (unique), title, authors, journal, year, abstract, cited_in_analysis_ids, citation_count, first_cited_at
    • Built backfill_papers.py script to extract PMIDs from all 118 hypotheses
    • Ran backfill: extracted 437 unique papers with citation tracking
    • Added /api/papers endpoint with pagination, filtering, and sorting support
    • Tested: All main pages return 200, API endpoints work correctly
    • Papers table stats:
    - 437 papers total
    - 1.04 avg citations per paper
    - Max: 3 citations (PMID:36692217 - stress granule homeostasis)
    • Ready to commit and push

    Sibling Tasks in Quest (Atlas) ↗