SciDEX — Task: [Atlas] Create papers table and literature corpus

The Atlas world model vision requires tracking every PubMed paper cited across SciDEX. Create a 'papers' table with schema: id, pmid, title, authors, journal, year, abstract, cited_in_analysis_ids, first_cited_at. Backfill from existing analyses' PubMed citations. Add /api/papers endpoint.

Completion Notes

Auto-release: non-recurring task produced no commits this iteration; requeuing for next cycle

Git Commits (2)

[Verify] papers table and literature corpus — already resolved [task:a592b68f-1688-4977-b1b7-ba8434ed96cf]2026-04-25

Merge remote-tracking branch 'origin/orchestra/task/a3b4b25b-ea46-4e82-b4c6-898ffc33e37a'2026-04-01

Spec File

[Atlas] Create papers table and literature corpus tracking

Goal

The Atlas world model vision requires tracking every PubMed paper cited across SciDEX. This creates a comprehensive literature corpus that can be queried, analyzed, and linked to the knowledge graph. Build the infrastructure to track papers, backfill from existing analyses, and expose via API.

Acceptance Criteria

☑ papers table created in PostgreSQL with proper schema

☑ Backfill script extracts PMIDs from existing analyses and populates papers table

☑ /api/papers endpoint returns papers with citation counts

☑ Papers are linked to analyses that cite them

☑ All existing pages continue to load (200 status)

☑ No broken links introduced

☑ Code follows existing patterns

Approach

Read existing code - Understand how analyses currently store PubMed citations

- Check analyses table schema
- Check hypotheses table for PMIDs
- Look at how post_process.py extracts citations

Create papers table - Schema design:

CREATE TABLE IF NOT EXISTS papers (
       id INTEGER PRIMARY KEY AUTOINCREMENT,
       pmid TEXT UNIQUE NOT NULL,
       title TEXT,
       authors TEXT,
       journal TEXT,
       year INTEGER,
       abstract TEXT,
       cited_in_analysis_ids TEXT,  -- JSON array
       first_cited_at TEXT,
       created_at TEXT DEFAULT CURRENT_TIMESTAMP
   )

Build backfill script - Extract PMIDs from:

- hypotheses.evidence_for (contains PMIDs)
- hypotheses.evidence_against (contains PMIDs)
- Analysis debate transcripts
- Parse and deduplicate

Add /api/papers endpoint - Return:

- List of all papers
- Citation counts
- Filter by analysis, year, journal
- Pagination support

Test thoroughly:

- Run backfill script
- Verify papers table populated
- Test /api/papers endpoint
- Check all main pages still load

Work Log

2026-04-01 21:19 PT — Slot 5

Task assigned: Create papers table and literature corpus tracking
Created spec file
Examined database schema - found PMIDs stored in hypotheses.evidence_for/evidence_against as JSON
Created papers table with schema: id, pmid (unique), title, authors, journal, year, abstract, cited_in_analysis_ids, citation_count, first_cited_at
Built backfill_papers.py script to extract PMIDs from all 118 hypotheses
Ran backfill: extracted 437 unique papers with citation tracking
Added /api/papers endpoint with pagination, filtering, and sorting support
Tested: All main pages return 200, API endpoints work correctly
Papers table stats:

- 437 papers total
- 1.04 avg citations per paper
- Max: 3 citations (PMID:36692217 - stress granule homeostasis)