[Atlas] Create papers table and literature corpus tracking
Goal
The Atlas world model vision requires tracking every PubMed paper cited across SciDEX. This creates a comprehensive literature corpus that can be queried, analyzed, and linked to the knowledge graph. Build the infrastructure to track papers, backfill from existing analyses, and expose via API.
Acceptance Criteria
☑ papers table created in PostgreSQL with proper schema
☑ Backfill script extracts PMIDs from existing analyses and populates papers table
☑ /api/papers endpoint returns papers with citation counts
☑ Papers are linked to analyses that cite them
☑ All existing pages continue to load (200 status)
☑ No broken links introduced
☑ Code follows existing patterns
Approach
Read existing code - Understand how analyses currently store PubMed citations
- Check
analyses table schema
- Check
hypotheses table for PMIDs
- Look at how post_process.py extracts citations
Create papers table - Schema design:
CREATE TABLE IF NOT EXISTS papers (
id INTEGER PRIMARY KEY AUTOINCREMENT,
pmid TEXT UNIQUE NOT NULL,
title TEXT,
authors TEXT,
journal TEXT,
year INTEGER,
abstract TEXT,
cited_in_analysis_ids TEXT, -- JSON array
first_cited_at TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP
)
Build backfill script - Extract PMIDs from:
-
hypotheses.evidence_for (contains PMIDs)
-
hypotheses.evidence_against (contains PMIDs)
- Analysis debate transcripts
- Parse and deduplicate
Add /api/papers endpoint - Return:
- List of all papers
- Citation counts
- Filter by analysis, year, journal
- Pagination support
Test thoroughly:
- Run backfill script
- Verify papers table populated
- Test /api/papers endpoint
- Check all main pages still load
Work Log
2026-04-01 21:19 PT — Slot 5
- Task assigned: Create papers table and literature corpus tracking
- Created spec file
- Examined database schema - found PMIDs stored in hypotheses.evidence_for/evidence_against as JSON
- Created papers table with schema: id, pmid (unique), title, authors, journal, year, abstract, cited_in_analysis_ids, citation_count, first_cited_at
- Built backfill_papers.py script to extract PMIDs from all 118 hypotheses
- Ran backfill: extracted 437 unique papers with citation tracking
- Added /api/papers endpoint with pagination, filtering, and sorting support
- Tested: All main pages return 200, API endpoints work correctly
- Papers table stats:
- 437 papers total
- 1.04 avg citations per paper
- Max: 3 citations (PMID:36692217 - stress granule homeostasis)