SciDEX — Task: [Atlas] Paper processing pipeline: fetch, extract

Run paper_processing_pipeline.py process: fetches queued papers from PubMed, extracts figures from PMC, reads papers with LLM to extract claims, and enriches wiki pages with rich citations. See wiki quality quest spec.

Git Commits (15)

Squash merge: orchestra/task/29dc1bc3-paper-processing-pipeline-fetch-extract (2 commits) (#739)2026-04-27

[Atlas] Update wiki quality quest spec work log [task:29dc1bc3-fa9a-4010-bf53-c780a5cde6ab]2026-04-27

[Atlas] Paper processing pipeline: add figures extraction module + fix JSONB deserialization bug2026-04-27

[Atlas] Make paper processing retries robust [task:29dc1bc3-fa9a-4010-bf53-c780a5cde6ab]2026-04-21

[Atlas] Paper processing pipeline: PostgreSQL migration + figures_fallback fix [task:29dc1bc3-fa9a-4010-bf53-c780a5cde6ab]2026-04-20

[Atlas] Update wiki quality quest work log [task:29dc1bc3-fa9a-4010-bf53-c780a5cde6ab]2026-04-20

[Atlas] Migrate paper_processing_pipeline.py from SQLite to PostgreSQL; fix transaction handling [task:29dc1bc3-fa9a-4010-bf53-c780a5cde6ab]2026-04-20

[Atlas] Remove figures_fallback_middleware - return truthful 404s for missing figure files [task:29dc1bc3-fa9a-4010-bf53-c780a5cde6ab]2026-04-20

[Atlas] Paper processing pipeline: PostgreSQL migration + figures_fallback fix [task:29dc1bc3-fa9a-4010-bf53-c780a5cde6ab]2026-04-20

[Atlas] Update wiki quality quest work log [task:29dc1bc3-fa9a-4010-bf53-c780a5cde6ab]2026-04-20

[Atlas] Migrate paper_processing_pipeline.py from SQLite to PostgreSQL; fix transaction handling [task:29dc1bc3-fa9a-4010-bf53-c780a5cde6ab]2026-04-20

[Atlas] Remove figures_fallback_middleware - return truthful 404s for missing figure files [task:29dc1bc3-fa9a-4010-bf53-c780a5cde6ab]2026-04-20

[Senate] Spec backfill batch 1: economics, squad, atlas drivers2026-04-16

Spec File

Goal

Systematic wiki quality improvement per Morgan's feedback: more inline citations, richer references, more prose, figure hover previews.

Related quests: external_refs_quest_spec.md — WS3 (figure hover citations) and the broader "richer references" goal are now partially delivered by the unified external_refs table and its hover/click viewer framework. Non-paper references (Reactome, UniProt, Wikipedia, ClinicalTrials, etc.) live in that new table; figure hovers reuse its source_kind='pubmed_figure' preview handler.

Workstreams

WS1: Citation enrichment

Scan pages with <5 inline citations, add [@key] markers + enrich refs_json.

WS2: Prose improvement

Flag pages with <35% prose ratio, rewrite bullets to narrative.

WS3: Figure hover citations

For refs with figure_ref, show the paper figure on hover. Proof of principle on FOX* pages.

WS4: Quality flagging pipeline

Daily scanner scoring all pages, creating improvement tasks for flagged pages.

Guidelines

> Prose improvement (WS2) regenerates entire sections with an LLM prompt enforcing: (1) each claim inline-cited; (2) ≥1 contradictory or nuanced statement per section (not all promotional); (3) ≤2 bullet points per section (prefer prose). Accept only if prose_ratio improves by ≥5% AND citation count increases by ≥50%. Batch regens use parallel agents.

Work Log

2026-04-28 — Slot 55

Approach: verify the recurring wiki quality pipeline is still active, fix any current blocker found in the score/process path, then run one scheduled slice: score 200 random wiki pages and process up to 5 items from each improvement queue.
Initial status showed active backlog: citation_enrichment=2551 pending, prose_improvement=1363 pending, figure_enrichment=121 pending, kg_enrichment=1765 pending.
Found scoring issue before the run: PostgreSQL jsonb refs_json values arrive as Python dicts, but score_page() only attempted json.loads(), causing rich-reference counts to fall back to zero for jsonb-backed pages.
Fixed score_page() to handle both dict and string refs_json values, matching the existing enrichment workers.
Score run completed: scored=200, queued=377.
Process run completed: citation_enrichment=4 completed / 1 failed, prose_improvement=4 completed / 1 failed, figure_enrichment=5 completed, kg_enrichment=5 completed. The two failed LLM-backed items exhausted all configured providers during that call; queue rows were marked failed, not left running.
Citation enrichment also exposed PGShimConnection.autocommit incompatibility in the shared rate limiter during paper search; added an autocommit property on the PostgreSQL shim and verified rate_limiter.acquire('NCBI') returns {"status": "acquired"}.

2026-04-27 — Slot 76

Paper processing pipeline run for task 29dc1bc3-fa9a-4010-bf53-c780a5cde6ab.
Pipeline cleared backlog: fetch/figures queues (66 each), read (67), enrich (64+4 retry).
Searched and queued 80 new TREM2/Alzheimer/microglia papers (20 per stage).
Bug fix: enrich_wiki_from_paper called json.loads() on refs_json and metadata columns that psycopg returns as already-deserialized Python dicts, causing TypeError. Fixed to handle both str and dict inputs.
Missing module: scripts/paper_figures_extraction.py did not exist, causing figures stage to return import_failed. Created the module with extract_figures_from_pmc(pmc_id) based on existing extract_figure_metadata.py patterns.
4 previously-failed enrich items retried and completed successfully after fix.
Commit 54253c1a3 pushed to branch.

2026-04-21 — Slot 50

Started paper processing recurring run for task 29dc1bc3-fa9a-4010-bf53-c780a5cde6ab.
Found 5 pending read queue rows; initial process attempt consumed them but returned read_failed because /data/papers existed yet was not writable in the worker sandbox.
Approach: make paper cache directory selection probe actual write/delete capability, fall back to configured production cache or /tmp without creating repo-local artifacts, then requeue and rerun the failed reads.
Rerun exposed retry completion conflict on the (pmid, stage, status) unique key; patched queue completion to update existing completed/failed rows and remove transient retry rows.
Verified rerun: read processed 5 papers (39 extracted claims total), enrich processed 4 retry rows plus the earlier 19731550 enrich, and queue status is clean (pending=0, failed=0 for all paper stages).

2026-04-20 (continued)

Pipeline run results: scored=200, queued=358
Process phase results: citation_enrichment +5, prose_improvement +5, figure_enrichment +5, kg_enrichment +5 (with connection resilience retry)
Connection resilience fix: PostgreSQL idle timeout (AdminShutdown) kills connections during LLM calls; added retry with fresh connection and _mark_failed helper
Push blocked: GitHub credentials missing (no token/SSH keys configured in environment); gh auth status shows not logged in; infrastructure fix required
Commit e75e8a78d adds psycopg retry logic for process_queues

2026-04-20 (initial)

Guidelines at docs/planning/wiki-quality-guidelines.md
genes-foxp1: 29 inline citations (verified, not lost)
proteins-foxp1-protein: enrichment in progress
Quest created

2026-04-23 — Slot 41

Score run: Scored 200 random pages; 335 queued across 4 queues.
Process run: Processed 5 items from each of 4 queues (20 total) — all completed successfully.
Bug fix: score_page() used SQLite ? placeholder instead of PostgreSQL %s; fixed (though the qmark→%s translator in database.py would have caught this anyway).
Connection fix: LLM calls (60-120s) caused PostgreSQL idle-in-transaction timeouts. Fixed by calling db.commit() in enrich_citations and improve_prose before the LLM call to release the read transaction.
Error handling fix: Added IdleInTransactionSessionTimeout and OperationalError to the connection-error retry list in process_queues; guarded db.rollback() calls in _mark_failed against already-dead connections.

Payload JSON

{
  "requirements": {
    "analysis": 7,
    "coding": 5,
    "safety": 9
  },
  "_stall_skip_providers": [
    "pro_allen",
    "max_gmail"
  ]
}

Sibling Tasks in Quest (Atlas) ↗

○[Atlas] Squad findings bubble-up driver (driver #20)P94

○[Atlas] Install Dolt server + migrate first dataset (driver #26)P92

○[Atlas] Dataset PR review & merge driver (driver #27)P92

○[Atlas] Wiki mermaid LLM regen — 50 pages/run, parallel agentsP92

○[Atlas] Gap closure pipeline — match 500 open gaps to accumulated evidence and resolveP92

○[Atlas] CI: Drive artifact folder migration backfillP92

○[Atlas] Versioned tabular datasets — overall coordination questP90

○[Atlas] KG ↔ dataset cross-link driver (driver #30)P90

○[Atlas] CI: Generate semantic metadata for unsummarized artifactsP90

○[Atlas] PubMed evidence update pipelineP87

[Atlas] Paper processing pipeline: fetch, extract figures, read, enrich wiki done analysis:7 coding:5 safety:9