SciDEX — Task: [Atlas] Wiki quality pipeline: score pages and pro

Run wiki_quality_pipeline.py: (1) score 200 random pages, queue low-quality ones, (2) process up to 5 items from each of the 4 queues (citation_enrichment, prose_improvement, figure_enrichment, kg_enrichment). Uses LLM for citation and prose work, DB-only for figure and KG linking. See docs/planning/wiki-quality-guidelines.md for standards.

Git Commits (11)

Squash merge: orchestra/task/d5e4edc1-triage-25-pending-governance-decisions (32 commits) (#1091)2026-04-27

[Atlas] Run wiki quality pipeline slice and fix scoring [task:b399cd3e-8b24-4c13-9c7b-a58254ba5bdb] (#1065)2026-04-27

[Atlas] Update wiki quality spec work log [task:b399cd3e-8b24-4c13-9c7b-a58254ba5bdb]2026-04-23

[Atlas] Wiki quality pipeline: fix idle-in-transaction timeout, score 200 pages, process 20 queue items [task:b399cd3e-8b24-4c13-9c7b-a58254ba5bdb]2026-04-23

[Atlas] Update wiki quality quest work log [task:b399cd3e-8b24-4c13-9c7b-a58254ba5bdb]2026-04-20

[Atlas] wiki_quality_pipeline: SQLite→PostgreSQL migration + process_queues fix2026-04-20

[Atlas] Update wiki quality quest work log [task:b399cd3e-8b24-4c13-9c7b-a58254ba5bdb]2026-04-20

[Atlas] wiki_quality_pipeline: SQLite→PostgreSQL migration + process_queues fix2026-04-20

Squash merge: wiki-quality-fix-b399cd3e (1 commits)2026-04-12

[Atlas] Fix refs_json list crash in citation/figure enrichment workers [task:b399cd3e-8b24-4c13-9c7b-a58254ba5bdb]2026-04-12

[Atlas] Fix refs_json list handling + score 500 pages + populate queues2026-04-12

Spec File

Goal

Systematic wiki quality improvement per Morgan's feedback: more inline citations, richer references, more prose, figure hover previews.

Related quests: external_refs_quest_spec.md — WS3 (figure hover citations) and the broader "richer references" goal are now partially delivered by the unified external_refs table and its hover/click viewer framework. Non-paper references (Reactome, UniProt, Wikipedia, ClinicalTrials, etc.) live in that new table; figure hovers reuse its source_kind='pubmed_figure' preview handler.

Workstreams

WS1: Citation enrichment

Scan pages with <5 inline citations, add [@key] markers + enrich refs_json.

WS2: Prose improvement

Flag pages with <35% prose ratio, rewrite bullets to narrative.

WS3: Figure hover citations

For refs with figure_ref, show the paper figure on hover. Proof of principle on FOX* pages.

WS4: Quality flagging pipeline

Daily scanner scoring all pages, creating improvement tasks for flagged pages.

Guidelines

> Prose improvement (WS2) regenerates entire sections with an LLM prompt enforcing: (1) each claim inline-cited; (2) ≥1 contradictory or nuanced statement per section (not all promotional); (3) ≤2 bullet points per section (prefer prose). Accept only if prose_ratio improves by ≥5% AND citation count increases by ≥50%. Batch regens use parallel agents.

Work Log

2026-04-28 — Slot 55

Approach: verify the recurring wiki quality pipeline is still active, fix any current blocker found in the score/process path, then run one scheduled slice: score 200 random wiki pages and process up to 5 items from each improvement queue.
Initial status showed active backlog: citation_enrichment=2551 pending, prose_improvement=1363 pending, figure_enrichment=121 pending, kg_enrichment=1765 pending.
Found scoring issue before the run: PostgreSQL jsonb refs_json values arrive as Python dicts, but score_page() only attempted json.loads(), causing rich-reference counts to fall back to zero for jsonb-backed pages.
Fixed score_page() to handle both dict and string refs_json values, matching the existing enrichment workers.
Score run completed: scored=200, queued=377.
Process run completed: citation_enrichment=4 completed / 1 failed, prose_improvement=4 completed / 1 failed, figure_enrichment=5 completed, kg_enrichment=5 completed. The two failed LLM-backed items exhausted all configured providers during that call; queue rows were marked failed, not left running.
Citation enrichment also exposed PGShimConnection.autocommit incompatibility in the shared rate limiter during paper search; added an autocommit property on the PostgreSQL shim and verified rate_limiter.acquire('NCBI') returns {"status": "acquired"}.

2026-04-27 — Slot 76

Paper processing pipeline run for task 29dc1bc3-fa9a-4010-bf53-c780a5cde6ab.
Pipeline cleared backlog: fetch/figures queues (66 each), read (67), enrich (64+4 retry).
Searched and queued 80 new TREM2/Alzheimer/microglia papers (20 per stage).
Bug fix: enrich_wiki_from_paper called json.loads() on refs_json and metadata columns that psycopg returns as already-deserialized Python dicts, causing TypeError. Fixed to handle both str and dict inputs.
Missing module: scripts/paper_figures_extraction.py did not exist, causing figures stage to return import_failed. Created the module with extract_figures_from_pmc(pmc_id) based on existing extract_figure_metadata.py patterns.
4 previously-failed enrich items retried and completed successfully after fix.
Commit 54253c1a3 pushed to branch.

2026-04-21 — Slot 50

Started paper processing recurring run for task 29dc1bc3-fa9a-4010-bf53-c780a5cde6ab.
Found 5 pending read queue rows; initial process attempt consumed them but returned read_failed because /data/papers existed yet was not writable in the worker sandbox.
Approach: make paper cache directory selection probe actual write/delete capability, fall back to configured production cache or /tmp without creating repo-local artifacts, then requeue and rerun the failed reads.
Rerun exposed retry completion conflict on the (pmid, stage, status) unique key; patched queue completion to update existing completed/failed rows and remove transient retry rows.
Verified rerun: read processed 5 papers (39 extracted claims total), enrich processed 4 retry rows plus the earlier 19731550 enrich, and queue status is clean (pending=0, failed=0 for all paper stages).

2026-04-20 (continued)

Pipeline run results: scored=200, queued=358
Process phase results: citation_enrichment +5, prose_improvement +5, figure_enrichment +5, kg_enrichment +5 (with connection resilience retry)
Connection resilience fix: PostgreSQL idle timeout (AdminShutdown) kills connections during LLM calls; added retry with fresh connection and _mark_failed helper
Push blocked: GitHub credentials missing (no token/SSH keys configured in environment); gh auth status shows not logged in; infrastructure fix required
Commit e75e8a78d adds psycopg retry logic for process_queues

2026-04-20 (initial)

Guidelines at docs/planning/wiki-quality-guidelines.md
genes-foxp1: 29 inline citations (verified, not lost)
proteins-foxp1-protein: enrichment in progress
Quest created

2026-04-23 — Slot 41

Score run: Scored 200 random pages; 335 queued across 4 queues.
Process run: Processed 5 items from each of 4 queues (20 total) — all completed successfully.
Bug fix: score_page() used SQLite ? placeholder instead of PostgreSQL %s; fixed (though the qmark→%s translator in database.py would have caught this anyway).
Connection fix: LLM calls (60-120s) caused PostgreSQL idle-in-transaction timeouts. Fixed by calling db.commit() in enrich_citations and improve_prose before the LLM call to release the read transaction.
Error handling fix: Added IdleInTransactionSessionTimeout and OperationalError to the connection-error retry list in process_queues; guarded db.rollback() calls in _mark_failed against already-dead connections.

Payload JSON

{
  "requirements": {
    "analysis": 5,
    "coding": 5
  },
  "completion_shas": [
    "c4f4f5413ad77a0c7f800c9773692e875f9d5dc9",
    "ca7dbbb05d279c04aea6eb37a8799c7a950bc644",
    "6d1b435513a7bbc4fdb713dc5f03088571bd884b"
  ],
  "completion_shas_checked_at": "2026-04-13T06:12:33.158305+00:00",
  "completion_shas_missing": [
    "c76d419ddde97d9b3cbf4c21e9d2d3521a43e487"
  ]
}