[Atlas] Wiki quality improvement quest

Goal

Systematic wiki quality improvement per Morgan's feedback: more inline citations, richer references, more prose, figure hover previews.

Related quests: external_refs_quest_spec.md — WS3 (figure hover citations) and the broader "richer references" goal are now partially delivered by the unified external_refs table and its hover/click viewer framework. Non-paper references (Reactome, UniProt, Wikipedia, ClinicalTrials, etc.) live in that new table; figure hovers reuse its source_kind='pubmed_figure' preview handler.

Workstreams

WS1: Citation enrichment

Scan pages with <5 inline citations, add [@key] markers + enrich refs_json.

WS2: Prose improvement

Flag pages with <35% prose ratio, rewrite bullets to narrative.

WS3: Figure hover citations

For refs with figure_ref, show the paper figure on hover. Proof of principle on FOX* pages.

WS4: Quality flagging pipeline

Daily scanner scoring all pages, creating improvement tasks for flagged pages.

Guidelines

> Prose improvement (WS2) regenerates entire sections with an LLM prompt enforcing: (1) each claim inline-cited; (2) ≥1 contradictory or nuanced statement per section (not all promotional); (3) ≤2 bullet points per section (prefer prose). Accept only if prose_ratio improves by ≥5% AND citation count increases by ≥50%. Batch regens use parallel agents.

Work Log

2026-04-21 — Slot 50

Started paper processing recurring run for task 29dc1bc3-fa9a-4010-bf53-c780a5cde6ab.
Found 5 pending read queue rows; initial process attempt consumed them but returned read_failed because /data/papers existed yet was not writable in the worker sandbox.
Approach: make paper cache directory selection probe actual write/delete capability, fall back to configured production cache or /tmp without creating repo-local artifacts, then requeue and rerun the failed reads.
Rerun exposed retry completion conflict on the (pmid, stage, status) unique key; patched queue completion to update existing completed/failed rows and remove transient retry rows.
Verified rerun: read processed 5 papers (39 extracted claims total), enrich processed 4 retry rows plus the earlier 19731550 enrich, and queue status is clean (pending=0, failed=0 for all paper stages).

2026-04-20 (continued)

Pipeline run results: scored=200, queued=358
Process phase results: citation_enrichment +5, prose_improvement +5, figure_enrichment +5, kg_enrichment +5 (with connection resilience retry)
Connection resilience fix: PostgreSQL idle timeout (AdminShutdown) kills connections during LLM calls; added retry with fresh connection and _mark_failed helper
Push blocked: GitHub credentials missing (no token/SSH keys configured in environment); gh auth status shows not logged in; infrastructure fix required
Commit e75e8a78d adds psycopg retry logic for process_queues

2026-04-20 (initial)

Guidelines at docs/planning/wiki-quality-guidelines.md
genes-foxp1: 29 inline citations (verified, not lost)
proteins-foxp1-protein: enrichment in progress
Quest created

2026-04-23 — Slot 41

Score run: Scored 200 random pages; 335 queued across 4 queues.
Process run: Processed 5 items from each of 4 queues (20 total) — all completed successfully.
Bug fix: score_page() used SQLite ? placeholder instead of PostgreSQL %s; fixed (though the qmark→%s translator in database.py would have caught this anyway).
Connection fix: LLM calls (60-120s) caused PostgreSQL idle-in-transaction timeouts. Fixed by calling db.commit() in enrich_citations and improve_prose before the LLM call to release the read transaction.
Error handling fix: Added IdleInTransactionSessionTimeout and OperationalError to the connection-error retry list in process_queues; guarded db.rollback() calls in _mark_failed against already-dead connections.

Tasks using this spec (2)

[Atlas] Wiki quality pipeline: score pages and process impro

Atlas running P77

[Atlas] Paper processing pipeline: fetch, extract figures, r

Atlas open P77

File: quest_wiki_quality_improvement_spec.md

Modified: 2026-04-24 07:15

Size: 4.0 KB