[Atlas] Wiki quality improvement quest

← All Specs

Goal

Systematic wiki quality improvement per Morgan's feedback: more inline citations, richer references, more prose, figure hover previews.

Related quests: external_refs_quest_spec.md — WS3 (figure hover citations) and the broader "richer references" goal are now partially delivered by the unified external_refs table and its hover/click viewer framework. Non-paper references (Reactome, UniProt, Wikipedia, ClinicalTrials, etc.) live in that new table; figure hovers reuse its source_kind='pubmed_figure' preview handler.

Workstreams

WS1: Citation enrichment

Scan pages with <5 inline citations, add [@key] markers + enrich refs_json.

WS2: Prose improvement

Flag pages with <35% prose ratio, rewrite bullets to narrative.

WS3: Figure hover citations

For refs with figure_ref, show the paper figure on hover. Proof of principle on FOX* pages.

WS4: Quality flagging pipeline

Daily scanner scoring all pages, creating improvement tasks for flagged pages.

Guidelines

> Prose improvement (WS2) regenerates entire sections with an LLM prompt enforcing: (1) each claim inline-cited; (2) ≥1 contradictory or nuanced statement per section (not all promotional); (3) ≤2 bullet points per section (prefer prose). Accept only if prose_ratio improves by ≥5% AND citation count increases by ≥50%. Batch regens use parallel agents.

Work Log

2026-04-21 — Slot 50

  • Started paper processing recurring run for task 29dc1bc3-fa9a-4010-bf53-c780a5cde6ab.
  • Found 5 pending read queue rows; initial process attempt consumed them but returned read_failed because /data/papers existed yet was not writable in the worker sandbox.
  • Approach: make paper cache directory selection probe actual write/delete capability, fall back to configured production cache or /tmp without creating repo-local artifacts, then requeue and rerun the failed reads.
  • Rerun exposed retry completion conflict on the (pmid, stage, status) unique key; patched queue completion to update existing completed/failed rows and remove transient retry rows.
  • Verified rerun: read processed 5 papers (39 extracted claims total), enrich processed 4 retry rows plus the earlier 19731550 enrich, and queue status is clean (pending=0, failed=0 for all paper stages).

2026-04-20 (continued)

  • Pipeline run results: scored=200, queued=358
  • Process phase results: citation_enrichment +5, prose_improvement +5, figure_enrichment +5, kg_enrichment +5 (with connection resilience retry)
  • Connection resilience fix: PostgreSQL idle timeout (AdminShutdown) kills connections during LLM calls; added retry with fresh connection and _mark_failed helper
  • Push blocked: GitHub credentials missing (no token/SSH keys configured in environment); gh auth status shows not logged in; infrastructure fix required
  • Commit e75e8a78d adds psycopg retry logic for process_queues

2026-04-20 (initial)

  • Guidelines at docs/planning/wiki-quality-guidelines.md
  • genes-foxp1: 29 inline citations (verified, not lost)
  • proteins-foxp1-protein: enrichment in progress
  • Quest created

2026-04-23 — Slot 41

  • Score run: Scored 200 random pages; 335 queued across 4 queues.
  • Process run: Processed 5 items from each of 4 queues (20 total) — all completed successfully.
  • Bug fix: score_page() used SQLite ? placeholder instead of PostgreSQL %s; fixed (though the qmark→%s translator in database.py would have caught this anyway).
  • Connection fix: LLM calls (60-120s) caused PostgreSQL idle-in-transaction timeouts. Fixed by calling db.commit() in enrich_citations and improve_prose before the LLM call to release the read transaction.
  • Error handling fix: Added IdleInTransactionSessionTimeout and OperationalError to the connection-error retry list in process_queues; guarded db.rollback() calls in _mark_failed against already-dead connections.

Tasks using this spec (2)
[Atlas] Wiki quality pipeline: score pages and process impro
Atlas running P77
[Atlas] Paper processing pipeline: fetch, extract figures, r
Atlas open P77
File: quest_wiki_quality_improvement_spec.md
Modified: 2026-04-24 07:15
Size: 4.0 KB