[Wiki] Reduce stub backlog on high-priority wiki pages

Goal

> ## Continuous-process anchor
>
> This spec describes an instance of one of the retired-script themes
> documented in docs/design/retired_scripts_patterns.md. Before
> implementing, read:
>
> 1. The "Design principles for continuous processes" section of that
> atlas — every principle is load-bearing. In particular:
> - LLMs for semantic judgment; rules for syntactic validation.
> - Gap-predicate driven, not calendar-driven.
> - Idempotent + version-stamped + observable.
> - No hardcoded entity lists, keyword lists, or canonical-name tables.
> - Three surfaces: FastAPI + orchestra + MCP.
> - Progressive improvement via outcome-feedback loop.
> 2. The theme entry in the atlas matching this task's capability:
> AG1 (pick the closest from Atlas A1–A7, Agora AG1–AG5,
> Exchange EX1–EX4, Forge F1–F2, Senate S1–S8, Cross-cutting X1–X2).
> 3. If the theme is not yet rebuilt as a continuous process, follow
> docs/planning/specs/rebuild_theme_template_spec.md to scaffold it
> BEFORE doing the per-instance work.
>
> **Specific scripts named below in this spec are retired and must not
> be rebuilt as one-offs.** Implement (or extend) the corresponding
> continuous process instead.

Continuously reduce the backlog of short wiki pages by expanding the highest-value stubs first, using page importance, quality scores, and world-model connectivity to prioritize work.

Background

SciDEX has 17,400+ wiki pages (NeuroWiki source). Currently:

16 pages under 200 words (severe stubs)
30 pages 200-500 words (moderate stubs)

The stub backlog (pages under 500 words in high-value entity types) needs prioritized expansion.

Acceptance Criteria

☑ Script identifies all stub pages (word_count < 500) from NeuroWiki source

☑ Stubs are ranked by composite priority: connectivity_score × quality_gap × entity_type_weight

☑ Top N stubs (configurable, default 3) are selected per run

☑ LLM generates expanded content (500+ words) for each selected stub

☑ Expanded content is saved to wiki_pages table via db_writes helper

☑ Dry-run mode available (--dry-run flag)

☑ Idempotent: re-running on already-expanded pages skips them

Approach

Priority Scoring

priority_score = connectivity_score × 0.4 + quality_gap_score × 0.3 + entity_type_weight × 10 × 0.3

connectivity_score (0-10): Based on KG edge count. More connections = higher priority. quality_gap_score (0-10): 10 - composite_score from wiki_quality_scores. Lower quality = higher priority. entity_type_weight: disease=1.0, mechanism=0.9, therapeutic=0.85, protein=0.7, gene=0.7, cell=0.6, pathway=0.6, clinical=0.6, dataset=0.5

Implementation

Script: scripts/reduce_wiki_stub_backlog.py

Key functions:

StubCandidate dataclass
get_stub_candidates() - queries DB for stubs
score_connectivity() - KG edge count scoring
score_quality_gap() - quality score gap
build_expanded_content() - LLM call
expand_stub() - main expansion logic
main() - CLI entry point

Work Log

2026-04-21 14:30 PT — Run: backlog clear, no stubs eligible

Rebased onto current origin/main (4cc29ba25)
Verified script: python3 scripts/reduce_wiki_stub_backlog.py --dry-run --limit 1 → 0 candidates
All NeuroWiki pages with word_count < 500 are ai_tool type (excluded by 0.0 entity weight)
All high-priority stubs (disease, mechanism, therapeutic, protein, gene, cell, pathway, clinical, dataset, phenotype) are already 500+ words
Backlog is clear; script is idempotent and ready for next high-priority stub

2026-04-19 — DB corruption workaround + 5 stubs expanded

Rewrote script at scripts/reduce_wiki_stub_backlog.py (previously deprecated/moved to scripts/deprecated/)
DB corruption (integrity_check reports 200+ errors) prevents normal SQL queries on some page ranges
Implemented rowid-based batch scanning with retry logic to work around corruption
Connectivity scoring also uses batch scan fallback since direct queries fail on corrupted ranges
Found 5 priority stubs: entities-parkinson (16w), proteins-akt (25w), pathways-autophagy (25w), phenotypes-inflammation (25w), phenotypes-neurodegeneration (25w)
All are placeholder pages ("A entity referenced in NNN knowledge graph relationships") with no real content
Expanded all 5 to 500+ words via LLM (sonnet model, 1200 token limit)

- entities-parkinson: 16 -> 714 words
- proteins-akt: 25 -> 572 words
- pathways-autophagy: 25 -> 561 words
- phenotypes-inflammation: 25 -> 683 words
- phenotypes-neurodegeneration: 25 -> 849 words

Remaining stubs (ai_tool, kdense_category, index types) intentionally skipped — not quality targets

2026-04-12 — Initial implementation

Created spec file with acceptance criteria
Implemented scripts/reduce_wiki_stub_backlog.py
Script tested in dry-run mode, found 3 stub candidates (therapeutics, clinical, dataset)
Priority scoring: connectivity × 0.4 + quality_gap × 0.3 + entity_weight × 10 × 0.3

2026-04-17 — Run: expanded senescence stub (180→705 words)

Queried current stub backlog: only 1 eligible high-priority stub remains (senescence, mechanism, 180 words)
Broader stub landscape: 4 pages <200 words, 11 <300, 31 <500 (most are ai_tool or index types, excluded)
senescence had highest priority score (8.20) due to 20 KG edges and mechanism entity type
Generated 705-word comprehensive overview covering senescence biology, role in AD/PD/ALS, key molecular players (SASP, SIRT1/6, TREM2, TFEB), clinical significance (senolytics), and cross-links to 6 detailed sub-pages
Saved via db_writes.save_wiki_page, verified DB write and HTTP 200 render
Backlog status: 0 high-priority stubs remaining (all eligible entity types under 500 words have been expanded)

Tasks using this spec (1)

[Wiki] Reduce stub backlog on high-priority wiki pages

Wiki blocked P78

File: d223c98a_2e11_wiki_reduce_stub_backlog_spec.md

Modified: 2026-04-25 23:40

Size: 5.9 KB