[Senate] World-model improvement detector (driver)
Task
- ID: 428c719e-a95a-40ca-8d8c-cba13e2f60cf
- Type: recurring
- Frequency: every-6h
- Layer: Senate
- Priority: P96
Goal
Detect concrete improvements to the platform's world model — new hypotheses that raise Elo above a threshold, wiki pages that gain substantive citations, datasets that meaningfully reduce uncertainty, analyses that refute/confirm prior claims — and emit
world_model_improvement events that the Economics v2 credit backprop pipeline uses as the root reward signal (see
project_economics_v2_credit_backprop_2026-04-10).
What it does
- Compute deltas since
last_wmi_scan_ts across four signal families:
- Hypothesis Elo gains (rank change above noise floor, verified by multiple judges).
- Wiki page quality jumps (citation count delta, reader score delta, structure improvements).
- Dataset uncertainty reduction (variance of downstream predictions before/after merge).
- Analysis confirmations/refutations (debate consensus shifts, benchmark wins/losses).
- For each delta that crosses its threshold, write a
world_model_improvements row with source_ref, delta_magnitude, signal_family, contributor_graph (author + reviewers + upstream citations).
- Run calibration: compare predicted improvements (from prediction markets) vs. realized — feed the residual into
calibration_slashing driver.
- Emit total daily world-model-improvement score to
logs/wmi-latest.json and the /senate/world-model dashboard.
Success criteria
- Every scan produces either >0 improvement events OR a logged
why_quiet explanation (e.g. "no new hypotheses above threshold in window").
- 100% of emitted events carry a non-empty
contributor_graph (input to PageRank credit backprop).
- No event is emitted twice (UNIQUE(source_ref, signal_family)).
- Threshold drift is tracked: if >50% of events cluster at the minimum threshold, Senate is alerted to retune.
Quality requirements
- No stubs: output must be substantive — link to the meta-quest
quest_quality_standards_spec.md. Events with magnitude = floor and no citation evidence are auto-rejected.
- When operating on >=10 items, use 3-5 parallel agents each handling a disjoint slice (shard by signal_family).
- Log total items processed + items that required retry so we can detect busywork.
- Thresholds are stored in
world_model_thresholds and versioned; changes require a Senate proposal.
- Output feeds discovery-dividend math (
44651656, 5531507e) — schema breaks must bump a version number.
Work Log
2026-04-23 03:30 PDT — Slot 51
- Ran Driver #13 dry-run and then live: emitted 10 pending
citation_threshold_medium world-model improvement rows.
- Verified idempotency after the live run: follow-up dry-run reported no new citation/gap/promoted events.
- Consistency audit found 4 high-confidence hypotheses (
confidence_score >= 0.7) missing hypothesis_matured events while the driver reported no-op, caused by the confidence_growth_last_created_at cursor skipping older rows whose confidence crossed the threshold later.
- Plan: remove mutable-signal
created_at watermarks and rely on world_model_improvements DB-level existence checks for idempotency, then backfill the 4 missed rows.
- Implemented the cursor removal for
hypothesis_matured and hypothesis_promoted; both now scan for eligible artifacts missing improvement rows.
- Ran Driver #13 live after the fix: emitted 4 pending
hypothesis_matured rows for the missed high-confidence hypotheses.
- Verified final state: follow-up dry-run no-op; 0 high-confidence hypotheses missing
hypothesis_matured; 0 promoted hypotheses missing hypothesis_promoted.
2026-04-23 04:05 PDT — Codex
- Reviewed in-progress Driver #13 patch and confirmed the mutable-signal cursor issue applies to resolved gaps, high-confidence hypotheses, and promoted hypotheses.
- Added focused regression tests for older hypotheses whose
confidence_score or status changes after cursor advancement.
- Verified:
pytest -q tests/test_detect_improvements.py -> 2 passed; python3 -m economics_drivers.detect_improvements --limit 20 --dry-run -> no-op.
- DB consistency audit: 0 high-confidence hypotheses missing
hypothesis_matured, 0 promoted hypotheses missing hypothesis_promoted, 0 resolved gaps missing gap_resolved.