[Senate] World-model improvement detector (driver)

← All Specs

[Senate] World-model improvement detector (driver)

Task

  • ID: 428c719e-a95a-40ca-8d8c-cba13e2f60cf
  • Type: recurring
  • Frequency: every-6h
  • Layer: Senate
  • Priority: P96

Goal

Detect concrete improvements to the platform's world model — new hypotheses that raise Elo above a threshold, wiki pages that gain substantive citations, datasets that meaningfully reduce uncertainty, analyses that refute/confirm prior claims — and emit world_model_improvement events that the Economics v2 credit backprop pipeline uses as the root reward signal (see project_economics_v2_credit_backprop_2026-04-10).

What it does

  • Compute deltas since last_wmi_scan_ts across four signal families:
- Hypothesis Elo gains (rank change above noise floor, verified by multiple judges).
- Wiki page quality jumps (citation count delta, reader score delta, structure improvements).
- Dataset uncertainty reduction (variance of downstream predictions before/after merge).
- Analysis confirmations/refutations (debate consensus shifts, benchmark wins/losses).
  • For each delta that crosses its threshold, write a world_model_improvements row with source_ref, delta_magnitude, signal_family, contributor_graph (author + reviewers + upstream citations).
  • Run calibration: compare predicted improvements (from prediction markets) vs. realized — feed the residual into calibration_slashing driver.
  • Emit total daily world-model-improvement score to logs/wmi-latest.json and the /senate/world-model dashboard.

Success criteria

  • Every scan produces either >0 improvement events OR a logged why_quiet explanation (e.g. "no new hypotheses above threshold in window").
  • 100% of emitted events carry a non-empty contributor_graph (input to PageRank credit backprop).
  • No event is emitted twice (UNIQUE(source_ref, signal_family)).
  • Threshold drift is tracked: if >50% of events cluster at the minimum threshold, Senate is alerted to retune.

Quality requirements

  • No stubs: output must be substantive — link to the meta-quest quest_quality_standards_spec.md. Events with magnitude = floor and no citation evidence are auto-rejected.
  • When operating on >=10 items, use 3-5 parallel agents each handling a disjoint slice (shard by signal_family).
  • Log total items processed + items that required retry so we can detect busywork.
  • Thresholds are stored in world_model_thresholds and versioned; changes require a Senate proposal.
  • Output feeds discovery-dividend math (44651656, 5531507e) — schema breaks must bump a version number.

Work Log

2026-04-23 03:30 PDT — Slot 51

  • Ran Driver #13 dry-run and then live: emitted 10 pending citation_threshold_medium world-model improvement rows.
  • Verified idempotency after the live run: follow-up dry-run reported no new citation/gap/promoted events.
  • Consistency audit found 4 high-confidence hypotheses (confidence_score >= 0.7) missing hypothesis_matured events while the driver reported no-op, caused by the confidence_growth_last_created_at cursor skipping older rows whose confidence crossed the threshold later.
  • Plan: remove mutable-signal created_at watermarks and rely on world_model_improvements DB-level existence checks for idempotency, then backfill the 4 missed rows.
  • Implemented the cursor removal for hypothesis_matured and hypothesis_promoted; both now scan for eligible artifacts missing improvement rows.
  • Ran Driver #13 live after the fix: emitted 4 pending hypothesis_matured rows for the missed high-confidence hypotheses.
  • Verified final state: follow-up dry-run no-op; 0 high-confidence hypotheses missing hypothesis_matured; 0 promoted hypotheses missing hypothesis_promoted.

2026-04-23 04:05 PDT — Codex

  • Reviewed in-progress Driver #13 patch and confirmed the mutable-signal cursor issue applies to resolved gaps, high-confidence hypotheses, and promoted hypotheses.
  • Added focused regression tests for older hypotheses whose confidence_score or status changes after cursor advancement.
  • Verified: pytest -q tests/test_detect_improvements.py -> 2 passed; python3 -m economics_drivers.detect_improvements --limit 20 --dry-run -> no-op.
  • DB consistency audit: 0 high-confidence hypotheses missing hypothesis_matured, 0 promoted hypotheses missing hypothesis_promoted, 0 resolved gaps missing gap_resolved.

File: 428c719e-a95a-40ca-8d8c-cba13e2f60cf_spec.md
Modified: 2026-04-25 23:40
Size: 4.3 KB