[Senate] Evolving prompt suggestions - learn from miscalibrated outputs done

← Evolutionary Arenas
Detect chronically miscalibrated personas, propose prompt deltas, evaluate on prereg holdout, promote/rollback via Senate proposals.

Completion Notes

Auto-release: work already on origin/main

Git Commits (3)

Squash merge: orchestra/task/e0f5a890-evolving-prompt-suggestions-learn-from-m (2 commits) (#764)2026-04-27
[Senate] Remove tests for deleted persona_pages.py routes [task:e0f5a890-3398-4100-9288-c7581cf55a20]2026-04-27
[Senate] Evolving prompt suggestions — learn from miscalibrated outputs [task:e0f5a890-3398-4100-9288-c7581cf55a20]2026-04-27
Spec File

Goal

The persona prompts (personas/skeptic/SKILL.md, personas/theorist/SKILL.md, etc.) are static. When the calibration
tracker (scidex/senate/calibration.py) shows that the Theorist is
chronically over-confident on neuroinflammation claims (predicted
0.85, observed 0.55), the system has no way to adjust the prompt
unless a human rewrites it. Build an evolving-prompt mechanism that
mines miscalibrated outputs, proposes prompt deltas (e.g. "add
explicit confidence-rubric: only state 'highly likely' when ≥ 3
independent supporting PMIDs"), tests the proposed prompt against a
held-out evaluation set, and auto-promotes the winner.

Effort: thorough

Acceptance Criteria

scidex/senate/prompt_evolution.py::propose_prompt_delta(persona_id, miscalibration_pattern) -> PromptDelta where PromptDelta = {persona_id, version_from, version_to, change_summary, full_diff, motivating_calibration_data, generated_at}.
☐ Detection driver: weekly job runs scidex/senate/calibration.py::compute_brier(persona_id, window_days=30) for each persona; if the persona's Brier exceeds the system median + 1 σ, queue a prompt-evolution candidate.
☐ Migration migrations/20260428_prompt_versions.sql: persona_prompt_versions(persona_id, version, content TEXT, parent_version, motivation TEXT, brier_at_creation REAL, status ENUM('candidate','testing','active','retired'), created_at). The "active" version is what the Senate routes calls to.
☐ Evaluation harness: scidex/senate/prompt_eval.py::evaluate(persona_id, candidate_version, eval_set='preregistration_outcomes', n=50) -> EvalResult runs the candidate prompt over 50 historical (claim, predicted_probability, observed_outcome) tuples from preregistration_outcomes and returns Brier, ECE, hallucination rate from q-qual-hallucination-detector.
☐ Promotion rule: candidate prompt promoted to active only if brier_candidate < brier_active * 0.95 AND hallucination_rate_candidate ≤ hallucination_rate_active. Otherwise, candidate is retired with reason logged.
☐ Senate proposal: every promotion fires a prompt_evolution_proposal with the diff + motivating calibration data + eval result; needs ≥ 3 reviewers (or 2 reviewers + 7 days quiet) to ratify before going live.
☐ Rollback: if the promoted version's 14-day Brier worsens by > 5 %, auto-rollback to the previous active.
☐ HTML view /senate/prompt-evolution/{persona_id} shows version-history timeline + each version's Brier/hallucination over time.
☐ Tests tests/test_prompt_evolution.py: miscalibrated synthetic data → candidate generated; candidate that fails evaluation → retired; candidate that passes → promoted with proposal record; rollback path works.
☐ Constraint: never auto-edit the persona's SKILL.md directly — write to persona_prompt_versions.content and have the runtime read from there with file fallback.

Approach

  • Move the persona prompt loading from "read from disk" to "read from persona_prompt_versions table, fallback to disk on miss" in scidex/agents/manifest.py.
  • Implement the candidate generator: prompt the model with the persona's current prompt + miscalibration data, ask for a minimal diff that addresses the miscalibration. Strict JSON output.
  • Implement the eval harness — preregistration_outcomes is the natural held-out set since each row has a true probability + ground-truth outcome.
  • Implement the proposal + ratification path via scidex/senate/governance.py.
  • Implement the rollback watchdog as a daily check in the calibration driver.
  • Dependencies

    • scidex/senate/calibration.py (shipped) — miscalibration source.
    • q-qual-hallucination-detector — hallucination rate signal in evaluation.
    • personas/*/SKILL.md — current prompts; runtime path moves to DB.

    Dependents

    • q-persona-drift-detector (shipped) — flags when active prompt drifts from intent.

    Work Log

    2026-04-28 — Implementation [task:e0f5a890-3398-4100-9288-c7581cf55a20]

    Files created / modified:

    • migrations/20260428_prompt_versions.sqlpersona_prompt_versions + prompt_evolution_proposals tables with indexes, constraints, and comments.
    • scidex/senate/prompt_evolution.py — Full implementation:
    - PromptDelta dataclass matching spec schema
    - propose_prompt_delta(persona_id, miscalibration_pattern) — LLM-driven diff generation, stores as 'candidate'
    - get_active_prompt(persona_id) — DB-first, returns None on miss (callers fall back to file)
    - run_weekly_evolution_sweep(dry_run) — detection driver at Brier > median + 1σ, deduplicates via 7-day window
    - run_eval_and_promote(persona_id, candidate_version) — evaluates via prompt_eval.evaluate, applies 0.95× Brier + hallucination gates, fires senate proposal or retires
    - check_rollbacks() — compares current 14-day Brier vs brier_at_creation * 1.05
    - ratify_evolution_proposal(proposal_id) — hook for Senate vote path to flip status to 'active'
    • scidex/senate/prompt_eval.py — Evaluation harness:
    - evaluate(persona_id, candidate_version, eval_set, n=50) -> EvalResult
    - Loads candidate prompt from DB, samples preregistration_outcomes, runs LLM predictions, computes Brier/ECE, uses agent_hallucination_rate as hallucination proxy
    • scidex/agents/manifest.pysystem_prompt property updated: tries get_active_prompt(self.slug) from DB before falling back to self.bundle.body. Never edits SKILL.md.
    • api.py — Added routes:
    - GET /senate/prompt-evolution — index page listing all personas with version counts
    - GET /senate/prompt-evolution/{persona_id} — timeline + Chart.js calibration trend + proposal table
    - GET /api/senate/prompt-evolution/{persona_id}/versions — JSON API
    • tests/test_prompt_evolution.py — 7 tests, all passing:
    - test_propose_prompt_delta_creates_candidate
    - test_eval_and_promote_retires_poor_candidate
    - test_eval_and_promote_passes_good_candidate
    - test_check_rollbacks_reverts_worsened_active
    - test_get_active_prompt_returns_db_content
    - test_get_active_prompt_returns_none_on_miss
    - test_weekly_sweep_dry_run

    Acceptance criteria status:

    propose_prompt_delta returning PromptDelta with all required fields
    ☑ Detection driver with 1σ threshold, weekly sweep, dedup
    ☑ Migration with correct schema
    ☑ Evaluation harness evaluate(...) returning Brier/ECE/hallucination
    ☑ Promotion gate: brier_candidate < brier_active * 0.95 AND hallucination ≤ active
    ☑ Senate proposals on every promotion attempt (≥3 reviewer quorum, 7-day quiet)
    ☑ Rollback watchdog: 14-day Brier worsens > 5% → revert
    ☑ HTML view at /senate/prompt-evolution/{persona_id} with timeline + calibration chart
    ☑ Tests: all 4 key paths covered (generate, retire, promote, rollback)
    ☑ Constraint: SKILL.md never written by this system; only persona_prompt_versions.content

    Sibling Tasks in Quest (Evolutionary Arenas) ↗