The persona prompts (personas/skeptic/SKILL.md,
personas/theorist/SKILL.md, etc.) are static. When the calibration
tracker (scidex/senate/calibration.py) shows that the Theorist is
chronically over-confident on neuroinflammation claims (predicted
0.85, observed 0.55), the system has no way to adjust the prompt
unless a human rewrites it. Build an evolving-prompt mechanism that
mines miscalibrated outputs, proposes prompt deltas (e.g. "add
explicit confidence-rubric: only state 'highly likely' when ≥ 3
independent supporting PMIDs"), tests the proposed prompt against a
held-out evaluation set, and auto-promotes the winner.
Effort: thorough
scidex/senate/prompt_evolution.py::propose_prompt_delta(persona_id, miscalibration_pattern) -> PromptDelta where PromptDelta = {persona_id, version_from, version_to, change_summary, full_diff, motivating_calibration_data, generated_at}.scidex/senate/calibration.py::compute_brier(persona_id, window_days=30) for each persona; if the persona's Brier exceeds the system median + 1 σ, queue a prompt-evolution candidate.migrations/20260428_prompt_versions.sql: persona_prompt_versions(persona_id, version, content TEXT, parent_version, motivation TEXT, brier_at_creation REAL, status ENUM('candidate','testing','active','retired'), created_at). The "active" version is what the Senate routes calls to.scidex/senate/prompt_eval.py::evaluate(persona_id, candidate_version, eval_set='preregistration_outcomes', n=50) -> EvalResult runs the candidate prompt over 50 historical (claim, predicted_probability, observed_outcome) tuples from preregistration_outcomes and returns Brier, ECE, hallucination rate from q-qual-hallucination-detector.active only if brier_candidate < brier_active * 0.95 AND hallucination_rate_candidate ≤ hallucination_rate_active. Otherwise, candidate is retired with reason logged.prompt_evolution_proposal with the diff + motivating calibration data + eval result; needs ≥ 3 reviewers (or 2 reviewers + 7 days quiet) to ratify before going live./senate/prompt-evolution/{persona_id} shows version-history timeline + each version's Brier/hallucination over time.tests/test_prompt_evolution.py: miscalibrated synthetic data → candidate generated; candidate that fails evaluation → retired; candidate that passes → promoted with proposal record; rollback path works.SKILL.md directly — write to persona_prompt_versions.content and have the runtime read from there with file fallback.persona_prompt_versions table, fallback to disk on miss" in scidex/agents/manifest.py.preregistration_outcomes is the natural held-out set since each row has a true probability + ground-truth outcome.scidex/senate/governance.py.scidex/senate/calibration.py (shipped) — miscalibration source.q-qual-hallucination-detector — hallucination rate signal in evaluation.personas/*/SKILL.md — current prompts; runtime path moves to DB.q-persona-drift-detector (shipped) — flags when active prompt drifts from intent.Files created / modified:
migrations/20260428_prompt_versions.sql — persona_prompt_versions + prompt_evolution_proposals tables with indexes, constraints, and comments.scidex/senate/prompt_evolution.py — Full implementation:PromptDelta dataclass matching spec schemapropose_prompt_delta(persona_id, miscalibration_pattern) — LLM-driven diff generation, stores as 'candidate'get_active_prompt(persona_id) — DB-first, returns None on miss (callers fall back to file)run_weekly_evolution_sweep(dry_run) — detection driver at Brier > median + 1σ, deduplicates via 7-day windowrun_eval_and_promote(persona_id, candidate_version) — evaluates via prompt_eval.evaluate, applies 0.95× Brier + hallucination gates, fires senate proposal or retirescheck_rollbacks() — compares current 14-day Brier vs brier_at_creation * 1.05ratify_evolution_proposal(proposal_id) — hook for Senate vote path to flip status to 'active'
scidex/senate/prompt_eval.py — Evaluation harness:evaluate(persona_id, candidate_version, eval_set, n=50) -> EvalResultpreregistration_outcomes, runs LLM predictions, computes Brier/ECE, uses agent_hallucination_rate as hallucination proxy
scidex/agents/manifest.py — system_prompt property updated: tries get_active_prompt(self.slug) from DB before falling back to self.bundle.body. Never edits SKILL.md.api.py — Added routes:GET /senate/prompt-evolution — index page listing all personas with version countsGET /senate/prompt-evolution/{persona_id} — timeline + Chart.js calibration trend + proposal tableGET /api/senate/prompt-evolution/{persona_id}/versions — JSON API
tests/test_prompt_evolution.py — 7 tests, all passing:test_propose_prompt_delta_creates_candidate ✓test_eval_and_promote_retires_poor_candidate ✓test_eval_and_promote_passes_good_candidate ✓test_check_rollbacks_reverts_worsened_active ✓test_get_active_prompt_returns_db_content ✓test_get_active_prompt_returns_none_on_miss ✓test_weekly_sweep_dry_run ✓Acceptance criteria status:
propose_prompt_delta returning PromptDelta with all required fieldsevaluate(...) returning Brier/ECE/hallucination/senate/prompt-evolution/{persona_id} with timeline + calibration chartpersona_prompt_versions.content