SciDEX — Task: [Senate] Evolving prompt suggestions

Detect chronically miscalibrated personas, propose prompt deltas, evaluate on prereg holdout, promote/rollback via Senate proposals.

Completion Notes

Auto-release: work already on origin/main

Git Commits (3)

Squash merge: orchestra/task/e0f5a890-evolving-prompt-suggestions-learn-from-m (2 commits) (#764)2026-04-27

[Senate] Remove tests for deleted persona_pages.py routes [task:e0f5a890-3398-4100-9288-c7581cf55a20]2026-04-27

[Senate] Evolving prompt suggestions — learn from miscalibrated outputs [task:e0f5a890-3398-4100-9288-c7581cf55a20]2026-04-27

Spec File

Goal

The persona prompts (personas/skeptic/SKILL.md, personas/theorist/SKILL.md, etc.) are static. When the calibration
tracker (scidex/senate/calibration.py) shows that the Theorist is
chronically over-confident on neuroinflammation claims (predicted
0.85, observed 0.55), the system has no way to adjust the prompt
unless a human rewrites it. Build an evolving-prompt mechanism that
mines miscalibrated outputs, proposes prompt deltas (e.g. "add
explicit confidence-rubric: only state 'highly likely' when ≥ 3
independent supporting PMIDs"), tests the proposed prompt against a
held-out evaluation set, and auto-promotes the winner.

Effort: thorough

Acceptance Criteria

☐ scidex/senate/prompt_evolution.py::propose_prompt_delta(persona_id, miscalibration_pattern) -> PromptDelta where

PromptDelta = {persona_id, version_from, version_to, change_summary, full_diff, motivating_calibration_data, generated_at}

☐ Detection driver: weekly job runs scidex/senate/calibration.py::compute_brier(persona_id, window_days=30) for each persona; if the persona's Brier exceeds the system median + 1 σ, queue a prompt-evolution candidate.

☐ Migration migrations/20260428_prompt_versions.sql:

persona_prompt_versions(persona_id, version, content TEXT, parent_version, motivation TEXT, brier_at_creation REAL, status ENUM('candidate','testing','active','retired'), created_at)

. The "active" version is what the Senate routes calls to.

☐ Evaluation harness:

scidex/senate/prompt_eval.py::evaluate(persona_id, candidate_version, eval_set='preregistration_outcomes', n=50) -> EvalResult

runs the candidate prompt over 50 historical (claim, predicted_probability, observed_outcome) tuples from preregistration_outcomes and returns Brier, ECE, hallucination rate from q-qual-hallucination-detector.

☐ Promotion rule: candidate prompt promoted to active only if brier_candidate < brier_active * 0.95 AND hallucination_rate_candidate ≤ hallucination_rate_active. Otherwise, candidate is retired with reason logged.

☐ Senate proposal: every promotion fires a prompt_evolution_proposal with the diff + motivating calibration data + eval result; needs ≥ 3 reviewers (or 2 reviewers + 7 days quiet) to ratify before going live.

☐ Rollback: if the promoted version's 14-day Brier worsens by > 5 %, auto-rollback to the previous active.

☐ HTML view /senate/prompt-evolution/{persona_id} shows version-history timeline + each version's Brier/hallucination over time.

☐ Tests tests/test_prompt_evolution.py: miscalibrated synthetic data → candidate generated; candidate that fails evaluation → retired; candidate that passes → promoted with proposal record; rollback path works.

☐ Constraint: never auto-edit the persona's SKILL.md directly — write to persona_prompt_versions.content and have the runtime read from there with file fallback.

Approach

Move the persona prompt loading from "read from disk" to "read from persona_prompt_versions table, fallback to disk on miss" in scidex/agents/manifest.py.

Implement the candidate generator: prompt the model with the persona's current prompt + miscalibration data, ask for a minimal diff that addresses the miscalibration. Strict JSON output.

Implement the eval harness — preregistration_outcomes is the natural held-out set since each row has a true probability + ground-truth outcome.

Implement the proposal + ratification path via scidex/senate/governance.py.

Implement the rollback watchdog as a daily check in the calibration driver.

Dependencies

scidex/senate/calibration.py (shipped) — miscalibration source.
q-qual-hallucination-detector — hallucination rate signal in evaluation.
personas/*/SKILL.md — current prompts; runtime path moves to DB.

Dependents

q-persona-drift-detector (shipped) — flags when active prompt drifts from intent.

Work Log

2026-04-28 — Implementation [task:e0f5a890-3398-4100-9288-c7581cf55a20]

Files created / modified:

migrations/20260428_prompt_versions.sql — persona_prompt_versions + prompt_evolution_proposals tables with indexes, constraints, and comments.
scidex/senate/prompt_evolution.py — Full implementation:

- PromptDelta dataclass matching spec schema
- propose_prompt_delta(persona_id, miscalibration_pattern) — LLM-driven diff generation, stores as 'candidate'
- get_active_prompt(persona_id) — DB-first, returns None on miss (callers fall back to file)
- run_weekly_evolution_sweep(dry_run) — detection driver at Brier > median + 1σ, deduplicates via 7-day window
- run_eval_and_promote(persona_id, candidate_version) — evaluates via prompt_eval.evaluate, applies 0.95× Brier + hallucination gates, fires senate proposal or retires
- check_rollbacks() — compares current 14-day Brier vs brier_at_creation * 1.05
- ratify_evolution_proposal(proposal_id) — hook for Senate vote path to flip status to 'active'

scidex/senate/prompt_eval.py — Evaluation harness:

- evaluate(persona_id, candidate_version, eval_set, n=50) -> EvalResult
- Loads candidate prompt from DB, samples preregistration_outcomes, runs LLM predictions, computes Brier/ECE, uses agent_hallucination_rate as hallucination proxy

scidex/agents/manifest.py — system_prompt property updated: tries get_active_prompt(self.slug) from DB before falling back to self.bundle.body. Never edits SKILL.md.
api.py — Added routes:

- GET /senate/prompt-evolution — index page listing all personas with version counts
- GET /senate/prompt-evolution/{persona_id} — timeline + Chart.js calibration trend + proposal table
- GET /api/senate/prompt-evolution/{persona_id}/versions — JSON API

tests/test_prompt_evolution.py — 7 tests, all passing:

- test_propose_prompt_delta_creates_candidate ✓
- test_eval_and_promote_retires_poor_candidate ✓
- test_eval_and_promote_passes_good_candidate ✓
- test_check_rollbacks_reverts_worsened_active ✓
- test_get_active_prompt_returns_db_content ✓
- test_get_active_prompt_returns_none_on_miss ✓
- test_weekly_sweep_dry_run ✓

Acceptance criteria status:

☑ propose_prompt_delta returning PromptDelta with all required fields

☑ Detection driver with 1σ threshold, weekly sweep, dedup

☑ Migration with correct schema

☑ Evaluation harness evaluate(...) returning Brier/ECE/hallucination

☑ Promotion gate: brier_candidate < brier_active * 0.95 AND hallucination ≤ active

☑ Senate proposals on every promotion attempt (≥3 reviewer quorum, 7-day quiet)

☑ Rollback watchdog: 14-day Brier worsens > 5% → revert

☑ HTML view at /senate/prompt-evolution/{persona_id} with timeline + calibration chart

☑ Tests: all 4 key paths covered (generate, retire, promote, rollback)

☑ Constraint: SKILL.md never written by this system; only persona_prompt_versions.content