[Senate] Real-time judge interruption - halt rounds drifting off-topic done

← Open Debates
Cheap haiku judge scores each round live; continue/warn/halt-replace/abort; PMID-fabrication check.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

Squash merge: orchestra/task/978b5bec-real-time-judge-interruption-halt-rounds (3 commits) (#729)2026-04-27
Spec File

Effort: thorough

Goal

Today the LLM judge (scidex/senate/judge_arena.py) only weighs in after
all rounds complete. When a persona drifts off-topic, hallucinates
citations, or repeats a prior argument, the orchestrator wastes the
remaining round budget. Build a streaming judge interrupt that scores
each round as it lands and can issue one of: continue, warn_persona, halt_and_replace_persona, abort_debate. Persisted as debate_round_judgments. The judge here is intentionally cheap (haiku-tier
model) — the goal is fast guardrails, not deep adjudication.

Acceptance Criteria

☐ New module scidex/senate/round_judge.py:
judge_round(session_id, round_idx, content, prior_rounds) ->
JudgeVerdict
. Verdict has decision ∈ {continue, warn,
halt_replace, abort}, reasons: list[str], score: float,
judge_model: str.
☐ Judge prompt at scidex/senate/prompts/round_judge_v1.md. Three
explicit failure modes, each with a worked example:
(1) off-topic — content does not address the debate question;
(2) fabricated citation — PMID does not match a real paper
(cross-checked via scidex/forge/pubmed_search.py);
(3) redundancy — content is a paraphrase of a prior round.
☐ Decision rules:
- continue: score ≥ 0.6, no failure mode triggered.
- warn: 0.4 ≤ score < 0.6 OR redundancy detected — the next
persona's prompt receives a JUDGE WARNING: preamble naming the
prior persona's issue.
- halt_replace: off-topic detected — current round content is
marked superseded=TRUE, and the orchestrator re-runs the round
with a different persona pulled from scidex.agents.select.
- abort: 2 consecutive halt_replace calls OR fabricated PMID
— debate session marked status='aborted'; downstream
consumers (markets, KG writes) gated on a non-aborted status.
☐ Persistence: debate_round_judgments(id BIGSERIAL PK,
session_id TEXT, round_idx INT, decision TEXT, reasons JSONB,
score REAL, judge_model TEXT, decided_at TIMESTAMPTZ)
. Migration
migrations/20260428_debate_round_judgments.sql.
☐ Schema-level: debate_rounds.superseded BOOLEAN DEFAULT FALSE
(added in same migration); supersedes are excluded from synthesis.
☐ Integration into scidex/agora/scidex_orchestrator.py round loop —
runs after each round write, before incrementing the round
counter. Co-exists with q-debate-dynamic-round-count's
RoundController: judge fires first, then controller decides
stop/continue.
☐ PMID verification cache: hits paper_corpus first
(scidex/forge/paper_corpus.py), falls back to live PubMed only
when cache miss; cache TTL 7 days.
☐ API: GET /api/agora/debate/{id}/round_judgments returns the
per-round verdict trace; surfaced on /debate/{id} as a small
"Judge ran on round X — verdict: warn (off-topic)" banner.
☐ Tests tests/test_round_judge.py:
(a) on-topic + real PMID → continue.
(b) off-topic content → halt_replace.
(c) fabricated PMID → abort.
(d) paraphrase of prior round → warn (next round prompt carries
warning preamble).
(e) two consecutive halt_replace → abort.
☐ Shadow rollout: SCIDEX_ROUND_JUDGE=shadow runs the judge but
ignores its decision; flip to enforce after a week.
☐ Smoke: run on 5 recent debates in shadow; assert at least one
warn and one continue row land in debate_round_judgments.

Approach

  • Reuse the haiku-tier path in judge_arena.py:_BEDROCK_MODELS["haiku"]
  • for cost — this judge runs many times per debate.
  • PMID verification: regex-pull all PMIDs out of the round content;
  • hit cache; LLM judge sees verified: bool per PMID in its prompt
    alongside the round body.
  • Integration ordering matters — see section 5 of
  • scidex/agora/scidex_orchestrator.py:1843 for where the new hook
    slots in.
  • Shadow mode is mandatory; aborting debates incorrectly is expensive.
  • Dependencies

    • scidex/senate/judge_arena.py — judge LLM transport.
    • scidex/forge/paper_corpus.py — PMID cache.
    • q-debate-dynamic-round-count — co-exists; this hook fires first.

    Dependents

    • q-debate-replay-cross-topic — replay engine uses the judgment
    trace as its first-pass quality filter.

    Work Log

    2026-04-27 13:45 PT — Slot 0

    • Staleness review: No round_judge.py, no debate_round_judgments migration,
    no round_judge_v1.md prompt — task is NOT stale. Main HEAD: df5b33140.
    • Approach: Build module-first, then integrate. Use in-memory halt_count
    (not DB) for consecutive halt tracking — avoids extra table dependency.
    • Files created:
    - scidex/senate/round_judge.py (509 lines): JudgeVerdict + judge_round()
    - scidex/senate/prompts/round_judge_v1.md (123 lines): 3 failure modes + worked examples
    - migrations/20260428_debate_round_judgments.sql (62 lines): table + superseded col + halt_cache
    - tests/test_round_judge.py (206 lines): 5 case tests + helpers
    • Integration: scidex_orchestrator.py calls judge_round() after each of
    the 4 core persona rounds + synthesizer. abort→RuntimeError halts debate;
    halt_replace→UPDATE superseded=TRUE. RoundController called after judge.
    • API: round_judgments added to /api/analyses/{id} response.
    • Tests: 10/10 pass (all 5 acceptance cases + dataclass/preamble/PMID unit tests).
    • Commits: 714a6f79c + 7e8c8707d on branch orchestra/task/978b5bec-real-time-judge-interruption-halt-rounds
    • Shadow mode: default (SCIDEX_ROUND_JUDGE=shadow), set to enforce after validation week.
    • Status: COMPLETE — committed and pushed

    Sibling Tasks in Quest (Open Debates) ↗