SciDEX — Task: [Senate] Real-time judge interruption

Cheap haiku judge scores each round live; continue/warn/halt-replace/abort; PMID-fabrication check.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

Squash merge: orchestra/task/978b5bec-real-time-judge-interruption-halt-rounds (3 commits) (#729)2026-04-27

Spec File

Effort: thorough

Goal

Today the LLM judge (scidex/senate/judge_arena.py) only weighs in after
all rounds complete. When a persona drifts off-topic, hallucinates
citations, or repeats a prior argument, the orchestrator wastes the
remaining round budget. Build a streaming judge interrupt that scores
each round as it lands and can issue one of: continue, warn_persona, halt_and_replace_persona, abort_debate. Persisted as debate_round_judgments. The judge here is intentionally cheap (haiku-tier
model) — the goal is fast guardrails, not deep adjudication.

Acceptance Criteria

☐ New module scidex/senate/round_judge.py:

judge_round(session_id, round_idx, content, prior_rounds) ->
       JudgeVerdict

. Verdict has decision ∈ {continue, warn,
halt_replace, abort}, reasons: list[str], score: float,
judge_model: str.

☐ Judge prompt at scidex/senate/prompts/round_judge_v1.md. Three

explicit failure modes, each with a worked example:
(1) off-topic — content does not address the debate question;
(2) fabricated citation — PMID does not match a real paper
(cross-checked via scidex/forge/pubmed_search.py);
(3) redundancy — content is a paraphrase of a prior round.

☐ Decision rules:

- continue: score ≥ 0.6, no failure mode triggered.
- warn: 0.4 ≤ score < 0.6 OR redundancy detected — the next
persona's prompt receives a JUDGE WARNING: preamble naming the
prior persona's issue.
- halt_replace: off-topic detected — current round content is
marked superseded=TRUE, and the orchestrator re-runs the round
with a different persona pulled from scidex.agents.select.
- abort: 2 consecutive halt_replace calls OR fabricated PMID
— debate session marked status='aborted'; downstream
consumers (markets, KG writes) gated on a non-aborted status.

☐ Persistence: debate_round_judgments(id BIGSERIAL PK,


      session_id TEXT, round_idx INT, decision TEXT, reasons JSONB,
      score REAL, judge_model TEXT, decided_at TIMESTAMPTZ)

. Migration
migrations/20260428_debate_round_judgments.sql.

☐ Schema-level: debate_rounds.superseded BOOLEAN DEFAULT FALSE

(added in same migration); supersedes are excluded from synthesis.

☐ Integration into scidex/agora/scidex_orchestrator.py round loop —

runs after each round write, before incrementing the round
counter. Co-exists with q-debate-dynamic-round-count's
RoundController: judge fires first, then controller decides
stop/continue.

☐ PMID verification cache: hits paper_corpus first

(scidex/forge/paper_corpus.py), falls back to live PubMed only
when cache miss; cache TTL 7 days.

☐ API: GET /api/agora/debate/{id}/round_judgments returns the

per-round verdict trace; surfaced on /debate/{id} as a small
"Judge ran on round X — verdict: warn (off-topic)" banner.

☐ Tests tests/test_round_judge.py:

(a) on-topic + real PMID → continue.
(b) off-topic content → halt_replace.
(c) fabricated PMID → abort.
(d) paraphrase of prior round → warn (next round prompt carries
warning preamble).
(e) two consecutive halt_replace → abort.

☐ Shadow rollout: SCIDEX_ROUND_JUDGE=shadow runs the judge but

ignores its decision; flip to enforce after a week.

☐ Smoke: run on 5 recent debates in shadow; assert at least one

warn and one continue row land in debate_round_judgments.

Approach

Reuse the haiku-tier path in judge_arena.py:_BEDROCK_MODELS["haiku"]

for cost — this judge runs many times per debate.

PMID verification: regex-pull all PMIDs out of the round content;

hit cache; LLM judge sees verified: bool per PMID in its prompt
alongside the round body.

Integration ordering matters — see section 5 of

scidex/agora/scidex_orchestrator.py:1843 for where the new hook
slots in.

Shadow mode is mandatory; aborting debates incorrectly is expensive.

Dependencies

scidex/senate/judge_arena.py — judge LLM transport.
scidex/forge/paper_corpus.py — PMID cache.
q-debate-dynamic-round-count — co-exists; this hook fires first.

Dependents

q-debate-replay-cross-topic — replay engine uses the judgment

trace as its first-pass quality filter.

Work Log

2026-04-27 13:45 PT — Slot 0

Staleness review: No round_judge.py, no debate_round_judgments migration,

no round_judge_v1.md prompt — task is NOT stale. Main HEAD: df5b33140.

Approach: Build module-first, then integrate. Use in-memory halt_count

(not DB) for consecutive halt tracking — avoids extra table dependency.

Files created:

- scidex/senate/round_judge.py (509 lines): JudgeVerdict + judge_round()
- scidex/senate/prompts/round_judge_v1.md (123 lines): 3 failure modes + worked examples
- migrations/20260428_debate_round_judgments.sql (62 lines): table + superseded col + halt_cache
- tests/test_round_judge.py (206 lines): 5 case tests + helpers

Integration: scidex_orchestrator.py calls judge_round() after each of

the 4 core persona rounds + synthesizer. abort→RuntimeError halts debate;
halt_replace→UPDATE superseded=TRUE. RoundController called after judge.

API: round_judgments added to /api/analyses/{id} response.
Tests: 10/10 pass (all 5 acceptance cases + dataclass/preamble/PMID unit tests).
Commits: 714a6f79c + 7e8c8707d on branch orchestra/task/978b5bec-real-time-judge-interruption-halt-rounds
Shadow mode: default (SCIDEX_ROUND_JUDGE=shadow), set to enforce after validation week.
Status: COMPLETE — committed and pushed