Effort: thorough
Today the LLM judge (scidex/senate/judge_arena.py) only weighs in after
all rounds complete. When a persona drifts off-topic, hallucinates
citations, or repeats a prior argument, the orchestrator wastes the
remaining round budget. Build a streaming judge interrupt that scores
each round as it lands and can issue one of: continue, warn_persona,
halt_and_replace_persona, abort_debate. Persisted as
debate_round_judgments. The judge here is intentionally cheap (haiku-tier
model) — the goal is fast guardrails, not deep adjudication.
scidex/senate/round_judge.py:judge_round(session_id, round_idx, content, prior_rounds) ->
JudgeVerdict. Verdict has decision ∈ {continue, warn,halt_replace, abort}, reasons: list[str], score: float,judge_model: str.
scidex/senate/prompts/round_judge_v1.md. Threescidex/forge/pubmed_search.py);continue: score ≥ 0.6, no failure mode triggered.warn: 0.4 ≤ score < 0.6 OR redundancy detected — the nextJUDGE WARNING: preamble naming thehalt_replace: off-topic detected — current round content issuperseded=TRUE, and the orchestrator re-runs the roundscidex.agents.select.abort: 2 consecutive halt_replace calls OR fabricated PMIDstatus='aborted'; downstreamdebate_round_judgments(id BIGSERIAL PK,migrations/20260428_debate_round_judgments.sql.
debate_rounds.superseded BOOLEAN DEFAULT FALSEscidex/agora/scidex_orchestrator.py round loop —q-debate-dynamic-round-count'sRoundController: judge fires first, then controller decidespaper_corpus firstscidex/forge/paper_corpus.py), falls back to live PubMed onlyGET /api/agora/debate/{id}/round_judgments returns the/debate/{id} as a smalltests/test_round_judge.py:SCIDEX_ROUND_JUDGE=shadow runs the judge butenforce after a week.
warn and one continue row land in debate_round_judgments.judge_arena.py:_BEDROCK_MODELS["haiku"]verified: bool per PMID in its promptscidex/agora/scidex_orchestrator.py:1843 for where the new hookscidex/senate/judge_arena.py — judge LLM transport.scidex/forge/paper_corpus.py — PMID cache.q-debate-dynamic-round-count — co-exists; this hook fires first.q-debate-replay-cross-topic — replay engine uses the judgmentround_judge.py, no debate_round_judgments migration,round_judge_v1.md prompt — task is NOT stale. Main HEAD: df5b33140.
scidex/senate/round_judge.py (509 lines): JudgeVerdict + judge_round()scidex/senate/prompts/round_judge_v1.md (123 lines): 3 failure modes + worked examplesmigrations/20260428_debate_round_judgments.sql (62 lines): table + superseded col + halt_cachetests/test_round_judge.py (206 lines): 5 case tests + helpers
enforce after validation week.