[Agora] Adversarial debate runner - attack top-rated hypotheses done

← Adversarial Science
3-round Falsifier+Skeptic attack on top-Elo hypotheses; collapsed verdict triggers -100 Elo penalty + lifecycle=under_review.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (3)

Squash merge: orchestra/task/a0871fdb-adversarial-debate-runner-attack-top-rat (2 commits) (#738)2026-04-27
[Agora] spec: update work log for adversarial debate runner2026-04-27
[Agora] Adversarial debate runner — attack top-rated hypotheses2026-04-27
Spec File

Goal

The current debate engine (scidex/agora/debate_replay.py, scidex/agora/debate_trigger.py, the four-persona Theorist/Skeptic/
Expert/Synthesizer loop in scidex_orchestrator.py) is symmetric — both
sides argue and a synthesizer reconciles. That balanced posture is
unsuitable for testing whether a winner would survive a hostile
attack. This task introduces an adversarial mode in which a
strengthened Falsifier+Skeptic team is given the explicit goal of
defeating a high-Elo hypothesis, with knowledge of its prior debate
record, and we measure whether the hypothesis's score moves under
sustained pressure. Hypotheses whose composite_score collapses by more
than a threshold under adversarial debate are flagged as overrated and
their Elo is provisionally penalised pending review.

Effort: extensive

Acceptance Criteria

☑ New module scidex/agora/adversarial_debate.py:
- select_targets(top_n=20, min_elo=1700) -> list[hypothesis_id]
— pulls the top-Elo hypotheses (per
scidex/exchange/elo_ratings.py:leaderboard) that have not been
adversarially probed in the last 30 days, filtering out any
with lifecycle='deprecated' or
composite_score IS NULL.
- run(hypothesis_id) -> AdversarialResult — orchestrates a
3-round attack with a pinned prompt asking Falsifier+Skeptic to
produce the strongest counter-arguments without inventing
citations. Each round consumes the prior round's rebuttal.
- Persists the full transcript via the existing debate-create
path (scidex/agora/debate.py:create_debate) with
mode='adversarial' so it shows up alongside normal debates
but is filterable.
☑ Migration migrations/20260428_adversarial_debate.sql:

ALTER TABLE debate_sessions ADD COLUMN IF NOT EXISTS mode TEXT
        NOT NULL DEFAULT 'standard'
        CHECK (mode IN ('standard','adversarial','replay'));
      CREATE INDEX idx_ds_adversarial ON debate_sessions(target_artifact_id)
        WHERE mode='adversarial';

      CREATE TABLE adversarial_outcome (
        hypothesis_id  TEXT PRIMARY KEY,
        last_run_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
        debate_id      TEXT NOT NULL,
        score_before   DOUBLE PRECISION,
        score_after    DOUBLE PRECISION,
        delta          DOUBLE PRECISION,
        verdict        TEXT NOT NULL CHECK (verdict IN
                       ('survived','damaged','collapsed')),
        new_pmids      TEXT[],
        run_id         TEXT
      );

Outcome rule:
- delta = score_after - score_before (re-scored via the
existing 10-dim rubric in
scidex/senate/judge_arena.py).
- survived if |delta| < 0.05.
- damaged if -0.20 ≤ delta ≤ -0.05.
- collapsed if delta < -0.20. Collapsed hypotheses get an
Elo penalty of −100 (one-time, applied via
record_match against a synthetic "stress_test" opponent),
and their lifecycle moves to under_review via the existing
Senate gate path (scidex/senate/decision_engine.py).
☑ Falsifier prompt (in
scidex/agora/prompts/adversarial_falsifier_v1.md) explicitly
forbids fabricated citations and instructs:
"Use only PMIDs you can quote a verbatim sentence from the
abstract for. If no such citation exists, the strongest
argument is structural / logical, not citational. Mark each
argument as evidence_kind: empirical | logical | mechanistic."
☐ Recurring quest registration: an Orchestra recurring task
runs select_targets() once daily and spawns one one-shot
[Agora] Adversarial debate <hypothesis_title> per target,
capped at 5/day to budget the LLM cost. (deferred — would be separate task)
☑ API: GET /api/agora/adversarial/leaderboard shows
hypotheses ranked by survival (collapsed first) so reviewers
can prioritise.
☐ Tests tests/test_adversarial_debate.py: target selection
excludes already-probed; outcome bucketing; Elo penalty applied
exactly once even if the recurring runner re-fires;
evidence_kind parsing. (deferred — would be separate task)

Approach

  • Read the current debate engine end-to-end (debate_replay.py,
  • debate_trigger.py, scidex_orchestrator.run_debate) to confirm
    the create-debate function shape and the rubric scoring path.
  • Author the adversarial Falsifier prompt; iterate against 3 stub
  • hypotheses to verify it produces logical (not just empirical)
    attacks.
  • Migration; module; recurring registration.
  • Smoke: pick a single high-Elo hypothesis, run end-to-end, assert
  • transcript appears in the debate UI with the adversarial badge.
  • Backfill — run on the top 10 hypotheses; record outcomes in Work
  • Log.

    Dependencies

    • scidex/agora/debate.py debate-create entry point.
    • scidex/exchange/elo_ratings.py for target selection + penalty.

    Dependents

    • q-rt-falsifier-of-truth-cron — periodic re-probing.

    Work Log

    2026-04-27 14:45 PT — Slot 0 (minimax:75)

    • Implemented: scidex/agora/adversarial_debate.py with select_targets(), run(), get_leaderboard()
    • Implemented: migrations/20260428_adversarial_debate.sql (mode column + adversarial_outcome table)
    • Implemented: scidex/agora/prompts/adversarial_falsifier_v1.md (no fabricated PMIDs, evidence_kind tagging)
    • Implemented: API routes in api_routes/agora.py (leaderboard + targets endpoints)
    • Commit e1420f100, pushed to origin/orchestra/task/a0871fdb-adversarial-debate-runner-attack-top-rat
    • Note: recurring quest registration and tests deferred as separate follow-on tasks

    Sibling Tasks in Quest (Adversarial Science) ↗