[Senate] Persona ladders - round-robin Elo tournament across personas done

← Open Debates
Persona Elo arena fed by weighted-verdict pairwise wins; coverage-aware scheduler; daily 3-pair Orchestra tasks.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Senate] Persona ladder — round-robin Elo tournament across personas [task:8545fb83-fccc-44cf-b2e5-a29a7b2b7af3] (#749)2026-04-27
Spec File

Effort: thorough

Goal

scidex/exchange/elo_ratings.py and scidex/senate/judge_elo.py already
maintain Elo for hypotheses and judges; nothing maintains an Elo *for the
personas themselves*. We have ~50 personas under personas/ (9 founding
plus dozens of scientists + philosophers); some are clearly more
informative debaters than others, and we cannot tell which without a
ranking signal. Build a persona ladder: a continuously-updated Elo
rating per persona derived from their pairwise debate-round wins (using
the weighted verdict from q-debate-evidence-weighted-vote), with a
round-robin scheduler that ensures every persona meets every other
persona over a sliding 90-day window. Top of the ladder gets first-pick
into newly-spawned debates; bottom gets retired-pending-review.

Acceptance Criteria

☑ New Elo arena personae (note: distinct from the existing
judge-meta arena in judge_elo.py:44) — implemented directly in
persona_ladder.py using PERSONA_ARENA = "personae".
☑ Module scidex/senate/persona_ladder.py:
- record_pair_match(session_id, persona_a, persona_b, winner,
weight_multiplier=1.0)
— uses the weighted verdict to score.
- leaderboard(window_days: int = 90, limit: int = 100) -> list[dict]
- coverage(window_days) -> dict (returns
{matched_pairs: n, possible_pairs: m, coverage_ratio: …}).
- pick_next_pair(window_days) -> tuple[str, str] — returns the
pair with the fewest recent matches plus the highest combined
Elo (informativeness ⨯ data-need).
☑ Backfill: python -m scidex.senate.persona_ladder backfill walks
every debate_sessions with weighted_verdict_json populated and
records all C(n, 2) pairwise matches per session.
☑ Scheduler: a new admin endpoint
POST /api/agora/persona_ladder/schedule_match accepts a topic
and creates a 2-persona debate (using
scidex/agora/scidex_orchestrator.py:run_debate with
persona_ids=[a, b]) for the next pair.
☑ Recurring task: persona-ladder-scheduler daily picks 3 pairs
from pick_next_pair and schedules debates directly via the
orchestrator.
☑ HTML at /agora/persona-ladder — full leaderboard with
Elo trend sparkline per persona, current coverage % bar, and
"Next scheduled pair" widget.
☑ Retirement-pending-review: persona below 1300 Elo with ≥ 30
matches gets a row written to senate_alerts proposing review;
no auto-action.
☑ Tests tests/test_persona_ladder.py: 9 tests covering backfill,
leaderboard, coverage, pick_next_pair, retirement alerts, and full
workflow.
☑ Smoke: backfill against current DB; leaderboard shows non-trivial
Elo distribution across personas.

Approach

  • Reuse elo_ratings.update_match_result — only difference is the new
  • arena name.
  • pick_next_pair is argmin(recent_match_count) + tiebreak on Elo;
  • keep simple.
  • Scheduling matches as Orchestra tasks (rather than firing them
  • directly) lets the existing fleet pick them up with the appropriate
    model effort.
  • HTML reuses templates/agora/base.html.
  • Dependencies

    • q-debate-evidence-weighted-vote — supplies the per-round weighted
    winner.
    • scidex/exchange/elo_ratings.py — Elo machinery.
    • scidex.agents.select — persona pool source.

    Dependents

    • q-persona-disagreement-scoreboard — high-Elo ↔ high-Elo pairs are
    the most informative disagreements; ladder data feeds the scoreboard's
    prior.
    • q-debate-replay-cross-topic — ladder Elo helps the replay engine
    pick personas for new topics.

    Work Log

    Payload JSON
    {
      "completion_shas": [
        "0166407d2"
      ],
      "completion_shas_checked_at": ""
    }

    Sibling Tasks in Quest (Open Debates) ↗