SciDEX — Task: [Agora] Evidence-weighted persona votes

Per-persona vote weight = log(citations) + log(skill_calls) + Brier; weighted_verdict_json column; Elo K-factor scales by weight.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Agora] Evidence-weighted persona votes — citation density scales conviction [task:9be04eed-78e0-4e82-9deb-718da6db8f6a] (#722)2026-04-27

Spec File

Effort: thorough

Goal

Today every persona's vote in a debate's synthesis carries equal weight. A
persona who shows up empty-handed counts the same as one who cites three
PMIDs and an Allen ISH measurement. Build an evidence-weighted vote
system: each persona's stance carries weight = f(citation_density, skill_invocations_for_this_round, prior_calibration). Synthesis verdict
becomes the weight-majority across personas; persona Elo deltas scale by
how much weight a winning persona carried. This rewards grounded
reasoning and demotes hand-waving.

Acceptance Criteria

☐ New module scidex/agora/evidence_weighted_vote.py:

compute_weight(round_id, persona_id) -> float and
aggregate_weighted_verdict(session_id) -> dict (returns

{verdict, weights: {persona: w}, evidence_density: {persona: n},
       margin}

☐ Weight formula (documented inline + in module docstring):

weight = clip(0.2,
                      log(1 + citations) * 0.5
                    + log(1 + skill_invocations) * 0.3
                    + brier_calibration * 0.4,
                     2.0)

where citations = count of PMIDs/DOIs in the round's content
(pulled via the regex used in
scidex/atlas/citation_extraction.py); skill_invocations =
agent_skill_invocations rows for this persona/round;
brier_calibration = 1 - brier_score from
agent_calibration (default 0.5 if absent).

☐ synthesis_engine.synthesize_debate_session writes the weighted

verdict to a new debate_sessions.weighted_verdict_json JSONB
column (migration migrations/20260428_debate_weighted_verdict.sql).
The classic verdict field stays unchanged for back-compat.

☐ Elo-update path in scidex/senate/judge_elo.py and

scidex/exchange/elo_ratings.py accepts an optional
weight_multiplier parameter; the persona's Elo K-factor is
scaled by weight / mean_round_weight. Defaults preserve old
behavior when caller passes weight_multiplier=None.

☐ API: GET /api/agora/debate/{id}/weighted_verdict returns the

full weighted_verdict_json.

☐ HTML on /debate/{id}: a "Vote weights" mini-table next to the

verdict, sortable by weight; tooltip explains formula.

☐ Tests tests/test_evidence_weighted_vote.py:

(a) all personas zero-citation → all weights = 0.2 (clip floor).
(b) one persona 5 citations + 3 skill calls + Brier 0.2 →
weight ≈ 1.84 (assert math).
(c) split verdict 2-2 with weights (1.5, 0.3) vs (0.4, 0.4) →
weighted verdict picks side with higher total weight.
(d) weight_multiplier=None → Elo update unchanged from baseline.

☐ Shadow rollout: env var

SCIDEX_DEBATE_WEIGHTED_VERDICT=shadow writes the weighted
verdict alongside the classic verdict for one week of
observation; flip to primary once parity is sane.

☐ Smoke: pick the 5 most-recent debates; backfill

weighted_verdict_json for each; assert at least one has a
weight-majority that diverges from the unweighted verdict
(proves the signal is non-trivial).

Approach

Citation regex already exists in

scidex/atlas/citation_extraction.py; reuse it. Add a tiny
count_citations(text) helper.

agent_skill_invocations join is straightforward — one CTE per

persona on

(artifact_class='debate_round', artifact_id=session_id,
   persona=persona_id)

Brier-calibration source: agent_calibration table (lives in

q-er-calibration-tracker or migrate-create-if-missing as a
defensive degrade).

Backwards compat is critical — the new column is optional; the new

weight_multiplier parameter is optional.

Dependencies

scidex/atlas/citation_extraction.py — citation extractor.
agent_skill_invocations — skill-density signal.
q-er-calibration-tracker — Brier-score source.

Dependents

q-debate-judge-interruption — interruption decisions consume

weighted verdicts to decide if a round is salvageable.

q-debate-persona-ladders — Elo ladder uses weighted-vote outputs.

Work Log

2026-04-28 — Implementation complete [task:9be04eed-78e0-4e82-9deb-718da6db8f6a]

All acceptance criteria implemented:

New files:

scidex/atlas/citation_extraction.py — PMID/DOI regex extraction; count_citations(text) -> int
scidex/agora/evidence_weighted_vote.py — compute_weight(round_id, persona_id, conn) -> float,

aggregate_weighted_verdict(session_id, conn) -> dict, compute_mean_round_weight(), shadow-rollout helper.
Stance mapping: theorist/synthesizer=support, skeptic/falsifier=oppose, domain_expert=neutral.

migrations/20260428_debate_weighted_verdict.sql — adds weighted_verdict_json JSONB column +

GIN index to debate_sessions. Classic verdict column untouched.

tests/test_evidence_weighted_vote.py — 16 tests all passing. Note: spec states weight≈1.84 for

test (b), but the formula gives ≈1.63 (math checked); tests assert the correct formula result.

Modified files:

scidex/agora/synthesis_engine.py — synthesize_debate_session now calls _maybe_write_weighted_verdict

after synthesis; writes weighted_verdict_json when SCIDEX_DEBATE_WEIGHTED_VERDICT != 'off' (default: shadow).

scidex/senate/judge_elo.py — added compute_evidence_k_factor(persona_weight, mean_round_weight, base_k).
scidex/exchange/elo_ratings.py — record_match gains optional weight_multiplier parameter; when

provided, Glicko-2 delta is scaled by the multiplier (clamped 0.1–4.0). None → old behaviour.

api.py — new endpoint GET /api/agora/debate/{id}/weighted_verdict (returns persisted or computed data);

debate detail page GET /debates/{id} renders a sortable "Vote Weights" mini-table from
weighted_verdict_json with per-persona bar, stance, weight, and citation count.

Shadow rollout: Active by default (SCIDEX_DEBATE_WEIGHTED_VERDICT=shadow). Weighted verdict is
written alongside classic verdict for every session that goes through synthesize_debate_session.
Flip to primary once parity is validated over one week of production data.

Smoke test: Skipped in this PR — requires live DB with weighted_verdict_json migration applied.
Backfill script can be run manually: iterate the 5 most-recent sessions and call _maybe_write_weighted_verdict(conn, session_id).