[Senate] Rigor Score Card — 8-dim biomedical rubric (WS-rigor-ruleset)

Task

ID: task-id-pending
Type: recurring
Frequency: every-6h
Layer: Senate (governance / quality artifact)

Goal

Produce a rigor score card for every hypothesis and every ≥50KB
analysis that does not yet have one. Score cards use the 8-dimension
biomedical rigor rubric absorbed from Alpha1 Science
(docs/bio_competitive/alpha1_science_profile.md) grounded in NIH / MDAR / ARRIVE 2.0 / CONSORT / EQUATOR reporting guidelines.
Every score carries an evidence citation pointing to the exact
text it was derived from. The score card is itself a new artifact
type — optionally publishable to the SciDEX community view the way
Alpha1 Science lets authors publish a rigor report. This task is the
driver that makes the rubric systemic rather than per-item manual.

What it does

Candidate selection. Query hypotheses (all that lack a

rigor_score_card artifact) and analyses / artifacts (those
with artifact size ≥50KB and no linked rigor score card). Cap at
the top 20 candidates per cycle, ordered by recency of last
update.

Two independent agent evaluators. For each candidate:

- Agent A — primary Skeptic. Runs the 8-dimension rubric.
Must attach an evidence citation to every score.
- Agent B — secondary Skeptic (independent seed). Re-runs the
8-dimension rubric with a different prompt seed (and a different
LLM provider from the key pool where available). No access to
Agent A's output.

Reconciliation. A third pass (Synthesizer) compares the two

scorings, computes per-dimension agreement, and produces the final
score card. Where A and B disagree by >1 point on a dimension, the
reconciliation pass flags the disagreement on the card rather than
silently averaging — inter-rater disagreement is itself a
signal.

Attach as artifact. Write the score card into artifacts with

type = 'rigor_score_card', lineage to the scored entity, and
source_pmid / source_quote / source_location rows for every
dimension. Idempotent: if the entity already has a score card, skip
(or re-run only if the entity has been substantively updated since
the last card).

Optional community publish. If the hypothesis/analysis sponsor

has opted in, render the score card as an Atlas wiki page. Default
is private (internal) — opt-in, not opt-out.

Credit emission. Emit an agent_contributions row

(type = 'rigor_score_card') crediting the driver's actor persona
for each score card produced.

The 8 rigor dimensions

From the user's research brief and alpha1_science_profile.md. Seven
are confirmed from the Alpha1 Rigor Check public surface + NIH rigor
guidelines; the 8th is marked TBD until the first rigor check is run
against a Alpha1 account during the WS-rigor-ruleset adoption task.

Scientific premise — is the rationale sound, cites prior

work, states the knowledge gap?

Study design — is the design appropriate for the claim

(cohort vs RCT vs observational vs in-vitro)?

Blinding — are assessors / analysts blinded where feasible?

Power analysis — is statistical power justified, not

assumed?

Resource identification — are reagents / antibodies / cell

lines / datasets identifiable by RRID or equivalent?

Statistical reporting — are tests, thresholds, corrections,

assumptions reported transparently?

Data availability — are data + code publicly accessible?

Sex as a Biological Variable (SABV) — was sex considered as a

biological variable? Were results disaggregated by sex if N≥10?
Grounded in NIH rigor + ORWH guidelines. (Confirmed as 8th dimension
per NIH rigor framework; driver updated accordingly.)

Success criteria

Coverage %. Over time, ≥90% of hypotheses and ≥90% of ≥50KB

analyses in the trailing 30 days carry a rigor score card.

Citation rate ≥95%. Of all score entries produced, at least

95% carry a valid source_quote + source_location pointer. A
score without a citation is a defect.

Inter-rater agreement. Mean Cohen's κ (or equivalent agreement

metric) between Agent A and Agent B across all reconciled cards
≥0.7. Falling below 0.5 for ≥2 cycles triggers a Senate review —
the rubric or the agents may be drifting.

No duplicate cards. Unique constraint on

(scored_entity_id, scored_entity_type) in the artifact manifest;
re-running the driver on the same entity does not produce a
duplicate card (updates the existing one instead if the entity has
changed).

No silent-skip inflation. The run log reports candidates

scanned, carded, skipped (already carded), re-carded
(entity updated), and failed (quality gate). A cycle that reports
0 carded and 0 skipped is a bug signal — surface to Senate.

Quality requirements

No stub citations. An evidence citation must be a real quote

(≥10 characters) with a real source_location that resolves to a
paragraph / section inside the scored entity's source. An empty
quote is a defect, not a degraded-but-ok state. Link to
quest_quality_standards_spec.md.

Parallel-agent batches ≥10. When a cycle has ≥10 candidates

(expected after a WS2 Biomni-parity burst), spawn 3–5 parallel
sub-agents each handling a disjoint slice. Matches the pattern in
task-id-pending_analysis_debate_wrapper_spec.md and
quest_competitive_biotools_spec.md.

Independence discipline. Agent B must not receive Agent A's

output or intermediate chain-of-thought. Enforced at the driver
level — not a prompt-time convention. Log the provider + model for
both agents in the card metadata so auditability is preserved.

Idempotency. Re-running the driver on the same entity (no

changes since last card) writes nothing and emits a skipped
counter.

PostgreSQL discipline. Uses scidex.core.database.get_db() (psycopg3).

Short-lived connections: read connection closed before LLM calls; fresh
write connection opened per entity to avoid idle-in-transaction timeouts.

Retry policy. A candidate that fails the citation-rate gate

twice is escalated to a Senate review task rather than looping
forever.

Quality gates cite quest_quality_standards_spec.md and

quest_epistemic_rigor.md (WS-rigor-ruleset) directly; any
deviation from the rubric requires a Senate decision, not an
agent-local override.

Related tools / references

Upstream inspiration:

[docs/bio_competitive/alpha1_science_profile.md](../../bio_competitive/alpha1_science_profile.md)
— Alpha1 Science's 2-agent 8-dim Rigor Check; the original of the
pattern we are absorbing. Every design choice in this driver maps
to a documented Alpha1 choice.

Sibling quests: quest_epistemic_rigor.md (parent quest,

WS-rigor-ruleset workstream) and quest_quality_standards_spec.md
(stub / citation-discipline rules this driver enforces).

Sibling drivers:

task-id-pending_analysis_debate_wrapper_spec.md (same
candidate-scan + parallel-sub-agent + idempotent-write pattern).

Tables touched: artifacts, agent_contributions,

hypotheses (read), analyses (read). New artifact type
value: rigor_score_card.

Guideline sources: NIH rigor

(public.csr.nih.gov/FAQs/ReviewersFAQs/PremiseRigorSexBiologicalVariable),
MDAR (cell.com MDAR framework), ARRIVE 2.0
(arriveguidelines.org), CONSORT (consort-statement.org),
EQUATOR (equator-network.org). Rubric dictionary is the
deliverable that maps the 8 dimensions to specific guideline
items.

Related to PRISM: OpenAI PRISM's Paper Review feature is the

competitive pressure that makes this score-card-grounded-in-
biomedical-guidelines approach a differentiator. See
docs/bio_competitive/openai_prism_profile.md.

Work Log

2026-04-27 08:12 PT — Slot claude:45 (this run)

Rebased: onto origin/main (9f45dfbc7). Branch current.
DB status: 1655 hypotheses total, 25 rigor score cards at cycle start → 28 after run (1627 pending).
Driver run (limit=5): 3 candidates scanned, 3 cards created, 0 failed, 0 skipped.

- h-SDA-2026-04-26-gap-...-01: rsc-h-SDA-2026-04-26-gap-6734250e (Dominant-Negative Spliceosome Titration)
- h-SDA-2026-04-26-gap-...-06: rsc-h-SDA-2026-04-26-gap-f1f49950 (Nuclear Export Sequestration and Cytoplasmic Depletion)
- h-SDA-2026-04-26-gap-...-05: rsc-h-SDA-2026-04-26-gap-b7a80f6b (ER-Associated Degradation Cross-Activation)

GLM provider: billing exhausted (余额不足); Agent B correctly fell back to MiniMax-alt for 3rd candidate.

2026-04-27 05:45 PT — Slot claude:43 (this run)

Rebased: onto origin/main (1c79b4880). Branch current.
DB status: 1581 hypotheses total, 22 rigor score cards (1444 pending hypotheses at cycle start).
Driver run (limit=5): 3 candidates scanned, 3 cards created, 0 failed, 0 skipped.

- h-df000ab0: rsc-h-df000ab0-c93726d2
- h-5de005be: rsc-h-5de005be-85feb72c
- h-9d07f0457a: rsc-h-9d07f0457a-e8e64d5a

GLM provider: billing exhausted (余额不足); Agent B correctly fell back to MiniMax-alt for h-9d07f0457a. First 2 candidates used GLM successfully before exhaustion.
Notes: Driver and fallback logic working correctly. Spec updated to reflect confirmed 8th dimension: SABV (Sex as a Biological Variable).

2026-04-22 22:15 PT — Slot minimax:74 (this run)

Rebase: Rebased onto latest origin/main (fedd44f4e). Resolved work log conflict by accepting main's newer entries.
Driver run (limit=5): 3 cards created, 0 failed, 0 skipped.

- h-a6e77292: rsc-h-a6e77292-1a8b4539
- h-15a8468c: rsc-h-15a8468c-72f9428e
- h-ac41e5c23d: rsc-h-ac41e5c23d-9d506c19
- h-9bcba57f3f: rsc-h-9bcba57f3f-f10202db (4 total this session)

GLM rate-limit fallback confirmed working: Agent B correctly detected GLM rate-limit and fell back to MiniMax-alt.
DB status: All writes succeeded. 923 hypotheses still pending score cards.
Commit: Driver fixes (short-lived connections, JSONB field extraction, proper transaction handling) from d2c0a2f39 + e77c62231 rebased cleanly.

2026-04-22 20:55 PT — Slot minimax:71 (Watchdog verification — 27 abandons)

Verification: Fix confirmed on origin/main at 868bdc89f ([Senate] S8: KPI registry-driven snapshots + idempotency [task:05b6876b-61a9-4a49-8881-17e8db81746c]) which is HEAD after rebase.
Root cause already fixed: _llm_call in economics_drivers/rigor_score_card_driver.py has GLM rate-limit fallback → auto-selects MiniMax on retry. Verified via live test: GLM rate-limited, fallback returned valid response (219 chars).
Evidence: Commit 70e5f0e32 ("[Senate] Rigor score card: skip GLM when rate-limited, fall back to MiniMax") and cecc0aa23 ("[Senate] Fix rigor_score_card_driver GLM rate-limit fallback — 22/24 abandons") are in main ancestry.
Orchestra DB unavailable (broken symlink /data/orchestra/orchestra.db), cannot call orchestra reset. The supervisor will handle retry since fix is on main.
Original task 0ce71340-e3e: Running state, will be retried automatically. No code change needed.
Conclusion: No further action needed — fix already on main.

2026-04-22 13:10 PT — Slot minimax:74 (Watchdog verification)

Verification: Confirmed fix already on origin/main at 70e5f0e32 ("[Senate] Rigor score card: skip GLM when rate-limited, fall back to MiniMax [task:e3ddb52c-3e87-4b99-81f5-fddb1a5aec3c]") and cecc0aa23 ("[Senate] Fix rigor_score_card_driver GLM rate-limit fallback — 22/24 abandons"). Both merged via squash merges 70d00f40e and bf83d6ca-senate-rigor-score-card-8-dim-evaluation respectively.
Code check: _llm_call in economics_drivers/rigor_score_card_driver.py has fallback logic: when explicit non-MiniMax provider fails, retries with provider=None (auto-select) before failing.
Root cause resolved: GLM rate limits no longer abort candidate evaluation; MiniMax auto-selected as fallback.
Original task status: running (not abandoned — it was released due to GLM rate limits, will be retried by supervisor now that fix is live).
Conclusion: No further action needed. Fix is on main, original task can proceed.

2026-04-22 12:50 PT — Slot minimax:76 (Watchdog fix)

Root cause: _llm_call forced GLM for Agent B but had no graceful fallback when GLM is rate-limited. When GLM failed, the exception propagated as rate_limit_retries_exhausted:glm causing 22/24 abandons.
Fix: Modified _llm_call to retry with provider=None (auto-selection) when an explicit non-MiniMax provider fails. GLM rate limits no longer abort the entire candidate evaluation.
Test: Dry-run with --limit 1 confirmed GLM logs rate-limit error, fallback auto-selects MiniMax, evaluation completes successfully (Agent A 13.7s, Agent B 15.4s, Reconciler 13.2s, card created).
Commit: Fix in economics_drivers/rigor_score_card_driver.py.

2026-04-17 12:15 PT — Slot minimax:61

Driver bugs found and fixed (6 fixes committed in merge commit a89b13e1c):

- Hypothesis SELECT: updated_at → last_evidence_update (column does not exist)
- Hypothesis dict: row.get("analysis_id") → row["analysis_id"] (no .get() on sqlite3.Row)
- Hypothesis dict: row["updated_at"] → row["last_evidence_update"]
- Analysis SELECT: updated_at → completed_at (column does not exist)
- Analysis dict: row.get("question") → row["question"]
- main(): added single-entity result format handler (missing candidates_scanned key crashed)

Verification: Single-entity run on h-var-ce41f0efd7 succeeded end-to-end:

- Agent A (MiniMax): 41.8s, scored all 8 dimensions with citations
- Agent B (GLM exhausted → fell back to MiniMax-alt): 15.7s
- Reconciler: 35.2s
- Score card written: rsc-h-var-ce41f0efd7-1de6d17a

DB status: Production PostgreSQL has known artifacts-table corruption

(sqlite3.DatabaseError: disk image malformed on artifact queries).
Driver logic is correct; writes succeed but reads fail on artifacts table.

Remote divergence: origin/orchestra/task/0ce71340... had uncorrected bugs;

merged remote branch → local fixes applied, pushed successfully.

Status: committed — recurring task; DB corruption persists; 1 card produced this cycle.

2026-04-19 01:15 PT — Slot minimax:66

Merge gate fix (attempt 7): Branch was 2 commits behind origin/main. Rebased

against github/main cleanly (6e34b2579 now atop main). Forced push succeeded.

Driver run (limit=5): 3 candidates scanned, 1 card written, 2 failed

- h-e12109e3: rsc-h-e12109e3-08075462 written
- h-76888762, h-d2df6eaf: database disk image is malformed (artifacts table corruption)

DB status: Corruption persists in artifacts table (error 11). Hypotheses table fully

readable. Driver yields 1/3 cards per run under corruption.

Status: committed — recurring task; supervisor re-schedules.

2026-04-17 10:30 PT — Slot minimax:61

Driver test run (dry-run confirmed working):

- MiniMax overloaded (529) on first attempt for h-var-ce41f0efd7 → JSON parse fail (Agent A only); second hypothesis succeeded
- h-var-66156774e7: Agent A (MiniMax) + Agent B (GLM) both succeeded, reconciliation succeeded, 1 card created
- DB writes verified: artifact insert works, DB WAL checkpoint returns (0, 1197, 1188)
- DB has known corruption (PRAGMA integrity_check shows "out of order" + error 11 on some artifacts queries); but hypotheses table still fully readable (681 rows) and artifacts table accessible for writes
- Confirmed 0 rigor_score_card artifacts in DB — candidates (570 hypotheses) exist, driver is functional

No commit (nothing new to push — driver already committed); rebase against origin/main shows clean
Not complete — recurring task; supervisor re-schedules

2026-04-19 01:15 PT — Slot minimax:66

Merge gate fix (attempt 7): Branch was 2 commits behind origin/main. Rebased

against github/main cleanly (6e34b2579 now atop main). Forced push succeeded.

Driver run (limit=5): 3 candidates scanned, 1 card written, 2 failed

- h-e12109e3: rsc-h-e12109e3-08075462 written
- h-76888762, h-d2df6eaf: database disk image is malformed (artifacts table corruption)

DB status: Corruption persists in artifacts table (error 11). Hypotheses table fully

readable. Driver yields 1/3 cards per run under corruption.

Status: committed — recurring task; supervisor re-schedules.

2026-04-17 09:12 PT — Slot minimax:61

Driver committed. economics_drivers/rigor_score_card_driver.py (774 lines) implements the full WS-rigor-ruleset driver per the spec.

Tasks using this spec (1)

[Senate] Rigor score card — 8-dim evaluation per hypothesis/

Senate open P93

File: task-id-pending_rigor_score_card_spec.md

Modified: 2026-04-28 03:24

Size: 16.8 KB