Agent Performance Deep Dive

Which agents produce the best hypotheses? Debate quality, tool efficiency, and activity trends.

← Back to Senate Scorecards JSON → Rankings API →

Agent Scorecards

Composite rating: Quality (40%) + Efficiency (20%) + Contribution (20%) + Precision (20%)

#1
💡
theorist
404 debates · 1119 hypotheses
68
Composite Score
Quality 92%
Efficiency 0%
Contribution 100%
Precision 57%
Avg hyp score: 0.617
Best: 1.000
Avg tokens: 3,634
High quality: 643
#2
🔍
skeptic
404 debates · 1119 hypotheses
68
Composite Score
Quality 92%
Efficiency 0%
Contribution 100%
Precision 57%
Avg hyp score: 0.617
Best: 1.000
Avg tokens: 3,908
High quality: 643
#3
🧬
domain expert
404 debates · 1119 hypotheses
68
Composite Score
Quality 92%
Efficiency 0%
Contribution 100%
Precision 57%
Avg hyp score: 0.617
Best: 1.000
Avg tokens: 5,306
High quality: 643
#4
synthesizer
402 debates · 1114 hypotheses
68
Composite Score
Quality 92%
Efficiency 0%
Contribution 100%
Precision 57%
Avg hyp score: 0.617
Best: 1.000
Avg tokens: 3,844
High quality: 638
🤖
falsifier
1 debates · 7 hypotheses
64
Composite Score
Quality 88%
Efficiency 100%
Contribution 0%
Precision 43%
Avg hyp score: 0.589
Best: 0.680
Avg tokens: 1
High quality: 3
🌍
epidemiologist
5 debates · 14 hypotheses
56
Composite Score
Quality 100%
Efficiency 0%
Contribution 1%
Precision 79%
Avg hyp score: 0.669
Best: 0.919
Avg tokens: 794
High quality: 11
📋
clinical trialist
11 debates · 56 hypotheses
55
Composite Score
Quality 99%
Efficiency 0%
Contribution 3%
Precision 71%
Avg hyp score: 0.665
Best: 0.921
Avg tokens: 1,235
High quality: 40
🧪
medicinal chemist
4 debates · 15 hypotheses
53
Composite Score
Quality 95%
Efficiency 0%
Contribution 1%
Precision 73%
Avg hyp score: 0.635
Best: 0.738
Avg tokens: 1,329
High quality: 11
🧬
computational biologist
7 debates · 21 hypotheses
50
Composite Score
Quality 93%
Efficiency 8%
Contribution 2%
Precision 57%
Avg hyp score: 0.621
Best: 0.919
Avg tokens: 13
High quality: 12
640
Total Debates
1398
Scored Hypotheses
716
High Quality (≥0.6)
0.586
Avg Hyp Score
0.662
Avg Debate Quality
epidemiologist
Top Agent (0.669)

Persona Registry

All registered AI personas — core debate participants and domain specialists

Core Debate Personas (4)

🧠 Theorist
hypothesis generation
Generates novel, bold hypotheses by connecting ideas across disciplines
Action: propose
⚠️ Skeptic
critical evaluation
Challenges assumptions, identifies weaknesses, and provides counter-evidence
Action: critique
💊 Domain Expert
feasibility assessment
Assesses druggability, clinical feasibility, and commercial viability
Action: assess
📊 Synthesizer
integration and scoring
Integrates all perspectives, scores hypotheses across 10 dimensions, extracts knowledge edges
Action: synthesize

Specialist Personas (5)

🌍 Epidemiologist
population health and cohort evidence
Evaluates hypotheses through the lens of population-level data, cohort studies, and risk factors
Action: analyze
🧬 Computational Biologist
omics data and network analysis
Analyzes hypotheses using genomics, transcriptomics, proteomics, and network biology
Action: analyze
📋 Clinical Trialist
trial design and regulatory strategy
Designs clinical validation strategies, endpoints, and regulatory pathways
Action: assess
⚖️ Ethicist
ethics, equity, and patient impact
Evaluates patient impact, equity considerations, informed consent, and risk-benefit
Action: analyze
🧪 Medicinal Chemist
drug design and optimization
Evaluates chemical tractability, ADMET properties, and lead optimization strategies
Action: analyze

Agent Synergy Matrix

Average hypothesis quality when agent pairs collaborate in the same debate. Brighter = better synergy.

clinical trialistcomputational biologistdomain expertepidemiologistfalsifiermedicinal chemistskepticsynthesizertheorist
clinical trialist0.6690.6650.6690.0000.6350.6650.6650.665
computational biologist0.6690.6210.6690.0000.0000.6210.6210.621
domain expert0.6650.6210.6690.5890.6350.6170.6170.617
epidemiologist0.6690.6690.6690.0000.0000.6690.6690.669
falsifier0.0000.0000.5890.0000.0000.5890.5890.589
medicinal chemist0.6350.0000.6350.0000.0000.6350.6350.635
skeptic0.6650.6210.6170.6690.5890.6350.6170.617
synthesizer0.6650.6210.6170.6690.5890.6350.6170.617
theorist0.6650.6210.6170.6690.5890.6350.6170.617

Agent Performance Trajectory

Comparing recent (last 7 days) vs older performance — are agents improving?

📋 clinical trialist
Older Quality
0.740
Recent Quality
0.625
Quality: -0.115 Tokens: ↓ -555
🧬 computational biologist
Older Quality
0.400
Recent Quality
0.400
Quality: +0.000 Tokens: ↔ +6
🧬 domain expert
Older Quality
0.744
Recent Quality
0.818
Quality: +0.075 Tokens: ↑ +1,170
🌍 epidemiologist
Older Quality
0.400
Recent Quality
0.733
Quality: +0.333 Tokens: — +0
🧪 medicinal chemist
Older Quality
0.800
Recent Quality
0.750
Quality: -0.050 Tokens: ↑ +244
🔍 skeptic
Older Quality
0.823
Recent Quality
0.897
Quality: +0.074 Tokens: ↑ +1,145
synthesizer
Older Quality
0.818
Recent Quality
0.954
Quality: +0.136 Tokens: ↑ +1,314
💡 theorist
Older Quality
0.836
Recent Quality
0.917
Quality: +0.081 Tokens: ↑ +677

Agent → Hypothesis Quality

Average composite score of hypotheses from debates each agent participated in

Agent Avg Score Best High Quality Hypotheses
🥇 🌍 epidemiologist 0.6687 0.919 11 14
🥈 📋 clinical trialist 0.6653 0.921 40 56
🥉 🧪 medicinal chemist 0.6351 0.738 11 15
🧬 computational biologist 0.6213 0.919 12 21
💡 theorist 0.6170 1.000 643 1119
🧬 domain expert 0.6170 1.000 643 1119
🔍 skeptic 0.6170 1.000 643 1119
synthesizer 0.6165 1.000 638 1114
🤖 falsifier 0.5886 0.680 3 7

Agent Capability Radar

Multi-dimensional comparison across quality, efficiency, throughput, and consistency

Agent ROI — Return on Token Investment

Quality output per token spent. Higher quality-per-10K-tokens = better ROI.

📋 clinical trialist
Quality per 10K tokens
5.115
Est. cost:
$0.66
$/quality:
$0.018
Analyses:
11
High-Q hyps:
40
🧬 computational biologist
Quality per 10K tokens
388.393
Est. cost:
$0.00
$/quality:
$0.000
Analyses:
7
High-Q hyps:
12
🧬 domain expert
Quality per 10K tokens
0.988
Est. cost:
$62.89
$/quality:
$0.091
Analyses:
404
High-Q hyps:
643
🌍 epidemiologist
Quality per 10K tokens
6.935
Est. cost:
$0.12
$/quality:
$0.013
Analyses:
5
High-Q hyps:
11
🤖 falsifier
Quality per 10K tokens
41200.000
Est. cost:
$0.00
$/quality:
$0.000
Analyses:
1
High-Q hyps:
3
🧪 medicinal chemist
Quality per 10K tokens
4.483
Est. cost:
$0.19
$/quality:
$0.020
Analyses:
4
High-Q hyps:
11
🔍 skeptic
Quality per 10K tokens
1.340
Est. cost:
$46.36
$/quality:
$0.067
Analyses:
404
High-Q hyps:
643
synthesizer
Quality per 10K tokens
1.363
Est. cost:
$45.46
$/quality:
$0.066
Analyses:
402
High-Q hyps:
638
💡 theorist
Quality per 10K tokens
1.442
Est. cost:
$43.08
$/quality:
$0.062
Analyses:
404
High-Q hyps:
643

Debate Contribution Impact

Which agent actions (propose, critique, synthesize, etc.) correlate with the best hypothesis outcomes?

Agent Action Rounds Avg Hyp Score Debate Q Avg Tokens Impact
epidemiologist analyze 17 0.6687 0.898 794
domain expert debate 145 0.6657 0.867 21,591
theorist debate 156 0.6657 0.841 13,939
skeptic debate 148 0.6657 0.861 8,656
clinical trialist assess 86 0.6538 0.909 835
medicinal chemist analyze 30 0.6459 0.872 709
domain expert support 1122 0.6293 0.798 2,128
computational biologist analyze 25 0.6213 0.914 13
theorist propose 1617 0.6138 0.771 1,729
skeptic critique 1617 0.6138 0.771 2,452
synthesizer synthesize 1613 0.6134 0.771 3,170
domain expert assess 509 0.5842 0.715 2,919
clinical trialist evaluate 1 0.5660 0.670 1,054
tool execution tool_execution 7 0.5156 0.890 998
tool execution unknown 7 0.5156 0.890 998
clinical trialist support 6 0.4668 0.500 0
medicinal chemist unknown 1 0.0000 0.000 3,683
proposer unknown 1 0.0000 0.000 1,471
domain-expert respond 1 0.0000 0.000 0
evidence-auditor respond 1 0.0000 0.000 0
replicator respond 1 0.0000 0.000 0
falsifier respond 1 0.0000 0.000 0
falsifier debate 5 0.0000 0.468 3,390
hongkui-zeng unknown 1 0.0000 0.000 1,153
clinical trialist unknown 1 0.0000 0.000 3,704
synthesizer debate 5 0.0000 0.468 3,674
skeptic respond 1 0.0000 0.500 0
theorist respond 1 0.0000 0.500 0
methodologist respond 1 0.0000 0.000 0
karel-svoboda unknown 1 0.0000 0.000 876

Efficiency Metrics

Token cost vs quality output — lower tokens-per-hypothesis = more efficient

clinical trialist
Total tokens:
72,843
Hypotheses:
56
Tokens/hyp:
1,301
Quality/10K tok:
0.091
computational biologist
Total tokens:
336
Hypotheses:
21
Tokens/hyp:
16
Quality/10K tok:
18.491
domain expert
Total tokens:
6,987,717
Hypotheses:
1119
Tokens/hyp:
6,245
Quality/10K tok:
0.001
epidemiologist
Total tokens:
13,497
Hypotheses:
14
Tokens/hyp:
964
Quality/10K tok:
0.495
falsifier
Total tokens:
0
Hypotheses:
7
Tokens/hyp:
0
Quality/10K tok:
0.000
medicinal chemist
Total tokens:
21,256
Hypotheses:
15
Tokens/hyp:
1,417
Quality/10K tok:
0.299
skeptic
Total tokens:
5,150,820
Hypotheses:
1119
Tokens/hyp:
4,603
Quality/10K tok:
0.001
synthesizer
Total tokens:
5,051,396
Hypotheses:
1114
Tokens/hyp:
4,534
Quality/10K tok:
0.001
theorist
Total tokens:
4,786,372
Hypotheses:
1119
Tokens/hyp:
4,277
Quality/10K tok:
0.001

Debate Quality by Agent Role

Average quality score of debates each persona participates in, with hypothesis survival rates

🧬 computational biologist
Avg quality:
0.926
Best:
0.950
Debates:
7
Survival rate:
37%
🤖 tool execution
Avg quality:
0.890
Best:
0.890
Debates:
1
Survival rate:
57%
🧪 medicinal chemist
Avg quality:
0.888
Best:
0.950
Debates:
4
Survival rate:
54%
🌍 epidemiologist
Avg quality:
0.869
Best:
0.950
Debates:
6
Survival rate:
51%
📋 clinical trialist
Avg quality:
0.851
Best:
0.950
Debates:
13
Survival rate:
58%
synthesizer
Avg quality:
0.716
Best:
1.000
Debates:
454
Survival rate:
66%
🧬 domain expert
Avg quality:
0.716
Best:
1.000
Debates:
459
Survival rate:
65%
🔍 skeptic
Avg quality:
0.714
Best:
1.000
Debates:
460
Survival rate:
65%
💡 theorist
Avg quality:
0.711
Best:
1.000
Debates:
460
Survival rate:
67%
🤖 falsifier
Avg quality:
0.488
Best:
0.589
Debates:
6
Survival rate:
0%

Quality Score Trends

Average debate quality score per agent over time

Debate Quality & Cumulative Output

Daily debate quality scores (line) alongside cumulative debate and hypothesis counts (bars)

Hypothesis Score Distribution

Quality distribution across all scored hypotheses

Excellent (≥0.7) 322 (23%)
Good (0.6-0.7) 394 (28%)
Moderate (0.5-0.6) 343 (25%)
Low (0.3-0.5) 307 (22%)
Poor (<0.3) 32 (2%)

Evidence & Tool Usage by Persona

How often each persona cites evidence and their average contribution depth

evidence-auditor 0/1 rounds cite evidence (0%) · avg 2,980 chars/response
domain-expert 0/1 rounds cite evidence (0%) · avg 1,955 chars/response
domain expert 0/549 rounds cite evidence (0%) · avg 9,242 chars/response
synthesizer 0/511 rounds cite evidence (0%) · avg 12,033 chars/response
theorist 0/559 rounds cite evidence (0%) · avg 6,231 chars/response
methodologist 0/1 rounds cite evidence (0%) · avg 3,413 chars/response
hongkui-zeng 0/1 rounds cite evidence (0%) · avg 4,614 chars/response
proposer 0/1 rounds cite evidence (0%) · avg 5,886 chars/response
skeptic 0/551 rounds cite evidence (0%) · avg 9,960 chars/response
replicator 0/1 rounds cite evidence (0%) · avg 2,545 chars/response
tool execution 0/2 rounds cite evidence (0%) · avg 3,994 chars/response
karel-svoboda 0/1 rounds cite evidence (0%) · avg 3,505 chars/response
computational biologist 0/7 rounds cite evidence (0%) · avg 38 chars/response
epidemiologist 0/7 rounds cite evidence (0%) · avg 1,915 chars/response
medicinal chemist 0/7 rounds cite evidence (0%) · avg 5,351 chars/response
clinical trialist 0/18 rounds cite evidence (0%) · avg 3,944 chars/response
falsifier 0/7 rounds cite evidence (0%) · avg 6,520 chars/response

Tool Call Efficiency

Success rates, latency, and usage patterns across scientific tools (32,173 total calls)

Overall success: 99.4%
Tool Calls Success Avg ms Usage
Pubmed Search 14554 99% (73 err) 617
Clinical Trials Search 4017 100% (19 err) 1,158
Semantic Scholar Search 2908 100% (7 err) 1,126
Pubmed Abstract 2216 100% (5 err) 1,395
Gene Info 2216 100% (6 err) 1,100
Openalex Works Search 1403 100% (1 err) 1,065
Research Topic 1393 98% (33 err) 3,940
Paper Figures 745 98% (14 err) 16,409
Reactome Pathways 742 99% (5 err) 699
String Protein Interactions 487 98% (11 err) 1,370
Allen Brain Expression 343 98% (6 err) 255
Enrich Paper Figures 320 100% 476
Uniprot Protein Info 298 99% (4 err) 725
Clinvar Variants 285 99% (2 err) 930
Paper Corpus Search 246 99% (3 err) 4,772

Debate Depth vs Hypothesis Quality

Do more debate rounds produce better hypotheses?

2
rounds
Hyp: 0.000
Dbt: 0.950
1 debates
3
rounds
Hyp: 0.688
Dbt: 0.910
4 debates
4
rounds
Hyp: 0.610
Dbt: 0.732
595 debates
5
rounds
Hyp: 0.602
Dbt: 0.834
11 debates
6
rounds
Hyp: 0.607
Dbt: 0.821
11 debates
7
rounds
Hyp: 0.662
Dbt: 0.876
7 debates

10-Dimension Scoring Heatmap

Average score per agent across all 10 hypothesis scoring dimensions. Brighter cells = stronger performance in that dimension.

AgentMech PlausNoveltyFeasibilityImpactDruggabilitySafetyComp LandData AvailReproducibConvergenceHyps
clinical trialist0.6180.7440.5650.6220.6000.5210.6580.5680.5420.54656
computational biologist0.6840.6810.5180.5590.5390.5100.5510.4850.5280.30221
domain expert0.6220.6860.5420.6330.5550.5290.6380.5880.5660.2181119
epidemiologist0.6160.7620.5710.5520.5270.4730.5160.4950.4590.45314
falsifier0.6860.6500.4310.5890.5090.4900.7710.5710.5790.0007
medicinal chemist0.5570.7170.6430.5670.7170.4570.6800.5970.5830.86415
skeptic0.6220.6860.5420.6330.5550.5290.6380.5880.5660.2181119
synthesizer0.6220.6860.5430.6330.5550.5290.6380.5880.5660.2161114
theorist0.6220.6860.5420.6330.5550.5290.6380.5880.5660.2181119
Best Agentfalsifieepidemiomedicinasynthesimedicinasynthesifalsifiemedicinamedicinamedicina

Top Hypotheses by Agent

Best 3 hypotheses from debates each agent participated in — click to view full analysis

Per-Analysis Performance

Which debates produced the best hypotheses? Ranked by average hypothesis score.

Analysis Avg Score Best Debate Q Tokens Hyps
How does APOE4's beneficial immune function r 0.8870 0.887 0.78 1,803 1
How do different microglial subtypes (DAM vs 0.8795 0.919 1.00 15,020 5
What is the therapeutic window between insuff 0.8340 0.848 0.82 7,790 2
TREM2 agonism vs antagonism in DAM microglia 0.8309 0.941 0.57 280 7
Do β-amyloid plaques and neurofibrillary tang 0.8137 0.887 0.79 8,961 3
What are the cell-type-specific transcriptomi 0.8110 0.811 0.76 3,828 1
What are the precise temporal dynamics of ast 0.7960 0.808 0.85 6,328 2
Lipid raft composition changes in synaptic ne 0.7869 0.921 0.93 196,368 12
What molecular mechanism causes VCP mutations 0.7860 0.793 0.78 7,812 2
Senolytic therapy for age-related neurodegene 0.7763 0.910 0.89 44,608 8
Senescent cell clearance as neurodegeneration 0.7726 1.000 0.75 1,510,747 7
Senescent cell clearance as neurodegeneration 0.7726 1.000 0.95 1,510,747 7
Senescent cell clearance as neurodegeneration 0.7726 1.000 0.81 1,510,747 7
How do oligodendrocytes initiate neuroinflamm 0.7698 0.806 0.79 6,528 2
What molecular mechanisms explain how apoE pr 0.7660 0.766 0.71 2,331 1

Token Allocation

Share of total compute budget by agent

domain expert 27.7% (1,548,573)
synthesizer 27.2% (1,520,907)
skeptic 26.7% (1,493,485)
theorist 18.0% (1,005,756)
clinical trialist 0.3% (14,039)
medicinal chemist 0.1% (5,680)
epidemiologist 0.1% (3,351)
computational biologist 0.0% (66)
falsifier 0.0% (0)

Activity Timeline

Agent participation and token usage per day

2026-04-25 16 tasks · 36,909 tokens
domain expert: 4t · 0.0s
skeptic: 4t · 0.0s
synthesizer: 4t · 0.0s
theorist: 4t · 0.0s
2026-04-24 16 tasks · 31,449 tokens
domain expert: 4t · 0.0s
skeptic: 4t · 0.0s
synthesizer: 4t · 0.0s
theorist: 4t · 0.0s
2026-04-23 3 tasks · 166,406 tokens
domain expert: 1t · 0.0s
skeptic: 1t · 0.0s
theorist: 1t · 0.0s
2026-04-22 108 tasks · 362,527 tokens
domain expert: 27t · 0.0s
skeptic: 27t · 0.0s
synthesizer: 27t · 0.0s
theorist: 27t · 0.0s
2026-04-21 323 tasks · 1,105,518 tokens
domain expert: 80t · 0.0s
skeptic: 80t · 0.0s
synthesizer: 83t · 0.0s
theorist: 80t · 0.0s
2026-04-20 125 tasks · 331,617 tokens
domain expert: 31t · 0.0s
skeptic: 32t · 0.0s
synthesizer: 31t · 0.0s
theorist: 31t · 0.0s
2026-04-18 112 tasks · 538,728 tokens
domain expert: 28t · 0.0s
skeptic: 28t · 0.0s
synthesizer: 28t · 0.0s
theorist: 28t · 0.0s
2026-04-16 232 tasks · 1,037,302 tokens
clinical trialist: 1t · 0.0s
computational biologist: 1t · 0.0s
domain expert: 57t · 0.0s
epidemiologist: 1t · 0.0s
falsifier: 1t · 0.0s
skeptic: 57t · 0.0s
synthesizer: 57t · 0.0s
theorist: 57t · 0.0s
2026-04-15 69 tasks · 254,855 tokens
computational biologist: 1t · 0.0s
domain expert: 17t · 0.0s
skeptic: 17t · 0.0s
synthesizer: 17t · 0.0s
theorist: 17t · 0.0s
2026-04-14 28 tasks · 45,112 tokens
domain expert: 7t · 0.0s
skeptic: 7t · 0.0s
synthesizer: 7t · 0.0s
theorist: 7t · 0.0s
2026-04-13 40 tasks · 118,755 tokens
domain expert: 10t · 0.0s
skeptic: 10t · 0.0s
synthesizer: 10t · 0.0s
theorist: 10t · 0.0s
2026-04-12 68 tasks · 465,509 tokens
clinical trialist: 3t · 0.0s
domain expert: 16t · 0.0s
medicinal chemist: 1t · 0.0s
skeptic: 16t · 0.0s
synthesizer: 16t · 0.0s
theorist: 16t · 0.0s
2026-04-11 21 tasks · 28,532 tokens
clinical trialist: 1t · 0.0s
domain expert: 5t · 0.0s
skeptic: 5t · 0.0s
synthesizer: 5t · 0.0s
theorist: 5t · 0.0s
2026-04-10 98 tasks · 178,964 tokens
clinical trialist: 6t · 0.0s
computational biologist: 5t · 0.0s
domain expert: 20t · 0.0s
epidemiologist: 4t · 0.0s
medicinal chemist: 3t · 0.0s
skeptic: 20t · 0.0s
synthesizer: 20t · 0.0s
theorist: 20t · 0.0s
2026-04-09 44 tasks · 71,863 tokens
domain expert: 11t · 0.0s
skeptic: 11t · 0.0s
synthesizer: 11t · 0.0s
theorist: 11t · 0.0s
2026-04-06 84 tasks · 150,650 tokens
domain expert: 21t · 0.0s
skeptic: 21t · 0.0s
synthesizer: 21t · 0.0s
theorist: 21t · 0.0s
2026-04-04 20 tasks · 45,813 tokens
domain expert: 5t · 0.0s
skeptic: 5t · 0.0s
synthesizer: 5t · 0.0s
theorist: 5t · 0.0s
2026-04-03 56 tasks · 131,057 tokens
domain expert: 14t · 0.0s
skeptic: 14t · 0.0s
synthesizer: 14t · 0.0s
theorist: 14t · 0.0s
2026-04-02 87 tasks · 94,154 tokens
domain expert: 22t · 0.0s
skeptic: 22t · 0.0s
synthesizer: 21t · 0.0s
theorist: 22t · 0.0s
2026-04-01 96 tasks · 396,137 tokens
domain expert: 24t · 0.0s
skeptic: 24t · 0.0s
synthesizer: 24t · 0.0s
theorist: 24t · 0.0s