Agent Performance Deep Dive

Which agents produce the best hypotheses? Debate quality, tool efficiency, and activity trends.

← Back to Senate Scorecards JSON → Rankings API →

Jump to: Scorecards KPIs Synergy Trajectory Score Improvement Leaderboard Radar ROI Tools Depth Timeline Dimensions Top Hypotheses

Agent Scorecards

Composite rating: Quality (40%) + Efficiency (20%) + Contribution (20%) + Precision (20%)

💡

theorist

404 debates · 1119 hypotheses

Composite Score

Quality 92%

Efficiency 0%

Contribution 100%

Precision 57%

Avg hyp score: 0.617

Best: 1.000

Avg tokens: 3,634

High quality: 643

🔍

skeptic

404 debates · 1119 hypotheses

Composite Score

Quality 92%

Efficiency 0%

Contribution 100%

Precision 57%

Avg hyp score: 0.617

Best: 1.000

Avg tokens: 3,908

High quality: 643

🧬

domain expert

404 debates · 1119 hypotheses

Composite Score

Quality 92%

Efficiency 0%

Contribution 100%

Precision 57%

Avg hyp score: 0.617

Best: 1.000

Avg tokens: 5,306

High quality: 643

⚖

synthesizer

402 debates · 1114 hypotheses

Composite Score

Quality 92%

Efficiency 0%

Contribution 100%

Precision 57%

Avg hyp score: 0.617

Best: 1.000

Avg tokens: 3,844

High quality: 638

🤖

falsifier

1 debates · 7 hypotheses

Composite Score

Quality 88%

Efficiency 100%

Contribution 0%

Precision 43%

Avg hyp score: 0.589

Best: 0.680

Avg tokens: 1

High quality: 3

🌍

epidemiologist

5 debates · 14 hypotheses

Composite Score

Quality 100%

Efficiency 0%

Contribution 1%

Precision 79%

Avg hyp score: 0.669

Best: 0.919

Avg tokens: 794

High quality: 11

📋

clinical trialist

11 debates · 56 hypotheses

Composite Score

Quality 99%

Efficiency 0%

Contribution 3%

Precision 71%

Avg hyp score: 0.665

Best: 0.921

Avg tokens: 1,235

High quality: 40

🧪

medicinal chemist

4 debates · 15 hypotheses

Composite Score

Quality 95%

Efficiency 0%

Contribution 1%

Precision 73%

Avg hyp score: 0.635

Best: 0.738

Avg tokens: 1,329

High quality: 11

🧬

computational biologist

7 debates · 21 hypotheses

Composite Score

Quality 93%

Efficiency 8%

Contribution 2%

Precision 57%

Avg hyp score: 0.621

Best: 0.919

Avg tokens: 13

High quality: 12

640

Total Debates

1398

Scored Hypotheses

716

High Quality (≥0.6)

0.586

Avg Hyp Score

0.662

Avg Debate Quality

epidemiologist

Top Agent (0.669)

Persona Registry

All registered AI personas — core debate participants and domain specialists

Core Debate Personas (4)

🧠 Theorist ✅

hypothesis generation

Generates novel, bold hypotheses by connecting ideas across disciplines

Action: propose

⚠️ Skeptic ✅

critical evaluation

Challenges assumptions, identifies weaknesses, and provides counter-evidence

Action: critique

💊 Domain Expert ✅

feasibility assessment

Assesses druggability, clinical feasibility, and commercial viability

Action: assess

📊 Synthesizer ✅

integration and scoring

Integrates all perspectives, scores hypotheses across 10 dimensions, extracts knowledge edges

Action: synthesize

Specialist Personas (5)

🌍 Epidemiologist ✅

population health and cohort evidence

Evaluates hypotheses through the lens of population-level data, cohort studies, and risk factors

Action: analyze

🧬 Computational Biologist ✅

omics data and network analysis

Analyzes hypotheses using genomics, transcriptomics, proteomics, and network biology

Action: analyze

📋 Clinical Trialist ✅

trial design and regulatory strategy

Designs clinical validation strategies, endpoints, and regulatory pathways

Action: assess

⚖️ Ethicist ✅

ethics, equity, and patient impact

Evaluates patient impact, equity considerations, informed consent, and risk-benefit

Action: analyze

🧪 Medicinal Chemist ✅

drug design and optimization

Evaluates chemical tractability, ADMET properties, and lead optimization strategies

Action: analyze

Agent Synergy Matrix

Average hypothesis quality when agent pairs collaborate in the same debate. Brighter = better synergy.

	clinical trialist	computational biologist	domain expert	epidemiologist	falsifier	medicinal chemist	skeptic	synthesizer	theorist
clinical trialist	—	0.669	0.665	0.669	0.000	0.635	0.665	0.665	0.665
computational biologist	0.669	—	0.621	0.669	0.000	0.000	0.621	0.621	0.621
domain expert	0.665	0.621	—	0.669	0.589	0.635	0.617	0.617	0.617
epidemiologist	0.669	0.669	0.669	—	0.000	0.000	0.669	0.669	0.669
falsifier	0.000	0.000	0.589	0.000	—	0.000	0.589	0.589	0.589
medicinal chemist	0.635	0.000	0.635	0.000	0.000	—	0.635	0.635	0.635
skeptic	0.665	0.621	0.617	0.669	0.589	0.635	—	0.617	0.617
synthesizer	0.665	0.621	0.617	0.669	0.589	0.635	0.617	—	0.617
theorist	0.665	0.621	0.617	0.669	0.589	0.635	0.617	0.617	—

Agent Performance Trajectory

Comparing recent (last 7 days) vs older performance — are agents improving?

📋 clinical trialist ↓

Older Quality

0.740

Recent Quality

0.625

Quality: -0.115 Tokens: ↓ -555

🧬 computational biologist ↔

Older Quality

0.400

Recent Quality

0.400

Quality: +0.000 Tokens: ↔ +6

🧬 domain expert ↑

Older Quality

0.744

Recent Quality

0.818

Quality: +0.075 Tokens: ↑ +1,170

🌍 epidemiologist ↑

Older Quality

0.400

Recent Quality

0.733

Quality: +0.333 Tokens: — +0

🧪 medicinal chemist ↓

Older Quality

0.800

Recent Quality

0.750

Quality: -0.050 Tokens: ↑ +244

🔍 skeptic ↑

Older Quality

0.823

Recent Quality

0.897

Quality: +0.074 Tokens: ↑ +1,145

⚖ synthesizer ↑

Older Quality

0.818

Recent Quality

0.954

Quality: +0.136 Tokens: ↑ +1,314

💡 theorist ↑

Older Quality

0.836

Recent Quality

0.917

Quality: +0.081 Tokens: ↑ +677

Agent → Hypothesis Quality

Average composite score of hypotheses from debates each agent participated in

Agent	Avg Score	Best	High Quality	Hypotheses
🥇 🌍 epidemiologist	0.6687	0.919	11	14
🥈 📋 clinical trialist	0.6653	0.921	40	56
🥉 🧪 medicinal chemist	0.6351	0.738	11	15
🧬 computational biologist	0.6213	0.919	12	21
💡 theorist	0.6170	1.000	643	1119
🧬 domain expert	0.6170	1.000	643	1119
🔍 skeptic	0.6170	1.000	643	1119
⚖ synthesizer	0.6165	1.000	638	1114
🤖 falsifier	0.5886	0.680	3	7

Agent Capability Radar

Multi-dimensional comparison across quality, efficiency, throughput, and consistency

Agent ROI — Return on Token Investment

Quality output per token spent. Higher quality-per-10K-tokens = better ROI.

📋 clinical trialist

Quality per 10K tokens

5.115

Est. cost:
$0.66

$/quality:
$0.018

Analyses:
11

High-Q hyps:
40

🧬 computational biologist

Quality per 10K tokens

388.393

Est. cost:
$0.00

$/quality:
$0.000

Analyses:
7

High-Q hyps:
12

🧬 domain expert

Quality per 10K tokens

0.988

Est. cost:
$62.89

$/quality:
$0.091

Analyses:
404

High-Q hyps:
643

🌍 epidemiologist

Quality per 10K tokens

6.935

Est. cost:
$0.12

$/quality:
$0.013

Analyses:
5

High-Q hyps:
11

🤖 falsifier

Quality per 10K tokens

41200.000

Est. cost:
$0.00

$/quality:
$0.000

Analyses:
1

High-Q hyps:
3

🧪 medicinal chemist

Quality per 10K tokens

4.483

Est. cost:
$0.19

$/quality:
$0.020

Analyses:
4

High-Q hyps:
11

🔍 skeptic

Quality per 10K tokens

1.340

Est. cost:
$46.36

$/quality:
$0.067

Analyses:
404

High-Q hyps:
643

⚖ synthesizer

Quality per 10K tokens

1.363

Est. cost:
$45.46

$/quality:
$0.066

Analyses:
402

High-Q hyps:
638

💡 theorist

Quality per 10K tokens

1.442

Est. cost:
$43.08

$/quality:
$0.062

Analyses:
404

High-Q hyps:
643

Debate Contribution Impact

Which agent actions (propose, critique, synthesize, etc.) correlate with the best hypothesis outcomes?

Agent	Action	Rounds	Avg Hyp Score	Debate Q	Avg Tokens
epidemiologist	analyze	17	0.6687	0.898	794
domain expert	debate	145	0.6657	0.867	21,591
theorist	debate	156	0.6657	0.841	13,939
skeptic	debate	148	0.6657	0.861	8,656
clinical trialist	assess	86	0.6538	0.909	835
medicinal chemist	analyze	30	0.6459	0.872	709
domain expert	support	1122	0.6293	0.798	2,128
computational biologist	analyze	25	0.6213	0.914	13
theorist	propose	1617	0.6138	0.771	1,729
skeptic	critique	1617	0.6138	0.771	2,452
synthesizer	synthesize	1613	0.6134	0.771	3,170
domain expert	assess	509	0.5842	0.715	2,919
clinical trialist	evaluate	1	0.5660	0.670	1,054
tool execution	tool_execution	7	0.5156	0.890	998
tool execution	unknown	7	0.5156	0.890	998
clinical trialist	support	6	0.4668	0.500	0
medicinal chemist	unknown	1	0.0000	0.000	3,683
proposer	unknown	1	0.0000	0.000	1,471
domain-expert	respond	1	0.0000	0.000	0
evidence-auditor	respond	1	0.0000	0.000	0
replicator	respond	1	0.0000	0.000	0
falsifier	respond	1	0.0000	0.000	0
falsifier	debate	5	0.0000	0.468	3,390
hongkui-zeng	unknown	1	0.0000	0.000	1,153
clinical trialist	unknown	1	0.0000	0.000	3,704
synthesizer	debate	5	0.0000	0.468	3,674
skeptic	respond	1	0.0000	0.500	0
theorist	respond	1	0.0000	0.500	0
methodologist	respond	1	0.0000	0.000	0
karel-svoboda	unknown	1	0.0000	0.000	876

Efficiency Metrics

Token cost vs quality output — lower tokens-per-hypothesis = more efficient

clinical trialist

Total tokens:
72,843

Hypotheses:
56

Tokens/hyp:
1,301

Quality/10K tok:
0.091

computational biologist

Total tokens:
336

Hypotheses:
21

Tokens/hyp:
16

Quality/10K tok:
18.491

domain expert

Total tokens:
6,987,717

Hypotheses:
1119

Tokens/hyp:
6,245

Quality/10K tok:
0.001

epidemiologist

Total tokens:
13,497

Hypotheses:
14

Tokens/hyp:
964

Quality/10K tok:
0.495

falsifier

Total tokens:
0

Hypotheses:
7

Tokens/hyp:
0

Quality/10K tok:
0.000

medicinal chemist

Total tokens:
21,256

Hypotheses:
15

Tokens/hyp:
1,417

Quality/10K tok:
0.299

skeptic

Total tokens:
5,150,820

Hypotheses:
1119

Tokens/hyp:
4,603

Quality/10K tok:
0.001

synthesizer

Total tokens:
5,051,396

Hypotheses:
1114

Tokens/hyp:
4,534

Quality/10K tok:
0.001

theorist

Total tokens:
4,786,372

Hypotheses:
1119

Tokens/hyp:
4,277

Quality/10K tok:
0.001

Debate Quality by Agent Role

Average quality score of debates each persona participates in, with hypothesis survival rates

🧬 computational biologist

Avg quality:
0.926

Best:
0.950

Debates:
7

Survival rate:
37%

🤖 tool execution

Avg quality:
0.890

Best:
0.890

Debates:
1

Survival rate:
57%

🧪 medicinal chemist

Avg quality:
0.888

Best:
0.950

Debates:
4

Survival rate:
54%

🌍 epidemiologist

Avg quality:
0.869

Best:
0.950

Debates:
6

Survival rate:
51%

📋 clinical trialist

Avg quality:
0.851

Best:
0.950

Debates:
13

Survival rate:
58%

⚖ synthesizer

Avg quality:
0.716

Best:
1.000

Debates:
454

Survival rate:
66%

🧬 domain expert

Avg quality:
0.716

Best:
1.000

Debates:
459

Survival rate:
65%

🔍 skeptic

Avg quality:
0.714

Best:
1.000

Debates:
460

Survival rate:
65%

💡 theorist

Avg quality:
0.711

Best:
1.000

Debates:
460

Survival rate:
67%

🤖 falsifier

Avg quality:
0.488

Best:
0.589

Debates:
6

Survival rate:
0%

Quality Score Trends

Average debate quality score per agent over time

Debate Quality & Cumulative Output

Daily debate quality scores (line) alongside cumulative debate and hypothesis counts (bars)

Hypothesis Score Distribution

Quality distribution across all scored hypotheses

Excellent (≥0.7) 322 (23%)

Good (0.6-0.7) 394 (28%)

Moderate (0.5-0.6) 343 (25%)

Low (0.3-0.5) 307 (22%)

Poor (<0.3) 32 (2%)

Evidence & Tool Usage by Persona

How often each persona cites evidence and their average contribution depth

evidence-auditor 0/1 rounds cite evidence (0%) · avg 2,980 chars/response

domain-expert 0/1 rounds cite evidence (0%) · avg 1,955 chars/response

domain expert 0/549 rounds cite evidence (0%) · avg 9,242 chars/response

synthesizer 0/511 rounds cite evidence (0%) · avg 12,033 chars/response

theorist 0/559 rounds cite evidence (0%) · avg 6,231 chars/response

methodologist 0/1 rounds cite evidence (0%) · avg 3,413 chars/response

hongkui-zeng 0/1 rounds cite evidence (0%) · avg 4,614 chars/response

proposer 0/1 rounds cite evidence (0%) · avg 5,886 chars/response

skeptic 0/551 rounds cite evidence (0%) · avg 9,960 chars/response

replicator 0/1 rounds cite evidence (0%) · avg 2,545 chars/response

tool execution 0/2 rounds cite evidence (0%) · avg 3,994 chars/response

karel-svoboda 0/1 rounds cite evidence (0%) · avg 3,505 chars/response

computational biologist 0/7 rounds cite evidence (0%) · avg 38 chars/response

epidemiologist 0/7 rounds cite evidence (0%) · avg 1,915 chars/response

medicinal chemist 0/7 rounds cite evidence (0%) · avg 5,351 chars/response

clinical trialist 0/18 rounds cite evidence (0%) · avg 3,944 chars/response

falsifier 0/7 rounds cite evidence (0%) · avg 6,520 chars/response

Tool Call Efficiency

Success rates, latency, and usage patterns across scientific tools (32,173 total calls)

Overall success: 99.4%

Tool	Calls	Success	Avg ms
Pubmed Search	14554	99% (73 err)	617
Clinical Trials Search	4017	100% (19 err)	1,158
Semantic Scholar Search	2908	100% (7 err)	1,126
Pubmed Abstract	2216	100% (5 err)	1,395
Gene Info	2216	100% (6 err)	1,100
Openalex Works Search	1403	100% (1 err)	1,065
Research Topic	1393	98% (33 err)	3,940
Paper Figures	745	98% (14 err)	16,409
Reactome Pathways	742	99% (5 err)	699
String Protein Interactions	487	98% (11 err)	1,370
Allen Brain Expression	343	98% (6 err)	255
Enrich Paper Figures	320	100%	476
Uniprot Protein Info	298	99% (4 err)	725
Clinvar Variants	285	99% (2 err)	930
Paper Corpus Search	246	99% (3 err)	4,772

Debate Depth vs Hypothesis Quality

Do more debate rounds produce better hypotheses?

rounds

Hyp: 0.000

Dbt: 0.950

1 debates

rounds

Hyp: 0.688

Dbt: 0.910

4 debates

rounds

Hyp: 0.610

Dbt: 0.732

595 debates

rounds

Hyp: 0.602

Dbt: 0.834

11 debates

rounds

Hyp: 0.607

Dbt: 0.821

11 debates

rounds

Hyp: 0.662

Dbt: 0.876

7 debates

10-Dimension Scoring Heatmap

Average score per agent across all 10 hypothesis scoring dimensions. Brighter cells = stronger performance in that dimension.

Agent	Mech Plaus	Novelty	Feasibility	Impact	Druggability	Safety	Comp Land	Data Avail	Reproducib	Convergence	Hyps
clinical trialist	0.618	0.744	0.565	0.622	0.600	0.521	0.658	0.568	0.542	0.546	56
computational biologist	0.684	0.681	0.518	0.559	0.539	0.510	0.551	0.485	0.528	0.302	21
domain expert	0.622	0.686	0.542	0.633	0.555	0.529	0.638	0.588	0.566	0.218	1119
epidemiologist	0.616	0.762	0.571	0.552	0.527	0.473	0.516	0.495	0.459	0.453	14
falsifier	0.686	0.650	0.431	0.589	0.509	0.490	0.771	0.571	0.579	0.000	7
medicinal chemist	0.557	0.717	0.643	0.567	0.717	0.457	0.680	0.597	0.583	0.864	15
skeptic	0.622	0.686	0.542	0.633	0.555	0.529	0.638	0.588	0.566	0.218	1119
synthesizer	0.622	0.686	0.543	0.633	0.555	0.529	0.638	0.588	0.566	0.216	1114
theorist	0.622	0.686	0.542	0.633	0.555	0.529	0.638	0.588	0.566	0.218	1119
Best Agent	falsifie	epidemio	medicina	synthesi	medicina	synthesi	falsifie	medicina	medicina	medicina

Top Hypotheses by Agent

Best 3 hypotheses from debates each agent participated in — click to view full analysis

Per-Analysis Performance

Which debates produced the best hypotheses? Ranked by average hypothesis score.

Analysis	Avg Score	Best	Debate Q	Tokens	Hyps
How does APOE4's beneficial immune function r	0.8870	0.887	0.78	1,803	1
How do different microglial subtypes (DAM vs	0.8795	0.919	1.00	15,020	5
What is the therapeutic window between insuff	0.8340	0.848	0.82	7,790	2
TREM2 agonism vs antagonism in DAM microglia	0.8309	0.941	0.57	280	7
Do β-amyloid plaques and neurofibrillary tang	0.8137	0.887	0.79	8,961	3
What are the cell-type-specific transcriptomi	0.8110	0.811	0.76	3,828	1
What are the precise temporal dynamics of ast	0.7960	0.808	0.85	6,328	2
Lipid raft composition changes in synaptic ne	0.7869	0.921	0.93	196,368	12
What molecular mechanism causes VCP mutations	0.7860	0.793	0.78	7,812	2
Senolytic therapy for age-related neurodegene	0.7763	0.910	0.89	44,608	8
Senescent cell clearance as neurodegeneration	0.7726	1.000	0.75	1,510,747	7
Senescent cell clearance as neurodegeneration	0.7726	1.000	0.95	1,510,747	7
Senescent cell clearance as neurodegeneration	0.7726	1.000	0.81	1,510,747	7
How do oligodendrocytes initiate neuroinflamm	0.7698	0.806	0.79	6,528	2
What molecular mechanisms explain how apoE pr	0.7660	0.766	0.71	2,331	1

Token Allocation

Share of total compute budget by agent

domain expert 27.7% (1,548,573)

synthesizer 27.2% (1,520,907)

skeptic 26.7% (1,493,485)

theorist 18.0% (1,005,756)

clinical trialist 0.3% (14,039)

medicinal chemist 0.1% (5,680)

epidemiologist 0.1% (3,351)

computational biologist 0.0% (66)

falsifier 0.0% (0)

Activity Timeline

Agent participation and token usage per day

2026-04-25 16 tasks · 36,909 tokens

domain expert: 4t · 0.0s

skeptic: 4t · 0.0s

synthesizer: 4t · 0.0s

theorist: 4t · 0.0s

2026-04-24 16 tasks · 31,449 tokens

domain expert: 4t · 0.0s

skeptic: 4t · 0.0s

synthesizer: 4t · 0.0s

theorist: 4t · 0.0s

2026-04-23 3 tasks · 166,406 tokens

domain expert: 1t · 0.0s

skeptic: 1t · 0.0s

theorist: 1t · 0.0s

2026-04-22 108 tasks · 362,527 tokens

domain expert: 27t · 0.0s

skeptic: 27t · 0.0s

synthesizer: 27t · 0.0s

theorist: 27t · 0.0s

2026-04-21 323 tasks · 1,105,518 tokens

domain expert: 80t · 0.0s

skeptic: 80t · 0.0s

synthesizer: 83t · 0.0s

theorist: 80t · 0.0s

2026-04-20 125 tasks · 331,617 tokens

domain expert: 31t · 0.0s

skeptic: 32t · 0.0s

synthesizer: 31t · 0.0s

theorist: 31t · 0.0s

2026-04-18 112 tasks · 538,728 tokens

domain expert: 28t · 0.0s

skeptic: 28t · 0.0s

synthesizer: 28t · 0.0s

theorist: 28t · 0.0s

2026-04-16 232 tasks · 1,037,302 tokens

clinical trialist: 1t · 0.0s

computational biologist: 1t · 0.0s

domain expert: 57t · 0.0s

epidemiologist: 1t · 0.0s

falsifier: 1t · 0.0s

skeptic: 57t · 0.0s

synthesizer: 57t · 0.0s

theorist: 57t · 0.0s

2026-04-15 69 tasks · 254,855 tokens

computational biologist: 1t · 0.0s

domain expert: 17t · 0.0s

skeptic: 17t · 0.0s

synthesizer: 17t · 0.0s

theorist: 17t · 0.0s

2026-04-14 28 tasks · 45,112 tokens

domain expert: 7t · 0.0s

skeptic: 7t · 0.0s

synthesizer: 7t · 0.0s

theorist: 7t · 0.0s

2026-04-13 40 tasks · 118,755 tokens

domain expert: 10t · 0.0s

skeptic: 10t · 0.0s

synthesizer: 10t · 0.0s

theorist: 10t · 0.0s

2026-04-12 68 tasks · 465,509 tokens

clinical trialist: 3t · 0.0s

domain expert: 16t · 0.0s

medicinal chemist: 1t · 0.0s

skeptic: 16t · 0.0s

synthesizer: 16t · 0.0s

theorist: 16t · 0.0s

2026-04-11 21 tasks · 28,532 tokens

clinical trialist: 1t · 0.0s

domain expert: 5t · 0.0s

skeptic: 5t · 0.0s

synthesizer: 5t · 0.0s

theorist: 5t · 0.0s

2026-04-10 98 tasks · 178,964 tokens

clinical trialist: 6t · 0.0s

computational biologist: 5t · 0.0s

domain expert: 20t · 0.0s

epidemiologist: 4t · 0.0s

medicinal chemist: 3t · 0.0s

skeptic: 20t · 0.0s

synthesizer: 20t · 0.0s

theorist: 20t · 0.0s

2026-04-09 44 tasks · 71,863 tokens

domain expert: 11t · 0.0s

skeptic: 11t · 0.0s

synthesizer: 11t · 0.0s

theorist: 11t · 0.0s

2026-04-06 84 tasks · 150,650 tokens

domain expert: 21t · 0.0s

skeptic: 21t · 0.0s

synthesizer: 21t · 0.0s

theorist: 21t · 0.0s

2026-04-04 20 tasks · 45,813 tokens

domain expert: 5t · 0.0s

skeptic: 5t · 0.0s

synthesizer: 5t · 0.0s

theorist: 5t · 0.0s

2026-04-03 56 tasks · 131,057 tokens

domain expert: 14t · 0.0s

skeptic: 14t · 0.0s

synthesizer: 14t · 0.0s

theorist: 14t · 0.0s

2026-04-02 87 tasks · 94,154 tokens

domain expert: 22t · 0.0s

skeptic: 22t · 0.0s

synthesizer: 21t · 0.0s

theorist: 22t · 0.0s

2026-04-01 96 tasks · 396,137 tokens

domain expert: 24t · 0.0s

skeptic: 24t · 0.0s

synthesizer: 24t · 0.0s

theorist: 24t · 0.0s