SciDEX — Task: [Forge] Benchmark evaluation harness

Build an automated benchmark evaluation harness for SciDEX's 6 seeded neurodegeneration benchmarks. Currently 6 benchmarks exist with 1 submission and no scoring pipeline — benchmarks are decorative without evaluation. Read spec: docs/planning/specs/forge_benchmark_evaluation_harness_spec.md. Steps: (1) inspect benchmark_submissions schema; (2) build scripts/evaluate_benchmark_submission.py that loads ground truth, scores predictions against it using the benchmark's stated metric (AUROC, NDCG, Spearman, etc.), and writes score+scored_at to submission row; (3) run on the 1 existing submission; (4) create 1-2 additional submissions from top hypotheses; (5) update leaderboard_top_score in benchmarks; (6) document submission format in docs/design/benchmark_submission_format.md. Do NOT modify ground truth data. Accept any 2+ metric types.

Git Commits (4)

[Forge] Add benchmark evaluation harness: scoring pipeline, leaderboard API, docs [task:daf64586-a66f-4e8c-9b7f-a07c78eaa17b] (#1273)2026-04-28

Squash merge: orchestra/task/80ffb77b-ambitious-quest-task-generator-xhigh-eff (34 commits) (#1264)2026-04-28

[Senate] Cycle 4: create market-resolution/benchmark/validation tasks; log world-model [task:80ffb77b-8391-493c-8644-37086c8e2e3c]2026-04-28

Spec File

Goal

Build and run a benchmark evaluation harness that scores SciDEX's top hypotheses against
the 6 registered ML benchmarks. Currently there are 6 benchmarks with only 1 submission
(the seeding task 1186a9ab created the registry; no evaluation harness was built).

This converts hypotheses from "interesting debate subjects" into **testable predictions with
quantified baseline performance** — a qualitative leap in scientific credibility.

Current state (as of 2026-04-28)

Benchmark	Task type	Baseline	Submissions
AD Protein Aggregation Propensity	classification	AUROC 0.72	0
PD Dopaminergic Neuron Expression Signature	classification	AUPRC 0.61	0
ALS Motor Neuron Survival Gene Prediction	target_ranking	NDCG@50 0.43	0
Neurodegeneration Drug Target Druggability	regression	Spearman 0.52	0
Glymphatic Clearance Biomarker Panel	feature_ranking	NDCG@20 0.49	0
OT-AD Target Ranking	target_ranking	—	1

What the agent should do

Read docs/planning/specs/1186a9ab-seed-neurodegeneration-ml-benchmark-regi.md for the benchmark

schema and how submissions work. Read \d benchmarks, \d benchmark_submissions.

Select top 50 hypotheses by composite_score that have a target_gene or mention a

mechanistically relevant gene/pathway matchable to a benchmark domain.

Match hypotheses to benchmarks: for each hypothesis, determine which benchmark(s) it

makes predictions about (e.g., a hypothesis about TREM2 and AD aggregation → AD Protein Aggregation benchmark).

Score each hypothesis against matched benchmarks: extract the predicted gene/protein

ranked list or binary prediction from the hypothesis description, then run a deterministic
scoring function against the benchmark's answer key (stored in artifacts).

Store each evaluation as a benchmark_submissions row with score, metric, method='hypothesis_eval',

notes containing the hypothesis title and extraction rationale.

Write a summary artifact: a markdown table of top 25 hypothesis-benchmark scores, stored

in data/scidex-artifacts/ via commit_artifact().

Expose /api/benchmarks/<id>/leaderboard endpoint returning the top submissions ranked by score.

Why this is not a recurring CI duplicate

No recurring CI covers benchmark evaluation. The closest is [Forge] CI: Test all scientific tools
which checks availability, not evaluation logic. This task builds a new capability (matching
hypotheses to benchmarks and scoring them) that does not exist today.

Success metrics

≥ 30 benchmark_submissions rows with non-null scores
At least 3 benchmarks have ≥ 5 submissions
/api/benchmarks/<id>/leaderboard returns ranked list
Summary artifact committed to data/scidex-artifacts/

Do NOT

Modify the benchmark answer keys (read-only)
Use LLM scoring as a substitute for the defined metrics (use the benchmark's stated metric)
Create new benchmarks (out of scope for this task)

Work Log

2026-04-28 — Created by ambitious-quest-task-generator Cycle 3

6 benchmarks exist with only 1 submission. Evaluation harness is absent — seeding task built
the registry but not the evaluator. This task builds and runs the evaluator.

2026-04-29 — Iteration 1 implementation plan

Live staleness review found the task is still relevant but partially advanced: the database
already contains scored hypothesis-evaluation submissions for all 6 benchmarks, with at least
21 scored rows per benchmark and denormalized leaderboard_top_score populated. The remaining
gap is durable repo code: there is no committed scripts/evaluate_benchmark_submission.py,
no documented file format, and no JSON /api/benchmarks/{id}/leaderboard endpoint.

This iteration will add a reusable deterministic evaluator for submitted prediction files,
support AUROC/AUPRC/NDCG@K/Spearman scoring against read-only registered answer keys, upsert benchmark_submissions rows plus benchmark leaderboard fields, expose the JSON leaderboard
endpoint from the mounted Forge router, and document JSON/CSV submission formats.

2026-04-29 — Iteration 1 implementation notes

Implemented the durable submission path: scidex/forge/benchmark_evaluation.py now computes
AUROC, AUPRC, NDCG@K, and Spearman without LLM scoring; scripts/evaluate_benchmark_submission.py
loads JSON/CSV predictions, scores them, upserts benchmark_submissions, and refreshes
leaderboard denormalizations. Added GET /api/benchmarks/{benchmark_id}/leaderboard in the
mounted Forge router and documented submission formats in docs/design/benchmark_submission_format.md.

Verification: focused unit tests passed (3 passed), syntax checks passed for the new module,
CLI, and Forge router, a dry-run AD aggregation submission scored AUROC 0.571429, a real smoke
submission eval_smoke:daf64586:ad_aggregation was written with AUROC 0.571429, and a local
FastAPI TestClient returned HTTP 200 with two ranked leaderboard rows for bench_ad_protein_aggregation_propensity_v1.

2026-04-29 — Iteration 1 continuation

Added the missing summary artifact path to the evaluator: the CLI now supports --write-summary-artifact --summary-limit 25, which renders the top accepted scored
submissions to benchmarks/leaderboard_summary.md under SCIDEX_ARTIFACTS_ROOT
and attempts to commit it through scidex.atlas.artifact_commit.commit_artifact().
This keeps the report generation in the same deterministic scoring module as the
submission evaluator instead of making a one-off artifact file.

Verification: pytest tests/test_benchmark_evaluation.py -q passed (4 passed); python3 -m py_compile scidex/forge/benchmark_evaluation.py scripts/evaluate_benchmark_submission.py api_routes/forge.py
passed; running python3 scripts/evaluate_benchmark_submission.py --write-summary-artifact --summary-limit 25
wrote a 25-row markdown summary from live PostgreSQL data. The artifact commit attempt
returned commit_ok=false in this sandbox because the task worktree's submodule gitdir
is read-only/uninitialised (git submodule update --init --recursive data/scidex-artifacts
could not create .git/worktrees/.../modules/data/scidex-artifacts), but the CLI path
and report contents were exercised.