Build and run a benchmark evaluation harness that scores SciDEX's top hypotheses against
the 6 registered ML benchmarks. Currently there are 6 benchmarks with only 1 submission
(the seeding task 1186a9ab created the registry; no evaluation harness was built).
This converts hypotheses from "interesting debate subjects" into **testable predictions with
quantified baseline performance** — a qualitative leap in scientific credibility.
docs/planning/specs/1186a9ab-seed-neurodegeneration-ml-benchmark-regi.md for the benchmark\d benchmarks, \d benchmark_submissions.
target_gene or mention aartifacts).
benchmark_submissions row with score, metric, method='hypothesis_eval',notes containing the hypothesis title and extraction rationale.
data/scidex-artifacts/ via commit_artifact().
/api/benchmarks/<id>/leaderboard endpoint returning the top submissions ranked by score.No recurring CI covers benchmark evaluation. The closest is [Forge] CI: Test all scientific tools
which checks availability, not evaluation logic. This task builds a new capability (matching
hypotheses to benchmarks and scoring them) that does not exist today.
benchmark_submissions rows with non-null scores/api/benchmarks/<id>/leaderboard returns ranked listdata/scidex-artifacts/leaderboard_top_score populated. The remainingscripts/evaluate_benchmark_submission.py,/api/benchmarks/{id}/leaderboard endpoint.This iteration will add a reusable deterministic evaluator for submitted prediction files,
support AUROC/AUPRC/NDCG@K/Spearman scoring against read-only registered answer keys, upsert
benchmark_submissions rows plus benchmark leaderboard fields, expose the JSON leaderboard
endpoint from the mounted Forge router, and document JSON/CSV submission formats.
scidex/forge/benchmark_evaluation.py now computesscripts/evaluate_benchmark_submission.pybenchmark_submissions, and refreshesGET /api/benchmarks/{benchmark_id}/leaderboard in thedocs/design/benchmark_submission_format.md.Verification: focused unit tests passed (3 passed), syntax checks passed for the new module,
CLI, and Forge router, a dry-run AD aggregation submission scored AUROC 0.571429, a real smoke
submission eval_smoke:daf64586:ad_aggregation was written with AUROC 0.571429, and a local
FastAPI TestClient returned HTTP 200 with two ranked leaderboard rows for
bench_ad_protein_aggregation_propensity_v1.
--write-summary-artifact --summary-limit 25, which renders the top accepted scoredbenchmarks/leaderboard_summary.md under SCIDEX_ARTIFACTS_ROOTscidex.atlas.artifact_commit.commit_artifact().Verification: pytest tests/test_benchmark_evaluation.py -q passed (4 passed);
python3 -m py_compile scidex/forge/benchmark_evaluation.py scripts/evaluate_benchmark_submission.py api_routes/forge.py
passed; running python3 scripts/evaluate_benchmark_submission.py --write-summary-artifact --summary-limit 25
wrote a 25-row markdown summary from live PostgreSQL data. The artifact commit attempt
returned commit_ok=false in this sandbox because the task worktree's submodule gitdir
is read-only/uninitialised (git submodule update --init --recursive data/scidex-artifacts
could not create .git/worktrees/.../modules/data/scidex-artifacts), but the CLI path
and report contents were exercised.