[Forge] Benchmark evaluation harness — submission pipeline, automated scoring, leaderboard done

← Forge
Build an automated benchmark evaluation harness for SciDEX's 6 seeded neurodegeneration benchmarks. Currently 6 benchmarks exist with 1 submission and no scoring pipeline — benchmarks are decorative without evaluation. Read spec: docs/planning/specs/forge_benchmark_evaluation_harness_spec.md. Steps: (1) inspect benchmark_submissions schema; (2) build scripts/evaluate_benchmark_submission.py that loads ground truth, scores predictions against it using the benchmark's stated metric (AUROC, NDCG, Spearman, etc.), and writes score+scored_at to submission row; (3) run on the 1 existing submission; (4) create 1-2 additional submissions from top hypotheses; (5) update leaderboard_top_score in benchmarks; (6) document submission format in docs/design/benchmark_submission_format.md. Do NOT modify ground truth data. Accept any 2+ metric types.

Git Commits (4)

[Forge] Add benchmark evaluation harness: scoring pipeline, leaderboard API, docs [task:daf64586-a66f-4e8c-9b7f-a07c78eaa17b] (#1273)2026-04-28
[Forge] Add benchmark evaluation harness: scoring pipeline, leaderboard API, docs [task:daf64586-a66f-4e8c-9b7f-a07c78eaa17b] (#1273)2026-04-28
Squash merge: orchestra/task/80ffb77b-ambitious-quest-task-generator-xhigh-eff (34 commits) (#1264)2026-04-28
[Senate] Cycle 4: create market-resolution/benchmark/validation tasks; log world-model [task:80ffb77b-8391-493c-8644-37086c8e2e3c]2026-04-28
Spec File

Goal

Build and run a benchmark evaluation harness that scores SciDEX's top hypotheses against
the 6 registered ML benchmarks. Currently there are 6 benchmarks with only 1 submission
(the seeding task 1186a9ab created the registry; no evaluation harness was built).

This converts hypotheses from "interesting debate subjects" into **testable predictions with
quantified baseline performance** — a qualitative leap in scientific credibility.

Current state (as of 2026-04-28)

BenchmarkTask typeBaselineSubmissions
AD Protein Aggregation PropensityclassificationAUROC 0.720
PD Dopaminergic Neuron Expression SignatureclassificationAUPRC 0.610
ALS Motor Neuron Survival Gene Predictiontarget_rankingNDCG@50 0.430
Neurodegeneration Drug Target DruggabilityregressionSpearman 0.520
Glymphatic Clearance Biomarker Panelfeature_rankingNDCG@20 0.490
OT-AD Target Rankingtarget_ranking1

What the agent should do

  • Read docs/planning/specs/1186a9ab-seed-neurodegeneration-ml-benchmark-regi.md for the benchmark
  • schema and how submissions work. Read \d benchmarks, \d benchmark_submissions.
  • Select top 50 hypotheses by composite_score that have a target_gene or mention a
  • mechanistically relevant gene/pathway matchable to a benchmark domain.
  • Match hypotheses to benchmarks: for each hypothesis, determine which benchmark(s) it
  • makes predictions about (e.g., a hypothesis about TREM2 and AD aggregation → AD Protein Aggregation benchmark).
  • Score each hypothesis against matched benchmarks: extract the predicted gene/protein
  • ranked list or binary prediction from the hypothesis description, then run a deterministic
    scoring function against the benchmark's answer key (stored in artifacts).
  • Store each evaluation as a benchmark_submissions row with score, metric, method='hypothesis_eval',
  • notes containing the hypothesis title and extraction rationale.
  • Write a summary artifact: a markdown table of top 25 hypothesis-benchmark scores, stored
  • in data/scidex-artifacts/ via commit_artifact().
  • Expose /api/benchmarks/<id>/leaderboard endpoint returning the top submissions ranked by score.
  • Why this is not a recurring CI duplicate

    No recurring CI covers benchmark evaluation. The closest is [Forge] CI: Test all scientific tools
    which checks availability, not evaluation logic. This task builds a new capability (matching
    hypotheses to benchmarks and scoring them) that does not exist today.

    Success metrics

    • ≥ 30 benchmark_submissions rows with non-null scores
    • At least 3 benchmarks have ≥ 5 submissions
    • /api/benchmarks/<id>/leaderboard returns ranked list
    • Summary artifact committed to data/scidex-artifacts/

    Do NOT

    • Modify the benchmark answer keys (read-only)
    • Use LLM scoring as a substitute for the defined metrics (use the benchmark's stated metric)
    • Create new benchmarks (out of scope for this task)

    Work Log

    2026-04-28 — Created by ambitious-quest-task-generator Cycle 3

    6 benchmarks exist with only 1 submission. Evaluation harness is absent — seeding task built
    the registry but not the evaluator. This task builds and runs the evaluator.

    2026-04-29 — Iteration 1 implementation plan

    Live staleness review found the task is still relevant but partially advanced: the database
    already contains scored hypothesis-evaluation submissions for all 6 benchmarks, with at least
    21 scored rows per benchmark and denormalized leaderboard_top_score populated. The remaining
    gap is durable repo code: there is no committed scripts/evaluate_benchmark_submission.py,
    no documented file format, and no JSON /api/benchmarks/{id}/leaderboard endpoint.

    This iteration will add a reusable deterministic evaluator for submitted prediction files,
    support AUROC/AUPRC/NDCG@K/Spearman scoring against read-only registered answer keys, upsert benchmark_submissions rows plus benchmark leaderboard fields, expose the JSON leaderboard
    endpoint from the mounted Forge router, and document JSON/CSV submission formats.

    2026-04-29 — Iteration 1 implementation notes

    Implemented the durable submission path: scidex/forge/benchmark_evaluation.py now computes
    AUROC, AUPRC, NDCG@K, and Spearman without LLM scoring; scripts/evaluate_benchmark_submission.py
    loads JSON/CSV predictions, scores them, upserts benchmark_submissions, and refreshes
    leaderboard denormalizations. Added GET /api/benchmarks/{benchmark_id}/leaderboard in the
    mounted Forge router and documented submission formats in docs/design/benchmark_submission_format.md.

    Verification: focused unit tests passed (3 passed), syntax checks passed for the new module,
    CLI, and Forge router, a dry-run AD aggregation submission scored AUROC 0.571429, a real smoke
    submission eval_smoke:daf64586:ad_aggregation was written with AUROC 0.571429, and a local
    FastAPI TestClient returned HTTP 200 with two ranked leaderboard rows for bench_ad_protein_aggregation_propensity_v1.

    2026-04-29 — Iteration 1 continuation

    Added the missing summary artifact path to the evaluator: the CLI now supports --write-summary-artifact --summary-limit 25, which renders the top accepted scored
    submissions to benchmarks/leaderboard_summary.md under SCIDEX_ARTIFACTS_ROOT
    and attempts to commit it through scidex.atlas.artifact_commit.commit_artifact().
    This keeps the report generation in the same deterministic scoring module as the
    submission evaluator instead of making a one-off artifact file.

    Verification: pytest tests/test_benchmark_evaluation.py -q passed (4 passed); python3 -m py_compile scidex/forge/benchmark_evaluation.py scripts/evaluate_benchmark_submission.py api_routes/forge.py
    passed; running python3 scripts/evaluate_benchmark_submission.py --write-summary-artifact --summary-limit 25
    wrote a 25-row markdown summary from live PostgreSQL data. The artifact commit attempt
    returned commit_ok=false in this sandbox because the task worktree's submodule gitdir
    is read-only/uninitialised (git submodule update --init --recursive data/scidex-artifacts
    could not create .git/worktrees/.../modules/data/scidex-artifacts), but the CLI path
    and report contents were exercised.

    Sibling Tasks in Quest (Forge) ↗

    Task Dependencies

    ↓ Referenced by (downstream)