Build and run a benchmark evaluation harness that scores SciDEX's top hypotheses against the 6 registered ML benchmarks. Currently 6 benchmarks exist with only 1 submission — the seeding task (1186a9ab) created the registry but no evaluation harness was built.
**Why this matters:** This converts hypotheses from debate subjects into testable predictions with quantified baseline performance — a qualitative leap in scientific credibility.
**Current state:**
- AD Protein Aggregation Propensity: 0 submissions
- PD Dopaminergic Neuron Expression Signature: 0 submissions
- ALS Motor Neuron Survival Gene Prediction: 0 submissions
- Neurodegeneration Drug Target Druggability: 0 submissions
- Glymphatic Clearance Biomarker Panel: 0 submissions
- OT-AD Target Ranking: 1 submission
**What to do:**
1. Read docs/planning/specs/1186a9ab-seed-neurodegeneration-ml-benchmark-regi.md for benchmark schema
2. Read schema: \\d benchmarks, \\d benchmark_submissions
3. Select top 50 hypotheses by composite_score with target_gene matchable to a benchmark domain
4. Match hypotheses to benchmarks by disease/gene overlap
5. Score each via the benchmark's stated metric against the answer key in artifacts
6. Store as benchmark_submissions rows with score, metric, method='hypothesis_eval'
7. Write summary artifact as markdown table via commit_artifact()
8. Add /api/benchmarks//leaderboard endpoint
**Spec:** docs/planning/specs/forge_benchmark_evaluation_harness_spec.md
**Success:** >= 30 benchmark_submissions rows; >= 3 benchmarks with >= 5 submissions; leaderboard endpoint works
**Do NOT:** Use LLM scoring as substitute for defined metrics; modify answer keys; create new benchmarks
Completion Notes
Auto-release: non-recurring task produced no commits this iteration; requeuing for next cycle