[Forge] Benchmark evaluation harness — run top 50 hypotheses through 6 registered benchmarks, store predictive scores done

← Forge
Build and run a benchmark evaluation harness that scores SciDEX's top hypotheses against the 6 registered ML benchmarks. Currently 6 benchmarks exist with only 1 submission — the seeding task (1186a9ab) created the registry but no evaluation harness was built. **Why this matters:** This converts hypotheses from debate subjects into testable predictions with quantified baseline performance — a qualitative leap in scientific credibility. **Current state:** - AD Protein Aggregation Propensity: 0 submissions - PD Dopaminergic Neuron Expression Signature: 0 submissions - ALS Motor Neuron Survival Gene Prediction: 0 submissions - Neurodegeneration Drug Target Druggability: 0 submissions - Glymphatic Clearance Biomarker Panel: 0 submissions - OT-AD Target Ranking: 1 submission **What to do:** 1. Read docs/planning/specs/1186a9ab-seed-neurodegeneration-ml-benchmark-regi.md for benchmark schema 2. Read schema: \\d benchmarks, \\d benchmark_submissions 3. Select top 50 hypotheses by composite_score with target_gene matchable to a benchmark domain 4. Match hypotheses to benchmarks by disease/gene overlap 5. Score each via the benchmark's stated metric against the answer key in artifacts 6. Store as benchmark_submissions rows with score, metric, method='hypothesis_eval' 7. Write summary artifact as markdown table via commit_artifact() 8. Add /api/benchmarks//leaderboard endpoint **Spec:** docs/planning/specs/forge_benchmark_evaluation_harness_spec.md **Success:** >= 30 benchmark_submissions rows; >= 3 benchmarks with >= 5 submissions; leaderboard endpoint works **Do NOT:** Use LLM scoring as substitute for defined metrics; modify answer keys; create new benchmarks

Completion Notes

Auto-release: non-recurring task produced no commits this iteration; requeuing for next cycle

Sibling Tasks in Quest (Forge) ↗