SciDEX — Task: [Forge] Benchmark evaluation harness

Build and run a benchmark evaluation harness that scores SciDEX's top hypotheses against the 6 registered ML benchmarks. Currently 6 benchmarks exist with only 1 submission — the seeding task (1186a9ab) created the registry but no evaluation harness was built. **Why this matters:** This converts hypotheses from debate subjects into testable predictions with quantified baseline performance — a qualitative leap in scientific credibility. **Current state:** - AD Protein Aggregation Propensity: 0 submissions - PD Dopaminergic Neuron Expression Signature: 0 submissions - ALS Motor Neuron Survival Gene Prediction: 0 submissions - Neurodegeneration Drug Target Druggability: 0 submissions - Glymphatic Clearance Biomarker Panel: 0 submissions - OT-AD Target Ranking: 1 submission **What to do:** 1. Read docs/planning/specs/1186a9ab-seed-neurodegeneration-ml-benchmark-regi.md for benchmark schema 2. Read schema: \\d benchmarks, \\d benchmark_submissions 3. Select top 50 hypotheses by composite_score with target_gene matchable to a benchmark domain 4. Match hypotheses to benchmarks by disease/gene overlap 5. Score each via the benchmark's stated metric against the answer key in artifacts 6. Store as benchmark_submissions rows with score, metric, method='hypothesis_eval' 7. Write summary artifact as markdown table via commit_artifact() 8. Add /api/benchmarks//leaderboard endpoint **Spec:** docs/planning/specs/forge_benchmark_evaluation_harness_spec.md **Success:** >= 30 benchmark_submissions rows; >= 3 benchmarks with >= 5 submissions; leaderboard endpoint works **Do NOT:** Use LLM scoring as substitute for defined metrics; modify answer keys; create new benchmarks

Completion Notes

Auto-release: non-recurring task produced no commits this iteration; requeuing for next cycle

Sibling Tasks in Quest (Forge) ↗

●[Forge] Benchmark answer-key migration to dataset registry (driver #31)P93

●[Forge] Benchmark quality signal pipeline — link 227 scored submissions to hypotheses, surface overall leaderboard, update hypothesis benchmark ranksP91

○[Forge] Integrate tools with debate engineP95

○[Forge] Reproducible analysis capsules and artifact supply chainP93

○[Forge] CI: Experiment claim driver — pick high-IIG experiments for executionP93

○[Forge] CI: Paper replication target selectorP91

○[Forge] Artifact enrichment quest — evaluation context, cross-links, provenanceP82

○[Forge] Reduce PubMed metadata backlog for papers missing abstractsP82

○[Forge] CI: Test all scientific tools for availabilityP78

○[Forge] Dedup scan every 6hP60

[Forge] Benchmark evaluation harness — run top 50 hypotheses through 6 registered benchmarks, store predictive scores done

Completion Notes

Sibling Tasks in Quest (Forge) ↗