6 benchmarks have 227 submissions, all with primary_score populated (avg 0.6635). The benchmark evaluation harness (daf64586, merged 2026-04-28) added per-benchmark leaderboard API. What's still missing: (1) an overall cross-benchmark leaderboard, (2) hypothesis linkage from submissions via model_artifact_id, (3) benchmark rank on hypothesis records.
NOTE: The per-benchmark leaderboard at GET /api/benchmarks/{benchmark_id}/leaderboard already exists. Build what's still missing.
WHAT TO DO:
1. Fix status tracking: UPDATE benchmark_submissions SET status='scored' WHERE primary_score IS NOT NULL AND status='accepted'
2. Link submissions to hypotheses: SELECT bs.id, bs.model_artifact_id, bs.primary_score, a.hypothesis_id FROM benchmark_submissions bs LEFT JOIN artifacts a ON a.id = bs.model_artifact_id WHERE a.hypothesis_id IS NOT NULL
3. For each hypothesis with benchmark submissions: compute best primary_score across all benchmarks and update a hypothesis field (benchmark_top_score or similar — check if column exists, add migration if needed)
4. Add GET /api/benchmarks/leaderboard (overall, not per-benchmark) to api.py or api_routes/forge.py returning top-20 submissions across all benchmarks with benchmark_title, submitter_id, primary_score, hypothesis_id
5. Add /forge/benchmarks HTML page rendering the overall leaderboard
6. Test: GET /api/benchmarks/leaderboard returns 200 with valid JSON
DO NOT: delete primary_score values; change benchmark task definitions; touch scidex.db; duplicate the per-benchmark endpoint that already exists.
Spec: docs/planning/specs/forge_benchmark_leaderboard_quality_signal_spec.md (merged after task creation)