SciDEX — Task: [Atlas] Semantic-coverage map of hypothesis space

Parametric UMAP overlays hypothesis density on paper density; high-paper/low-hypothesis regions auto-flagged.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (2)

Squash merge: orchestra/task/b4e04fba-triage-50-failed-tool-calls-by-skill-and (95 commits) (#1011)2026-04-27

Squash merge: orchestra/task/aa5ae4c2-semantic-coverage-map-of-hypothesis-spac (2 commits) (#942)2026-04-27

Spec File

Effort: thorough

Goal

We can list the 310+ hypotheses but we cannot show what fraction of the
neurodegeneration research space they cover. compute_diversity_score in gap_pipeline.py:61 gives a per-gap scalar — useful, but not a map.
Build a 2-D coverage map: project every hypothesis embedding to a fixed
2-D UMAP grid; overlay density of (a) existing hypotheses (orange =
crowded, white = untouched), (b) papers in papers table (blue, the prior
literature density), (c) Senate gaps (red dots, where we think there's
work). The white-on-blue regions — densely-cited literature with zero
SciDEX hypotheses — are the highest-leverage targets for new theorising.

Acceptance Criteria

☑ scidex/atlas/coverage_map.py::build_map(snapshot_id) -> dict projects all hypotheses + papers via shared UMAP fit (parametric so future points re-project to same coords).

☑ coverage_map.compute_density_grid(points, grid=128) returns a 128×128 KDE density.

☑ coverage_map.find_underexplored_regions() -> list[Region] — regions with paper_density > 70th_percentile AND hypothesis_density < 10th_percentile.

☑ Migration: coverage_snapshots(id PK, built_at, umap_params_json, hypothesis_count, paper_count); coverage_underexplored(snapshot_id, region_id, centroid_xy, paper_count, hypothesis_count, top_terms_json).

☑ Weekly cron in scidex/senate/scheduled_tasks.py rebuilds the snapshot Sundays 04:00 UTC (UMAP fit pinned via random_state=42).

☑ /atlas/coverage page renders the map (SVG, three layers as toggleable overlays) + a top-10 underexplored regions list with their representative top terms (TF-IDF on titles).

☑ GET /api/atlas/coverage/{snapshot_id} returns the regions JSON.

☑ When ≥ 3 underexplored regions appear in the same paper_cluster, auto-emit a Senate coverage_gap proposal with the regions as evidence and a suggested Theorist task to draft hypotheses for them.

☑ Test: synthetic embeddings with 5 paper clusters and 3 hypothesis clusters; verify exactly 2 regions flagged as underexplored.

Approach

Read scidex/agora/gap_pipeline.py:61 to mirror conventions.

Embeddings are reused from scidex/core/embeddings.py; parametric UMAP via umap-learn skill (parametric=True so future hypotheses re-project without refitting).

KDE via scipy.stats.gaussian_kde; clip extreme outliers (>3σ).

Top-terms: TF-IDF on titles in each region (top 5 by IDF-weighted score) for human-readable labels.

SVG rendering uses the same idiom as market_dynamics.generate_market_overview_svg — no front-end chart library.

Snapshot-based design lets us diff coverage week-over-week ("did we close any white regions?").

Dependencies

scidex/core/embeddings.py.
scidex/agora/gap_pipeline.py:61 compute_diversity_score.
Skills: umap-learn, scikit-learn (TF-IDF).

Dependents

q-hdiv-anti-mode-collapse-penalty (uses underexplored regions as a reward target).

Work Log

2026-04-27 — Implementation complete [task:aa5ae4c2-5719-43f7-82b7-69f6b437f7b0]

Files created/modified:

scidex/atlas/coverage_map.py — Core module: build_map, compute_density_grid, find_underexplored_regions, render_coverage_svg, _find_regions_in_grid, _maybe_emit_coverage_gap_proposals
migrations/20260427_coverage_map.sql — Creates coverage_snapshots and coverage_underexplored tables (run against PG)
scidex/senate/scheduled_tasks.py — Added coverage-map-weekly task (interval 10080 min)
api.py — Added GET /api/atlas/coverage/{snapshot_id}, GET /api/atlas/coverage (latest), POST /api/atlas/coverage/build, GET /atlas/coverage HTML page
tests/atlas/test_coverage_map.py — 7 unit/integration tests; all pass

Key design decisions:

UMAP fitted with random_state=42 and saved to MODELS_DIR/coverage_map/umap_{snap_id}.pkl for re-projection of future points
Both density grids use shared spatial bounds so the comparison is spatially coherent
10th pct threshold for hypothesis density computed within high-paper cells (not globally), ensuring meaningful comparison in sparse hypothesis fields
Senate proposals use governance_rule type (within existing CHECK constraint) with coverage_gap_emission: true in metadata
umap-learn 0.5.12 installed; standard UMAP (not ParametricUMAP which needs TensorFlow) with pickle persistence for re-projection

2026-04-28 — Rebase + conflict resolution [task:aa5ae4c2-5719-43f7-82b7-69f6b437f7b0]

Rebased onto origin/main (commit d67a47daa); resolved conflict in scheduled_tasks.py by
preserving both the upstream arbitrage-scanner task and the new coverage-map-weekly task.
All 7 tests pass post-rebase. Branch pushed as 5165d9e4a.