SciDEX — Task: [Senate] Hot-path query optimizer for top-20 endpo

Sampled per-route DB-time instrumentation + EXPLAIN-driven index suggestions; senate page; CI gate against new seq-scans on top-20 routes.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Senate] Hot-path query optimizer: instrumentation, aggregator, senate page, CI gate [task:25f7f7f2-e05b-4c08-8414-492b5aa4ccaa] (#742)2026-04-27

Spec File

Effort: thorough

Goal

api.py is 87,283 LoC and serves scidex.ai. The 2026-04-21 PG pool
exhaustion incident showed we have no per-endpoint visibility into
which routes burn DB budget. The sibling q-obs-pg-slow-query-detector ranks individual SQL statements; this
spec attributes time at the route level, links each route's worst
query back to its FastAPI handler, and proposes concrete indexes.
Goal: cut p95 of the top-20 routes (by request volume × mean ms) by
≥30 % through targeted indexes only — no schema changes.

Acceptance Criteria

☐ Instrumentation in api_shared/db.py near

_get_pool (line 190): a thin wrapper around the pool's
connection() context manager that captures the calling
route from request.scope["route"].path (set by a new
_route_label_middleware in api.py adjacent to
_trace_id_middleware) and records to a new
route_db_time table:

route_label TEXT, query_hash TEXT, exec_ms FLOAT,
       rows INT, ts TIMESTAMP DEFAULT NOW()

.
Sampled at 5 % to keep write volume bounded.

☐ Migration migrations/<date>_route_db_time.sql —

table + 90-day retention via a daily

DELETE WHERE ts < NOW()
      - INTERVAL '90 days'

cron.

☐ Aggregator scripts/hot_path_optimizer.py:

computes top 20 by sum(exec_ms) over last 7 days; for
each, joins with pg_stat_statements (already present after
q-obs-pg-slow-query-detector) to pick the dominant
query_hash; runs EXPLAIN (FORMAT JSON, BUFFERS); flags
Seq Scan on tables >100K rows whose WHERE-column has no
btree index. Emits a remediation report at
data/perf/hot_path_report_<date>.md.

☐ Page GET /senate/hot-paths lists top-20 routes,

total exec time, p50/p95, dominant query, and a one-click
"draft index" button that opens an Orchestra task pre-populated
with a CREATE INDEX CONCURRENTLY statement and the EXPLAIN
diff (before/after) once applied.

☐ Apply-before-commit gate. New CI check

scripts/ci/check_no_seqscan_regression.py runs the report
against PR HEAD; fails the PR if a new top-20 route does a
seq-scan on a >100K-row table that wasn't on the prior
report's allowlist.

☐ Smoke evidence. Pick 3 routes from the first report,

apply suggested indexes, attach EXPLAIN before/after to the
Work Log; document the p95 delta on a synthetic load test.

Approach

Sampled wrapper first — verify <2 % CPU overhead under

wrk -t4 -c64 -d30s against /senate/quality-dashboard.

Aggregator standalone CLI; iterate report format with the

senate page consumer in mind.

CI gate is opt-in for first 2 weeks (warn-only), then

blocking.

Coordinate with q-obs-pg-slow-query-detector so the two pages

cross-link.

Dependencies

q-obs-pg-slow-query-detector — provides pg_stat_statements

remediation hints for the EXPLAIN join.

q-obs-trace-id-propagation — supplies request.scope adornment.

Dependents

q-perf-cdn-friendly-views — uses route timing to pick

cache-eligible routes.

Work Log

2026-04-27 14:30 UTC — Slot 73

Instrumentation in api_shared/db.py: Added _route_label_var ContextVar and

_get_route_label() lazy accessor to read the route label from the middleware.
Added _maybe_record_route_time() sampled at 5% that writes to route_db_time
table. Modified PGConnection.execute() to time each query and record
route/exec_ms in a finally block (never blocks on errors).

Middleware in api.py: Added _route_label_middleware adjacent to

_trace_id_middleware (line ~1454). Extracts request.scope["route"].path
via Starlette's Route object and stores it in the ContextVar.

Migration migrations/20260427_route_db_time.sql: Creates route_db_time

table (id, route_label, query_hash, exec_ms, rows, ts) with indexes on
(route_label, ts DESC) and (ts DESC). Applied to prod DB successfully.

Aggregator scripts/hot_path_optimizer.py: Computes top-20 routes by

sum(exec_ms) over 7-day window. Uses PERCENTILE_CONT for p50/p95.
Runs EXPLAIN on sample queries; flags Seq Scan on tables >100K rows whose
WHERE columns lack btree indexes. Emits markdown report to
data/perf/hot_path_report_<date>.md. First report generated:
data/perf/hot_path_report_20260427.md (empty table — instrumentation
just created, data accumulates as requests are served at 5% rate).

Senate page GET /senate/hot-paths: HTML page showing top-20 routes

table (calls, total ms, avg ms, p50, p95), p95/avg bar chart, latest
remediation report preview, link to slow-queries page. Uses standard
_get_cached_page caching pattern.

CI gate scripts/ci/check_no_seqscan_regression.py: Opt-in check

(warn-only first 14 days, then blocking). Parses prior + current report,
flags new seq-scan opportunities not in the allowlist. Exit 0 on clean,
exit 1 on regression.

Note: pg_stat_statements extension is not yet installed on this PG
instance. The query_hash column is present but empty; the EXPLAIN join
in hot_path_optimizer.py uses placeholder samples. Install with: CREATE EXTENSION IF NOT EXISTS pg_stat_statements; (sibling task q-obs-pg-slow-query-detector should cover this).