[Senate] Hot-path query optimizer for top-20 endpoints done

← Code Health
Sampled per-route DB-time instrumentation + EXPLAIN-driven index suggestions; senate page; CI gate against new seq-scans on top-20 routes.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Senate] Hot-path query optimizer: instrumentation, aggregator, senate page, CI gate [task:25f7f7f2-e05b-4c08-8414-492b5aa4ccaa] (#742)2026-04-27
Spec File

Effort: thorough

Goal

api.py is 87,283 LoC and serves scidex.ai. The 2026-04-21 PG pool
exhaustion incident showed we have no per-endpoint visibility into
which routes burn DB budget. The sibling q-obs-pg-slow-query-detector ranks individual SQL statements; this
spec attributes time at the route level, links each route's worst
query back to its FastAPI handler, and proposes concrete indexes.
Goal: cut p95 of the top-20 routes (by request volume × mean ms) by
≥30 % through targeted indexes only — no schema changes.

Acceptance Criteria

Instrumentation in api_shared/db.py near
_get_pool (line 190): a thin wrapper around the pool's
connection() context manager that captures the calling
route from request.scope["route"].path (set by a new
_route_label_middleware in api.py adjacent to
_trace_id_middleware) and records to a new
route_db_time table:
route_label TEXT, query_hash TEXT, exec_ms FLOAT,
rows INT, ts TIMESTAMP DEFAULT NOW()
.
Sampled at 5 % to keep write volume bounded.
Migration migrations/<date>_route_db_time.sql
table + 90-day retention via a daily DELETE WHERE ts < NOW()
- INTERVAL '90 days'
cron.
Aggregator scripts/hot_path_optimizer.py:
computes top 20 by sum(exec_ms) over last 7 days; for
each, joins with pg_stat_statements (already present after
q-obs-pg-slow-query-detector) to pick the dominant
query_hash; runs EXPLAIN (FORMAT JSON, BUFFERS); flags
Seq Scan on tables >100K rows whose WHERE-column has no
btree index. Emits a remediation report at
data/perf/hot_path_report_<date>.md.
Page GET /senate/hot-paths lists top-20 routes,
total exec time, p50/p95, dominant query, and a one-click
"draft index" button that opens an Orchestra task pre-populated
with a CREATE INDEX CONCURRENTLY statement and the EXPLAIN
diff (before/after) once applied.
Apply-before-commit gate. New CI check
scripts/ci/check_no_seqscan_regression.py runs the report
against PR HEAD; fails the PR if a new top-20 route does a
seq-scan on a >100K-row table that wasn't on the prior
report's allowlist.
Smoke evidence. Pick 3 routes from the first report,
apply suggested indexes, attach EXPLAIN before/after to the
Work Log; document the p95 delta on a synthetic load test.

Approach

  • Sampled wrapper first — verify <2 % CPU overhead under
  • wrk -t4 -c64 -d30s against /senate/quality-dashboard.
  • Aggregator standalone CLI; iterate report format with the
  • senate page consumer in mind.
  • CI gate is opt-in for first 2 weeks (warn-only), then
  • blocking.
  • Coordinate with q-obs-pg-slow-query-detector so the two pages
  • cross-link.

    Dependencies

    • q-obs-pg-slow-query-detector — provides pg_stat_statements
    remediation hints for the EXPLAIN join.
    • q-obs-trace-id-propagation — supplies request.scope adornment.

    Dependents

    • q-perf-cdn-friendly-views — uses route timing to pick
    cache-eligible routes.

    Work Log

    2026-04-27 14:30 UTC — Slot 73

    • Instrumentation in api_shared/db.py: Added _route_label_var ContextVar and
    _get_route_label() lazy accessor to read the route label from the middleware.
    Added _maybe_record_route_time() sampled at 5% that writes to route_db_time
    table. Modified PGConnection.execute() to time each query and record
    route/exec_ms in a finally block (never blocks on errors).
    • Middleware in api.py: Added _route_label_middleware adjacent to
    _trace_id_middleware (line ~1454). Extracts request.scope["route"].path
    via Starlette's Route object and stores it in the ContextVar.
    • Migration migrations/20260427_route_db_time.sql: Creates route_db_time
    table (id, route_label, query_hash, exec_ms, rows, ts) with indexes on
    (route_label, ts DESC) and (ts DESC). Applied to prod DB successfully.
    • Aggregator scripts/hot_path_optimizer.py: Computes top-20 routes by
    sum(exec_ms) over 7-day window. Uses PERCENTILE_CONT for p50/p95.
    Runs EXPLAIN on sample queries; flags Seq Scan on tables >100K rows whose
    WHERE columns lack btree indexes. Emits markdown report to
    data/perf/hot_path_report_<date>.md. First report generated:
    data/perf/hot_path_report_20260427.md (empty table — instrumentation
    just created, data accumulates as requests are served at 5% rate).
    • Senate page GET /senate/hot-paths: HTML page showing top-20 routes
    table (calls, total ms, avg ms, p50, p95), p95/avg bar chart, latest
    remediation report preview, link to slow-queries page. Uses standard
    _get_cached_page caching pattern.
    • CI gate scripts/ci/check_no_seqscan_regression.py: Opt-in check
    (warn-only first 14 days, then blocking). Parses prior + current report,
    flags new seq-scan opportunities not in the allowlist. Exit 0 on clean,
    exit 1 on regression.

    Note: pg_stat_statements extension is not yet installed on this PG
    instance. The query_hash column is present but empty; the EXPLAIN join
    in hot_path_optimizer.py uses placeholder samples. Install with: CREATE EXTENSION IF NOT EXISTS pg_stat_statements; (sibling task q-obs-pg-slow-query-detector should cover this).

    Sibling Tasks in Quest (Code Health) ↗