Fix the PostgreSQL connection pool leak that caused scidex.ai to 503 intermittently ("PoolTimeout: couldn't get a connection after 30.00 sec" — recurring). Add a local Prometheus-compatible monitoring stack (VictoriaMetrics + Grafana + node/postgres exporters) so the next pool saturation event is visible before users see errors.
PGConnection.close() in api_shared/db.py (previously db_pg.py) was a no-op (pass), with only an aspirational comment. The close_thread_db_connection middleware called _conn.close() at the end of every request expecting the connection to return to the pool — it never did. The underlying psycopg.Connection was orphaned, the pool's internal accounting still considered the slot checked out, and the pool exhausted in ~30 requests under sustained traffic. Compounding issues:
autocommit=False + the pool's check=ConnectionPool.check_connection hook ran SELECT 1 on every checkout, implicitly opening a transaction that was never committed — yielding "idle in transaction" connections that accumulated forever.get_db()'s liveness probe also ran SELECT 1, adding more implicit transactions.max_idle / max_lifetime, so stuck connections never recycled.try/finally, so handler exceptions leaked slots._market_consumer_loop persistent background thread called get_db() per cycle but never released, holding a pool slot for the process lifetime._write_pageview fire-and-forget thread didn't use try/finally, so exceptions leaked.max_size=30 was below Starlette's default threadpool size (40), so even a correctly-returning pool could contend.PGConnection.close() rolls back any open txn and calls pool.putconn(). On putconn failure, closes the socket so PG backends aren't leaked.max_size=50 (env overridable), timeout=10s, max_idle=300s, max_lifetime=1800s. check= removed.pool_stats() helper exposes psycopg get_stats() for observability.close_thread_db_connection middleware wrapped in try/finally.get_db() liveness check uses transaction_status instead of SELECT 1._market_consumer_loop finally block releases its connection each cycle._write_pageview fire-and-forget wrapped in try/finally./health?pool=1 returns JSON pool stats and 503s when requests_waiting > 0./metrics endpoint via prometheus-fastapi-instrumentator exposes request/latency + scidex_pg_pool_{size,available,requests_waiting} gauges.monitoring/config/prometheus.yml committed (binaries + systemd units live outside the repo under /home/ubuntu/monitoring/)./health?pool=1 shows requests_waiting=0.api_shared/db.py: rewrite close(), raise max_size, add timeouts/max_idle/max_lifetime, drop check=, add pool_stats(), replace SELECT 1 liveness probe with transaction_status check.api.py: wrap cleanup middleware in try/finally; add ?pool=1 to /health; wire prometheus-fastapi-instrumentator; release connections in market loop finally; wrap pageview fire-and-forget in try/finally.monitoring/config/prometheus.yml for reproducibility of the local scrape config.scidex-api after merge to main — the running process must reload to pick up the new code./home/ubuntu/monitoring/).AsyncConnectionPool) — larger refactor.prometheus-fastapi-instrumentator — added to venv (not tracked in requirements.txt here; added out-of-band during outage response)./health?pool=1 acceptance criterion unmet: a later agent replaced health_check(pool: int) with health_dashboard() losing the pool query parameter.pool: int = Query(0) to health_dashboard() — returns JSON pool stats (with 503 on requests_waiting > 0) when pool=1./metrics endpoint + prometheus-fastapi-instrumentator in requirements.txt + tests/test_metrics_fallback.py.api_shared/db.py, api.py, api_routes/admin.py, and monitoring/config/prometheus.yml before editing./health?pool=1 support already landed earlier under this task, but later pool hardening intentionally changed the exact tuning (pool_max=80, checkout probe restored) so the spec is partially stale.TestClient(api.app) checks and found the remaining live gap: /metrics was disabled in this environment because prometheus-fastapi-instrumentator was never added to requirements.txt, yielding Prometheus instrumentation disabled: No module named 'prometheus_fastapi_instrumentator' and GET /metrics -> 404./metrics fallback path./wiki/*, etc).journalctl -u scidex-api + pg_stat_activity.close() was pass — identified by reading api_shared/db_pg.py and cross-referencing the middleware that called it./home/ubuntu/monitoring/.{
"requirements": {
"coding": 7,
"reasoning": 6
},
"completion_shas": [
"12d234c61396ea22bbc9eb1476698c853687a12f",
"b5d3ac2e519e8d48890f1ce46e7c2657fbadced6"
],
"completion_shas_checked_at": "2026-04-18T12:51:45.392735+00:00"
}