All 14 routes were returning HTTP 502 due to the FastAPI backend server not responding.
Affected routes:
/, /exchange, /gaps, /wiki, /graph, /senate, /quests, /resources, /forge, /pitch/entity/, /hypothesis/, /analysis/, /api/Transient server restart / process crash. The backend recovered on its own and subsequent
agents confirmed all routes healthy. The scidex-api systemd service handles auto-restart.
api.py — /api/health endpoint now returns correct HTTP status codes:
Previously the health endpoint always returned HTTP 200 even when status was "degraded".
Fixed to return HTTP 503 when the server is degraded (DB errors), enabling monitoring tools,
load balancers, and automated restart scripts to detect and respond to unhealthy states.
# Before: always HTTP 200
return { "status": status, ... }
# After: HTTP 503 when degraded
http_status = 503 if status != "healthy" else 200
return JSONResponse(content=payload, status_code=http_status)All 14 affected endpoints verified on current main (HEAD b6689c9f6):
scidex-api systemd service: active.ea1bd2cf-f329-4784-9071-672801f5accc: Re-validated current main before changing code. Confirmed ci_route_health.py already contains the systemctl show ... MainPID recovery path and scidex-route-health.timer is set to 15 minutes on disk, but the deployed timer had not been reloaded (systemctl status still showed OnUnitActiveSec=8h). Also reproduced a false-negative failure in python3 ci_route_health.py: the API stayed healthy (/api/health 200, core routes 200/302) while /papers and /senate transiently timed out once, causing the whole watchdog service to exit 1. Plan: harden the watchdog so it retries transient route timeouts and only treats repeated failures on core liveness routes as backend-health failures, then reload the timer/service.ea1bd2cf-f329-4784-9071-672801f5accc: Implemented the watchdog hardening in ci_route_health.py: added retries for transient request timeouts, split core liveness probes from auxiliary heavyweight pages, and only fail the backend-health watchdog for core-route failures or genuine HTTP errors. Verification: python3 -m py_compile ci_route_health.py passed; python3 ci_route_health.py now returns 0 with 21 OK, 0 timeout/unreachable, 0 HTTP error; direct curls to /, /api/health, /api/status, /forge, /exchange, /gaps, /graph, /quests, /papers, and /senate all returned 200/302. Remaining operational follow-up: systemctl status still shows the timer loaded at 8h and requires privileged daemon-reload + timer restart to activate the on-disk 15min unit; this sandbox cannot perform that step (Interactive authentication required)./api/health now returns HTTP 503 for degraded states, enabling proper monitoring integration.
Re-verified on current main (HEAD ff5a4ac03):
api.py:3093: return JSONResponse(content=payload, status_code=503 if status != "healthy" else 200) — confirmed on mainscidex-api systemd service activeclean-entity-502-push merge pathCurrent root cause shifted from the original transient 502: the API can now enter a hung state where port 8000 still accepts TCP connections but all HTTP routes time out.
Evidence gathered on 2026-04-23:
scidex status showed scidex-api active while direct requests to http://127.0.0.1:8000/, /health, /api/health, and /api/status timed out.systemctl status scidex-route-health.* showed the watchdog timer active, but the recovery service failing repeatedly.ss -tlnp 'sport = :8000' on this host exposed no pid= metadata, so ci_route_health.py could not identify or kill the stuck uvicorn process.ci_route_health.py: recover hung API by reading MainPID from systemctl show scidex-api.service and sending SIGKILL directly instead of scraping ss -p output.scidex-route-health.timer: reduce cadence from every 8 hours to every 15 minutes so a hung backend is auto-recovered promptly instead of lingering for most of a workday.Watchdog repair task ea1bd2cf re-verified on main (HEAD be903cfed):
7138bc29-21f status: done (completed 2026-04-02, summary: "False alarm — 502 errors were transient")ci_route_health.py compiled and executed successfully: 21 OK, 0 timeout/unreachable, 0 HTTP error (exit 0)/api/health returns 200 with full health payload: hypotheses=1171, analyses=398, entities=49251, edges=714201, debates=607be903cfed, d75f90c0b, 43ab72196) are merged to mainorchestra/task/ea1bd2cf-complete-backend-api-server-failure-17-a is up to date with origin/main (empty diff)Watchdog verification task e0848a71 re-verified on current main (HEAD 250e1cae5):
scidex status reports scidex-api active against PostgreSQL with hypotheses=1171, analyses=398, and KG edges=714201python3 ci_route_health.py passed on this host with 21 OK, 0 timeout/unreachable, 0 HTTP error and exit code 0/, /exchange, /gaps, /wiki, /graph, /senate, /quests, /resources, /forge, /pitch, /entity/gene-APOE, /api/health, and /api/search?q=test&limit=17138bc29-21fb-499d-88a3-d9c5abab8cd0 was already done; orchestra update --id 7138bc29-21fb-499d-88a3-d9c5abab8cd0 --status done returned {"success": true, "updated": 1}be903cfed, d75f90c0b, and 43ab72196ec905552-0564-4dee-b740-0ddf21a1eb88: Re-validated current main before changing code. scidex status showed scidex-api healthy and /api/health returned HTTP 200 with current DB counts, but python3 ci_route_health.py exited 1 because most route probes returned HTTP 429 from the shared localhost IP while the backend itself remained healthy. Plan: keep the existing hung-process recovery logic, but harden ci_route_health.py so localhost rate-limit responses are treated as probe contention rather than backend failure, while preserving failures for repeated timeouts and real non-429 HTTP errors.ec905552-0564-4dee-b740-0ddf21a1eb88: Implemented and verified the probe hardening in ci_route_health.py. Changes: added a dedicated User-Agent, classified HTTP 429 as a soft signal instead of a backend failure, and left timeouts plus non-429 HTTP errors as hard failures. Verification: python3 -m py_compile ci_route_health.py passed; python3 ci_route_health.py returned exit 0 on a clean run; after deliberately generating 94 localhost 429s with 130 rapid GETs to /, python3 ci_route_health.py still returned exit 0 with 9 OK, 0 timeout/unreachable, 12 rate-limited, 0 HTTP error, proving the watchdog no longer misclassifies local probe contention as API failure.