Complete Backend API Server Failure

Problem

All 14 routes were returning HTTP 502 due to the FastAPI backend server not responding.

Affected routes:

/, /exchange, /gaps, /wiki, /graph, /senate, /quests, /resources, /forge, /pitch
/entity/, /hypothesis/, /analysis/, /api/

Root Cause

Transient server restart / process crash. The backend recovered on its own and subsequent
agents confirmed all routes healthy. The scidex-api systemd service handles auto-restart.

Fix Applied

api.py — /api/health endpoint now returns correct HTTP status codes:

Previously the health endpoint always returned HTTP 200 even when status was "degraded".
Fixed to return HTTP 503 when the server is degraded (DB errors), enabling monitoring tools,
load balancers, and automated restart scripts to detect and respond to unhealthy states.

# Before: always HTTP 200
return { "status": status, ... }

# After: HTTP 503 when degraded
http_status = 503 if status != "healthy" else 200
return JSONResponse(content=payload, status_code=http_status)

Verification (2026-04-13)

All 14 affected endpoints verified on current main (HEAD b6689c9f6):

Route	Status
`/`	302 → OK
`/exchange`	200 OK
`/gaps`	200 OK
`/wiki`	200 OK
`/graph`	200 OK
`/senate`	200 OK
`/quests`	200 OK
`/resources`	200 OK
`/forge`	200 OK
`/pitch`	200 OK
`/entity/gene-APOE`	200 OK
`/api/health`	200 OK (healthy)
`/api/search?q=test`	200 OK

Server uptime: 20,794 seconds. scidex-api systemd service: active.

Work Log

2026-04-23 10:52 PT — watchdog repair task ea1bd2cf-f329-4784-9071-672801f5accc: Re-validated current main before changing code. Confirmed ci_route_health.py already contains the systemctl show ... MainPID recovery path and scidex-route-health.timer is set to 15 minutes on disk, but the deployed timer had not been reloaded (systemctl status still showed OnUnitActiveSec=8h). Also reproduced a false-negative failure in python3 ci_route_health.py: the API stayed healthy (/api/health 200, core routes 200/302) while /papers and /senate transiently timed out once, causing the whole watchdog service to exit 1. Plan: harden the watchdog so it retries transient route timeouts and only treats repeated failures on core liveness routes as backend-health failures, then reload the timer/service.
2026-04-23 10:52 PT — watchdog repair task ea1bd2cf-f329-4784-9071-672801f5accc: Implemented the watchdog hardening in ci_route_health.py: added retries for transient request timeouts, split core liveness probes from auxiliary heavyweight pages, and only fail the backend-health watchdog for core-route failures or genuine HTTP errors. Verification: python3 -m py_compile ci_route_health.py passed; python3 ci_route_health.py now returns 0 with 21 OK, 0 timeout/unreachable, 0 HTTP error; direct curls to /, /api/health, /api/status, /forge, /exchange, /gaps, /graph, /quests, /papers, and /senate all returned 200/302. Remaining operational follow-up: systemctl status still shows the timer loaded at 8h and requires privileged daemon-reload + timer restart to activate the on-disk 15min unit; this sandbox cannot perform that step (Interactive authentication required).
2026-04-13: Re-evaluated task. Server healthy, all 14 endpoints 200 OK. Prior resolution commits

(53d3d6007, e71fca765, 657e29b39) did not land on main (orphan branches). Applied concrete fix:
/api/health now returns HTTP 503 for degraded states, enabling proper monitoring integration.

2026-04-13 (retry): Addressing merge gate — added spec work log noting that api.py health

endpoint change (HTTP 503 on degraded state) is the substantive fix in this branch. Entity 502s
(task ae65cf18) were a transient ngrok infrastructure issue; no api.py changes needed for those.

Reopened Task Re-evaluation (2026-04-13)

Re-verified on current main (HEAD ff5a4ac03):

api.py:3093: return JSONResponse(content=payload, status_code=503 if status != "healthy" else 200) — confirmed on main
All 14 affected endpoints: HTTP 200 OK (verified via curl against running server on port 8000)
Server healthy: uptime 3374s, scidex-api systemd service active
Prior fix commits (15b485e0d, db9b08420) are present on main via the clean-entity-502-push merge path

Conclusion: Task is complete. No additional code changes required.

Reopened Task Re-evaluation (2026-04-23)

Current root cause shifted from the original transient 502: the API can now enter a hung state where port 8000 still accepts TCP connections but all HTTP routes time out.

Evidence gathered on 2026-04-23:

scidex status showed scidex-api active while direct requests to http://127.0.0.1:8000/, /health, /api/health, and /api/status timed out.
systemctl status scidex-route-health.* showed the watchdog timer active, but the recovery service failing repeatedly.
ss -tlnp 'sport = :8000' on this host exposed no pid= metadata, so ci_route_health.py could not identify or kill the stuck uvicorn process.

Fix applied on this task:

ci_route_health.py: recover hung API by reading MainPID from systemctl show scidex-api.service and sending SIGKILL directly instead of scraping ss -p output.
scidex-route-health.timer: reduce cadence from every 8 hours to every 15 minutes so a hung backend is auto-recovered promptly instead of lingering for most of a workday.

Already Resolved — 2026-04-23 18:02 UTC

Watchdog repair task ea1bd2cf re-verified on main (HEAD be903cfed):

Original task 7138bc29-21f status: done (completed 2026-04-02, summary: "False alarm — 502 errors were transient")
ci_route_health.py compiled and executed successfully: 21 OK, 0 timeout/unreachable, 0 HTTP error (exit 0)
All 14 originally affected endpoints return healthy status codes (200/301/302)
/api/health returns 200 with full health payload: hypotheses=1171, analyses=398, entities=49251, edges=714201, debates=607
Prior watchdog hardening commits (be903cfed, d75f90c0b, 43ab72196) are merged to main
Branch orchestra/task/ea1bd2cf-complete-backend-api-server-failure-17-a is up to date with origin/main (empty diff)

Already Resolved — 2026-04-23 19:00:02Z

Watchdog verification task e0848a71 re-verified on current main (HEAD 250e1cae5):

scidex status reports scidex-api active against PostgreSQL with hypotheses=1171, analyses=398, and KG edges=714201
python3 ci_route_health.py passed on this host with 21 OK, 0 timeout/unreachable, 0 HTTP error and exit code 0
Direct HTTP probes returned 200 for /, /exchange, /gaps, /wiki, /graph, /senate, /quests, /resources, /forge, /pitch, /entity/gene-APOE, /api/health, and /api/search?q=test&limit=1
Original task 7138bc29-21fb-499d-88a3-d9c5abab8cd0 was already done; orchestra update --id 7138bc29-21fb-499d-88a3-d9c5abab8cd0 --status done returned {"success": true, "updated": 1}
Prior fixes remain present on main via commits be903cfed, d75f90c0b, and 43ab72196

Summary: the backend API server failure is already resolved on main; this watchdog task only records fresh verification evidence so it stops requeueing.

2026-04-23 12:12 PT — watchdog repair task ec905552-0564-4dee-b740-0ddf21a1eb88: Re-validated current main before changing code. scidex status showed scidex-api healthy and /api/health returned HTTP 200 with current DB counts, but python3 ci_route_health.py exited 1 because most route probes returned HTTP 429 from the shared localhost IP while the backend itself remained healthy. Plan: keep the existing hung-process recovery logic, but harden ci_route_health.py so localhost rate-limit responses are treated as probe contention rather than backend failure, while preserving failures for repeated timeouts and real non-429 HTTP errors.
2026-04-23 12:13 PT — watchdog repair task ec905552-0564-4dee-b740-0ddf21a1eb88: Implemented and verified the probe hardening in ci_route_health.py. Changes: added a dedicated User-Agent, classified HTTP 429 as a soft signal instead of a backend failure, and left timeouts plus non-429 HTTP errors as hard failures. Verification: python3 -m py_compile ci_route_health.py passed; python3 ci_route_health.py returned exit 0 on a clean run; after deliberately generating 94 localhost 429s with 130 rapid GETs to /, python3 ci_route_health.py still returned exit 0 with 9 OK, 0 timeout/unreachable, 12 rate-limited, 0 HTTP error, proving the watchdog no longer misclassifies local probe contention as API failure.

File: 7138bc29_complete_backend_api_server_failure_spec.md

Modified: 2026-04-25 23:40

Size: 8.7 KB