[Senate] Add process watchdog that kills runaway analyses and orphaned processes
Quest: Resource Governance
Priority: P4
Status: done
Goal
Background watchdog process that monitors all analysis-related processes. Kills any process that exceeds wall-clock timeout (30min default), memory limit, or appears orphaned (parent died). Prevents zombie processes from accumulating.
Acceptance Criteria
☑ Watchdog runs as background thread or separate process
☑ Kills analyses exceeding 30min wall-clock time
☑ Kills processes exceeding memory limit (2GB default)
☑ Detects and kills orphaned analysis subprocesses
☑ Kill events logged with full context (PID, memory, duration, analysis_id)
☑ Watchdog itself is monitored (auto-restart if it dies)
Approach
Write analysis_watchdog.py using psutil
Track analysis PIDs in a shared state file or DB table
Poll every 30s: check wall-clock, RSS memory, parent PID
Use SIGTERM first, SIGKILL after 10s grace period
Integrate with systemd for watchdog auto-restart
Add watchdog status to /api/quests/statusDependencies
- psutil (already in use elsewhere in codebase)
- scidex.senate.resource_governance (for kill event recording integration)
- scidex.core.database (PostgreSQL)
Dependents
- Resource Governance quest: real-time resource monitoring dashboard
- Analysis Sandbox quest: process kill integration
Work Log
2026-04-20T22:30:00Z — Implementation Complete
Implemented full process watchdog:
migrations/add_watchdog_tables.py: Created two PostgreSQL tables:
-
watchdog_processes: tracks PIDs currently monitored with wallclock/memory limits
-
watchdog_events: logs all kill events with full context (PID, memory, duration, analysis_id)
- Migration applied successfully; DB schema verified
scidex/senate/analysis_watchdog.py: Full watchdog implementation:
-
Watchdog class with
register()/
unregister() for tracking analysis PIDs
- Background monitoring via daemon thread (poll every 30s)
- SIGTERM → 10s grace period → SIGKILL for graceful termination
- Wall-clock timeout detection (30min default)
- Memory limit detection via psutil RSS
- Orphan detection (parent died → SIGTERM immediately)
- DB persistence of monitored processes and kill events
-
get_watchdog() singleton pattern, CLI with
--once and
--daemon modes
- Tested: singleton, register/unregister, poll_once, DB integration
scidex-watchdog.service: systemd unit for watchdog auto-restart:
- Type=simple, Restart=on-failure, RestartSec=10
- MemoryMax=256M, CPUQuota=10%
- Security: ProtectSystem=strict, ProtectHome=read-only, ReadWritePaths=/tmp/scidex-analysis
- Installs to /etc/systemd/system/ (deployed separately)
api.py (/api/quests/status): Added watchdog status section:
- Shows monitored_count, grace_count, kills_last_24h, recent_events
- Returns empty status when no processes monitored (graceful degradation)
Acceptance criteria fully met. All tests pass.
2026-04-20T23:45:00Z — FK Constraint Fix
Review feedback identified: watchdog_events.analysis_id was declared NOT NULL but the FK constraint specifies ON DELETE SET NULL. PostgreSQL allows table creation but any delete of an analysis with associated watchdog_events would raise a constraint violation.
Fix: Changed analysis_id TEXT NOT NULL to analysis_id TEXT in watchdog_events table definition, consistent with ON DELETE SET NULL semantics.
Commit: c753578da