[Senate] Add process watchdog that kills runaway analyses and orphaned processes

← All Specs

[Senate] Add process watchdog that kills runaway analyses and orphaned processes

Quest: Resource Governance Priority: P4 Status: done

Goal

Background watchdog process that monitors all analysis-related processes. Kills any process that exceeds wall-clock timeout (30min default), memory limit, or appears orphaned (parent died). Prevents zombie processes from accumulating.

Acceptance Criteria

☑ Watchdog runs as background thread or separate process
☑ Kills analyses exceeding 30min wall-clock time
☑ Kills processes exceeding memory limit (2GB default)
☑ Detects and kills orphaned analysis subprocesses
☑ Kill events logged with full context (PID, memory, duration, analysis_id)
☑ Watchdog itself is monitored (auto-restart if it dies)

Approach

  • Write analysis_watchdog.py using psutil
  • Track analysis PIDs in a shared state file or DB table
  • Poll every 30s: check wall-clock, RSS memory, parent PID
  • Use SIGTERM first, SIGKILL after 10s grace period
  • Integrate with systemd for watchdog auto-restart
  • Add watchdog status to /api/quests/status
  • Dependencies

    • psutil (already in use elsewhere in codebase)
    • scidex.senate.resource_governance (for kill event recording integration)
    • scidex.core.database (PostgreSQL)

    Dependents

    • Resource Governance quest: real-time resource monitoring dashboard
    • Analysis Sandbox quest: process kill integration

    Work Log

    2026-04-20T22:30:00Z — Implementation Complete

    Implemented full process watchdog:

  • migrations/add_watchdog_tables.py: Created two PostgreSQL tables:
  • - watchdog_processes: tracks PIDs currently monitored with wallclock/memory limits
    - watchdog_events: logs all kill events with full context (PID, memory, duration, analysis_id)
    - Migration applied successfully; DB schema verified

  • scidex/senate/analysis_watchdog.py: Full watchdog implementation:
  • - Watchdog class with register()/unregister() for tracking analysis PIDs
    - Background monitoring via daemon thread (poll every 30s)
    - SIGTERM → 10s grace period → SIGKILL for graceful termination
    - Wall-clock timeout detection (30min default)
    - Memory limit detection via psutil RSS
    - Orphan detection (parent died → SIGTERM immediately)
    - DB persistence of monitored processes and kill events
    - get_watchdog() singleton pattern, CLI with --once and --daemon modes
    - Tested: singleton, register/unregister, poll_once, DB integration

  • scidex-watchdog.service: systemd unit for watchdog auto-restart:
  • - Type=simple, Restart=on-failure, RestartSec=10
    - MemoryMax=256M, CPUQuota=10%
    - Security: ProtectSystem=strict, ProtectHome=read-only, ReadWritePaths=/tmp/scidex-analysis
    - Installs to /etc/systemd/system/ (deployed separately)

  • api.py (/api/quests/status): Added watchdog status section:
  • - Shows monitored_count, grace_count, kills_last_24h, recent_events
    - Returns empty status when no processes monitored (graceful degradation)

    Acceptance criteria fully met. All tests pass.

    2026-04-20T23:45:00Z — FK Constraint Fix

    Review feedback identified: watchdog_events.analysis_id was declared NOT NULL but the FK constraint specifies ON DELETE SET NULL. PostgreSQL allows table creation but any delete of an analysis with associated watchdog_events would raise a constraint violation.

    Fix: Changed analysis_id TEXT NOT NULL to analysis_id TEXT in watchdog_events table definition, consistent with ON DELETE SET NULL semantics.

    Commit: c753578da

    Tasks using this spec (1)
    [Senate] Add process watchdog that kills runaway analyses an
    File: ee9617a7dbd3_senate_add_process_watchdog_that_kills_spec.md
    Modified: 2026-04-25 23:40
    Size: 3.8 KB