[Senate] Emergency-pause switch for individual agents and quests running

← Senate
Scoped pause (agent|skill|quest|actor) enforced at acquire time; in-flight loops respect; auto-pause on 3+ critical alerts in 5m.
Spec File

Goal

When something goes wrong with one persona / skill / quest (a runaway
loop, a botched prompt change, a model regression), the operator's only
current recourse is "stop the entire fleet" — systemctl stop
scidex-agent
and the orchestra supervisor. There is no scoped pause.
This task adds three concentric pause scopes — agent_id, skill, quest_id — surfaced through one CLI verb and one API route, with the
guarantee that a paused entity will not start new work but in-flight
work continues until normal completion. It is the operational analog
of "feature flags for safety". Crucially, the pause is enforced at
worker acquire time, not pre-launch — preventing the reboot-resurrect
pattern where a paused entity restarts within 30 seconds because the
fleet supervisor doesn't know it's paused.

Effort: deep

Acceptance Criteria

☐ Migration migrations/20260428_emergency_pause.sql:

CREATE TABLE senate_pause (
        scope_kind   TEXT NOT NULL CHECK (scope_kind IN ('agent','skill','quest','actor')),
        scope_value  TEXT NOT NULL,
        paused_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
        paused_by    TEXT NOT NULL,
        reason       TEXT NOT NULL,
        ttl_seconds  INT,                          -- NULL = indefinite
        cleared_at   TIMESTAMPTZ,
        cleared_by   TEXT,
        PRIMARY KEY (scope_kind, scope_value, paused_at)
      );
      CREATE INDEX idx_sp_active ON senate_pause (scope_kind, scope_value)
        WHERE cleared_at IS NULL;

      CREATE TABLE senate_alerts (
        id BIGSERIAL PRIMARY KEY,
        kind TEXT NOT NULL,
        ref_id TEXT,
        severity TEXT NOT NULL DEFAULT 'medium',
        details JSONB,
        created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
        ack_at TIMESTAMPTZ,
        ack_by TEXT
      );

(The senate_alerts table is shared with circuit-breaker /
pattern-detector siblings; this is its canonical migration.)

☐ New module scidex/senate/emergency_pause.py:
is_paused(*, agent_id=None, skill=None, quest_id=None,
actor_id=None) -> tuple[bool, reason | None]
with a 5 s in-process
LRU cache (so high-frequency pollers don't hammer the DB).
Acquire-time guard — patch the agent acquire path
(scidex/agents/runner.py:claim_next_task or the equivalent;
grep claim_task to be sure) so that before returning a task it
checks is_paused against the candidate's agent_id, skill,
and quest_id. If any scope is paused the task is requeued with
next_eligible_at = now() + max(60, remaining_ttl_seconds) and
a task_events row is written.
In-flight respect — long-running jobs must check
is_paused between iterations. Add the helper to the canonical
loop helpers in scidex/senate/integrity_sweeper.py:run_sweeps,
scidex/senate/comment_classifier.run, and the agora debate
loop. They abort cleanly (commit current chunk, stop) on detect.
☐ API:
- POST /api/senate/pause {scope_kind, scope_value, reason,
ttl_seconds?}
→ 200 with {paused_at, paused_by}. Auth
required; record paused_by = auth_user_id.
- POST /api/senate/unpause {scope_kind, scope_value} → 200.
- GET /api/senate/pauses returns active pauses.
☐ CLI: orchestra senate pause <scope> <value> --reason "..."
[--ttl 3600] and orchestra senate unpause <scope> <value>.
orchestra senate pauses lists active.
☐ Senate dashboard banner — when any active pauses exist,
render a top-of-page banner listing scope+reason+age, so
operators don't forget about indefinite pauses.
Self-pause — if senate_alerts accumulates ≥3 critical
alerts for the same (actor_id) within 5 minutes, the alert
handler auto-creates a pause for that actor with reason
auto-paused: 3+ critical alerts in 5m and TTL 1800. Records
the auto-pause via paused_by='senate.auto'.
☐ Tests tests/test_emergency_pause.py: pause scope precedence,
TTL expiration, acquire-time gate, in-flight gate, unpause path,
auto-pause cascade.

Approach

  • Migration first; verify against a dev PG instance.
  • Implement emergency_pause.py against the table; LRU-cache layer.
  • Patch the agent acquire path; reuse task_events for the requeue
  • trail so prior tooling (orchestra task events <id>) shows it.
  • Patch the three in-flight loops; pattern: if is_paused(...): break.
  • API + CLI; auto-pause cascade.
  • Banner + smoke (pause agent=skeptic and verify the next acquire
  • skips it; unpause; verify acquire resumes).

    Dependencies

    • q-safety-runaway-circuit-breaker — shared senate_alerts table.

    Dependents

    • q-safety-suspicious-pattern-detector — emits the critical
    senate_alerts rows that drive auto-pause cascade.

    Work Log

    Sibling Tasks in Quest (Senate) ↗