When something goes wrong with one persona / skill / quest (a runaway
loop, a botched prompt change, a model regression), the operator's only
current recourse is "stop the entire fleet" — systemctl stop and the orchestra supervisor. There is no scoped pause.
scidex-agent
This task adds three concentric pause scopes — agent_id, skill,
quest_id — surfaced through one CLI verb and one API route, with the
guarantee that a paused entity will not start new work but in-flight
work continues until normal completion. It is the operational analog
of "feature flags for safety". Crucially, the pause is enforced at
worker acquire time, not pre-launch — preventing the reboot-resurrect
pattern where a paused entity restarts within 30 seconds because the
fleet supervisor doesn't know it's paused.
Effort: deep
migrations/20260428_emergency_pause.sql:CREATE TABLE senate_pause (
scope_kind TEXT NOT NULL CHECK (scope_kind IN ('agent','skill','quest','actor')),
scope_value TEXT NOT NULL,
paused_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
paused_by TEXT NOT NULL,
reason TEXT NOT NULL,
ttl_seconds INT, -- NULL = indefinite
cleared_at TIMESTAMPTZ,
cleared_by TEXT,
PRIMARY KEY (scope_kind, scope_value, paused_at)
);
CREATE INDEX idx_sp_active ON senate_pause (scope_kind, scope_value)
WHERE cleared_at IS NULL;
CREATE TABLE senate_alerts (
id BIGSERIAL PRIMARY KEY,
kind TEXT NOT NULL,
ref_id TEXT,
severity TEXT NOT NULL DEFAULT 'medium',
details JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
ack_at TIMESTAMPTZ,
ack_by TEXT
); (The senate_alerts table is shared with circuit-breaker /
pattern-detector siblings; this is its canonical migration.)
scidex/senate/emergency_pause.py:is_paused(*, agent_id=None, skill=None, quest_id=None,
actor_id=None) -> tuple[bool, reason | None] with a 5 s in-processscidex/agents/runner.py:claim_next_task or the equivalent;claim_task to be sure) so that before returning a task itis_paused against the candidate's agent_id, skill,quest_id. If any scope is paused the task is requeued withnext_eligible_at = now() + max(60, remaining_ttl_seconds) andtask_events row is written.
is_paused between iterations. Add the helper to the canonicalscidex/senate/integrity_sweeper.py:run_sweeps,scidex/senate/comment_classifier.run, and the agora debatePOST /api/senate/pause {scope_kind, scope_value, reason,
ttl_seconds?} → 200 with {paused_at, paused_by}. Authpaused_by = auth_user_id.POST /api/senate/unpause {scope_kind, scope_value} → 200.GET /api/senate/pauses returns active pauses.
orchestra senate pause <scope> <value> --reason "..."orchestra senate unpause <scope> <value>.orchestra senate pauses lists active.
senate_alerts accumulates ≥3 critical(actor_id) within 5 minutes, the alertauto-paused: 3+ critical alerts in 5m and TTL 1800. Recordspaused_by='senate.auto'.
tests/test_emergency_pause.py: pause scope precedence,emergency_pause.py against the table; LRU-cache layer.task_events for the requeueorchestra task events <id>) shows it.
if is_paused(...): break.agent=skeptic and verify the next acquireq-safety-runaway-circuit-breaker — shared senate_alerts table.q-safety-suspicious-pattern-detector — emits the criticalsenate_alerts rows that drive auto-pause cascade.