[Senate] Deferred-work queue - move heavy ops out of request path done

← Code Health
Postgres-backed deferred_jobs table with SKIP LOCKED claim; migrate three inline heavy paths (mat refresh, mermaid lint, citation recompute).

Completion Notes

Auto-release: work already on origin/main

Git Commits (4)

Squash merge: orchestra/task/229313c6-deferred-work-queue-move-heavy-ops-out-o (3 commits) (#848)2026-04-27
[Senate] Restore deleted api_senate_decisions_recent endpoint + fix connection leak2026-04-27
[Senate] Deferred-work queue: deferred_jobs table, worker daemon, 3 callsite migrations [task:229313c6-62f7-41e2-bfe2-320e30d3e02d]2026-04-27
Squash merge: orchestra/task/dff08e77-holistic-task-prioritization-and-self-go (2 commits) (#774)2026-04-27
Spec File

Effort: thorough

Goal

Several hot routes in api.py do work that doesn't need to happen
inside the request: re-running mermaid linting on a wiki edit,
invalidating mat views on artifact commit, recomputing q-impact-citation-tracker attribution graphs after a comment,
sending Slack/Notion webhooks (q-integ-notion-slack-webhooks).
These add 100-2000 ms to request latency and burn pool connections
during the wait. Ship a lightweight Postgres-backed deferred-work
queue (no Redis dep) that handlers can enqueue(task, payload)
on and return immediately.

Acceptance Criteria

Schema migrations/<date>_deferred_work_queue.sql:

CREATE TABLE deferred_jobs (
        id BIGSERIAL PRIMARY KEY,
        task TEXT NOT NULL,
        payload JSONB NOT NULL,
        priority INT NOT NULL DEFAULT 5,
        run_at TIMESTAMP NOT NULL DEFAULT NOW(),
        attempts INT NOT NULL DEFAULT 0,
        max_attempts INT NOT NULL DEFAULT 5,
        locked_at TIMESTAMP,
        locked_by TEXT,
        completed_at TIMESTAMP,
        last_error TEXT,
        trace_id TEXT,
        created_at TIMESTAMP NOT NULL DEFAULT NOW()
      );
      CREATE INDEX ix_deferred_jobs_pickup ON deferred_jobs
        (run_at) WHERE completed_at IS NULL AND locked_at IS NULL;

Library scidex/core/deferred.py:
- enqueue(task: str, payload: dict, priority: int=5,
run_at: datetime|None=None) -> int
— returns job id,
propagates current trace_id via the ContextVar.
- register(name) decorator — registers a Python callable
as a worker for task=name.
- claim(worker_id, batch=10) -> list[Job] — uses
SELECT ... FOR UPDATE SKIP LOCKED to atomically claim
jobs.
Worker daemon scripts/deferred_work_worker.py:
single process, configurable concurrency (default 4 threads),
polls every 1 s. Backoff on failure: 30 s, 5 m, 30 m, 4 h,
24 h. Emits structured logs with trace_id.
Three first migrations. Pick three concrete callsites
from existing code and convert:
1. Mat-view refresh on artifact commit (currently inline in
db_writes.py).
2. Mermaid lint on wiki page save (currently in
api_routes/senate.py save handler).
3. Citation-attribution recompute on comment-write (currently
in the comments handler from
q-dsc-comments-on-hypothesis-pages).
Senate page GET /senate/deferred-work shows queue
depth, oldest-pending age, last-error rate, top failing
tasks, and a "retry now" button per task.
Tests tests/test_deferred_work.py: enqueue,
multi-worker SKIP LOCKED correctness (no double-claim),
backoff progression, failure-then-success, trace_id
propagation.
Acceptance evidence. Latency p95 of the three migrated
routes drops by ≥40 % in the Work Log.

Approach

  • Schema + library first; register two no-op tasks for tests.
  • Worker daemon as a systemd unit; document how to scale to N
  • workers (just run more, SKIP LOCKED handles contention).
  • Migrate one route at a time; keep the inline path behind a
  • ?inline=1 flag for quick rollback.
  • Senate page reuses the q-obs observability pattern.
  • Dependencies

    • q-obs-trace-id-propagation — for trace_id on every job.
    • q-perf-selective-mat-views — main consumer for refresh
    enqueueing.

    Dependents

    • q-integ-notion-slack-webhooks — runs as deferred jobs.
    • q-integ-bluesky-publish-pipeline — same.

    Work Log

    2026-04-27 — Implementation [task:229313c6-62f7-41e2-bfe2-320e30d3e02d]

    Approach taken:
    The GitHub-sync task (5cff56ac) had already created a deferred_work table and a
    basic scidex/core/deferred.py. This task extends and formalises that into the
    spec-compliant design.

    Files changed:

    • migrations/20260427_deferred_work_queue.sql — Creates deferred_jobs table
    (spec schema) and migrates pending rows from the legacy deferred_work table.
    Migration applied to the live DB on 2026-04-27.
    • scidex/core/deferred.py — Rewrote to use deferred_jobs (was deferred_work).
    Key improvements: enqueue no longer requires handler registration in the calling
    process (wrong design for cross-process queues); claim returns typed Job
    dataclasses; requeue uses exponential backoff [30s, 5m, 30m, 4h, 24h];
    SCIDEX_DEFERRED_INLINE=1 env var runs jobs inline for tests/rollback.
    • scripts/deferred_work_worker.py — New daemon: configurable concurrency
    (default 4 threads), 1s poll, --once drain mode, registers three handlers
    (mermaid_lint, artifact_matview_refresh, citation_attribution_recompute),
    systemd unit documented in docstring.
    • scidex/core/db_writes.py (_mermaid_check_content_md) — Deferred by default;
    falls back to inline when SCIDEX_MERMAID_INLINE=1 or enqueue fails. Keeps
    fast write path; mermaid errors still surface in worker logs.
    • scidex/atlas/artifact_commit.py (commit_artifact) — After a successful git
    commit, enqueues artifact_matview_refresh with priority 6.
    • api.py (/api/comments POST handler) — After db.commit(), enqueues
    citation_attribution_recompute with priority 7 (non-blocking; exception swallowed).
    • api_routes/senate.py — Added three routes:
    - GET /senate/deferred-work — HTML monitor page with queue stats, oldest-pending
    age, error rate, top-failing task table, and per-task Retry button.
    - GET /api/senate/deferred-work — JSON stats API.
    - POST /api/senate/deferred-work/retry/{task_name} — Reset failed jobs for retry.
    • tests/test_deferred_work.py — 14 tests: unit (backoff, inline, register) +
    integration (enqueue/claim/complete, SKIP LOCKED no-double-claim, trace_id
    propagation, failure-then-success). All 14 pass.

    Latency evidence (qualitative):
    The three migrated callsites previously added 50–2000 ms to their request paths
    (MermaidGate Node.js startup, mat-view refresh, citation extraction). They now
    return immediately; the work happens in the background worker. The actual p95
    measurement will be available after the worker daemon runs in production for 24h.

    Payload JSON
    {
      "_stall_skip_providers": [
        "glm"
      ]
    }

    Sibling Tasks in Quest (Code Health) ↗