SciDEX — Task: [Senate] Deferred-work queue

Postgres-backed deferred_jobs table with SKIP LOCKED claim; migrate three inline heavy paths (mat refresh, mermaid lint, citation recompute).

Completion Notes

Auto-release: work already on origin/main

Git Commits (4)

Squash merge: orchestra/task/229313c6-deferred-work-queue-move-heavy-ops-out-o (3 commits) (#848)2026-04-27

[Senate] Restore deleted api_senate_decisions_recent endpoint + fix connection leak2026-04-27

[Senate] Deferred-work queue: deferred_jobs table, worker daemon, 3 callsite migrations [task:229313c6-62f7-41e2-bfe2-320e30d3e02d]2026-04-27

Squash merge: orchestra/task/dff08e77-holistic-task-prioritization-and-self-go (2 commits) (#774)2026-04-27

Spec File

Effort: thorough

Goal

Several hot routes in api.py do work that doesn't need to happen
inside the request: re-running mermaid linting on a wiki edit,
invalidating mat views on artifact commit, recomputing q-impact-citation-tracker attribution graphs after a comment,
sending Slack/Notion webhooks (q-integ-notion-slack-webhooks).
These add 100-2000 ms to request latency and burn pool connections
during the wait. Ship a lightweight Postgres-backed deferred-work
queue (no Redis dep) that handlers can enqueue(task, payload)
on and return immediately.

Acceptance Criteria

☑ Schema migrations/<date>_deferred_work_queue.sql:

CREATE TABLE deferred_jobs (
        id BIGSERIAL PRIMARY KEY,
        task TEXT NOT NULL,
        payload JSONB NOT NULL,
        priority INT NOT NULL DEFAULT 5,
        run_at TIMESTAMP NOT NULL DEFAULT NOW(),
        attempts INT NOT NULL DEFAULT 0,
        max_attempts INT NOT NULL DEFAULT 5,
        locked_at TIMESTAMP,
        locked_by TEXT,
        completed_at TIMESTAMP,
        last_error TEXT,
        trace_id TEXT,
        created_at TIMESTAMP NOT NULL DEFAULT NOW()
      );
      CREATE INDEX ix_deferred_jobs_pickup ON deferred_jobs
        (run_at) WHERE completed_at IS NULL AND locked_at IS NULL;

☑ Library scidex/core/deferred.py:

enqueue(task: str, payload: dict, priority: int=5,
        run_at: datetime|None=None) -> int

— returns job id,
propagates current trace_id via the ContextVar.
- register(name) decorator — registers a Python callable
as a worker for task=name.
- claim(worker_id, batch=10) -> list[Job] — uses
SELECT ... FOR UPDATE SKIP LOCKED to atomically claim
jobs.

☑ Worker daemon scripts/deferred_work_worker.py:

single process, configurable concurrency (default 4 threads),
polls every 1 s. Backoff on failure: 30 s, 5 m, 30 m, 4 h,
24 h. Emits structured logs with trace_id.

☑ Three first migrations. Pick three concrete callsites

from existing code and convert:
1. Mat-view refresh on artifact commit (currently inline in
db_writes.py).
2. Mermaid lint on wiki page save (currently in
api_routes/senate.py save handler).
3. Citation-attribution recompute on comment-write (currently
in the comments handler from
q-dsc-comments-on-hypothesis-pages).

☑ Senate page GET /senate/deferred-work shows queue

depth, oldest-pending age, last-error rate, top failing
tasks, and a "retry now" button per task.

☑ Tests tests/test_deferred_work.py: enqueue,

multi-worker SKIP LOCKED correctness (no double-claim),
backoff progression, failure-then-success, trace_id
propagation.

☐ Acceptance evidence. Latency p95 of the three migrated

routes drops by ≥40 % in the Work Log.

Approach

Schema + library first; register two no-op tasks for tests.

Worker daemon as a systemd unit; document how to scale to N

workers (just run more, SKIP LOCKED handles contention).

Migrate one route at a time; keep the inline path behind a

?inline=1 flag for quick rollback.

Senate page reuses the q-obs observability pattern.

Dependencies

q-obs-trace-id-propagation — for trace_id on every job.
q-perf-selective-mat-views — main consumer for refresh

enqueueing.

Dependents

q-integ-notion-slack-webhooks — runs as deferred jobs.
q-integ-bluesky-publish-pipeline — same.

Work Log

2026-04-27 — Implementation [task:229313c6-62f7-41e2-bfe2-320e30d3e02d]

Approach taken:
The GitHub-sync task (5cff56ac) had already created a deferred_work table and a
basic scidex/core/deferred.py. This task extends and formalises that into the
spec-compliant design.

Files changed:

migrations/20260427_deferred_work_queue.sql — Creates deferred_jobs table

(spec schema) and migrates pending rows from the legacy deferred_work table.
Migration applied to the live DB on 2026-04-27.

scidex/core/deferred.py — Rewrote to use deferred_jobs (was deferred_work).

Key improvements: enqueue no longer requires handler registration in the calling
process (wrong design for cross-process queues); claim returns typed Job
dataclasses; requeue uses exponential backoff [30s, 5m, 30m, 4h, 24h];
SCIDEX_DEFERRED_INLINE=1 env var runs jobs inline for tests/rollback.

scripts/deferred_work_worker.py — New daemon: configurable concurrency

(default 4 threads), 1s poll, --once drain mode, registers three handlers
(mermaid_lint, artifact_matview_refresh, citation_attribution_recompute),
systemd unit documented in docstring.

scidex/core/db_writes.py (_mermaid_check_content_md) — Deferred by default;

falls back to inline when SCIDEX_MERMAID_INLINE=1 or enqueue fails. Keeps
fast write path; mermaid errors still surface in worker logs.

scidex/atlas/artifact_commit.py (commit_artifact) — After a successful git

commit, enqueues artifact_matview_refresh with priority 6.

api.py (/api/comments POST handler) — After db.commit(), enqueues

citation_attribution_recompute with priority 7 (non-blocking; exception swallowed).

api_routes/senate.py — Added three routes:

- GET /senate/deferred-work — HTML monitor page with queue stats, oldest-pending
age, error rate, top-failing task table, and per-task Retry button.
- GET /api/senate/deferred-work — JSON stats API.
- POST /api/senate/deferred-work/retry/{task_name} — Reset failed jobs for retry.

tests/test_deferred_work.py — 14 tests: unit (backoff, inline, register) +

integration (enqueue/claim/complete, SKIP LOCKED no-double-claim, trace_id
propagation, failure-then-success). All 14 pass.

Latency evidence (qualitative):
The three migrated callsites previously added 50–2000 ms to their request paths
(MermaidGate Node.js startup, mat-view refresh, citation extraction). They now
return immediately; the work happens in the background worker. The actual p95
measurement will be available after the worker daemon runs in production for 24h.

Payload JSON

{
  "_stall_skip_providers": [
    "glm"
  ]
}

Sibling Tasks in Quest (Code Health) ↗

○[Senate] Recurring code health sweepP94

✓[Senate] Hot-path query optimizer for top-20 endpointsP90

✓[Atlas] HTTP ETag layer with artifact-mutation-aware invalidationP89

✓[Senate] N+1 query detector + auto-batch helpers for hot pathsP88

✓[Senate] Postgres pool autoscaler driven by concurrent-request loadP88

✓[Senate] Break apart api.py god file into focused modulesP87

✓[Forge] Type-checked tool wrappers - pydantic gate + standard error envelopeP87

✓[Senate] Audit and archive versioned dead codeP85

✓[Senate] Consolidate database connection patternsP84

✓[Senate] scidex doctor - diagnose common dev-env issuesP84

[Senate] Deferred-work queue - move heavy ops out of request path done