[Senate] Triage 20 autonomous engine failures and create fix tasks

← All Specs

Goal

Triage the highest-signal recurring-engine failures in the SciDEX autonomous task fleet and turn real failures into concrete follow-up work. The outcome should distinguish transient rate limits from actionable code, DB, and task-shape problems so the recurring engine layer wastes fewer cycles.

Acceptance Criteria

☑ 20 recurring task failures triaged with task/run evidence
☑ Fix tasks created for code/db failures
☑ Frequency reduction or batching recommendations documented for non-fixable recurring failures
☑ Triage summary committed to the repo

Approach

  • Query the live Orchestra task state for SciDEX recurring tasks with non-empty last_error.
  • If the live tasks.last_error set is smaller than 20, extend the census with recent task_runs.last_error history and mark which failures are live-state vs run-history.
  • Classify each failure as rate_limit, code_error, db_error, or spec_mismatch.
  • Create one-shot fix tasks with specs for actionable code/db problems and document cadence changes for rate-limit/spec-shape failures.
  • Record the triage matrix, created tasks, and recommendations in this spec.
  • Dependencies

    • docs/planning/specs/quest-engine-ci.md - quest-engine recurring task behavior and target frequency
    • docs/planning/specs/senate_quest_engine_noop_fix_spec.md - existing sibling investigation for the quest-engine recurring driver

    Dependents

    • Recurring Senate and cross-layer autonomous engines that currently burn slot time on retries, requeues, or blocked merges

    Work Log

    2026-04-26 06:46:33Z — Slot codex:54

    • Started task 9e45545a-0eeb-4698-9d79-ffda3f456b45 from the assigned worktree on current main commit 0f46acefa.
    • Read /home/ubuntu/Orchestra/AGENTS.md, repo AGENTS.md, docs/planning/specs/quest-engine-ci.md, docs/planning/specs/senate_quest_engine_noop_fix_spec.md, docs/planning/alignment-feedback-loops.md, docs/planning/artifact-governance.md, and docs/planning/landscape-gap-framework.md.
    • Verified task is still relevant: task created 2026-04-23T04:44:38Z, no attached spec existed, and live SciDEX recurring tasks still carry non-empty last_error values.
    • Initial DB census result: the exact tasks.last_error query described in the task now yields 10 live recurring tasks, so this triage extends the analysis with recent task_runs.last_error history to reach 20 concrete failure events while preserving the distinction between live-state failures and run-history failures.

    Triage Summary

    Live recurring-task failures (tasks.last_error on 2026-04-26)

    #Recurring taskFailure signalClassAction
    1[Artifacts] Audit all 67 stub notebooks (<10KB) and regenerate with real contentReview gate rejected PostgreSQL-incompatible PRAGMA busy_timeout in market repricing helpersdb_errorCreated fix task 5caa79fb-399c-4c4a-bbe6-c1eb032617ec
    2[Artifacts] Review notebook links from analysis pages — fix any that lead to stubsrate_limit_retries_exhausted:max_gmailrate_limitReduce cadence until provider pool is stable; prefer non-Gmail account routing for recurring audits
    3[Atlas] CI: Verify and repair KG edge consistencyWorker lease expired at 30m and task requeuedspec_mismatchSplit into bounded batches or raise lease/runtime budget before retrying
    4[Atlas] Wiki citation coverage report — daily metrics snapshot10 blocked merge attempts, escalated to higher-safety routingspec_mismatchKeep escalated; add smaller scoped verifier or split risky write path from report generation
    5[Senate] Agent activity heartbeat (driver #2)phantom running task with no live workercode_errorCreated fix task 77521747-85b5-4463-a04e-ccbe05424b77
    6[Senate] Link validation sweepCommand timed out after 900sspec_mismatchLower cadence and shard link checks into smaller batches
    7[Senate] Orchestra operator watchdog and self-repair loopacquire_fail:worktree_creation_failedcode_errorCreated fix task 24c69375-1e0a-4b9d-a926-355a7fbdbff2
    8[Senate] Orphan coverage checkphantom running task attributed to schedulercode_errorCreated fix task 77521747-85b5-4463-a04e-ccbe05424b77
    9[Senate] Visual regression testingCommand timed out after 300sspec_mismatchReduce cadence and/or restrict each cycle to a fixed page subset
    10[Senate] World-model improvement detector (driver #13)phantom running task with no live workercode_errorCreated fix task 77521747-85b5-4463-a04e-ccbe05424b77

    Run-history supplement (task_runs.last_error, newest-first, to reach 20 triaged failures)

    #Run timestamp (UTC)Recurring taskFailure signalClassAction
    112026-04-26 00:30Quest engine CIRequeued by cli-requeuespec_mismatchTreat as evidence of operator/manual recovery churn; fix underlying acquire/no-op loop before keeping 30-minute cadence
    122026-04-25 22:05Quest engine CIRequeued by cli-requeuespec_mismatchSame as above
    132026-04-25 20:40Quest engine CIwatchdog: worker lease expired; requeuedspec_mismatchIncrease boundedness or lease only after fixing queue-depth/no-op behavior
    142026-04-25 10:45Quest engine CIsupervisor restart: no live agentcode_errorCovered by phantom-run cleanup task 77521747-85b5-4463-a04e-ccbe05424b77
    152026-04-25 01:59Quest engine CIacquire_fail:worktree_creation_failedcode_errorCovered by worktree-acquisition task 24c69375-1e0a-4b9d-a926-355a7fbdbff2
    162026-04-25 01:23Quest engine CIacquire_fail:worktree_creation_failedcode_errorSame as above
    172026-04-25 01:01Quest engine CIacquire_fail:worktree_creation_failedcode_errorSame as above
    182026-04-25 00:47Quest engine CIacquire_fail:worktree_creation_failedcode_errorSame as above
    192026-04-24 09:10[Agora] Debate engine cycleRequeued by cli-requeuespec_mismatchRequeue noise only; keep under observation and avoid treating manual retries as engine health
    202026-04-24 06:50[Senate] DB health checkRequeued by cli-requeuespec_mismatchSame as above

    Follow-up tasks created

    • 24c69375-1e0a-4b9d-a926-355a7fbdbff2[Senate] Fix recurring engine worktree acquisition failures
    • 77521747-85b5-4463-a04e-ccbe05424b77[Senate] Fix phantom-running recurring engine cleanup
    • 5caa79fb-399c-4c4a-bbe6-c1eb032617ec[Exchange] Remove PostgreSQL-incompatible PRAGMA calls from recurring market engines

    Frequency and batching recommendations

    • Reduce or temporarily pause recurring engines that are blocked by provider-specific rate limits instead of spending every cycle on retries. The clearest live case is notebook-link review hitting max_gmail.
    • Move timeout-prone recurring sweeps (Link validation sweep, Visual regression testing, and likely KG edge consistency) to batched work with a smaller per-cycle surface area before considering a longer lease.
    • Do not use raw cli-requeue history as a health signal for recurrence quality. For Quest engine CI, Debate engine cycle, and DB health check, the meaningful fixes are the upstream worktree/phantom-run issues and the too-frequent no-op cadence already documented in sibling quest-engine specs.

    2026-04-26 07:05:00Z — Slot codex:54

    • Created a new task spec for this triage run because the assigned task had no spec_path in Orchestra.
    • Queried live SciDEX recurring task failures from /home/ubuntu/Orchestra/orchestra.db and supplemented the census with recent task_runs.last_error history because only 10 live recurring tasks currently carry last_error.
    • Triaged 20 failure cases into four classes: rate_limit, code_error, db_error, and spec_mismatch.
    • Created 3 one-shot follow-up tasks with worktree-pinned spec paths:
    - 24c69375-1e0a-4b9d-a926-355a7fbdbff2 for recurring worktree acquisition failures
    - 77521747-85b5-4463-a04e-ccbe05424b77 for phantom-running recurring tasks
    - 5caa79fb-399c-4c4a-bbe6-c1eb032617ec for PostgreSQL-incompatible PRAGMA usage in recurring market engines

    2026-04-26 07:24:00Z — Slot codex:54

    • Re-validated task relevance against live Orchestra state: task 9e45545a-0eeb-4698-9d79-ffda3f456b45 was created 2026-04-23T04:44:38Z, still has no spec_path, and the recurring-failure query on /home/ubuntu/Orchestra/orchestra.db still returns 10 non-empty live last_error rows for SciDEX.
    • Confirmed the run-history supplement still matches current task_runs ordering for recurring failures, with Quest engine CI dominating the recent error stream and multiple acquire_fail:worktree_creation_failed entries still visible.
    • Verified the three follow-up task IDs exist in Orchestra: 5caa79fb-399c-4c4a-bbe6-c1eb032617ec and 77521747-85b5-4463-a04e-ccbe05424b77 remain open for follow-up work, while 24c69375-1e0a-4b9d-a926-355a7fbdbff2 has already been completed by a later worker.

    File: 9e45545a_triage_20_autonomous_engine_failures_spec.md
    Modified: 2026-04-26 00:19
    Size: 9.2 KB