[Senate] scidex doctor - diagnose common dev-env issues done

← Code Health
15 health probes (PG, submodules, secrets, services, claude CLI, migrations) with PASS/FAIL/FIX commands; --json.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Senate] scidex doctor — 15 health probes for dev-env diagnostics [task:eb06d3df-8328-49ce-a959-06e16002550d] (#904)2026-04-27
Spec File

Effort: standard

Goal

Onboarding a new contributor (or a fresh agent VM) to SciDEX hits a
predictable wall of opaque failures: PG socket missing, submodule
data dir uninitialised, secrets file not symlinked, claude-CLI
auth not migrated, pull_main.sh not running, nginx not pointing
at port 8000, the data/scidex-artifacts/ worktree in a detached
state, the personas/ skills not loaded because the loader can't
see ~/.claude. Ship scidex doctor — a single command that runs
~15 health probes and prints PASS/FAIL/FIX-cmd for each, exit-coded.

Acceptance Criteria

CLI scidex doctor [--fix] [--verbose] [--json] added to
cli.py. Default mode prints a coloured table; --json
emits machine-parseable output for the fleet health
watchdog.
Probes (each a Check dataclass with name,
description, run() -> CheckResult(status, evidence,
fix_command)
):
1. PG socketselect 1 from scidex DB; FAIL prints
sudo systemctl restart postgresql.
2. Submodules initialiseddata/scidex-artifacts/.git
and data/scidex-papers/.git exist; FAIL prints git
submodule update --init --recursive
.
3. Submodule worktree — current artifacts checkout is on a
branch (not detached HEAD); FAIL prints
cd data/scidex-artifacts && git checkout main.
4. Secrets.env symlinked to /home/ubuntu/secrets/
and parses (no obvious <token> placeholders).
5. Pull looppgrep -f pull_main.sh returns one PID;
FAIL prints the systemd unit start command.
6. API healthcurl -fs localhost:8000/api/health
returns 200; FAIL prints sudo systemctl restart
scidex-api
.
7. Nginx routecurl -fs localhost/api/health returns
200; FAIL points at /etc/nginx/sites-available/scidex.
8. Claude CLIwhich claude resolves and claude
--version
returns ≥2.1.119; FAIL prints the install
command.
9. Skills loaderpython3 -c "import scidex.skills.loader
as L; print(len(L.list_skills()))"
returns ≥9; FAIL points
at the persona dir permissions check.
10. Bridgecurl -fs localhost:8899/health returns 200;
FAIL points at scidex-bridge.
11. Disk space/data/ has ≥10 GB free; warn at <10,
fail at <2.
12. Postgres connections — open count <80% of max_connections;
warn at 80, fail at 95 (cf. existing pool-gauge metric).
13. Worktree drift — count of .orchestra-worktrees/*/
dirs <30; warn above (most likely abandoned).
14. Stale lock files/tmp/scidex-*.lock younger than
1h are fine; older than 24h is FAIL.
15. Migration parity — count of migrations/*.py matches
rows in _applied_migrations table; FAIL prints python
scripts/run_migration.py --apply-pending
.
--fix mode runs the fix command for any FAIL the user
explicitly opts into (interactive y/n unless --yes).
Tests tests/test_doctor.py: each probe importable;
mocked subprocess returns produce the expected
CheckResult; aggregate run_all() returns deterministic
ordered list.
Hook into watchdog. The fleet health watchdog
(cf. reference_fleet_health_watchdog.md) gains a periodic
scidex doctor --json poll and emits a notification on
regression (FAIL count delta > 0 since last run).

Approach

  • Sketch the Check dataclass + a tiny register_check()
  • decorator so each probe is one function.
  • Probe-by-probe, lifting evidence for each from the SciDEX
  • incident memory (PG corruption, claude-cli auth migration,
    pull-loop missing).
  • Add the JSON contract first (the watchdog will consume it
  • immediately), pretty-print second.
  • Wire into cli.py's argparse next to the existing status
  • command.

    Dependencies

    • None — pure read-side instrumentation.

    Dependents

    • reference_fleet_health_watchdog.md consumer.
    • q-obs-agent-latency-budget may reuse the probe pattern.

    Work Log

    2026-04-27 23:45 PT — Slot 75

    • Completed implementation: scidex core/scidex_doctor.py with 15 health probes
    • Added scidex doctor [--fix][--verbose][--json][--yes] to cli.py
    • Created tests/test_doctor.py (27 tests, all passing)
    • Rebased against origin/main (was based on stale 31f1e41e5, now at a4d9b890f)
    • Verified: scidex doctor --json returns correct JSON with 15 checks
    • Migration parity probe uses migration_history table (not _applied_migrations)
    • Skills loader probe gracefully falls back to counting persona dirs when loader import fails
    • Committed and pushed to branch orchestra/task/eb06d3df-scidex-doctor-diagnose-common-dev-env-is
    • Result: 11 PASS, 3 FAIL (expected: submodules not initialised in worktree, bridge not running), 1 WARN (skills loader)

    Sibling Tasks in Quest (Code Health) ↗