[Forge] GPU sandbox pilot — scGPT fine-tune (WS4)

← All Specs

[Forge] GPU sandbox pilot — scGPT fine-tune (WS4)

Task

  • ID: task-id-pending
  • Type: one-shot
  • Frequency: one-shot pilot; success unlocks a follow-up quest for
multi-model GPU support
  • Layer: Forge

Goal

Match Biomni's "GPU-as-a-tool" capability at pilot scale by running one
end-to-end model fine-tune (scGPT preferred — it directly feeds WS2's
scRNA-seq analyses) inside SciDEX's existing bwrap sandbox with a new GPU
launcher. Prove that SciDEX can run publication-grade compute (not just
API calls) without compromising sandbox isolation or the resource ledger.

What it does

  • Writes scripts/sandbox/run_gpu.sh — a bwrap-based launcher that:
- Takes a model name, dataset manifest path, output artifact path,
wall-time cap, and VRAM cap as arguments.
- Reserves the GPU via resource_tracker before starting; refuses to
launch if the pool is out of GPU-hours.
- Restricts the sandbox's network allow-list to model-weight CDNs
(HuggingFace, Zenodo) + the dataset registry + pip mirror, matching
the existing bwrap policy in quest_analysis_sandboxing_spec.md.
- Caps wall-time via timeout and VRAM via CUDA_VISIBLE_DEVICES +
nvidia-smi pre-check. Kills + cleans on overrun, marks the job
failed in resource_allocations.
- Captures stdout + stderr + training curves as artifacts.
  • Adds a gpu_launch() helper in the Forge tool registry that agents
call; the helper enforces cost-ledger debit before the job runs
(not after) so silent overruns are impossible.
  • Runs the pilot fine-tune: scGPT on an SEA-AD subset (≤50K cells to
keep wall-time bounded), target task = cell type classification
refinement. Outputs: final weights (artifact), validation metrics
(F1, confusion matrix), training curves, a write-up markdown.
  • Triggers a post-run debate on the fine-tune's utility (WS5 wrapper):
"Does the fine-tuned scGPT produce better cell type calls on this
SEA-AD subset than the base model?" — Theorist + Skeptic + domain
expert weigh in.
  • Reconciles the cost ledger entry against actual GPU-hours used; surfaces
a variance report if estimate vs actual differs by >20%.

Success criteria

  • scripts/sandbox/run_gpu.sh merged to main via the worktree; passes a
smoke test (tiny model, 5-minute cap) before the scGPT run.
  • One scGPT fine-tune run end-to-end inside the sandbox, artifacts stored
under artifacts/gpu_pilots/scgpt_seaad/ with ≥50KB of content
(weights ref + metrics JSON + training curves + write-up).
  • Cost ledger debit recorded; variance between estimated and actual
GPU-hours < 20%.
  • Debate triggered; quality_score ≥ 0.6.
  • Sandbox isolation verified: no network calls outside the allow-list;
no host filesystem writes outside artifacts/gpu_pilots/; no
lingering processes after job completion.
  • Follow-up quest ticket filed for multi-model support (Borzoi / ESM2 /
ADMET) with lessons learned from the pilot.

Quality requirements

  • No stubs: a "pilot" that only launches the bwrap container and writes
a placeholder file is rejected. The fine-tune must converge — or fail
with a documented reason — and produce real metrics. Cite
quest_quality_standards_spec.md.
  • Parallel agents are not used here (single pilot model). The task is
intentionally narrow to reduce sandbox-escape risk.
  • Sandbox policy must match or exceed quest_analysis_sandboxing_spec.md.
No loosened network or filesystem policy for GPU jobs.
  • Resource accounting is mandatory pre-flight. Jobs that cannot produce
an estimated GPU-hour cost are refused.
  • Every artifact cites the upstream model (scGPT repo + commit), the
upstream dataset (SEA-AD version), and the pipeline (this task's
task ID).

Related tools / packages

  • Upstream scGPT: github.com/bowang-lab/scGPT — reference
fine-tuning pipeline we adapt.
  • bwrap sandbox: existing policy in quest_analysis_sandboxing_spec.md
— we extend, do not loosen.
  • Dataset: SEA-AD transcriptomic subset from Allen Brain Cell Atlas,
per quest_real_data_pipeline_spec.md.
  • SciDEX internal: resource_tracker.py, resource_allocations
table, cost_ledger, Forge tool registry, @log_tool_call.
  • Biomni reference pattern: Biomni's agents launch GPU sandboxes
for scGPT / Borzoi / ESM2 / UniRef / ADMET fine-tuning — this task
ports the pattern to SciDEX for one model.
  • Comparable K-Dense skill: K-Dense's scVelo / Scanpy /
Cellxgene Census skills feed the dataset side; PyTorch Lightning
skill feeds the training loop side.

Work Log

2026-04-16 21:45 PT — Slot 72

  • Started task — verified no GPU sandbox infrastructure existed yet.
No scripts/sandbox/, no GPU resource tracking functions.
  • Investigated existing infrastructure:
- scidex/senate/cgroup_isolation.py — cgroup-based process isolation
(systemd-run based); used by LocalExecutor in forge/executor.py
- scidex/core/resource_tracker.py — LLM token + API call tracking;
no GPU functions
- resource_allocations table — tokens-based; no GPU-hours column
- bwrap available on host (v0.9.0); no nvidia-smi (no GPU on dev host)
  • Implemented:
- scripts/sandbox/run_gpu.sh — bwrap launcher with network allow-list
(HuggingFace, Zenodo, pypi, cellxgene, SEA-AD), VRAM cap via
CUDA_VISIBLE_DEVICES, wall-time watchdog, stdout/stderr capture.
Pre-flight GPU reservation via Python module. Exit codes 0/2/3/4/5.
- scidex/core/resource_tracker.py — added reserve_gpu_job()
(pre-flight debit, checks pool), reconcile_gpu_job() (post-run
variance check <20%), record_gpu_job_failure() (marks FAILED).
GPU rate: $0.50/GPU-hour, pool budget: 168h/week.
- scidex/forge/forge_tools.py — added gpu_launch() helper.
Calls reserve_gpu_job() before launching; degrades gracefully
if tracker unavailable; returns structured dict.
- scripts/sandbox/smoke_test_gpu.sh — validates bwrap + script +
Python imports; degrades cleanly on non-GPU hosts.
  • Tested: bash scripts/sandbox/smoke_test_gpu.sh → PASSED
(degraded mode — no GPU on dev host, but all structure validated)
  • Committed: e6cbbc8d6 — "[Forge] GPU sandbox pilot —
bwrap launcher + resource tracking + gpu_launch()"
  • Result: Infrastructure implemented. Actual scGPT fine-tune run
requires GPU hardware (nvidia-smi unavailable on current host).
Follow-up quest ticket needed for GPU compute environment + model weights.

Tasks using this spec (1)
[Forge] GPU sandbox pilot — fine-tune scGPT inside bwrap
Forge done P90
File: task-id-pending_gpu_sandbox_pilot_spec.md
Modified: 2026-04-24 07:15
Size: 6.6 KB