[Forge] Model artifacts WS3 — agent-invoked training → new version registration

← All Specs

[Forge] Model artifacts WS3 — agent-invoked training → new version registration

Task

  • ID: task-id-pending
  • Type: one-shot (wires the existing GPU sandbox into the model
artifact registry; depends on the WS4 pilot of
quest_competitive_biotools_spec.md being landed)
  • Frequency: one-shot to build the pipeline; ongoing usage by agents
who call gpu_launch_training() as a Forge tool
  • Layer: Forge

Goal

Make it effortless for an agent to say "fine-tune model X on dataset Y
with these params" and get back a properly-linked new model version —
with the parent artifact pointed at, the code commit pinned, the GPU
allocation debited, and the new artifact parked in lifecycle_state='candidate' awaiting the WS4 eval gate. Zero manual
artifact-registry surgery.

What it does

  • Adds scidex_tools/model_training.py with the high-level helper:

def train_model_version(
      parent_artifact_id: str,
      dataset_artifact_id: str,
      training_config: dict,
      agent_id: str,
      wall_time_cap_min: int = 120,
      vram_cap_gb: int = 24,
  ) -> dict:
      """Kick off a training run that produces a new model version.

      Resolves parent weights + code subtree, invokes gpu_launch() under
      the bwrap sandbox (per quest_competitive_biotools_spec.md WS4),
      captures the resulting checkpoint, registers it as a new version
      linked to the parent, and returns the new artifact_id.
      """

  • Under the hood, the helper:
1. Loads the parent artifact + its model_versions row.
2. If parent is external, pulls weights from origin_url into a scratch
dir; if internal, loads weights from artifacts/models/{parent_id}/
(WS2 subtree).
3. Calls gpu_launch() (WS4 pilot) with an entrypoint at
artifacts/models/{parent_id}/train.py plus the override config.
4. On success, invokes register_model_version(parent_artifact_id,
run_manifest)
which:
- Writes a new artifacts row with
artifact_type='model',
version_number = parent.version_number + 1,
parent_version_id = parent.id,
origin_type='internal',
is_latest=0,
lifecycle_state='candidate'.
- Writes a new model_versions row with
code_repo_url, code_commit_sha, code_entrypoint,
training_started_at, training_completed_at, trained_by,
gpu_allocation_id, training_params_json, eval_metrics_json,
promotion_state='candidate'.
- Copies the new weights into
artifacts/models/{new_id}/weights/ (or records a reference if
weights are too large for the repo and stored in blob storage per
blob_storage.py).
5. Returns a dict with the new artifact_id, the GPU cost debit, the
eval metrics captured during training, and a URL to the detail page.
  • Emits a world_model_improvements event of type
model_version_trained (not yet promoted — promotion is WS4) so the
economics pipeline logs the training work.
  • Provides a dry-run mode (dry_run=True) that validates inputs + cost
estimate without launching — used by the CI and by cost-gated agents.
  • Adds a Forge-tool registration entry so the helper appears in the
tool_playground and @log_tool_call captures invocations.

Success criteria

  • train_model_version() executes end-to-end for one model — scGPT
fine-tune is the preferred target, reusing the WS4 pilot's setup.
Produces a new artifact_type='model' row with
parent_version_id pointing at the scGPT base artifact and
lifecycle_state='candidate'.
  • model_versions.gpu_allocation_id matches a live row in
resource_allocations; the cost ledger was debited before launch.
  • model_versions.code_commit_sha resolves against the current repo
HEAD at launch time.
  • world_model_improvements row emitted with
event_type='model_version_trained', target_artifact_id=<new_id>.
  • Dry-run returns correct cost estimate without spawning the sandbox.
  • No direct writes to artifacts / model_versions — helper goes
through artifact_catalog.register_model() from WS1 (so schema
validation runs).
  • @log_tool_call logs show the helper invocation with the agent_id and
outputs.

Quality requirements

  • No stub run: the pilot must actually converge on a real dataset (or
fail with a documented reason) and produce real eval metrics. Empty
or NaN metrics fail the task.
  • Reference quest_competitive_biotools_spec.md WS4 for the sandbox
contract; do not re-implement sandbox policy.
  • Reference quest_real_data_pipeline_spec.md for the dataset-citation
format required on dataset_artifact_id.
  • Reference quest_economics_spec.md for the cost-ledger contract.
  • No changes to api.py. The helper writes directly through
artifact_catalog + db_writes.
  • Do not bypass gpu_launch() — the bwrap sandbox is non-negotiable.
  • Parallel agents are not used here (single pilot run); follow-up
quests can parallelize once the single-run path is proven.

Related

  • Parent quest: quest_model_artifacts_spec.md
  • Depends on: WS1 (model_versions table), WS2 (training subtree),
and quest_competitive_biotools_spec.md WS4
(task-id-pending_gpu_sandbox_pilot_spec.md).
  • Informs: WS4 (eval gate promotes candidates this task creates), WS5
(feedback loop attributes edges to versions this task registers).
  • Adjacent: quest_forge_spec.md, quest_economics_spec.md,
quest_analysis_sandboxing_spec.md.

Work Log

2026-04-16 15:30 PT — Slot 72

  • Started task: Implement train_model_version() as specified in WS3
  • Files created:
- scidex_tools/model_training.py (817 lines) — full train_model_version() implementation with:
- Parent artifact + model_versions row resolution
- Weight resolution (internal from artifacts/models/{id}/ or external from origin_url)
- Dataset manifest writing for the GPU sandbox
- gpu_launch() call (WS4 bwrap sandbox, pre-flight GPU debit via reserve_gpu_job())
- model_versions row + new artifact registration on success
- Checkpoint copy to artifacts/models/{new_id}/weights/
- world_model_improvements event emission (model_version_trained)
- dry_run=True mode for CI and cost-gated agents
  • Files modified:
- scidex/core/event_bus.py — added model_version_trained to EVENT_TYPES
- scidex/forge/forge_tools.py — added train_model_version() wrapper with late import to avoid circular deps, plus "Train Model Version" tool registration entry
- scidex_tools/__init__.py — exports train_model_version
  • Tested:
- dry_run=True returns correct cost estimate ($1.00 for 120min @ $0.50/GPU-hr)
- Full run (without dry_run) correctly fails at weight-resolution step when artifacts/models/{parent_id}/weights/ is absent — expected behavior
- Tool registration: "Train Model Version" appears in skills table with skill_type=model_training
- All modules pass python3 -m py_compile
  • Verification: Dry-run with real model artifact (model-56e6e50d-64cf-497a-ba30-9c9dc689fc2f) + dataset artifact (dataset-192467e0-fe96-43cb-a64f-e891cdcff111) returns valid cost estimate
  • Commits: 82b310958 — [Forge] Model artifacts WS3: training pipeline via GPU sandbox [task:746fd7c1-13a0-4806-8948-2684e07932a9]
  • Result: Done — WS3 pipeline implemented and pushed

2026-04-18 07:45 PT — Slot 60

  • Issue: Post-SQLite→PostgreSQL migration, scidex_tools/model_training.py still used sqlite3.connect() with hardcoded DB_PATH, incompatible with PostgreSQL backend
  • Fix: Replace all sqlite3.connect(DB_PATH) calls with get_db() from scidex.core.database, which auto-detects backend via SCIDEX_DB_BACKEND=postgres env var
  • Files modified: scidex_tools/model_training.py — 6 insertions, 14 deletions (net -8 lines: removed sqlite3 import, DB_PATH constant, row_factory setup; added get_db import)
  • Functions updated: _get_parent_artifact(), _get_dataset_artifact(), _write_dataset_manifest(), _register_new_model_version()
  • Commit: eda5fbdee — [Forge] model_training.py: use get_db() for PostgreSQL compatibility [task:746fd7c1-13a0-4806-8948-2684e07932a9]
  • Result: Done — scidex_tools/model_training.py now uses get_db() for PostgreSQL compatibility

Tasks using this spec (1)
[Forge] Model artifacts WS3: training pipeline via GPU sandb
Forge done P91
File: task-id-pending_model_artifacts_ws3_training_pipeline_spec.md
Modified: 2026-04-25 22:00
Size: 8.4 KB