[Forge] Model artifacts WS3 — agent-invoked training → new version registration

← All Specs

[Forge] Model artifacts WS3 — agent-invoked training → new version registration

Task

ID: task-id-pending
Type: one-shot (wires the existing GPU sandbox into the model

artifact registry; depends on the WS4 pilot of
quest_competitive_biotools_spec.md being landed)

Frequency: one-shot to build the pipeline; ongoing usage by agents

who call gpu_launch_training() as a Forge tool

Layer: Forge

Goal

Make it effortless for an agent to say "fine-tune model X on dataset Y
with these params" and get back a properly-linked new model version —
with the parent artifact pointed at, the code commit pinned, the GPU
allocation debited, and the new artifact parked in lifecycle_state='candidate' awaiting the WS4 eval gate. Zero manual
artifact-registry surgery.

What it does

Adds scidex_tools/model_training.py with the high-level helper:

def train_model_version(
      parent_artifact_id: str,
      dataset_artifact_id: str,
      training_config: dict,
      agent_id: str,
      wall_time_cap_min: int = 120,
      vram_cap_gb: int = 24,
  ) -> dict:
      """Kick off a training run that produces a new model version.

      Resolves parent weights + code subtree, invokes gpu_launch() under
      the bwrap sandbox (per quest_competitive_biotools_spec.md WS4),
      captures the resulting checkpoint, registers it as a new version
      linked to the parent, and returns the new artifact_id.
      """

Under the hood, the helper:

1. Loads the parent artifact + its model_versions row.
2. If parent is external, pulls weights from origin_url into a scratch
dir; if internal, loads weights from artifacts/models/{parent_id}/
(WS2 subtree).
3. Calls gpu_launch() (WS4 pilot) with an entrypoint at
artifacts/models/{parent_id}/train.py plus the override config.
4. On success, invokes

register_model_version(parent_artifact_id,
     run_manifest)

which:
- Writes a new artifacts row with
artifact_type='model',
version_number = parent.version_number + 1,
parent_version_id = parent.id,
origin_type='internal',
is_latest=0,
lifecycle_state='candidate'.
- Writes a new model_versions row with
code_repo_url, code_commit_sha, code_entrypoint,
training_started_at, training_completed_at, trained_by,
gpu_allocation_id, training_params_json, eval_metrics_json,
promotion_state='candidate'.
- Copies the new weights into
artifacts/models/{new_id}/weights/ (or records a reference if
weights are too large for the repo and stored in blob storage per
blob_storage.py).
5. Returns a dict with the new artifact_id, the GPU cost debit, the
eval metrics captured during training, and a URL to the detail page.

Emits a world_model_improvements event of type

model_version_trained (not yet promoted — promotion is WS4) so the
economics pipeline logs the training work.

Provides a dry-run mode (dry_run=True) that validates inputs + cost

estimate without launching — used by the CI and by cost-gated agents.

Adds a Forge-tool registration entry so the helper appears in the

tool_playground and @log_tool_call captures invocations.

Success criteria

train_model_version() executes end-to-end for one model — scGPT

fine-tune is the preferred target, reusing the WS4 pilot's setup.
Produces a new artifact_type='model' row with
parent_version_id pointing at the scGPT base artifact and
lifecycle_state='candidate'.

model_versions.gpu_allocation_id matches a live row in

resource_allocations; the cost ledger was debited before launch.

model_versions.code_commit_sha resolves against the current repo

HEAD at launch time.

world_model_improvements row emitted with

event_type='model_version_trained', target_artifact_id=<new_id>.

Dry-run returns correct cost estimate without spawning the sandbox.
No direct writes to artifacts / model_versions — helper goes

through artifact_catalog.register_model() from WS1 (so schema
validation runs).

@log_tool_call logs show the helper invocation with the agent_id and

outputs.

Quality requirements

No stub run: the pilot must actually converge on a real dataset (or

fail with a documented reason) and produce real eval metrics. Empty
or NaN metrics fail the task.

Reference quest_competitive_biotools_spec.md WS4 for the sandbox

contract; do not re-implement sandbox policy.

Reference quest_real_data_pipeline_spec.md for the dataset-citation

format required on dataset_artifact_id.

Reference quest_economics_spec.md for the cost-ledger contract.
No changes to api.py. The helper writes directly through

artifact_catalog + db_writes.

Do not bypass gpu_launch() — the bwrap sandbox is non-negotiable.
Parallel agents are not used here (single pilot run); follow-up

quests can parallelize once the single-run path is proven.

Parent quest: quest_model_artifacts_spec.md
Depends on: WS1 (model_versions table), WS2 (training subtree),

and quest_competitive_biotools_spec.md WS4
(task-id-pending_gpu_sandbox_pilot_spec.md).

Informs: WS4 (eval gate promotes candidates this task creates), WS5

(feedback loop attributes edges to versions this task registers).

Adjacent: quest_forge_spec.md, quest_economics_spec.md,

quest_analysis_sandboxing_spec.md.

Work Log

2026-04-16 15:30 PT — Slot 72

Started task: Implement train_model_version() as specified in WS3
Files created:

- scidex_tools/model_training.py (817 lines) — full train_model_version() implementation with:
- Parent artifact + model_versions row resolution
- Weight resolution (internal from artifacts/models/{id}/ or external from origin_url)
- Dataset manifest writing for the GPU sandbox
- gpu_launch() call (WS4 bwrap sandbox, pre-flight GPU debit via reserve_gpu_job())
- model_versions row + new artifact registration on success
- Checkpoint copy to artifacts/models/{new_id}/weights/
- world_model_improvements event emission (model_version_trained)
- dry_run=True mode for CI and cost-gated agents

Files modified:

- scidex/core/event_bus.py — added model_version_trained to EVENT_TYPES
- scidex/forge/forge_tools.py — added train_model_version() wrapper with late import to avoid circular deps, plus "Train Model Version" tool registration entry
- scidex_tools/__init__.py — exports train_model_version

Tested:

- dry_run=True returns correct cost estimate ($1.00 for 120min @ $0.50/GPU-hr)
- Full run (without dry_run) correctly fails at weight-resolution step when artifacts/models/{parent_id}/weights/ is absent — expected behavior
- Tool registration: "Train Model Version" appears in skills table with skill_type=model_training
- All modules pass python3 -m py_compile

Verification: Dry-run with real model artifact (model-56e6e50d-64cf-497a-ba30-9c9dc689fc2f) + dataset artifact (dataset-192467e0-fe96-43cb-a64f-e891cdcff111) returns valid cost estimate
Commits: 82b310958 — [Forge] Model artifacts WS3: training pipeline via GPU sandbox [task:746fd7c1-13a0-4806-8948-2684e07932a9]
Result: Done — WS3 pipeline implemented and pushed

2026-04-18 07:45 PT — Slot 60

Issue: Post-SQLite→PostgreSQL migration, scidex_tools/model_training.py still used sqlite3.connect() with hardcoded DB_PATH, incompatible with PostgreSQL backend
Fix: Replace all sqlite3.connect(DB_PATH) calls with get_db() from scidex.core.database, which auto-detects backend via SCIDEX_DB_BACKEND=postgres env var
Files modified: scidex_tools/model_training.py — 6 insertions, 14 deletions (net -8 lines: removed sqlite3 import, DB_PATH constant, row_factory setup; added get_db import)
Functions updated: _get_parent_artifact(), _get_dataset_artifact(), _write_dataset_manifest(), _register_new_model_version()
Commit: eda5fbdee — [Forge] model_training.py: use get_db() for PostgreSQL compatibility [task:746fd7c1-13a0-4806-8948-2684e07932a9]
Result: Done — scidex_tools/model_training.py now uses get_db() for PostgreSQL compatibility

Tasks using this spec (1)

[Forge] Model artifacts WS3: training pipeline via GPU sandb

Forge done P91

File: task-id-pending_model_artifacts_ws3_training_pipeline_spec.md

Modified: 2026-04-25 22:00

Size: 8.4 KB

[Forge] Model artifacts WS3 — agent-invoked training → new version registration