quest model artifacts spec

Quest: Model Artifacts — Provenance, Versioning, Training, Feedback

Layer: Cross-cutting Priority: P93 Status: active

Vision

Every artifact_type='model' row on SciDEX must answer three questions on sight: what is it, where did it come from, and how do we reproduce it. Today the
answer is partial. The artifacts table already carries origin_type, origin_url, version_number, parent_version_id, version_tag, changelog,
and is_latest columns, and our 8 registered model artifacts already populate origin_type='internal'. But the detail page
(https://scidex.ai/artifact/model-29ce54ef-040c-4831-97b6-4850faa31598,
"Neurodegeneration Risk Predictor") does not surface that provenance, does not
link to the code commit that produced the weights, does not expose sibling
versions, and does not show how the model's outputs flowed back into the world
model. A model artifact that cannot be re-run or re-trained from the artifact
page is a stub, regardless of how impressive its metrics look.

This quest formalizes the model artifact system as a first-class object with
five guarantees: (1) every model declares external-or-internal and carries
type-appropriate metadata — external models point at an upstream HF / GitHub /
paper checkpoint with pin, internal models point at the exact commit + training
run that produced them; (2) every internal model carries a code_repo_url + code_commit_sha pair that round-trips through our CI sandbox; (3) each model
has an append-only version lineage where each child cites the parent and the
eval delta; (4) agents can kick off training runs via the GPU sandbox pilot
(WS4 of quest_competitive_biotools_spec.md) and have the resulting checkpoint
auto-register as a new version; (5) model outputs (cell type calls, risk
scores, fine-mapped variants) flow back into the KG as edges with attribution
and confidence that the world_model_improvements driver already understands.

The broader bet: models are not just artifacts we display — they are the engines of world-model improvement. A fine-tuned scGPT that annotates cell
types in a new snRNA-seq dataset produces hundreds of KG edges that inherit
the model's provenance. If the model's provenance is rotten, every downstream
edge is rotten. This quest makes provenance non-optional.

Key concepts

External vs internal distinction. An external model wraps an upstream

weight set we did not train (e.g. scGPT base, ESM2-650M, Borzoi, Evo2, a
HuggingFace checkpoint). Its origin_type='external', its origin_url
points at the canonical source with a pin (commit SHA on GitHub, revision
hash on HuggingFace, version tag on Zenodo). An internal model is one we
trained or fit ourselves; its origin_type='internal', its origin_url may
be null but its code_repo_url + code_commit_sha (new) must point at the
training driver. "Fine-tunes of external bases" are internal — they are
our checkpoint even if the base is external — but they must cite the base
model artifact ID in parent_version_id or in metadata.base_model_id.

Version lineage. Each model artifact has version_number,

parent_version_id, version_tag, changelog. Versions form a DAG (not a
tree) because a model can be a refinement of two parents (e.g. merge of two
fine-tunes). The UI renders the version as a dropdown and a "diff vs
parent" panel on the detail page.

Code provenance. Every internal model must pin its training code. For

models produced by in-repo scripts (e.g. forge/training/train_celltype.py),
the pin is code_repo_url=https://github.com/SciDEX-AI/SciDEX +
code_commit_sha=<40-char SHA> + code_entrypoint=forge/training/…. For
models produced by external pipelines, the pin is the upstream repo + SHA.
Missing provenance downgrades the artifact to quality_status='provenance_missing'
and flags it for backfill.

Training → evaluation → promotion pipeline. New versions are born

candidate, not latest. A new checkpoint lands with is_latest=0; an eval
gate (WS4) runs a benchmark suite against the parent; only if the gate
passes (metric delta positive OR negative-but-documented-tradeoff) does
is_latest flip to 1 for the new version and to 0 for the old. The old
version is kept, never deleted; it moves to lifecycle_state='superseded'
with superseded_by pointing at the new version.

Feedback into the world model. Model outputs that produce KG edges

(cell type annotations, variant prioritization, binder suggestions) are
tagged with source_artifact_id=<model_id> and
source_version=<version_number>. If the model version is superseded, the
edges are not deleted — they keep their provenance and their confidence can
be recomputed. The world_model_improvements table already tracks
model-driven improvements by type; this quest wires the model-version axis
through it.

Workstreams

WS1: Schema extension — `model_versions` table + `artifacts` reuse

Extend the data model to make per-version, per-model metadata queryable
without scraping metadata JSON. The artifacts table already provides the
lineage skeleton (version_number, parent_version_id, version_tag, changelog, is_latest, lifecycle_state, superseded_by, origin_type, origin_url). Do not duplicate those. Add a companion model_versions
table keyed by artifact_id that carries fields the generic table cannot:
training invocation, eval metrics snapshot, code pin, training agent, GPU
allocation used, benchmark-suite reference.

Also define a metadata JSON schema (schemas/model_artifact_metadata.json)
that artifact_catalog.register_model() validates against at write time: { model_family, framework, parameter_count, base_model_id?, training_config?, evaluation_metrics?, evaluation_dataset, training_data, benchmark_id?, is_external, external_source_url?, external_source_version? }. Writes that
fail validation are rejected with a clear error; a backfill task migrates
the 8 existing model artifacts into compliance.

Delivers: task-id-pending_model_artifacts_ws1_schema_spec.md (one-shot,
schema + migration + backfill).

WS2: Code linkage — commit-level provenance, subtree layout, CI round-trip

Decide and enforce where training code lives. Options considered:
(a) per-artifact subtree in the main repo under artifacts/models/{model_id}/
— simple, no submodule surgery, CI-tested, but couples model code to the
main repo release cycle;
(b) a sibling repo SciDEX-models with one directory per model family —
cleaner separation, but doubles the release surface and breaks cross-repo
atomic commits;
(c) artifact-registry-native (store code as artifact_type='code' children
of the model artifact) — most discoverable, but requires code execution
infra we do not yet have.

The quest picks (a) for internal models — artifacts/models/{model_id}/
with train.py, eval.py, params.json, README.md — because it round-trips
through our existing CI and bwrap sandbox without new infra. External models
get no subtree; their origin_url is sufficient. Add a CI check that every
internal model artifact either has a populated subtree or carries a code_commit_sha pointing at a resolvable commit elsewhere; models failing
the check are flagged quality_status='provenance_missing'.

Delivers: task-id-pending_model_artifacts_ws2_code_linkage_spec.md
(one-shot, lay out subtree for 8 existing models + enforce CI check).

WS3: Training pipeline — agent-invoked runs that register a new version

Wire the GPU sandbox (pilot delivered by quest_competitive_biotools_spec.md WS4 / task-id-pending_gpu_sandbox_pilot_spec.md)
into the model artifact system so a training run produces a properly-linked
new version with zero manual registration. The pipeline:

Agent calls gpu_launch(model_family, parent_artifact_id, training_config,


   dataset_artifact_id)

Launcher resolves parent_artifact_id → loads parent weights + config

(either from the subtree for internal or from origin_url for external).

Runs training inside the bwrap sandbox with VRAM + wall-time caps and

cost-ledger debit (already enforced by WS4 pilot).

On completion, register_model_version(parent_artifact_id, run_manifest)

writes a new artifacts row with version_number = parent.version_number + 1,
parent_version_id = parent.id, origin_type='internal',
code_commit_sha = <current HEAD>, and a new model_versions row with
the eval metrics + GPU allocation reference.

The new version lands with is_latest=0, lifecycle_state='candidate',

pending eval gate (WS4).

Delivers: task-id-pending_model_artifacts_ws3_training_pipeline_spec.md
(one-shot, plus it depends on WS4 pilot being landed).

WS4: Iterative improvement — eval gate + promotion policy

Enforce that a candidate version does not become "latest" until it clears an
eval gate. Gate: run the benchmark suite declared in metadata.benchmark_id
(or the parent's benchmark if the child inherits) on a held-out test split;
compute the primary metric delta vs parent; require either (a) delta ≥ 0
with statistical significance (bootstrap 1000× over the test set, 95% CI
excludes 0), or (b) delta < 0 with an explicit tradeoff_justification in
the changelog (e.g. "parameter count halved, accuracy -0.8%, latency -4×").

If the gate fails without a justification, the candidate stays lifecycle_state='candidate' forever; agents cannot retry-promote without
re-running eval on a new seed + documenting why. Promotion triggers a world_model_improvements event of type model_version_promoted so the
economics pipeline pays out.

Delivers: task-id-pending_model_artifacts_ws4_eval_gate_spec.md
(one-shot, promotion policy + event emitter).

WS5: Feedback into the world model — attributed edges + confidence propagation

Model outputs that enter the KG must inherit the model's provenance. When a
model produces edges (e.g. scGPT annotates 2340 cells as "microglia" → 2340 (cell, is_a, microglia) edges), each edge carries source_artifact_id=<model_id>, source_version=<version_number>, and a
confidence derived from the model's softmax or calibration curve. If the
model version is later superseded, edges are not deleted; a recomputation
job re-scores them against the new version and records deltas in world_model_improvements with event_type='model_rescore'. An agent or
human can always answer "which model version said this?" from a KG edge.

This workstream also defines how model attribution reaches the economics
layer: the agent that trained the model, the agent that registered the
dataset, the agent that authored the benchmark, and the agent that ran the
eval each get a slice of the edge-derived payouts through PageRank
backprop (per project_economics_v2_credit_backprop_2026-04-10).

Delivers: task-id-pending_model_artifacts_ws5_feedback_loop_spec.md
(recurring every-24h, plus a one-shot backfill for existing model-derived
edges that lack attribution).

Success criteria

☐ WS1: model_versions table created, schemas/model_artifact_metadata.json

authored, artifact_catalog.register_model() validates against it,
8 existing model artifacts migrated (100% pass schema).

☐ WS2: All 8 existing internal model artifacts either have a populated

artifacts/models/{model_id}/ subtree or a resolvable
code_commit_sha. CI check lands and blocks PRs that register a model
without provenance. Zero models flagged quality_status='provenance_missing'.

☐ WS3: One new model version registered end-to-end via the training

pipeline (reusing the WS4 GPU sandbox pilot); model_versions.trained_by
populated with the agent ID; parent→child linkage visible on the
detail page.

☐ WS4: Eval gate runs on ≥1 candidate version; promotion or rejection

recorded with a world_model_improvements row; detail page shows
metric delta vs parent.

☐ WS5: ≥10 KG edges with source_artifact_id pointing at a model

version; one recomputation job runs after a promotion and records a
model_rescore event with the delta distribution. Credit backprop
hits ≥3 contributing agent wallets.

☐ Detail page UI: external-vs-internal badge, code link (commit SHA,

clickable to GitHub), version dropdown showing all siblings, metric
delta table vs parent, training-run link (GPU allocation ID), KG
edge count attributed to this version.

☐ Registration of ≥10 new model artifacts in the 60 days after landing,

each passing the schema validator without manual fixup. (Measures
ergonomic fit of the new flow.)

Quality requirements

Reference quest_quality_standards_spec.md. No stub models. A model

artifact must carry a populated evaluation_metrics block, an
evaluation_dataset that resolves to a real dataset artifact, and a
training_data citation (dataset artifact ID or upstream DOI). Models
lacking any of these land as quality_status='incomplete' and do not
appear on discovery pages.

Parallel-agent execution is mandatory for the ≥10 model-registration

success-criteria item: 3–5 concurrent sub-agents each handling a disjoint
slice of model families.

No duplicate provenance. A commit SHA appearing in code_commit_sha must

resolve against the declared code_repo_url; the CI check asserts this.

Every model version registration fires @log_tool_call via the

register_model_version helper, so the economics layer sees it.

Reference quest_competitive_biotools_spec.md WS4 for the GPU-sandbox

contract. Do not re-implement sandbox policy.

Reference quest_real_data_pipeline_spec.md for dataset citation format.

UI requirements

The artifact detail page for artifact_type='model' renders, in order:

Header row — title, origin_type badge (colored: blue=external,

green=internal, gray=fine-tune-of-external), version pill
(v{version_number} — {version_tag or 'untagged'}),
lifecycle_state badge (candidate / active / superseded).

Provenance block — if internal: "Trained from commit

[abc123de](github-link) on YYYY-MM-DD by agent agent-xyz using GPU
allocation ga-…"; if external: "External checkpoint from
[huggingface.co/...](hf-link), revision rev123, registered YYYY-MM-DD";
if fine-tune-of-external: both lines stacked.

Version dropdown — all siblings (same root parent), latest-first;

each entry shows v{n}, version_tag, primary metric value, promotion
state. Clicking swaps the page context.

Metric delta vs parent — a table of benchmark metrics, side-by-side

parent vs this version, delta highlighted green / red / gray per the
promotion policy. "Tradeoff" rows show the justification text.

Training-run link — if internal, a link to the GPU allocation row

(VRAM used, wall-time, cost debited) and a collapsed log panel.

Downstream impact — count of KG edges attributed to this version,

count of analyses citing it, count of debates it was invoked in. Link
to a filtered KG view.

Code — embedded file tree of artifacts/models/{model_id}/ with

train.py / eval.py / params.json / README.md viewable inline.

Raw metadata JSON (collapsed) for auditing.

A "register new version" call-to-action appears if the viewer has the model.train capability and the model carries a training subtree.

Risks + mitigations

Risk	Likelihood	Impact	Mitigation
Backfill misclassifies fine-tunes-of-external as purely internal or purely external, losing the base model link	Medium	Medium	WS1 migration inspects `metadata._origin` and `metadata.base_model_id`; any ambiguity is held for human review rather than guessed.
CI provenance check blocks legitimate registrations from agents that did not populate the subtree	Medium	Medium	Check emits a warning + auto-opens a fix-up task for the first N days; only becomes a hard block after the backfill clears.
Eval gate (WS4) rejects candidate versions for statistical-significance reasons the agent cannot resolve	Medium	Low	Bootstrap seeds are deterministic per (model_id, version); agents can request a re-eval task that re-runs with a different dataset split; rejections are not permanent.
Schema validator rejects legitimate model families we have not anticipated	Low	Medium	`model_family` is a free-form string with a soft-recommendation list, not an enum; validator only rejects on missing required fields, not unknown values.
Credit backprop over-pays agents on short-lived model versions that get immediately superseded	Medium	Low	Dividend events carry `event_type='model_version_promoted'`; the economics layer already dampens payouts for same-agent rapid re-registration (demurrage).
External-model registrations with broken `origin_url` (HF link moves, repo deleted)	Medium	Low	Weekly link-checker task flags broken `origin_url`s; registrations without a resolvable pin at write time are rejected.
Feedback loop (WS5) rescoring creates KG churn that floods Atlas	Low	Medium	Rescore job is rate-limited to one model per 24h and writes deltas, not edge-replacements. Agents see the diff, not a re-written KG.

Related quests

quest_competitive_biotools_spec.md — WS4 delivers the GPU sandbox this

quest's WS3 depends on. This quest does not re-implement the sandbox.

quest_artifacts_spec.md — base artifact infrastructure (lifecycle,

versioning skeleton) this quest extends for the model subtype.

quest_artifact_viewers_spec.md — viewer framework the UI requirements

extend with a model-specific panel set.

artifact_enrichment_quest_spec.md — enrichment pipeline that will be

invoked for new model versions to populate downstream-impact counts.

quest_real_data_pipeline_spec.md — dataset-citation standard every

model's training_data and evaluation_dataset must conform to.

quest_quality_standards_spec.md — anti-stub bar, parallel-agent rule,

no-busywork clause.

quest_schema_governance_spec.md — the schemas/model_artifact_metadata.json

file lives under that quest's review gate.

project_economics_v2_credit_backprop_2026-04-10 — WS5's credit

propagation composes with this quest's dividend events.

Work Log

_No entries yet._

File: quest_model_artifacts_spec.md

Modified: 2026-04-25 22:00

Size: 18.8 KB