[Artifacts] Model artifact type: biophysical, deep learning, statistical models done

← Artifacts
Define model artifact metadata schema supporting multiple model families: biophysical (equations, parameters, species, reactions), deep_learning (architecture, framework, layer_config, training_metrics, checkpoint_path), statistical (model_type, features, coefficients, fit_metrics). Models link to datasets (trained_on), hypotheses (tests/supports), analyses (produced_by), and other models (fine_tuned_from). Register via register_model(model_family, title, metadata, trained_on_dataset_id). Models are versioned -- each training run or parameter update creates a new version. Depends on: a17-18-VERS0001, a17-19-TYPE0001.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Artifacts] Add model artifact metadata validation and auto-linking support [task:a17-23-MODL0001]2026-04-25
Spec File

[Artifacts] Model artifact type: biophysical, deep learning, statistical models

Goal

Models are central to the scientific process — they encode our understanding of mechanisms, make predictions, and can be tested against data. SciDEX needs a model artifact type that captures the diversity of scientific models while maintaining enough structure for comparison, versioning, and provenance tracking.

Model Families & Metadata Schemas

Biophysical Models

Mechanistic models based on physical/chemical principles.

{
  "model_family": "biophysical",
  "equations": ["dA/dt = k_prod - k_clear * A - k_phago * M * A"],
  "species": ["amyloid_beta", "microglia", "cytokines"],
  "parameters": {
    "k_prod": {"value": 0.1, "units": "uM/hr", "source": "PMID:12345"},
    "k_clear": {"value": 0.05, "units": "1/hr", "source": "fitted"},
    "k_phago": {"value": 0.02, "units": "1/(cells*hr)", "source": "PMID:67890"}
  },
  "reactions": ["production", "clearance", "phagocytosis"],
  "steady_states": {"amyloid_beta": 2.0, "microglia": 100},
  "solver": "scipy.integrate.solve_ivp",
  "solver_config": {"method": "RK45", "t_span": [0, 100]}
}

Deep Learning Models

Neural network models for prediction or classification.

{
  "model_family": "deep_learning",
  "framework": "pytorch",
  "architecture": "transformer",
  "layer_config": {"n_layers": 6, "hidden_dim": 512, "n_heads": 8},
  "parameter_count": 25000000,
  "training_config": {
    "optimizer": "adamw",
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 100,
    "early_stopping_patience": 10
  },
  "training_metrics": {
    "final_loss": 0.023,
    "best_val_loss": 0.031,
    "training_time_hours": 2.5
  },
  "checkpoint_path": "/models/checkpoint_epoch_87.pt",
  "input_schema": "gene expression matrix (genes x samples)",
  "output_schema": "cell type probabilities"
}

Statistical Models

Traditional statistical/ML models.

{
  "model_family": "statistical",
  "model_type": "logistic_regression",
  "framework": "sklearn",
  "features": ["gene_expression", "age", "sex", "apoe_genotype"],
  "target": "disease_status",
  "coefficients": {"gene_expression": 0.45, "age": 0.02},
  "fit_metrics": {
    "auc_roc": 0.87,
    "accuracy": 0.82,
    "f1_score": 0.79,
    "n_samples": 500,
    "cross_validation": "5-fold"
  }
}

Implementation

register_model(model_family, title, metadata, trained_on_dataset_id=None)

def register_model(model_family, title, metadata, 
                    trained_on_dataset_id=None,
                    tests_hypothesis_id=None,
                    produced_by_analysis_id=None):
    metadata["model_family"] = model_family
    artifact_id = register_artifact(
        artifact_type="model",
        title=title,
        metadata=metadata,
        quality_score=0.6  # default, updated after evaluation
    )
    # Auto-create links
    if trained_on_dataset_id:
        create_link(artifact_id, trained_on_dataset_id, "derives_from",
                    evidence=f"Model trained on dataset {trained_on_dataset_id}")
    if tests_hypothesis_id:
        create_link(artifact_id, tests_hypothesis_id, "supports",
                    evidence=f"Model tests hypothesis {tests_hypothesis_id}")
    if produced_by_analysis_id:
        create_link(artifact_id, produced_by_analysis_id, "derives_from",
                    evidence=f"Model produced by analysis {produced_by_analysis_id}")
    return artifact_id

Model Versioning

Each training run or parameter update creates a new version via create_version():
  • Retrained model → new version with updated training_metrics
  • Fine-tuned model → new version with parent_version_id pointing to base model
  • Parameter sweep → multiple versions branching from same parent

Artifact Links for Models

Link TypeMeaning
derives_from → dataset"trained on this data"
derives_from → model"fine-tuned from this base model"
supports → hypothesis"model predictions support this hypothesis"
contradicts → hypothesis"model predictions contradict this hypothesis"
derives_from → analysis"produced by this analysis run"

Acceptance Criteria

register_model() creates model artifact with correct metadata
☐ All three model families (biophysical, DL, statistical) supported
☐ Auto-linking to datasets, hypotheses, and analyses
☐ Model metadata validates against family-specific schema
☐ Versioning works: create_version() of a model preserves model_family
☐ GET /api/models endpoint returns models with family/dataset filtering (if time permits)
☐ Work log updated with timestamped entry

Dependencies

  • a17-18-VERS0001 (versioning schema)
  • a17-19-TYPE0001 (model type registration)

Dependents

  • frg-mb-01-ANLX (Forge model-building framework)
  • d16-23-BMOD0001 (demo: biophysical model)

Work Log

2026-04-26 06:20 UTC — Slot minimax:77

Task: Model artifact type: biophysical, deep learning, statistical models

Analysis:

  • Verified task is still relevant (dependencies a17-18-VERS0001 and a17-19-TYPE0001 already completed on main)
  • register_model() exists in artifact_registry.py but was missing:
- Auto-linking to datasets/hypotheses/analyses
- Family-specific metadata validation
- Support for trained_on_dataset_id, tests_hypothesis_id, produced_by_analysis_id
  • Model family rendering already exists in api.py (line 26203) showing family badges + equations/parameters/architecture/metrics
  • create_version() already merges metadata from parent, preserving model_family
Changes made to scidex/atlas/artifact_registry.py:
  • Added MODEL_FAMILIES = {'biophysical', 'deep_learning', 'statistical'} constant (line ~113)
  • Added _validate_model_family(model_family) validation function
  • Added validate_model_metadata(model_family, metadata) for family-specific schema validation:
  • - biophysical: requires 'equations'
    - deep_learning: requires 'framework', 'architecture'
    - statistical: requires 'model_type'
  • Enhanced register_model() with:
  • - trained_on_dataset_id: creates 'derives_from' link to dataset artifact
    - tests_hypothesis_id: creates 'supports' link to hypothesis artifact
    - produced_by_analysis_id: creates 'derives_from' link to analysis artifact
    - metadata parameter for additional family-specific fields
    - Family-specific validation (non-blocking warnings)
    - Changed default quality_score from 0.7 to 0.6 per spec
    - All link creation errors logged but don't fail registration

    Verification:

    • Syntax check passed (python3 -m py_compile)
    • Unit tests for _validate_model_family and validate_model_metadata passed
    • Function signature includes all required parameters
    • create_version() metadata merge preserves model_family from parent
    Acceptance criteria status:
    register_model() creates model artifact with correct metadata
    ☑ All three model families (biophysical, DL, statistical) supported
    ☑ Auto-linking to datasets, hypotheses, and analyses
    ☑ Model metadata validates against family-specific schema
    ☑ Versioning works: create_version() of a model preserves model_family (verified via code review)
    ☐ GET /api/models endpoint (not implemented — not time-critical)
    ☑ Work log updated

    Sibling Tasks in Quest (Artifacts) ↗

    Task Dependencies

    ↓ Referenced by (downstream)