[Artifacts] Reproducible analysis chains: pin artifact versions in analysis specs done analysis:8 coding:7 reasoning:7

← Artifacts
Extend analysis specifications to include a pinned_artifacts field: list of (artifact_id, version_number) tuples that fix the exact inputs used. When an analysis runs, auto-snapshot all input artifact versions into the analysis provenance_chain. Add verify_reproducibility(analysis_id) that checks whether pinned versions still exist and match content_hash. Add /api/analysis/{id}/provenance endpoint showing full input/output artifact DAG with versions. This ensures any reasoning chain can be replayed with identical inputs. Depends on: a17-20-VAPI0001.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (5)

[Verify] Task already on main at 541786d21 — verified acceptance criteria [task:a17-24-REPR0001]2026-04-26
[Atlas] Reproducible analysis chains: pin artifact versions, capture inputs, provenance DAG2026-04-26
[Atlas] Work log: update spec with edge key fix retry [task:a17-24-REPR0001]2026-04-26
[Artifacts] Reproducible analysis chains: pin artifact versions [task:a17-24-REPR0001]2026-04-26
[Forge] Reproducible analysis capsules: move verify_reproducibility to artifact_registry, add export_artifact_capsule, add /api/analyses/{id}/provenance endpoint2026-04-10
Spec File

[Artifacts] Reproducible analysis chains: pin artifact versions in analysis specs

Goal

Scientific reproducibility requires knowing exactly which inputs produced which outputs. When an analysis runs, it should record the precise versions of all input artifacts (datasets, models, hypotheses, prior analyses). Later, anyone should be able to verify that those inputs still exist and haven't been modified, enabling full replay of the reasoning chain.

Design

Pinned Artifacts in Analysis Specs

Add a pinned_artifacts field to analysis records:

{
  "analysis_id": "SDA-2026-04-05-xxx",
  "pinned_artifacts": [
    {"artifact_id": "dataset-allen_brain-SEA-AD", "version_number": 1, "content_hash": "sha256:abc..."},
    {"artifact_id": "model-biophys-microglia-v3", "version_number": 3, "content_hash": "sha256:def..."},
    {"artifact_id": "hypothesis-h-seaad-v4-26ba859b", "version_number": 1, "content_hash": "sha256:ghi..."}
  ],
  "outputs": [
    {"artifact_id": "figure-timecourse-001", "version_number": 1}
  ]
}

Auto-Snapshot on Analysis Run

When an analysis executes:
  • Collect all input artifacts referenced in the analysis spec
  • For each, record current (artifact_id, version_number, content_hash)
  • Store as pinned_artifacts in the analysis metadata
  • Create artifact_links: analysis → each input (link_type="cites", with version info in evidence)
  • Register analysis outputs as new artifacts linked back (derives_from)
  • verify_reproducibility(analysis_id) → dict

    def verify_reproducibility(analysis_id):
        """Check that all pinned inputs still exist and match their content_hash."""
        analysis = get_artifact(analysis_id)
        pinned = analysis.metadata.get("pinned_artifacts", [])
        results = []
        for pin in pinned:
            artifact = get_artifact(pin["artifact_id"])
            if artifact is None:
                results.append({"artifact_id": pin["artifact_id"], "status": "missing"})
            elif artifact.content_hash != pin["content_hash"]:
                results.append({"artifact_id": pin["artifact_id"], "status": "modified",
                               "expected_hash": pin["content_hash"],
                               "current_hash": artifact.content_hash})
            else:
                results.append({"artifact_id": pin["artifact_id"], "status": "verified"})
        return {
            "analysis_id": analysis_id,
            "reproducible": all(r["status"] == "verified" for r in results),
            "checks": results
        }

    Provenance DAG Endpoint

    GET /api/analysis/{id}/provenance returns the full input/output artifact DAG:

    {
      "analysis_id": "SDA-2026-04-05-xxx",
      "nodes": [
        {"id": "paper-12345", "type": "paper", "version": 1, "role": "input"},
        {"id": "dataset-geo-GSE123", "type": "dataset", "version": 1, "role": "input"},
        {"id": "SDA-2026-04-05-xxx", "type": "analysis", "version": 1, "role": "center"},
        {"id": "model-biophys-001", "type": "model", "version": 1, "role": "output"},
        {"id": "figure-tc-001", "type": "figure", "version": 1, "role": "output"}
      ],
      "edges": [
        {"from": "paper-12345", "to": "SDA-...", "type": "cites"},
        {"from": "dataset-geo-GSE123", "to": "SDA-...", "type": "cites"},
        {"from": "SDA-...", "to": "model-biophys-001", "type": "produces"},
        {"from": "SDA-...", "to": "figure-tc-001", "type": "produces"}
      ],
      "reproducibility": {"status": "verified", "checked_at": "2026-04-05T12:00:00"}
    }

    Acceptance Criteria

    ☐ Analysis metadata supports pinned_artifacts field
    ☐ Auto-snapshot captures all input artifact versions on analysis execution
    ☐ Content hashes recorded for each pinned artifact
    verify_reproducibility() checks all pins and reports status
    ☐ GET /api/analysis/{id}/provenance returns DAG structure
    ☐ Provenance DAG includes both inputs and outputs with versions
    ☐ Edge case: analysis with no pinned artifacts returns empty but valid response
    ☐ Work log updated with timestamped entry

    Dependencies

    • a17-20-VAPI0001 (version-aware API for resolving versions and hashes)

    Dependents

    • a17-25-AVUI0001 (version browser UI shows provenance)
    • d16-24-PROV0001 (demo: end-to-end provenance walkthrough)

    Work Log

    2026-04-26 00:43 PT — Slot minimax:71

    • Added capture_analysis_inputs() function to scidex/atlas/artifact_registry.py (line 4969)
    - Takes analysis_id + list of input_artifact_ids
    - Captures version_number, content_hash, title for each input
    - Stores pinned_artifacts snapshot in analysis artifact's metadata
    - Creates cites artifact_links from analysis → each input with version evidence
    • Updated get_analysis_provenance() to:
    - Add role field to all nodes (center/input/output)
    - Add version field to analysis node
    - Use from/to keys for edges instead of source/target
    - Return cites edge type for inputs, produces for outputs
    - Use spec-compliant reproducibility structure with status/checked_at
    - Fixed PostgreSQL column name (parent_hypothesis_iddepends_on_hypothesis_id)
    • Updated verify_reproducibility() to fall back to analyses table when artifact not found
    • All acceptance criteria met: pinned_artifacts field, auto-snapshot, verify_reproducibility(), provenance DAG with roles + edge types
    • Tested: get_analysis_provenance returns 31 nodes, 33 edges, correct role field, from/to edge keys
    • Tested: verify_reproducibility returns reproducible: True for analysis without pins

    2026-04-26 08:10 PT — Slot minimax:71 (retry 2)

    • Confirmed branch is at same SHA as origin/main (541786d21) — task work was merged by prior agent
    • Verified acceptance criteria against current HEAD:
    - capture_analysis_inputs() exists in artifact_registry.py at line 4969
    - verify_reproducibility() exists at line 4800, falls back to analyses table, returns reproducible: True for analysis without pins
    - get_analysis_provenance() returns DAG with nodes (role field), edges (from/to keys), reproducibility (status/checked_at)
    - GET /api/analyses/{id}/provenance route registered, returns 8 nodes + 7 edges for real analysis
    - artifact_provenance_graph_html accepts both {from,to} and {source,target} edge keys
    - Edge case (no pins): returns valid response with status=verified
    • All acceptance criteria verified as satisfied. Task is complete on main.
    • Branch is clean, at same SHA as origin/main (541786d21), no pending changes.

    Payload JSON
    {
      "requirements": {
        "coding": 7,
        "reasoning": 7,
        "analysis": 8
      }
    }

    Sibling Tasks in Quest (Artifacts) ↗

    Task Dependencies

    ↓ Referenced by (downstream)