[Artifacts] Reproducible analysis chains: pin artifact versions in analysis specs
Goal
Scientific reproducibility requires knowing exactly which inputs produced which outputs. When an analysis runs, it should record the precise versions of all input artifacts (datasets, models, hypotheses, prior analyses). Later, anyone should be able to verify that those inputs still exist and haven't been modified, enabling full replay of the reasoning chain.
Design
Pinned Artifacts in Analysis Specs
Add a
pinned_artifacts field to analysis records:
{
"analysis_id": "SDA-2026-04-05-xxx",
"pinned_artifacts": [
{"artifact_id": "dataset-allen_brain-SEA-AD", "version_number": 1, "content_hash": "sha256:abc..."},
{"artifact_id": "model-biophys-microglia-v3", "version_number": 3, "content_hash": "sha256:def..."},
{"artifact_id": "hypothesis-h-seaad-v4-26ba859b", "version_number": 1, "content_hash": "sha256:ghi..."}
],
"outputs": [
{"artifact_id": "figure-timecourse-001", "version_number": 1}
]
}
Auto-Snapshot on Analysis Run
When an analysis executes:
Collect all input artifacts referenced in the analysis spec
For each, record current (artifact_id, version_number, content_hash)
Store as pinned_artifacts in the analysis metadata
Create artifact_links: analysis → each input (link_type="cites", with version info in evidence)
Register analysis outputs as new artifacts linked back (derives_from)verify_reproducibility(analysis_id) → dict
def verify_reproducibility(analysis_id):
"""Check that all pinned inputs still exist and match their content_hash."""
analysis = get_artifact(analysis_id)
pinned = analysis.metadata.get("pinned_artifacts", [])
results = []
for pin in pinned:
artifact = get_artifact(pin["artifact_id"])
if artifact is None:
results.append({"artifact_id": pin["artifact_id"], "status": "missing"})
elif artifact.content_hash != pin["content_hash"]:
results.append({"artifact_id": pin["artifact_id"], "status": "modified",
"expected_hash": pin["content_hash"],
"current_hash": artifact.content_hash})
else:
results.append({"artifact_id": pin["artifact_id"], "status": "verified"})
return {
"analysis_id": analysis_id,
"reproducible": all(r["status"] == "verified" for r in results),
"checks": results
}
Provenance DAG Endpoint
GET /api/analysis/{id}/provenance returns the full input/output artifact DAG:
{
"analysis_id": "SDA-2026-04-05-xxx",
"nodes": [
{"id": "paper-12345", "type": "paper", "version": 1, "role": "input"},
{"id": "dataset-geo-GSE123", "type": "dataset", "version": 1, "role": "input"},
{"id": "SDA-2026-04-05-xxx", "type": "analysis", "version": 1, "role": "center"},
{"id": "model-biophys-001", "type": "model", "version": 1, "role": "output"},
{"id": "figure-tc-001", "type": "figure", "version": 1, "role": "output"}
],
"edges": [
{"from": "paper-12345", "to": "SDA-...", "type": "cites"},
{"from": "dataset-geo-GSE123", "to": "SDA-...", "type": "cites"},
{"from": "SDA-...", "to": "model-biophys-001", "type": "produces"},
{"from": "SDA-...", "to": "figure-tc-001", "type": "produces"}
],
"reproducibility": {"status": "verified", "checked_at": "2026-04-05T12:00:00"}
}
Acceptance Criteria
☐ Analysis metadata supports pinned_artifacts field
☐ Auto-snapshot captures all input artifact versions on analysis execution
☐ Content hashes recorded for each pinned artifact
☐ verify_reproducibility() checks all pins and reports status
☐ GET /api/analysis/{id}/provenance returns DAG structure
☐ Provenance DAG includes both inputs and outputs with versions
☐ Edge case: analysis with no pinned artifacts returns empty but valid response
☐ Work log updated with timestamped entry
Dependencies
- a17-20-VAPI0001 (version-aware API for resolving versions and hashes)
Dependents
- a17-25-AVUI0001 (version browser UI shows provenance)
- d16-24-PROV0001 (demo: end-to-end provenance walkthrough)
Work Log
2026-04-26 00:43 PT — Slot minimax:71
- Added
capture_analysis_inputs() function to scidex/atlas/artifact_registry.py (line 4969)
- Takes analysis_id + list of input_artifact_ids
- Captures version_number, content_hash, title for each input
- Stores pinned_artifacts snapshot in analysis artifact's metadata
- Creates cites artifact_links from analysis → each input with version evidence
- Updated
get_analysis_provenance() to:
- Add
role field to all nodes (center/input/output)
- Add
version field to analysis node
- Use
from/
to keys for edges instead of
source/
target - Return
cites edge type for inputs,
produces for outputs
- Use spec-compliant reproducibility structure with
status/
checked_at - Fixed PostgreSQL column name (
parent_hypothesis_id →
depends_on_hypothesis_id)
- Updated
verify_reproducibility() to fall back to analyses table when artifact not found
- All acceptance criteria met: pinned_artifacts field, auto-snapshot, verify_reproducibility(), provenance DAG with roles + edge types
- Tested: get_analysis_provenance returns 31 nodes, 33 edges, correct role field,
from/to edge keys
- Tested: verify_reproducibility returns
reproducible: True for analysis without pins
2026-04-26 08:10 PT — Slot minimax:71 (retry 2)
- Confirmed branch is at same SHA as origin/main (541786d21) — task work was merged by prior agent
- Verified acceptance criteria against current HEAD:
-
capture_analysis_inputs() exists in artifact_registry.py at line 4969
-
verify_reproducibility() exists at line 4800, falls back to analyses table, returns
reproducible: True for analysis without pins
-
get_analysis_provenance() returns DAG with nodes (role field), edges (from/to keys), reproducibility (status/checked_at)
-
GET /api/analyses/{id}/provenance route registered, returns 8 nodes + 7 edges for real analysis
-
artifact_provenance_graph_html accepts both {from,to} and {source,target} edge keys
- Edge case (no pins): returns valid response with status=verified
- All acceptance criteria verified as satisfied. Task is complete on main.
- Branch is clean, at same SHA as origin/main (541786d21), no pending changes.