Move every SciDEX artifact into a folder-per-artifact storage layout
where the folder is named by the artifact's existing id. Friendly
filenames live inside the folder; multi-file artifacts (notebook +
HTML + figures + manifest) co-locate naturally. DB-only artifacts
(hypotheses, claims, KG entities) get folders too — their content is
the canonical manifest.json so the filesystem becomes a uniform
provenance surface in git.
Today the storage layout mixes UUID-named files (notebooks),
slug-named files (figures), session-id folders (analyses), and flat
CSVs (datasets). That works for read paths but blocks: (a) artifacts
that span multiple files; (b) provenance graphs (parent → derived);
(c) DB-only artifacts being version-controlled at all; (d) idempotent
migration tooling. Folder-per-artifact solves all four.
> ## Continuous-process anchor
>
> This is a bounded migration quest, not a continuous process — it
> has a clear "done" state. Apply the design principles in
> docs/design/retired_scripts_patterns.md for the backfill driver
> (gap-predicate, idempotent, version-stamped, observable), but the
> migration itself is a phased rollout, not a steady-state job.
>
> The eventual steady-state — every newly-committed artifact gets a
> folder automatically — is a side effect of Phase 1 changes, not a
> separate continuous process.
Final design after iterating with the user:
artifacts.id (existing column): preserved verbatim. Mixedfigure-abc123, MITOCHONDRIAL_DYSFUNCTION,sess_SDA-2026-04-16-..., etc.). Never changes. The legacy handle.
artifacts.artifact_id (NEW column): UUID type, UNIQUE,artifacts.artifact_type (existing column): kept. Carries typeartifact_id itself can be a bare UUID without a type prefix.Conversation thread that led here:
> R1: "we should use uuid rather than guid (minor difference, but if
> needed for naming); just id is probably also fine within the db,
> docs, externally, etc."
>
> R2: "the id itself should be a uuid/guid — i was just commenting on
> the column name"
>
> R3: "uuid is more of a standard than guid"
>
> R4: "if there is already an id column that is NOT a uuid, it should
> be preserved. if there are conflicts we can create a new column
> called artifact_id. if there is not already an id column then we
> can use it..."
>
> R5: "we should support backwards compat with old paths. yes having
> an artifact type makes sense"
The artifacts table already has id (mixed format) — that's the
"conflict" R4 references — so the new column is named artifact_id
(UUID) per R4's guidance.
artifact_id (UUIDv4). Every artifact, new ordata/scidex-artifacts/<uuid>/. Clean,/artifact/<id> for backwards compatibility/artifact/<artifact_id> (UUID) also resolves toid and artifact_id so consumersTables created by this and sibling specs (experiment_claims,
replication_attempts, artifact_migration_log, etc.) FK to
artifacts.id (the legacy handle) for migration safety — every
artifact has an id from day one, so there's no chicken-and-egg
problem. After Phase 2 backfill is complete and artifact_id is
populated for every row, future tables may FK to artifacts.artifact_id
instead.
register_artifact() (Phase 1)Today scidex.atlas.artifact_registry.register_artifact() mints ids
in {type}-{uuid} format on the id column. Phase 1 changes:
id continues to receive {type}-{uuid} format for backwardsartifact_id receives a fresh uuid4() for every new artifact.artifact_id for every existing row.quest_artifact_metadata_semantic_spec.md andquest_artifact_reuse_provenance_qc_spec.md want to attach newdata/scidex-artifacts/
<artifact_id>/ # one directory per artifact
manifest.json # canonical metadata snapshot (mirrors DB row)
<friendly_name>.<ext> # primary file (notebook.ipynb, figure.png, dataset.csv, ...)
accessories/
<friendly_name>.html # rendered notebook
<friendly_name>.schema.json # dataset schema
preview.png # thumbnail
summary.md # rendered summary
inputs/ # symlinks to upstream artifacts' folders
<input-artifact_id> -> ../../<input-artifact_id>
outputs/ # symlinks to derivative artifacts
<output-artifact_id> -> ../../<output-artifact_id>Rationale:
ls inputs/ and ls outputs/accessories/ is opt-in; small artifacts (single figure) skip it.manifest.json is a denormalized snapshot for offline analysis,vasodilator_response_AD.ipynb not the raw id).
artifactsALTER TABLE artifacts ADD COLUMN artifact_id UUID; -- canonical UUID handle (new code path)
ALTER TABLE artifacts ADD COLUMN artifact_dir TEXT; -- 'data/scidex-artifacts/<artifact_id>'
ALTER TABLE artifacts ADD COLUMN primary_filename TEXT; -- 'vasodilator_response_AD.ipynb'
ALTER TABLE artifacts ADD COLUMN accessory_filenames TEXT[]; -- ['vasodilator_response_AD.html']
ALTER TABLE artifacts ADD COLUMN folder_layout_version INT DEFAULT 1;
ALTER TABLE artifacts ADD COLUMN migrated_to_folder_at TIMESTAMPTZ;
-- UNIQUE on artifact_id is enforced via partial index (allows NULL during
-- backfill, prevents duplicates among populated rows).
CREATE UNIQUE INDEX idx_artifacts_artifact_id_unique
ON artifacts(artifact_id) WHERE artifact_id IS NOT NULL;
CREATE INDEX idx_artifacts_migrated ON artifacts(migrated_to_folder_at)
WHERE migrated_to_folder_at IS NOT NULL;After Phase 2 backfill reaches 99%+ coverage: optionally
ALTER TABLE artifacts ALTER COLUMN artifact_id SET NOT NULL. Until
then, artifact_id is nullable so the schema applies cleanly to a
populated table.
artifact_dir is always derivable as
'data/scidex-artifacts/' || artifact_id::text but stored explicitly
to support future relocation (S3, etc.).
_<artifact_type>_<short-id-hash>_n suffixprimary_filename; once set, never changes/artifact/<id> stays canonical (already deployed). Type-prefixed
routes can be added as redirects without breaking anything:
/figure/<id> → 301 /artifact/<id> (read pretty URL → canonical)/notebook/<id> → 301 /artifact/<id>/dataset/<id> → 301 /artifact/<id>/model/<id> → 301 /artifact/<id>/artifact/id makes great sense."Each phase has a hard exit gate. Don't proceed to Phase N+1 until
Phase N acceptance is fully met and verified with a 24h soak.
Migration: migrations/20260428_artifact_folder_columns.sql
artifact_dir, primary_filename, accessory_filenames,folder_layout_version, migrated_to_folder_at (all nullable)
migrated_to_folder_atartifact_migration_log for backfill auditALTER TABLE ... DROP COLUMN works\d artifacts shows new columnsStatus: scaffolded in companion PR. Phase 0 also lands path
helpers (artifact_dir(artifact_id), etc.) and a manifest.json
writer module (scidex.atlas.artifact_manifest).
Code changes:
scidex/atlas/artifact_commit.py:artifact_id parameter (writers MUST supply; no auto-generationregister_artifact())friendly_name parameter (default None → derive from first path)data/scidex-artifacts/<id>/ with accessories/ for non-primarymanifest.json from DB row before commitmigrated_to_folder_at = now() on the rowscidex/core/paths.py:artifact_dir(artifact_id) → returns Pathartifact_primary_path(artifact_id, filename) → returns Pathartifact_accessory_path(artifact_id, filename) → returns PathFIGURE_DIR, NOTEBOOK_DIR, etc. — emit DeprecationWarningscidex/atlas/artifact_registry.py:register_artifact() populates artifact_dir, primary_filenameresolve_artifact() accepts id; returns row including folder fieldsget_capsule_manifest() reads from <id>/manifest.json if present,api.py artifact write paths:commit_artifact(artifact_id=..., paths=...)# TODO(folder-migration phase 5): delete commentsTests:
<id>/ → manifest.json present → URL works
Acceptance:
migrated_to_folder_at IS NOT NULLartifact_commit_failed events; no rate changeDriver: scripts/artifact_folder_backfill.py (new, recurring)
Gap predicate:
SELECT id, artifact_type, content_hash, metadata
FROM artifacts
WHERE migrated_to_folder_at IS NULL
ORDER BY created_at DESC
LIMIT 100;Per-artifact algorithm (idempotent, atomic):
metadata.figure_path, metadata.notebook_path,metadata.dataset_path, etc.provenance_chain if path is in theredata/scidex-artifacts/{type}s/ for files matchingartifact_dirmanifest.json to the folder
mkdir -p data/scidex-artifacts/<id>/git mv files into <id>/ folder (preserves history, atomic in Git)<id>/<friendly_name>.<ext><id>/accessories/<accessory_friendly>.<ext>
content_hash if set<id>/manifest.json from the DB rowinputs/ and outputs/ based on artifact_linksartifact_dir, primary_filename,accessory_filenames, migrated_to_folder_at
artifact_migration_log:(artifact_id, status, files_moved, errors, took_ms)Atomicity: each artifact is one Git commit
(git mv + manifest write) so it's reverted as a unit if any step fails.
Bounded batch: 100 artifacts/cycle, every-30-min recurring. At
11,667 artifacts and 100/cycle/30min → ~60h total wall time, ~3 days.
Cold-start QC backfill (per user direction): now is the time to
also run QC on existing artifacts. Coordinate with
quest_artifact_reuse_provenance_qc_spec.md so QC + folder migration
happen together in this drained-fleet window.
Failure modes:
Acceptance:migrated_to_folder_at IS NOT NULL within 7 daysartifact_migration_log: ≥99% status='success' for artifacts with filesfind data/scidex-artifacts/ -type f | wc -lcontent_hash values verified post-move (or flagged)Rollback: each cycle is one commit; git revert to undo.
Order (topological by dependency):
scidex/atlas/artifact_registry.resolve_artifact() — read folder fieldsapi.py /api/artifacts/{id} routes — return folder fields<id>/<filename>.html first, fall back to legacy{{artifact:ID}} embed — resolve via id; folder for contentEach switch is its own small PR with a feature flag
(ARTIFACT_FOLDER_READERS_ENABLED=true). Roll forward gradually.
Add type-prefixed redirects:
/figure/<id> → 301 /artifact/<id>/notebook/<id> → 301/dataset/<id> → 301/model/<id> → 301/artifact/<id> stays canonical.<link rel="canonical"> header on every artifact page.Acceptance:
/artifact/<id>This is the cleanup phase, runs only after 30+ day soak post-Phase 3/4.
figures/, notebooks/, etc.) emptiedpaths.py legacy constants (FIGURE_DIR, etc.) deletedcommit_artifact legacy code paths deletedFIGURE_DIR etc.data/scidex-artifacts/ listing shows only <id>/ directoriesdata/scidex-artifacts/ and data/scidex-papers/ are git submodules.
Migration commits land in the submodule first, then the outer repo
updates the submodule pointer. Use git submodule update --remote in
backfill driver before each batch. Push submodule before outer repo.
Some artifacts (paper figures with extracted SVG + PNG + LaTeX) might
have 5+ files. Layout supports this: primary file + N accessories.
Folder structure stays flat (no nested subdirs except accessories/,
inputs/, outputs/).
These have no on-disk files today. They still get a folder containing
only manifest.json. Per user direction, this brings them into git
version control — the manifest is the canonical exported representation.
Folder size is ~1KB; trivial.
Phase 2 driver inserts folder fields for existing rows. Phase 1 ensures
new rows get folder fields. Race: a new artifact created at the same
moment the driver picks up an old row → no conflict, different rows.
Driver acquires per-artifact advisory lock before mutating:
SELECT pg_try_advisory_xact_lock(hashtext('artifact_folder_migration:' || id))If crash, the next cycle picks up the same row (idempotent — checks if
folder already populated, skips).
Phase 2 writes manifests + creates folders → ~50KB overhead per artifact
on average → 11,667 × 50KB = 580MB additional. Within current submodule
size budget (1.1GB → 1.7GB).
-- Phase 0
ALTER TABLE artifacts ADD COLUMN artifact_dir TEXT;
ALTER TABLE artifacts ADD COLUMN primary_filename TEXT;
ALTER TABLE artifacts ADD COLUMN accessory_filenames TEXT[];
ALTER TABLE artifacts ADD COLUMN folder_layout_version INT DEFAULT 1;
ALTER TABLE artifacts ADD COLUMN migrated_to_folder_at TIMESTAMPTZ;
CREATE TABLE artifact_migration_log (
id BIGSERIAL PRIMARY KEY,
artifact_id TEXT NOT NULL,
status TEXT NOT NULL,
files_moved JSONB,
errors JSONB,
took_ms INT,
created_at TIMESTAMPTZ DEFAULT NOW()
);GET /api/artifacts/<id>/folder — list files in artifact folderGET /api/artifacts/<id>/manifest — raw manifest.jsonGET /api/artifacts/<id>/inputs — list of input artifact idsGET /api/artifacts/<id>/outputs — list of artifacts derived from thisPOST /api/artifacts/<id>/accessory — add accessory file (auth required)scidex.core.database (PG access)scidex.atlas.artifact_commit (extension point in Phase 1)scidex.atlas.artifact_registry (extension point in Phase 1, 3)data/scidex-artifacts/ git submodule (write target in Phase 2)quest_artifact_metadata_semantic_spec.md — keys summary embedding on idquest_artifact_reuse_provenance_qc_spec.md — keys parent/derived on idquest_experiment_execution_participant_spec.md — output artifacts use folder layoutquest_paper_replication_starter_spec.md — replication artifacts use folder layoutInitial design proposed a separate guid UUID column. User pushed
back: "we should use uuid rather than guid (minor difference, but if
needed for naming); just id is probably also fine." Going with the
simplest path — use existing artifacts.id as the canonical handle,
no new column. Folder name = literal id value.
Phase 0 PR (#1222) reflects the simplified design: 5 nullable columns
+ migration log table; no guid column. Spec content updated in this PR.
User-confirmed design choices:
/artifact/<id> stays canonicalOpen items (track here as work begins):
scidex.ai/artifact/<id>)? Auditmanifest.json be canonicalized JSON (sorted keys, no{
"requirements": {
"reasoning": 7,
"coding": 8,
"safety": 8
}
}