[Atlas] CI: Drive artifact folder migration backfill open coding:8 reasoning:7 safety:8

← Atlas
Phase 2 of artifact folder migration. Each cycle: 100 artifacts WHERE artifact_id IS NULL OR migrated_to_folder_at IS NULL. Generate uuid4() for artifact_id, mkdir /, git mv files, write manifest.json, set provenance symlinks. Idempotent via DB advisory lock. Append to artifact_migration_log. Stop when COUNT(WHERE artifact_id IS NULL)=0. ~3 days for 11,667 artifacts at 100/cycle every-30-min.

Git Commits (4)

[Atlas] UUID migration Phase 1b: commit_artifact_to_folder helper (#1227)2026-04-28
[Atlas] UUID migration Phase 1b: commit_artifact_to_folder helper2026-04-28
[Atlas] UUID migration Phase 1: register_artifact() populates artifact_id (#1226)2026-04-28
[Atlas] UUID migration Phase 1: register_artifact() populates artifact_id2026-04-28
Spec File

Goal

Move every SciDEX artifact into a folder-per-artifact storage layout
where the folder is named by the artifact's existing id. Friendly
filenames live inside the folder; multi-file artifacts (notebook +
HTML + figures + manifest) co-locate naturally. DB-only artifacts
(hypotheses, claims, KG entities) get folders too — their content is
the canonical manifest.json so the filesystem becomes a uniform
provenance surface in git.

Today the storage layout mixes UUID-named files (notebooks),
slug-named files (figures), session-id folders (analyses), and flat
CSVs (datasets). That works for read paths but blocks: (a) artifacts
that span multiple files; (b) provenance graphs (parent → derived);
(c) DB-only artifacts being version-controlled at all; (d) idempotent
migration tooling. Folder-per-artifact solves all four.

> ## Continuous-process anchor
>
> This is a bounded migration quest, not a continuous process — it
> has a clear "done" state. Apply the design principles in
> docs/design/retired_scripts_patterns.md for the backfill driver
> (gap-predicate, idempotent, version-stamped, observable), but the
> migration itself is a phased rollout, not a steady-state job.
>
> The eventual steady-state — every newly-committed artifact gets a
> folder automatically — is a side effect of Phase 1 changes, not a
> separate continuous process.

Naming decision

Final design after iterating with the user:

  • artifacts.id (existing column): preserved verbatim. Mixed
format (figure-abc123, MITOCHONDRIAL_DYSFUNCTION,
sess_SDA-2026-04-16-..., etc.). Never changes. The legacy handle.
  • artifacts.artifact_id (NEW column): UUID type, UNIQUE,
populated for every artifact (new ones at write time, existing ones
in Phase 2 backfill). The canonical clean handle for new code.
  • artifacts.artifact_type (existing column): kept. Carries type
info so artifact_id itself can be a bare UUID without a type prefix.

Conversation thread that led here:

> R1: "we should use uuid rather than guid (minor difference, but if
> needed for naming); just id is probably also fine within the db,
> docs, externally, etc."
>
> R2: "the id itself should be a uuid/guid — i was just commenting on
> the column name"
>
> R3: "uuid is more of a standard than guid"
>
> R4: "if there is already an id column that is NOT a uuid, it should
> be preserved. if there are conflicts we can create a new column
> called artifact_id. if there is not already an id column then we
> can use it..."
>
> R5: "we should support backwards compat with old paths. yes having
> an artifact type makes sense"

The artifacts table already has id (mixed format) — that's the
"conflict" R4 references — so the new column is named artifact_id
(UUID) per R4's guidance.

Folder + URL semantics

  • Folder name = artifact_id (UUIDv4). Every artifact, new or
backfilled, lives at data/scidex-artifacts/<uuid>/. Clean,
uniform. No mixed-format folder names.
  • Canonical URL stays /artifact/<id> for backwards compatibility
(R5). Existing public links keep working forever.
  • New URL alias /artifact/<artifact_id> (UUID) also resolves to
the same record. Both lookups go to the same renderer.
  • API responses include both id and artifact_id so consumers
can pick.

FK choices in new tables

Tables created by this and sibling specs (experiment_claims, replication_attempts, artifact_migration_log, etc.) FK to artifacts.id (the legacy handle) for migration safety — every
artifact has an id from day one, so there's no chicken-and-egg
problem. After Phase 2 backfill is complete and artifact_id is
populated for every row, future tables may FK to artifacts.artifact_id
instead.

Implications for register_artifact() (Phase 1)

Today scidex.atlas.artifact_registry.register_artifact() mints ids
in {type}-{uuid} format on the id column. Phase 1 changes:

  • id continues to receive {type}-{uuid} format for backwards
compatibility with existing URL patterns (no need to change every
consumer at once).
  • artifact_id receives a fresh uuid4() for every new artifact.

Phase 2 backfill mints artifact_id for every existing row.

Why now

  • 11,667 files across 4+ naming conventions; impossible to reason about
"what files belong to artifact X" without DB joins
  • quest_artifact_metadata_semantic_spec.md and
quest_artifact_reuse_provenance_qc_spec.md want to attach new
schema (summary, embedding, parent_artifact_id, qc_status); doing
the migration first avoids two schema churns
  • Replication and experiment-execution quests will produce
multi-file artifacts (notebook + data dumps + figures + manifest)
that benefit from folder-per-artifact natively
  • DB-only artifacts (hypotheses, claims) gain version-controlled
representation when they get a folder containing manifest.json —
user-confirmed: "we should consider moving some artifact more onto
git versioned files than relying on just the db"

Folder layout

data/scidex-artifacts/
  <artifact_id>/                   # one directory per artifact
    manifest.json                  # canonical metadata snapshot (mirrors DB row)
    <friendly_name>.<ext>          # primary file (notebook.ipynb, figure.png, dataset.csv, ...)
    accessories/
      <friendly_name>.html         # rendered notebook
      <friendly_name>.schema.json  # dataset schema
      preview.png                  # thumbnail
      summary.md                   # rendered summary
    inputs/                        # symlinks to upstream artifacts' folders
      <input-artifact_id> -> ../../<input-artifact_id>
    outputs/                       # symlinks to derivative artifacts
      <output-artifact_id> -> ../../<output-artifact_id>

Rationale:

  • Symlink-based provenance lets ls inputs/ and ls outputs/
answer "what did this derive from / what derives from this" without
DB joins. Symlinks survive in the submodule because Git tracks them.
  • accessories/ is opt-in; small artifacts (single figure) skip it.
  • manifest.json is a denormalized snapshot for offline analysis,
rebuildable from the DB at any time. Idempotent generation.
  • Friendly names preserved so a downloaded artifact is still
human-meaningful (vasodilator_response_AD.ipynb not the raw id).
  • DB-only artifacts get folders too — the manifest.json is the
only file; that's how they become git-versioned.

New columns on artifacts

ALTER TABLE artifacts ADD COLUMN artifact_id UUID;           -- canonical UUID handle (new code path)
ALTER TABLE artifacts ADD COLUMN artifact_dir TEXT;          -- 'data/scidex-artifacts/<artifact_id>'
ALTER TABLE artifacts ADD COLUMN primary_filename TEXT;      -- 'vasodilator_response_AD.ipynb'
ALTER TABLE artifacts ADD COLUMN accessory_filenames TEXT[]; -- ['vasodilator_response_AD.html']
ALTER TABLE artifacts ADD COLUMN folder_layout_version INT DEFAULT 1;
ALTER TABLE artifacts ADD COLUMN migrated_to_folder_at TIMESTAMPTZ;

-- UNIQUE on artifact_id is enforced via partial index (allows NULL during
-- backfill, prevents duplicates among populated rows).
CREATE UNIQUE INDEX idx_artifacts_artifact_id_unique
  ON artifacts(artifact_id) WHERE artifact_id IS NOT NULL;

CREATE INDEX idx_artifacts_migrated ON artifacts(migrated_to_folder_at)
  WHERE migrated_to_folder_at IS NOT NULL;

After Phase 2 backfill reaches 99%+ coverage: optionally ALTER TABLE artifacts ALTER COLUMN artifact_id SET NOT NULL. Until
then, artifact_id is nullable so the schema applies cleanly to a
populated table.

artifact_dir is always derivable as 'data/scidex-artifacts/' || artifact_id::text but stored explicitly
to support future relocation (S3, etc.).

Friendly-name generation

  • Read existing filename → strip extension → slugify
  • Truncate to 60 chars, lowercase, ASCII-only, replace whitespace/punct with _
  • If empty after slugify, fall back to <artifact_type>_<short-id-hash>
  • Conflicts inside a folder resolved with _n suffix
  • Stored in primary_filename; once set, never changes

URL paths

/artifact/<id> stays canonical (already deployed). Type-prefixed
routes can be added as redirects without breaking anything:

  • /figure/<id> → 301 /artifact/<id> (read pretty URL → canonical)
  • /notebook/<id> → 301 /artifact/<id>
  • /dataset/<id> → 301 /artifact/<id>
  • /model/<id> → 301 /artifact/<id>

User-confirmed: "there could be multiple paths to artifact (e.g., by
their type) but /artifact/id makes great sense."

Phased rollout

Each phase has a hard exit gate. Don't proceed to Phase N+1 until
Phase N acceptance is fully met and verified with a 24h soak.

Phase 0 — Schema (1 PR, ~2 hours)

Migration: migrations/20260428_artifact_folder_columns.sql

  • Add artifact_dir, primary_filename, accessory_filenames,
folder_layout_version, migrated_to_folder_at (all nullable)
  • Index on migrated_to_folder_at
  • New table artifact_migration_log for backfill audit
  • No data writes; pure schema change
  • Reversible — ALTER TABLE ... DROP COLUMN works
Acceptance:
☐ Migration applies cleanly to scidex PG
\d artifacts shows new columns
☐ No regression in any existing query (read tests pass)

Status: scaffolded in companion PR. Phase 0 also lands path
helpers (artifact_dir(artifact_id), etc.) and a manifest.json
writer module (scidex.atlas.artifact_manifest).

Phase 1 — New artifacts use folders (1 PR, ~1 day)

Code changes:

  • scidex/atlas/artifact_commit.py:
  • - Add artifact_id parameter (writers MUST supply; no auto-generation
    here — the caller already has the id from register_artifact())
    - Add friendly_name parameter (default None → derive from first path)
    - When called with multi-file paths, write all into
    data/scidex-artifacts/<id>/ with accessories/ for non-primary
    - Generate manifest.json from DB row before commit
    - Set migrated_to_folder_at = now() on the row

  • scidex/core/paths.py:
  • - artifact_dir(artifact_id) → returns Path
    - artifact_primary_path(artifact_id, filename) → returns Path
    - artifact_accessory_path(artifact_id, filename) → returns Path
    - Keep FIGURE_DIR, NOTEBOOK_DIR, etc. — emit DeprecationWarning

  • scidex/atlas/artifact_registry.py:
  • - register_artifact() populates artifact_dir, primary_filename
    - resolve_artifact() accepts id; returns row including folder fields
    - get_capsule_manifest() reads from <id>/manifest.json if present,
    else falls back to legacy path

  • api.py artifact write paths:
  • - Switch all writers to the new commit_artifact(artifact_id=..., paths=...)
    signature
    - Old code paths (direct write into FIGURE_DIR, etc.) flagged with
    # TODO(folder-migration phase 5): delete comments

    Tests:

    • New artifact end-to-end: API call → artifact row exists → folder at
    <id>/ → manifest.json present → URL works
    • Mixed write: artifact row created without folder fields (legacy code
    path) → Phase 2 backfill assigns folder
    • Concurrent writes: two artifacts created simultaneously land in
    different folders (no collision; ids are unique)

    Acceptance:

    ☐ Every artifact created after deploy has migrated_to_folder_at IS NOT NULL
    ☐ No regression: old artifact reads still work via legacy paths
    ☐ Deprecation warnings logged but don't error
    ☐ 24h soak: monitor artifact_commit_failed events; no rate change

    Phase 2 — Backfill existing artifacts (1 long-running migration task)

    Driver: scripts/artifact_folder_backfill.py (new, recurring)

    Gap predicate:

    SELECT id, artifact_type, content_hash, metadata
    FROM artifacts
    WHERE migrated_to_folder_at IS NULL
    ORDER BY created_at DESC
    LIMIT 100;

    Per-artifact algorithm (idempotent, atomic):

  • Resolve current file location(s):
  • - From metadata.figure_path, metadata.notebook_path,
    metadata.dataset_path, etc.
    - From provenance_chain if path is in there
    - Fallback: scan data/scidex-artifacts/{type}s/ for files matching
    the artifact's slug or content_hash
    - If no files found (DB-only artifact) → mark artifact_dir
    anyway, write only manifest.json to the folder
  • mkdir -p data/scidex-artifacts/<id>/
  • Generate friendly name from first file's basename
  • git mv files into <id>/ folder (preserves history, atomic in Git)
  • - Primary file at <id>/<friendly_name>.<ext>
    - Accessories at <id>/accessories/<accessory_friendly>.<ext>
  • Verify SHA256 of moved file matches content_hash if set
  • Generate <id>/manifest.json from the DB row
  • Symlink up inputs/ and outputs/ based on artifact_links
  • Update DB row: artifact_dir, primary_filename,
  • accessory_filenames, migrated_to_folder_at
  • Append to artifact_migration_log:
  • (artifact_id, status, files_moved, errors, took_ms)

    Atomicity: each artifact is one Git commit
    (git mv + manifest write) so it's reverted as a unit if any step fails.

    Bounded batch: 100 artifacts/cycle, every-30-min recurring. At
    11,667 artifacts and 100/cycle/30min → ~60h total wall time, ~3 days.

    Cold-start QC backfill (per user direction): now is the time to
    also run QC on existing artifacts. Coordinate with quest_artifact_reuse_provenance_qc_spec.md so QC + folder migration
    happen together in this drained-fleet window.

    Failure modes:

    FailureResponse
    Files not found on diskDB-only artifact; write manifest.json only, mark migrated
    Content hash mismatchstatus='hash_mismatch', do NOT move files, log diff
    Git mv failsstatus='error', retry next cycle (idempotent)
    Disk fullHalt all writes, alert operator
    Submodule lock contentionSkip artifact, retry next cycle
    Acceptance:
    ☐ Every existing artifact has migrated_to_folder_at IS NOT NULL within 7 days
    artifact_migration_log: ≥99% status='success' for artifacts with files
    ☐ No file lost: pre-migration find data/scidex-artifacts/ -type f | wc -l
    ≤ post-migration count
    ☐ All content_hash values verified post-move (or flagged)
    ☐ Submodule grows by ≤ 5% (mostly directory entries + manifests)
    ☐ No reduction in artifact-serving latency (p99 < 500ms maintained)

    Rollback: each cycle is one commit; git revert to undo.

    Phase 3 — Switch readers to folder paths (1 PR per consumer)

    Order (topological by dependency):

  • scidex/atlas/artifact_registry.resolve_artifact() — read folder fields
  • api.py /api/artifacts/{id} routes — return folder fields
  • Notebook viewer — read <id>/<filename>.html first, fall back to legacy
  • Figure viewer — same
  • Dataset viewer — same
  • KG → artifact link rendering — reference id in folder layout
  • Wiki {{artifact:ID}} embed — resolve via id; folder for content
  • Search/recommendation surfaces — return folder fields
  • Each switch is its own small PR with a feature flag
    (ARTIFACT_FOLDER_READERS_ENABLED=true). Roll forward gradually.

    Phase 4 — Type-prefixed URL aliases (1 PR)

    Add type-prefixed redirects:

    • /figure/<id> → 301 /artifact/<id>
    • /notebook/<id> → 301
    • /dataset/<id> → 301
    • /model/<id> → 301

    These are net-new URLs; existing /artifact/<id> stays canonical.
    Add <link rel="canonical"> header on every artifact page.

    Acceptance:

    ☐ New type-prefixed URLs return 301 to /artifact/<id>
    ☐ Canonical link header on every artifact page
    ☐ Sitemap includes both type-prefixed and canonical URLs

    Phase 5 — Remove legacy paths (1 PR)

    This is the cleanup phase, runs only after 30+ day soak post-Phase 3/4.

    • Old type-grouped folders (figures/, notebooks/, etc.) emptied
    (files already moved by Phase 2); this phase deletes the empty dirs
    • paths.py legacy constants (FIGURE_DIR, etc.) deleted
    • commit_artifact legacy code paths deleted
    • Deprecation warnings removed
    • VACUUM ANALYZE
    Acceptance:
    ☐ All legacy paths removed
    ☐ All readers use folder paths exclusively
    ☐ Test suite has no references to FIGURE_DIR etc.
    data/scidex-artifacts/ listing shows only <id>/ directories

    Edge cases

    Submodule artifact: special handling

    data/scidex-artifacts/ and data/scidex-papers/ are git submodules.
    Migration commits land in the submodule first, then the outer repo
    updates the submodule pointer. Use git submodule update --remote in
    backfill driver before each batch. Push submodule before outer repo.

    Multi-file artifacts that don't fit the layout

    Some artifacts (paper figures with extracted SVG + PNG + LaTeX) might
    have 5+ files. Layout supports this: primary file + N accessories.
    Folder structure stays flat (no nested subdirs except accessories/, inputs/, outputs/).

    DB-only artifacts (hypotheses, claims, etc.)

    These have no on-disk files today. They still get a folder containing
    only manifest.json. Per user direction, this brings them into git
    version control — the manifest is the canonical exported representation.
    Folder size is ~1KB; trivial.

    Concurrent backfill + new artifact creation

    Phase 2 driver inserts folder fields for existing rows. Phase 1 ensures
    new rows get folder fields. Race: a new artifact created at the same
    moment the driver picks up an old row → no conflict, different rows.

    Backfill driver crashes mid-batch

    Driver acquires per-artifact advisory lock before mutating:

    SELECT pg_try_advisory_xact_lock(hashtext('artifact_folder_migration:' || id))

    If crash, the next cycle picks up the same row (idempotent — checks if
    folder already populated, skips).

    Storage cost

    Phase 2 writes manifests + creates folders → ~50KB overhead per artifact
    on average → 11,667 × 50KB = 580MB additional. Within current submodule
    size budget (1.1GB → 1.7GB).

    Schema changes summary

    -- Phase 0
    ALTER TABLE artifacts ADD COLUMN artifact_dir TEXT;
    ALTER TABLE artifacts ADD COLUMN primary_filename TEXT;
    ALTER TABLE artifacts ADD COLUMN accessory_filenames TEXT[];
    ALTER TABLE artifacts ADD COLUMN folder_layout_version INT DEFAULT 1;
    ALTER TABLE artifacts ADD COLUMN migrated_to_folder_at TIMESTAMPTZ;
    
    CREATE TABLE artifact_migration_log (
      id BIGSERIAL PRIMARY KEY,
      artifact_id TEXT NOT NULL,
      status TEXT NOT NULL,
      files_moved JSONB,
      errors JSONB,
      took_ms INT,
      created_at TIMESTAMPTZ DEFAULT NOW()
    );

    API additions

    • GET /api/artifacts/<id>/folder — list files in artifact folder
    • GET /api/artifacts/<id>/manifest — raw manifest.json
    • GET /api/artifacts/<id>/inputs — list of input artifact ids
    • GET /api/artifacts/<id>/outputs — list of artifacts derived from this
    • POST /api/artifacts/<id>/accessory — add accessory file (auth required)

    Acceptance criteria (top-level)

    ☐ Phase 0 migration applied, no regressions
    ☐ Phase 1 deployed, every new artifact has folder + manifest
    ☐ Phase 2 driver runs every-30-min, processes 100/cycle, idempotent
    ☐ Phase 2 reaches 99%+ coverage within 7 days (drained fleet
    should let this run uninterrupted)
    ☐ Phase 3 readers switched, ≥95% reads use folder paths
    ☐ Phase 4 type-prefixed URL aliases live
    ☐ Phase 5 cleanup deployed; legacy paths gone
    ☐ No artifact files lost
    ☐ Public URLs stable

    Dependencies

    • scidex.core.database (PG access)
    • scidex.atlas.artifact_commit (extension point in Phase 1)
    • scidex.atlas.artifact_registry (extension point in Phase 1, 3)
    • data/scidex-artifacts/ git submodule (write target in Phase 2)
    • Orchestra recurring task scheduling (Phase 2 driver)

    Dependents

    • quest_artifact_metadata_semantic_spec.md — keys summary embedding on id
    • quest_artifact_reuse_provenance_qc_spec.md — keys parent/derived on id
    • quest_experiment_execution_participant_spec.md — output artifacts use folder layout
    • quest_paper_replication_starter_spec.md — replication artifacts use folder layout

    Work Log

    2026-04-28 — Spec authored, then revised on user feedback

    Initial design proposed a separate guid UUID column. User pushed
    back: "we should use uuid rather than guid (minor difference, but if
    needed for naming); just id is probably also fine." Going with the
    simplest path — use existing artifacts.id as the canonical handle,
    no new column. Folder name = literal id value.

    Phase 0 PR (#1222) reflects the simplified design: 5 nullable columns
    + migration log table; no guid column. Spec content updated in this PR.

    User-confirmed design choices:

    • DB-only artifacts get folders too (move more onto git-versioned files)
    • Multiple URL paths by type are fine; /artifact/<id> stays canonical
    • Higher embedding dim is fine (deferred to metadata spec)
    • Cold-start QC backfill happens together with folder migration during
    the current drained-fleet window

    Open items (track here as work begins):

  • Are there artifacts referenced by external systems (e.g. published
  • paper supplementals pointing at scidex.ai/artifact/<id>)? Audit
    external references before any URL semantics change.
  • Should manifest.json be canonicalized JSON (sorted keys, no
  • trailing newline) so commit churn is minimized? Recommended yes —
    the helper does this.

    Payload JSON
    {
      "requirements": {
        "reasoning": 7,
        "coding": 8,
        "safety": 8
      }
    }

    Sibling Tasks in Quest (Atlas) ↗