SciDEX — Task: [Atlas] CI: Drive artifact folder migration backfi

Phase 2 of artifact folder migration. Each cycle: 100 artifacts WHERE artifact_id IS NULL OR migrated_to_folder_at IS NULL. Generate uuid4() for artifact_id, mkdir /, git mv files, write manifest.json, set provenance symlinks. Idempotent via DB advisory lock. Append to artifact_migration_log. Stop when COUNT(WHERE artifact_id IS NULL)=0. ~3 days for 11,667 artifacts at 100/cycle every-30-min.

Completion Notes

Auto-release: recurring task had no work this cycle

Git Commits (4)

[Atlas] UUID migration Phase 1b: commit_artifact_to_folder helper (#1227)2026-04-28

[Atlas] UUID migration Phase 1b: commit_artifact_to_folder helper2026-04-28

[Atlas] UUID migration Phase 1: register_artifact() populates artifact_id (#1226)2026-04-28

[Atlas] UUID migration Phase 1: register_artifact() populates artifact_id2026-04-28

Spec File

Goal

Move every SciDEX artifact into a folder-per-artifact storage layout
where the folder is named by the artifact's existing id. Friendly
filenames live inside the folder; multi-file artifacts (notebook +
HTML + figures + manifest) co-locate naturally. DB-only artifacts
(hypotheses, claims, KG entities) get folders too — their content is
the canonical manifest.json so the filesystem becomes a uniform
provenance surface in git.

Today the storage layout mixes UUID-named files (notebooks),
slug-named files (figures), session-id folders (analyses), and flat
CSVs (datasets). That works for read paths but blocks: (a) artifacts
that span multiple files; (b) provenance graphs (parent → derived);
(c) DB-only artifacts being version-controlled at all; (d) idempotent
migration tooling. Folder-per-artifact solves all four.

> ## Continuous-process anchor
>
> This is a bounded migration quest, not a continuous process — it
> has a clear "done" state. Apply the design principles in
> docs/design/retired_scripts_patterns.md for the backfill driver
> (gap-predicate, idempotent, version-stamped, observable), but the
> migration itself is a phased rollout, not a steady-state job.
>
> The eventual steady-state — every newly-committed artifact gets a
> folder automatically — is a side effect of Phase 1 changes, not a
> separate continuous process.

Naming decision

Final design after iterating with the user:

artifacts.id (existing column): preserved verbatim. Mixed

format (figure-abc123, MITOCHONDRIAL_DYSFUNCTION,
sess_SDA-2026-04-16-..., etc.). Never changes. The legacy handle.

artifacts.artifact_id (NEW column): UUID type, UNIQUE,

populated for every artifact (new ones at write time, existing ones
in Phase 2 backfill). The canonical clean handle for new code.

artifacts.artifact_type (existing column): kept. Carries type

info so artifact_id itself can be a bare UUID without a type prefix.

Conversation thread that led here:

> R1: "we should use uuid rather than guid (minor difference, but if
> needed for naming); just id is probably also fine within the db,
> docs, externally, etc."
>
> R2: "the id itself should be a uuid/guid — i was just commenting on
> the column name"
>
> R3: "uuid is more of a standard than guid"
>
> R4: "if there is already an id column that is NOT a uuid, it should
> be preserved. if there are conflicts we can create a new column
> called artifact_id. if there is not already an id column then we
> can use it..."
>
> R5: "we should support backwards compat with old paths. yes having
> an artifact type makes sense"

The artifacts table already has id (mixed format) — that's the
"conflict" R4 references — so the new column is named artifact_id
(UUID) per R4's guidance.

Folder + URL semantics

Folder name = artifact_id (UUIDv4). Every artifact, new or

backfilled, lives at data/scidex-artifacts/<uuid>/. Clean,
uniform. No mixed-format folder names.

Canonical URL stays /artifact/<id> for backwards compatibility

(R5). Existing public links keep working forever.

New URL alias /artifact/<artifact_id> (UUID) also resolves to

the same record. Both lookups go to the same renderer.

API responses include both id and artifact_id so consumers

can pick.

FK choices in new tables

Tables created by this and sibling specs (experiment_claims, replication_attempts, artifact_migration_log, etc.) FK to artifacts.id (the legacy handle) for migration safety — every
artifact has an id from day one, so there's no chicken-and-egg
problem. After Phase 2 backfill is complete and artifact_id is
populated for every row, future tables may FK to artifacts.artifact_id
instead.

Implications for `register_artifact()` (Phase 1)

Today scidex.atlas.artifact_registry.register_artifact() mints ids
in {type}-{uuid} format on the id column. Phase 1 changes:

id continues to receive {type}-{uuid} format for backwards

compatibility with existing URL patterns (no need to change every
consumer at once).

artifact_id receives a fresh uuid4() for every new artifact.

Phase 2 backfill mints artifact_id for every existing row.

Why now

11,667 files across 4+ naming conventions; impossible to reason about

"what files belong to artifact X" without DB joins

quest_artifact_metadata_semantic_spec.md and

quest_artifact_reuse_provenance_qc_spec.md want to attach new
schema (summary, embedding, parent_artifact_id, qc_status); doing
the migration first avoids two schema churns

Replication and experiment-execution quests will produce

multi-file artifacts (notebook + data dumps + figures + manifest)
that benefit from folder-per-artifact natively

DB-only artifacts (hypotheses, claims) gain version-controlled

representation when they get a folder containing manifest.json —
user-confirmed: "we should consider moving some artifact more onto
git versioned files than relying on just the db"

Folder layout

data/scidex-artifacts/
  <artifact_id>/                   # one directory per artifact
    manifest.json                  # canonical metadata snapshot (mirrors DB row)
    <friendly_name>.<ext>          # primary file (notebook.ipynb, figure.png, dataset.csv, ...)
    accessories/
      <friendly_name>.html         # rendered notebook
      <friendly_name>.schema.json  # dataset schema
      preview.png                  # thumbnail
      summary.md                   # rendered summary
    inputs/                        # symlinks to upstream artifacts' folders
      <input-artifact_id> -> ../../<input-artifact_id>
    outputs/                       # symlinks to derivative artifacts
      <output-artifact_id> -> ../../<output-artifact_id>

Rationale:

Symlink-based provenance lets ls inputs/ and ls outputs/

answer "what did this derive from / what derives from this" without
DB joins. Symlinks survive in the submodule because Git tracks them.

accessories/ is opt-in; small artifacts (single figure) skip it.
manifest.json is a denormalized snapshot for offline analysis,

rebuildable from the DB at any time. Idempotent generation.

Friendly names preserved so a downloaded artifact is still

human-meaningful (vasodilator_response_AD.ipynb not the raw id).

DB-only artifacts get folders too — the manifest.json is the

only file; that's how they become git-versioned.

New columns on `artifacts`

ALTER TABLE artifacts ADD COLUMN artifact_id UUID;           -- canonical UUID handle (new code path)
ALTER TABLE artifacts ADD COLUMN artifact_dir TEXT;          -- 'data/scidex-artifacts/<artifact_id>'
ALTER TABLE artifacts ADD COLUMN primary_filename TEXT;      -- 'vasodilator_response_AD.ipynb'
ALTER TABLE artifacts ADD COLUMN accessory_filenames TEXT[]; -- ['vasodilator_response_AD.html']
ALTER TABLE artifacts ADD COLUMN folder_layout_version INT DEFAULT 1;
ALTER TABLE artifacts ADD COLUMN migrated_to_folder_at TIMESTAMPTZ;

-- UNIQUE on artifact_id is enforced via partial index (allows NULL during
-- backfill, prevents duplicates among populated rows).
CREATE UNIQUE INDEX idx_artifacts_artifact_id_unique
  ON artifacts(artifact_id) WHERE artifact_id IS NOT NULL;

CREATE INDEX idx_artifacts_migrated ON artifacts(migrated_to_folder_at)
  WHERE migrated_to_folder_at IS NOT NULL;

After Phase 2 backfill reaches 99%+ coverage: optionally ALTER TABLE artifacts ALTER COLUMN artifact_id SET NOT NULL. Until
then, artifact_id is nullable so the schema applies cleanly to a
populated table.

artifact_dir is always derivable as 'data/scidex-artifacts/' || artifact_id::text but stored explicitly
to support future relocation (S3, etc.).

Friendly-name generation

Read existing filename → strip extension → slugify
Truncate to 60 chars, lowercase, ASCII-only, replace whitespace/punct with _
If empty after slugify, fall back to <artifact_type>_<short-id-hash>
Conflicts inside a folder resolved with _n suffix
Stored in primary_filename; once set, never changes

URL paths

/artifact/<id> stays canonical (already deployed). Type-prefixed
routes can be added as redirects without breaking anything:

/figure/<id> → 301 /artifact/<id> (read pretty URL → canonical)
/notebook/<id> → 301 /artifact/<id>
/dataset/<id> → 301 /artifact/<id>
/model/<id> → 301 /artifact/<id>

User-confirmed: "there could be multiple paths to artifact (e.g., by
their type) but /artifact/id makes great sense."

Phased rollout

Each phase has a hard exit gate. Don't proceed to Phase N+1 until
Phase N acceptance is fully met and verified with a 24h soak.

Phase 0 — Schema (1 PR, ~2 hours)

Migration: migrations/20260428_artifact_folder_columns.sql

Add artifact_dir, primary_filename, accessory_filenames,

folder_layout_version, migrated_to_folder_at (all nullable)

Index on migrated_to_folder_at
New table artifact_migration_log for backfill audit
No data writes; pure schema change
Reversible — ALTER TABLE ... DROP COLUMN works

Acceptance:

☐ Migration applies cleanly to scidex PG

☐ \d artifacts shows new columns

☐ No regression in any existing query (read tests pass)

Status: scaffolded in companion PR. Phase 0 also lands path
helpers (artifact_dir(artifact_id), etc.) and a manifest.json
writer module (scidex.atlas.artifact_manifest).

Phase 1 — New artifacts use folders (1 PR, ~1 day)

Code changes:

scidex/atlas/artifact_commit.py:

- Add artifact_id parameter (writers MUST supply; no auto-generation
here — the caller already has the id from register_artifact())
- Add friendly_name parameter (default None → derive from first path)
- When called with multi-file paths, write all into
data/scidex-artifacts/<id>/ with accessories/ for non-primary
- Generate manifest.json from DB row before commit
- Set migrated_to_folder_at = now() on the row

scidex/core/paths.py:

- artifact_dir(artifact_id) → returns Path
- artifact_primary_path(artifact_id, filename) → returns Path
- artifact_accessory_path(artifact_id, filename) → returns Path
- Keep FIGURE_DIR, NOTEBOOK_DIR, etc. — emit DeprecationWarning

scidex/atlas/artifact_registry.py:

- register_artifact() populates artifact_dir, primary_filename
- resolve_artifact() accepts id; returns row including folder fields
- get_capsule_manifest() reads from <id>/manifest.json if present,
else falls back to legacy path

api.py artifact write paths:

- Switch all writers to the new commit_artifact(artifact_id=..., paths=...)
signature
- Old code paths (direct write into FIGURE_DIR, etc.) flagged with
# TODO(folder-migration phase 5): delete comments

Tests:

New artifact end-to-end: API call → artifact row exists → folder at

<id>/ → manifest.json present → URL works

Mixed write: artifact row created without folder fields (legacy code

path) → Phase 2 backfill assigns folder

Concurrent writes: two artifacts created simultaneously land in

different folders (no collision; ids are unique)

Acceptance:

☐ Every artifact created after deploy has migrated_to_folder_at IS NOT NULL

☐ No regression: old artifact reads still work via legacy paths

☐ Deprecation warnings logged but don't error

☐ 24h soak: monitor artifact_commit_failed events; no rate change

Phase 2 — Backfill existing artifacts (1 long-running migration task)

Driver: scripts/artifact_folder_backfill.py (new, recurring)

Gap predicate:

SELECT id, artifact_type, content_hash, metadata
FROM artifacts
WHERE migrated_to_folder_at IS NULL
ORDER BY created_at DESC
LIMIT 100;

Per-artifact algorithm (idempotent, atomic):

Resolve current file location(s):

- From metadata.figure_path, metadata.notebook_path,
metadata.dataset_path, etc.
- From provenance_chain if path is in there
- Fallback: scan data/scidex-artifacts/{type}s/ for files matching
the artifact's slug or content_hash
- If no files found (DB-only artifact) → mark artifact_dir
anyway, write only manifest.json to the folder

mkdir -p data/scidex-artifacts/<id>/

Generate friendly name from first file's basename

git mv files into <id>/ folder (preserves history, atomic in Git)

- Primary file at <id>/<friendly_name>.<ext>
- Accessories at <id>/accessories/<accessory_friendly>.<ext>

Verify SHA256 of moved file matches content_hash if set

Generate <id>/manifest.json from the DB row

Symlink up inputs/ and outputs/ based on artifact_links

Update DB row: artifact_dir, primary_filename,

accessory_filenames, migrated_to_folder_at

Append to artifact_migration_log:

(artifact_id, status, files_moved, errors, took_ms)

Atomicity: each artifact is one Git commit
(git mv + manifest write) so it's reverted as a unit if any step fails.

Bounded batch: 100 artifacts/cycle, every-30-min recurring. At
11,667 artifacts and 100/cycle/30min → ~60h total wall time, ~3 days.

Cold-start QC backfill (per user direction): now is the time to
also run QC on existing artifacts. Coordinate with quest_artifact_reuse_provenance_qc_spec.md so QC + folder migration
happen together in this drained-fleet window.

Failure modes:

Failure	Response
Files not found on disk	DB-only artifact; write manifest.json only, mark migrated
Content hash mismatch	`status='hash_mismatch'`, do NOT move files, log diff
Git mv fails	`status='error'`, retry next cycle (idempotent)
Disk full	Halt all writes, alert operator
Submodule lock contention	Skip artifact, retry next cycle

Acceptance:

☐ Every existing artifact has migrated_to_folder_at IS NOT NULL within 7 days

☐ artifact_migration_log: ≥99% status='success' for artifacts with files

☐ No file lost: pre-migration find data/scidex-artifacts/ -type f | wc -l

≤ post-migration count

☐ All content_hash values verified post-move (or flagged)

☐ Submodule grows by ≤ 5% (mostly directory entries + manifests)

☐ No reduction in artifact-serving latency (p99 < 500ms maintained)

Rollback: each cycle is one commit; git revert to undo.

Phase 3 — Switch readers to folder paths (1 PR per consumer)

Order (topological by dependency):

scidex/atlas/artifact_registry.resolve_artifact() — read folder fields

api.py /api/artifacts/{id} routes — return folder fields

Notebook viewer — read <id>/<filename>.html first, fall back to legacy

Figure viewer — same

Dataset viewer — same

KG → artifact link rendering — reference id in folder layout

Wiki {{artifact:ID}} embed — resolve via id; folder for content

Search/recommendation surfaces — return folder fields

Each switch is its own small PR with a feature flag
(ARTIFACT_FOLDER_READERS_ENABLED=true). Roll forward gradually.

Phase 4 — Type-prefixed URL aliases (1 PR)

Add type-prefixed redirects:

/figure/<id> → 301 /artifact/<id>
/notebook/<id> → 301
/dataset/<id> → 301
/model/<id> → 301

These are net-new URLs; existing /artifact/<id> stays canonical.
Add <link rel="canonical"> header on every artifact page.

Acceptance:

☐ New type-prefixed URLs return 301 to /artifact/<id>

☐ Canonical link header on every artifact page

☐ Sitemap includes both type-prefixed and canonical URLs

Phase 5 — Remove legacy paths (1 PR)

This is the cleanup phase, runs only after 30+ day soak post-Phase 3/4.

Old type-grouped folders (figures/, notebooks/, etc.) emptied

(files already moved by Phase 2); this phase deletes the empty dirs

paths.py legacy constants (FIGURE_DIR, etc.) deleted
commit_artifact legacy code paths deleted
Deprecation warnings removed
VACUUM ANALYZE

Acceptance:

☐ All legacy paths removed

☐ All readers use folder paths exclusively

☐ Test suite has no references to FIGURE_DIR etc.

☐ data/scidex-artifacts/ listing shows only <id>/ directories

Edge cases

Submodule artifact: special handling

data/scidex-artifacts/ and data/scidex-papers/ are git submodules.
Migration commits land in the submodule first, then the outer repo
updates the submodule pointer. Use git submodule update --remote in
backfill driver before each batch. Push submodule before outer repo.

Multi-file artifacts that don't fit the layout

Some artifacts (paper figures with extracted SVG + PNG + LaTeX) might
have 5+ files. Layout supports this: primary file + N accessories.
Folder structure stays flat (no nested subdirs except accessories/, inputs/, outputs/).

DB-only artifacts (hypotheses, claims, etc.)

These have no on-disk files today. They still get a folder containing
only manifest.json. Per user direction, this brings them into git
version control — the manifest is the canonical exported representation.
Folder size is ~1KB; trivial.

Concurrent backfill + new artifact creation

Phase 2 driver inserts folder fields for existing rows. Phase 1 ensures
new rows get folder fields. Race: a new artifact created at the same
moment the driver picks up an old row → no conflict, different rows.

Backfill driver crashes mid-batch

Driver acquires per-artifact advisory lock before mutating:

SELECT pg_try_advisory_xact_lock(hashtext('artifact_folder_migration:' || id))

If crash, the next cycle picks up the same row (idempotent — checks if
folder already populated, skips).

Storage cost

Phase 2 writes manifests + creates folders → ~50KB overhead per artifact
on average → 11,667 × 50KB = 580MB additional. Within current submodule
size budget (1.1GB → 1.7GB).

Schema changes summary

-- Phase 0
ALTER TABLE artifacts ADD COLUMN artifact_dir TEXT;
ALTER TABLE artifacts ADD COLUMN primary_filename TEXT;
ALTER TABLE artifacts ADD COLUMN accessory_filenames TEXT[];
ALTER TABLE artifacts ADD COLUMN folder_layout_version INT DEFAULT 1;
ALTER TABLE artifacts ADD COLUMN migrated_to_folder_at TIMESTAMPTZ;

CREATE TABLE artifact_migration_log (
  id BIGSERIAL PRIMARY KEY,
  artifact_id TEXT NOT NULL,
  status TEXT NOT NULL,
  files_moved JSONB,
  errors JSONB,
  took_ms INT,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

API additions

GET /api/artifacts/<id>/folder — list files in artifact folder
GET /api/artifacts/<id>/manifest — raw manifest.json
GET /api/artifacts/<id>/inputs — list of input artifact ids
GET /api/artifacts/<id>/outputs — list of artifacts derived from this
POST /api/artifacts/<id>/accessory — add accessory file (auth required)

Acceptance criteria (top-level)

☐ Phase 0 migration applied, no regressions

☐ Phase 1 deployed, every new artifact has folder + manifest

☐ Phase 2 driver runs every-30-min, processes 100/cycle, idempotent

☐ Phase 2 reaches 99%+ coverage within 7 days (drained fleet

should let this run uninterrupted)

☐ Phase 3 readers switched, ≥95% reads use folder paths

☐ Phase 4 type-prefixed URL aliases live

☐ Phase 5 cleanup deployed; legacy paths gone

☐ No artifact files lost

☐ Public URLs stable

Dependencies

scidex.core.database (PG access)
scidex.atlas.artifact_commit (extension point in Phase 1)
scidex.atlas.artifact_registry (extension point in Phase 1, 3)
data/scidex-artifacts/ git submodule (write target in Phase 2)
Orchestra recurring task scheduling (Phase 2 driver)

Dependents

quest_artifact_metadata_semantic_spec.md — keys summary embedding on id
quest_artifact_reuse_provenance_qc_spec.md — keys parent/derived on id
quest_experiment_execution_participant_spec.md — output artifacts use folder layout
quest_paper_replication_starter_spec.md — replication artifacts use folder layout

Work Log

2026-04-28 — Spec authored, then revised on user feedback

Initial design proposed a separate guid UUID column. User pushed
back: "we should use uuid rather than guid (minor difference, but if
needed for naming); just id is probably also fine." Going with the
simplest path — use existing artifacts.id as the canonical handle,
no new column. Folder name = literal id value.

Phase 0 PR (#1222) reflects the simplified design: 5 nullable columns
+ migration log table; no guid column. Spec content updated in this PR.

User-confirmed design choices:

DB-only artifacts get folders too (move more onto git-versioned files)
Multiple URL paths by type are fine; /artifact/<id> stays canonical
Higher embedding dim is fine (deferred to metadata spec)
Cold-start QC backfill happens together with folder migration during

the current drained-fleet window

Open items (track here as work begins):

Are there artifacts referenced by external systems (e.g. published

paper supplementals pointing at scidex.ai/artifact/<id>)? Audit
external references before any URL semantics change.

Should manifest.json be canonicalized JSON (sorted keys, no

trailing newline) so commit churn is minimized? Recommended yes —
the helper does this.

Payload JSON

{
  "requirements": {
    "reasoning": 7,
    "coding": 8,
    "safety": 8
  }
}

Sibling Tasks in Quest (Atlas) ↗

○[Atlas] Drug target therapeutic recommendations — generate actionable recs for 91 tier-1 neurodegeneration targetsP96

○[Atlas] Causal KG entity resolution — bridge 19K free-text causal edges to canonical KG entitiesP95

○[Atlas] Squad findings bubble-up driver (driver #20)P94

○[Atlas] Install Dolt server + migrate first dataset (driver #26)P92

○[Atlas] Dataset PR review & merge driver (driver #27)P92

○[Atlas] Wiki mermaid LLM regen — 50 pages/run, parallel agentsP92

○[Atlas] Unresolved causal edge triage — mine 12K stalled causal claims for cross-disease KG nodesP91

○[Atlas] Versioned tabular datasets — overall coordination questP90

○[Atlas] KG ↔ dataset cross-link driver (driver #30)P90

○[Atlas] CI: Generate semantic metadata for unsummarized artifactsP90

[Atlas] CI: Drive artifact folder migration backfill open coding:8 reasoning:7 safety:8

Completion Notes

Git Commits (4)

Goal

Naming decision

Folder + URL semantics

FK choices in new tables

Implications for register_artifact() (Phase 1)

Why now

Folder layout

New columns on artifacts

Friendly-name generation

URL paths

Phased rollout

Phase 0 — Schema (1 PR, ~2 hours)

Phase 1 — New artifacts use folders (1 PR, ~1 day)

Phase 2 — Backfill existing artifacts (1 long-running migration task)

Phase 3 — Switch readers to folder paths (1 PR per consumer)

Phase 4 — Type-prefixed URL aliases (1 PR)

Phase 5 — Remove legacy paths (1 PR)

Edge cases

Submodule artifact: special handling

Multi-file artifacts that don't fit the layout

DB-only artifacts (hypotheses, claims, etc.)

Concurrent backfill + new artifact creation

Backfill driver crashes mid-batch

Storage cost

Schema changes summary

API additions

Acceptance criteria (top-level)

Dependencies

Dependents

Work Log

2026-04-28 — Spec authored, then revised on user feedback

Sibling Tasks in Quest (Atlas) ↗

Implications for `register_artifact()` (Phase 1)

New columns on `artifacts`