[Atlas] CI: Generate semantic metadata for unsummarized artifacts open analysis:8 reasoning:8

← Atlas
Recurring driver per quest_artifact_metadata_semantic_spec.md. Each cycle: 50 artifacts WHERE summary IS NULL OR summary_rubric_version < current. Run versioned LLM rubric -> summary, key_findings, methods, data_sources, applicable_domains, semantic_keywords. Embed summary into pgvector. Upsert into artifacts row.
Spec File

Goal

Make every artifact semantically discoverable. Each artifact carries a
short LLM-generated summary, structured metadata (key findings,
methods, data sources, applicable domains), and a vector embedding of
the summary. A semantic-search endpoint and "find similar" surfaces
make artifacts genuinely reusable rather than orphaned per-task outputs.

The current state has metadata as a free-form JSONB column with
type-specific schemas (figures have caption + source_notebook_id,
models have model_family + framework, etc.). What's missing is a uniform, semantic, queryable layer on top: every artifact, regardless
of type, has a summary you can search against.

> ## Continuous-process anchor
>
> Steady-state: a recurring driver finds artifacts with
> summary_version < current_rubric_version, runs the rubric,
> upserts the summary + embedding, and stamps the version. Every
> rubric improvement triggers a re-summary pass. See
> docs/design/retired_scripts_patterns.md § "1. LLMs for semantic
> judgment; rules for syntactic validation" — this spec is a textbook
> case.

Why now

  • Today, "find me a heatmap of microglial gene expression in AD"
requires keyword grep through artifact titles. Reuse is rare because
discovery is hard.
  • Artifact reuse is a stated core value (the user said: "we want
artifacts that will get reused"). Reuse requires findability.
  • The replication and experiment-execution quests will produce many
similar artifacts (multiple agents may attempt the same figure);
semantic dedup needs embeddings.
  • Compounds with the artifact folder migration: artifact.id is the
stable handle that embeddings reference.

Design

New columns on artifacts

ALTER TABLE artifacts
  ADD COLUMN summary TEXT,
  ADD COLUMN summary_embedding vector(1536),     -- pgvector
  ADD COLUMN key_findings JSONB,                 -- ["finding 1", "finding 2", ...]
  ADD COLUMN methods_used TEXT[],                -- ['RNA-seq', 'CRISPR-screen']
  ADD COLUMN data_sources TEXT[],                -- ['Allen-SEA-AD', 'GTEx-v10']
  ADD COLUMN applicable_domains TEXT[],          -- ['alzheimers', 'microglia']
  ADD COLUMN semantic_keywords TEXT[],           -- ['heatmap', 'differential-expression']
  ADD COLUMN summary_generated_at TIMESTAMPTZ,
  ADD COLUMN summary_model TEXT,                 -- 'claude-opus-4-7' / 'codex-...'
  ADD COLUMN summary_rubric_version INT;

CREATE INDEX idx_artifacts_summary_embedding
  ON artifacts USING ivfflat (summary_embedding vector_cosine_ops);
CREATE INDEX idx_artifacts_methods_used ON artifacts USING gin(methods_used);
CREATE INDEX idx_artifacts_data_sources ON artifacts USING gin(data_sources);
CREATE INDEX idx_artifacts_applicable_domains ON artifacts USING gin(applicable_domains);

pgvector dependency: confirm extension is installed
(CREATE EXTENSION IF NOT EXISTS vector). If not present, the migration
installs it; the recurring driver becomes a no-op until vector is
available.

The summary rubric

Stored as a versioned PG row in a new artifact_summary_rubric table:

CREATE TABLE artifact_summary_rubric (
  version INT PRIMARY KEY,
  prompt_template TEXT NOT NULL,
  output_schema_json JSONB NOT NULL,
  embedding_model TEXT NOT NULL,    -- e.g. 'claude-opus-4-7-embed' (or third-party)
  retired BOOLEAN DEFAULT FALSE,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  notes TEXT
);

Initial rubric (v1):

> Read the artifact's title, type, and metadata. If it's a figure or
> notebook, also read up to 200 lines of associated text/captions/cells.
> Produce:
> 1. A 1-3 sentence summary capturing what the artifact contains and
> why it might matter to a researcher.
> 2. 3-5 key findings (bullet phrases, each ≤15 words).
> 3. Methods used (controlled vocabulary; see methods_taxonomy).
> 4. Data sources (controlled vocabulary; see data_sources_taxonomy).
> 5. Applicable disease/biology domains.
> 6. 5-10 semantic keywords for retrieval (free-form, lowercase).
> 7. Confidence (0-1) in your summary's faithfulness to the artifact.
>
> Be honest about what's actually in the artifact. If it's a stub,
> say so. If you can't tell what it does, say so. Do not embellish.

The rubric self-improves: a meta-task ("audit 50 random rubric_v1
outputs against ground-truth artifacts; propose rubric_v2 changes")
runs weekly. Operators approve rubric upgrades.

Controlled vocabularies

Two new tables, populated by LLM-driven discovery (NOT hardcoded):

CREATE TABLE methods_taxonomy (
  term TEXT PRIMARY KEY,
  category TEXT,           -- 'wet-lab', 'computational', 'imaging', ...
  parent_term TEXT REFERENCES methods_taxonomy(term),
  synonyms TEXT[],
  first_seen_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE data_sources_taxonomy (
  source_id TEXT PRIMARY KEY,    -- 'Allen-SEA-AD', 'GTEx-v10'
  display_name TEXT NOT NULL,
  category TEXT,                 -- 'transcriptomics', 'imaging', 'clinical'
  url TEXT,
  license TEXT
);

Bootstrap: a one-shot seeds each table with 30-50 obvious entries
(LLM-generated from sampled artifacts, operator-approved). Steady
state: the rubric driver proposes new terms when it can't fit an
artifact's method to existing entries; weekly meta-job consolidates.

Recurring driver

scripts/artifact_summary_backfill.py — but per the continuous-process
principles, this is wired as a recurring CI task with a Codex agent
reading the spec, not a standalone script.

Gap predicate:

SELECT id, artifact_type, title, metadata
FROM artifacts
WHERE summary IS NULL
   OR summary_rubric_version < (SELECT MAX(version) FROM artifact_summary_rubric WHERE NOT retired)
ORDER BY quality_score DESC NULLS LAST,
         created_at DESC
LIMIT 50;

Priority order: high-quality artifacts first (already vetted),
newest first within tier.

Per-artifact algorithm:

  • Fetch artifact row + metadata + (if file-bearing) primary file content
  • Trim file content to a reasonable token budget (8K tokens max)
  • Render rubric prompt with artifact context
  • Call LLM (provider routed via quest_llm_routing_spec.md)
  • Parse JSON response per output_schema_json
  • If parse fails: one retry with stricter "JSON only, no prose" suffix
  • Generate embedding of the summary (separate model call)
  • Upsert into artifacts row:

  • UPDATE artifacts SET
         summary = ?, key_findings = ?, methods_used = ?,
         data_sources = ?, applicable_domains = ?, semantic_keywords = ?,
         summary_embedding = ?, summary_generated_at = NOW(),
         summary_model = ?, summary_rubric_version = ?
       WHERE id = ?

  • Append run row to artifact_summary_runs (count per cycle, cost, errors)
  • Bounded batch: 50 artifacts/cycle, every-2h.

    Failure modes:

    FailureResponse
    LLM rate-limitedSkip remaining items in cycle, log, retry next cycle
    Malformed JSON after retrySkip artifact, log to artifact_summary_failures
    Embedding model unavailableGenerate summary anyway, embedding=NULL, retry next cycle
    Artifact files missingGenerate summary from metadata + title only, mark confidence < 0.5
    Rubric version bumped mid-runExisting in-flight items continue at old version; new rubric applies next cycle

    Search API

    POST /api/artifacts/search
    {
      "query": "heatmap microglial gene expression Alzheimer",
      "filters": {
        "artifact_type": ["figure", "notebook"],
        "methods_used": ["differential-expression"],
        "min_quality_score": 0.6,
        "applicable_domains": ["alzheimers"]
      },
      "limit": 25
    }

    Response:

    {
      "results": [
        {
          "id": "...",
          "id": "figure-abc123",
          "title": "...",
          "summary": "...",
          "similarity": 0.847,
          "key_findings": [...],
          "url": "/artifact/<id>"
        },
        ...
      ]
    }

    Implementation: cosine similarity over summary_embedding filtered by
    the structured filters. Returns top-N ranked by similarity.

    Companion endpoint: GET /api/artifacts/<id>/similar?limit=10
    returns artifacts with embedding similarity > 0.75 to this one.

    "Find similar" UI surface

    On every artifact detail page (Phase 1 of folder migration adds this):
    a "Similar artifacts" sidebar. Useful for reuse — a researcher viewing
    an AD heatmap immediately sees related heatmaps they could derive from.

    Compositional metadata extraction

    Some metadata fields are extractable without LLM:

    • Notebook: parse .ipynb cells; extract import statements →
    methods_used candidates (e.g. import scanpy → 'single-cell-analysis')
    • Dataset: parse .schema.json → column names + types as features
    • Figure: read EXIF/metadata; parse caption if PNG has iTXt
    • Model: parse model card if present

    Run these before LLM call; LLM gets pre-extracted hints, reduces
    hallucination risk.

    Acceptance criteria

    ☐ All schema migrations applied
    artifact_summary_rubric v1 row inserted
    methods_taxonomy and data_sources_taxonomy seeded with ≥30 entries each
    ☐ Recurring driver running every-2h, processing 50/cycle
    ☐ 90% of artifacts with quality_score ≥ 0.6 have summary within 14 days
    ☐ Search API returns relevant results: precision@10 ≥ 0.7 on a 30-query test set
    ☐ "Similar artifacts" sidebar live on artifact detail pages
    ☐ Weekly rubric audit job runs; v2 rubric proposed within 4 weeks
    ☐ Failure rate < 5% (parsing + LLM errors combined)

    Dependencies

    • quest_artifact_uuid_migration_spec.md Phase 0 (uses artifacts.id; id values are UUIDs for new artifacts)
    • pgvector PostgreSQL extension
    • quest_llm_routing_spec.md for provider selection

    Dependents

    • quest_artifact_reuse_provenance_qc_spec.md (uses summary in reuse signals)
    • All consumers wanting artifact recommendation / dedup
    • quest_paper_replication_starter_spec.md (semantic dedup of replication attempts)

    Work Log

    2026-04-28 — Spec authored

    Bootstrap design. Versioned LLM rubric for summaries; pgvector for
    embeddings; structured taxonomies for filterable facets. Recurring
    driver pattern (50/cycle every-2h). Search + similar APIs designed
    but not implemented.

    Open question: should embeddings be 1536-dim (OpenAI/Anthropic style)
    or smaller? Storage at 11K artifacts is trivial either way; pick based
    on retrieval quality benchmark.

    Payload JSON
    {
      "requirements": {
        "reasoning": 8,
        "analysis": 8
      }
    }

    Sibling Tasks in Quest (Atlas) ↗