SciDEX — Task: [Atlas] CI: Generate semantic metadata for unsumma

Recurring driver per quest_artifact_metadata_semantic_spec.md. Each cycle: 50 artifacts WHERE summary IS NULL OR summary_rubric_version < current. Run versioned LLM rubric -> summary, key_findings, methods, data_sources, applicable_domains, semantic_keywords. Embed summary into pgvector. Upsert into artifacts row.

Spec File

Goal

Make every artifact semantically discoverable. Each artifact carries a
short LLM-generated summary, structured metadata (key findings,
methods, data sources, applicable domains), and a vector embedding of
the summary. A semantic-search endpoint and "find similar" surfaces
make artifacts genuinely reusable rather than orphaned per-task outputs.

The current state has metadata as a free-form JSONB column with
type-specific schemas (figures have caption + source_notebook_id,
models have model_family + framework, etc.). What's missing is a uniform, semantic, queryable layer on top: every artifact, regardless
of type, has a summary you can search against.

> ## Continuous-process anchor
>
> Steady-state: a recurring driver finds artifacts with
> summary_version < current_rubric_version, runs the rubric,
> upserts the summary + embedding, and stamps the version. Every
> rubric improvement triggers a re-summary pass. See
> docs/design/retired_scripts_patterns.md § "1. LLMs for semantic
> judgment; rules for syntactic validation" — this spec is a textbook
> case.

Why now

Today, "find me a heatmap of microglial gene expression in AD"

requires keyword grep through artifact titles. Reuse is rare because
discovery is hard.

Artifact reuse is a stated core value (the user said: "we want

artifacts that will get reused"). Reuse requires findability.

The replication and experiment-execution quests will produce many

similar artifacts (multiple agents may attempt the same figure);
semantic dedup needs embeddings.

Compounds with the artifact folder migration: artifact.id is the

stable handle that embeddings reference.

Design

New columns on `artifacts`

ALTER TABLE artifacts
  ADD COLUMN summary TEXT,
  ADD COLUMN summary_embedding vector(1536),     -- pgvector
  ADD COLUMN key_findings JSONB,                 -- ["finding 1", "finding 2", ...]
  ADD COLUMN methods_used TEXT[],                -- ['RNA-seq', 'CRISPR-screen']
  ADD COLUMN data_sources TEXT[],                -- ['Allen-SEA-AD', 'GTEx-v10']
  ADD COLUMN applicable_domains TEXT[],          -- ['alzheimers', 'microglia']
  ADD COLUMN semantic_keywords TEXT[],           -- ['heatmap', 'differential-expression']
  ADD COLUMN summary_generated_at TIMESTAMPTZ,
  ADD COLUMN summary_model TEXT,                 -- 'claude-opus-4-7' / 'codex-...'
  ADD COLUMN summary_rubric_version INT;

CREATE INDEX idx_artifacts_summary_embedding
  ON artifacts USING ivfflat (summary_embedding vector_cosine_ops);
CREATE INDEX idx_artifacts_methods_used ON artifacts USING gin(methods_used);
CREATE INDEX idx_artifacts_data_sources ON artifacts USING gin(data_sources);
CREATE INDEX idx_artifacts_applicable_domains ON artifacts USING gin(applicable_domains);

pgvector dependency: confirm extension is installed
(CREATE EXTENSION IF NOT EXISTS vector). If not present, the migration
installs it; the recurring driver becomes a no-op until vector is
available.

The summary rubric

Stored as a versioned PG row in a new artifact_summary_rubric table:

CREATE TABLE artifact_summary_rubric (
  version INT PRIMARY KEY,
  prompt_template TEXT NOT NULL,
  output_schema_json JSONB NOT NULL,
  embedding_model TEXT NOT NULL,    -- e.g. 'claude-opus-4-7-embed' (or third-party)
  retired BOOLEAN DEFAULT FALSE,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  notes TEXT
);

Initial rubric (v1):

> Read the artifact's title, type, and metadata. If it's a figure or
> notebook, also read up to 200 lines of associated text/captions/cells.
> Produce:
> 1. A 1-3 sentence summary capturing what the artifact contains and
> why it might matter to a researcher.
> 2. 3-5 key findings (bullet phrases, each ≤15 words).
> 3. Methods used (controlled vocabulary; see methods_taxonomy).
> 4. Data sources (controlled vocabulary; see data_sources_taxonomy).
> 5. Applicable disease/biology domains.
> 6. 5-10 semantic keywords for retrieval (free-form, lowercase).
> 7. Confidence (0-1) in your summary's faithfulness to the artifact.
>
> Be honest about what's actually in the artifact. If it's a stub,
> say so. If you can't tell what it does, say so. Do not embellish.

The rubric self-improves: a meta-task ("audit 50 random rubric_v1
outputs against ground-truth artifacts; propose rubric_v2 changes")
runs weekly. Operators approve rubric upgrades.

Controlled vocabularies

Two new tables, populated by LLM-driven discovery (NOT hardcoded):

CREATE TABLE methods_taxonomy (
  term TEXT PRIMARY KEY,
  category TEXT,           -- 'wet-lab', 'computational', 'imaging', ...
  parent_term TEXT REFERENCES methods_taxonomy(term),
  synonyms TEXT[],
  first_seen_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE data_sources_taxonomy (
  source_id TEXT PRIMARY KEY,    -- 'Allen-SEA-AD', 'GTEx-v10'
  display_name TEXT NOT NULL,
  category TEXT,                 -- 'transcriptomics', 'imaging', 'clinical'
  url TEXT,
  license TEXT
);

Bootstrap: a one-shot seeds each table with 30-50 obvious entries
(LLM-generated from sampled artifacts, operator-approved). Steady
state: the rubric driver proposes new terms when it can't fit an
artifact's method to existing entries; weekly meta-job consolidates.

Recurring driver

scripts/artifact_summary_backfill.py — but per the continuous-process
principles, this is wired as a recurring CI task with a Codex agent
reading the spec, not a standalone script.

Gap predicate:

SELECT id, artifact_type, title, metadata
FROM artifacts
WHERE summary IS NULL
   OR summary_rubric_version < (SELECT MAX(version) FROM artifact_summary_rubric WHERE NOT retired)
ORDER BY quality_score DESC NULLS LAST,
         created_at DESC
LIMIT 50;

Priority order: high-quality artifacts first (already vetted),
newest first within tier.

Per-artifact algorithm:

Fetch artifact row + metadata + (if file-bearing) primary file content

Trim file content to a reasonable token budget (8K tokens max)

Render rubric prompt with artifact context

Call LLM (provider routed via quest_llm_routing_spec.md)

Parse JSON response per output_schema_json

If parse fails: one retry with stricter "JSON only, no prose" suffix

Generate embedding of the summary (separate model call)

Upsert into artifacts row:

UPDATE artifacts SET
     summary = ?, key_findings = ?, methods_used = ?,
     data_sources = ?, applicable_domains = ?, semantic_keywords = ?,
     summary_embedding = ?, summary_generated_at = NOW(),
     summary_model = ?, summary_rubric_version = ?
   WHERE id = ?

Append run row to artifact_summary_runs (count per cycle, cost, errors)

Bounded batch: 50 artifacts/cycle, every-2h.

Failure modes:

Failure	Response
LLM rate-limited	Skip remaining items in cycle, log, retry next cycle
Malformed JSON after retry	Skip artifact, log to `artifact_summary_failures`
Embedding model unavailable	Generate summary anyway, embedding=NULL, retry next cycle
Artifact files missing	Generate summary from metadata + title only, mark `confidence < 0.5`
Rubric version bumped mid-run	Existing in-flight items continue at old version; new rubric applies next cycle

Search API

POST /api/artifacts/search
{
  "query": "heatmap microglial gene expression Alzheimer",
  "filters": {
    "artifact_type": ["figure", "notebook"],
    "methods_used": ["differential-expression"],
    "min_quality_score": 0.6,
    "applicable_domains": ["alzheimers"]
  },
  "limit": 25
}

Response:

{
  "results": [
    {
      "id": "...",
      "id": "figure-abc123",
      "title": "...",
      "summary": "...",
      "similarity": 0.847,
      "key_findings": [...],
      "url": "/artifact/<id>"
    },
    ...
  ]
}

Implementation: cosine similarity over summary_embedding filtered by
the structured filters. Returns top-N ranked by similarity.

Companion endpoint: GET /api/artifacts/<id>/similar?limit=10
returns artifacts with embedding similarity > 0.75 to this one.

"Find similar" UI surface

On every artifact detail page (Phase 1 of folder migration adds this):
a "Similar artifacts" sidebar. Useful for reuse — a researcher viewing
an AD heatmap immediately sees related heatmaps they could derive from.

Compositional metadata extraction

Some metadata fields are extractable without LLM:

Notebook: parse .ipynb cells; extract import statements →

methods_used candidates (e.g. import scanpy → 'single-cell-analysis')

Dataset: parse .schema.json → column names + types as features
Figure: read EXIF/metadata; parse caption if PNG has iTXt
Model: parse model card if present

Run these before LLM call; LLM gets pre-extracted hints, reduces
hallucination risk.

Acceptance criteria

☐ All schema migrations applied

☐ artifact_summary_rubric v1 row inserted

☐ methods_taxonomy and data_sources_taxonomy seeded with ≥30 entries each

☐ Recurring driver running every-2h, processing 50/cycle

☐ 90% of artifacts with quality_score ≥ 0.6 have summary within 14 days

☐ Search API returns relevant results: precision@10 ≥ 0.7 on a 30-query test set

☐ "Similar artifacts" sidebar live on artifact detail pages

☐ Weekly rubric audit job runs; v2 rubric proposed within 4 weeks

☐ Failure rate < 5% (parsing + LLM errors combined)

Dependencies

quest_artifact_uuid_migration_spec.md Phase 0 (uses artifacts.id; id values are UUIDs for new artifacts)
pgvector PostgreSQL extension
quest_llm_routing_spec.md for provider selection

Dependents

quest_artifact_reuse_provenance_qc_spec.md (uses summary in reuse signals)
All consumers wanting artifact recommendation / dedup
quest_paper_replication_starter_spec.md (semantic dedup of replication attempts)

Work Log

2026-04-28 — Spec authored

Bootstrap design. Versioned LLM rubric for summaries; pgvector for
embeddings; structured taxonomies for filterable facets. Recurring
driver pattern (50/cycle every-2h). Search + similar APIs designed
but not implemented.

Open question: should embeddings be 1536-dim (OpenAI/Anthropic style)
or smaller? Storage at 11K artifacts is trivial either way; pick based
on retrieval quality benchmark.

Payload JSON

{
  "requirements": {
    "reasoning": 8,
    "analysis": 8
  }
}

Sibling Tasks in Quest (Atlas) ↗

○[Atlas] Squad findings bubble-up driver (driver #20)P94

○[Atlas] Install Dolt server + migrate first dataset (driver #26)P92

○[Atlas] Dataset PR review & merge driver (driver #27)P92

○[Atlas] Wiki mermaid LLM regen — 50 pages/run, parallel agentsP92

○[Atlas] Gap closure pipeline — match 500 open gaps to accumulated evidence and resolveP92

○[Atlas] CI: Drive artifact folder migration backfillP92

○[Atlas] Versioned tabular datasets — overall coordination questP90

○[Atlas] KG ↔ dataset cross-link driver (driver #30)P90

○[Atlas] PubMed evidence update pipelineP87

○[Atlas] CI: Database integrity and orphan analysis checkP85

[Atlas] CI: Generate semantic metadata for unsummarized artifacts open analysis:8 reasoning:8