Rebuild theme <ID>: <name> as a continuous process

Template for "rebuild a retired theme" tasks. Each concrete rebuild
spec references this template rather than duplicating the acceptance
criteria. When you see {{PLACEHOLDER}}, fill it in from the per-theme
spec.

Context

One of the themes in docs/design/retired_scripts_patterns.md must be
rebuilt as a continuous process. The old implementation was a cluster
of ~N one-off scripts that rotted. This task replaces them with a
single, robust, LLM-first recurring process that gets better over time
without operator intervention.

Theme: {{THEME_ID}} — {{THEME_NAME}} Layer: {{LAYER}} Spec anchor: docs/design/retired_scripts_patterns.md#{{THEME_ID}}

Read these in order before starting:

docs/design/retired_scripts_patterns.md — the "Design principles

for continuous processes" section. Every principle is load-bearing.

The theme's entry in the same doc.

CLAUDE.md + AGENTS.md for worktree rules and commit standards.

What you are NOT allowed to do

**No hardcoded entity lists, keyword lists, regex patterns, or

canonical name tables for semantic work.** If you feel the urge to
write KEYWORDS = ["therapeutic", "clinical", ...], stop and write
an LLM rubric instead.

**No hardcoded table/column names for work that could be

polymorphic.** Use information_schema to discover shape.

No calendar-driven sweeps. Every run must be gap-predicate

driven. "If nothing is broken, do nothing."

No single-use scripts. Your output is a service + recurring

task + CLI command + MCP tool, not a scripts/one_off_X.py.

No raw sqlite3.connect. SciDEX uses PG via

scidex.core.database.get_db(); Orchestra via orchestra CLI/MCP.

**No writes without version stamp + audit log entry + idempotency

key.** Any write that can't be safely replayed is a bug.

No blast radius exceeding 50 rows per run without operator gate.

What you MUST do

1. LLM rubric, not rules

The core judgment in this process is {{CORE_JUDGMENT}}. That
judgment is made by a versioned LLM rubric:

Rubric lives in a PG table theme_rubric with columns

(theme_id, version, prompt, output_schema_json, created_at).

Every rubric output row records the rubric_version that produced

it. Upgrading the rubric is a config change, not a code change.

The rubric itself is LLM-writable: a meta-task ("review rubric_v3

outputs, propose rubric_v4") is how the rubric evolves.

Where tempting to use a regex: ask yourself, "will content three
months from now still match this pattern?" If not, it's a rubric job.

2. Gap-predicate query, not calendar trigger

Define the predicate for "this row needs work": {{GAP_PREDICATE}}. The process does:

SELECT <minimal_columns>
FROM <content_table>
WHERE <gap_predicate>
ORDER BY <priority_signal> DESC
LIMIT <batch_size>

Where priority_signal is itself a learned quality score, not
hardcoded "newest first".

3. Bounded batch, idempotent, version-stamped

Batch size ≤ 50 per run. Writes:

INSERT INTO <output_table> (..., rubric_version, model, run_id)
VALUES (...)
ON CONFLICT (<idempotency_key>) DO UPDATE SET ...
WHERE EXCLUDED.rubric_version > <output_table>.rubric_version

run_id is the row from <theme>_runs that produced this output;
every output row can be traced back to its run.

4. Observability contract

Three tables:

<theme>_runs(run_id, started_at, finished_at, items_considered,

  items_processed, items_skipped, items_errored, llm_calls,
  llm_cost_usd, rubric_version, duration_ms, error_summary)

<theme>_audit_log(id, run_id, entity_id, before_json, after_json,

reason, created_at) — one row per state change.

<theme>_outcome_feedback(id, output_row_id, outcome_metric,

outcome_value, observed_at) — how downstream signals judged this
output. Populated by the outcome feedback loop, not by the process
itself.

5. Graceful failure

LLM rate-limited or non-200: back off once, retry on a cheaper

model, then skip-and-log.

External API down: skip-and-log; the gap-predicate will re-select

the row next cycle.

Malformed LLM output: one retry with stricter JSON-mode prompt; if

still bad, skip-and-log. NEVER write malformed output.

Never raise unhandled; every failure is logged + skipped.

6. Three surfaces

FastAPI route POST /api/{{LAYER_SLUG}}/{{THEME_SLUG}}/run —

operator-invocable, returns the run_id. Query param ?dry_run=1
returns what WOULD be processed without writing.

Orchestra recurring task created in orchestra.db with cadence

{{CADENCE}} and description pointing at this spec.

MCP tool {{LAYER_SLUG}}__{{THEME_SLUG}}_run so agents can

invoke when they detect gap conditions themselves.

All three call the same underlying scidex.{{LAYER_SLUG}}.{{THEME_SLUG}}.run_once() function. Zero code duplication.

7. Progressive-improvement feedback loop

A separate recurring task (one per theme, shared pattern) correlates
this theme's outputs with downstream quality signals:

For A1 (KG edge extraction): did extracted edges get confirmed by

later papers / retracted / scored well by agents?

For AG1 (thin-content enrichment): did expanded pages get higher

quality scores / user engagement / citation counts?

For S3 (FTS coverage): did search usage hit the re-indexed content?

The feedback worker writes <theme>_outcome_feedback rows. A second
meta-worker periodically (weekly) proposes rubric/threshold adjustments
based on these signals, writing to theme_rubric_proposal for
operator approval.

**This loop is the difference between "process that gets better over
time" and "process that stays at day-one quality forever."**

8. Self-calibrating thresholds

Thresholds (batch size, priority weights, gap-predicate constants)
live in a PG theme_config(theme_id, key, value, updated_at, reason)
table. A meta-worker re-evaluates them monthly based on runs table
metrics. Operators can tune via SQL without code deployment.

Never write THINNESS_THRESHOLD = 500 as a module constant.

9. Composition over reimplementation

Use (or create, once) these shared primitives:

scidex.core.external_source_client — throttled HTTP + cache + retry.
scidex.core.llm_rubric_judge — rubric+input → structured verdict,

versioned, cost-tracked.

scidex.core.versioned_upsert — PG ON CONFLICT with version stamp.
scidex.core.priority_batch_selector — gap-predicate + priority

→ batch.

scidex.core.run_metrics — context manager that opens a run row,

tallies counts, emits the row on exit.

scidex.core.rubric_registry — load rubric by theme+version.

If your theme is the first to need one of these, build it as a
first-class helper, not inline. Subsequent themes reuse it.

10. Polymorphism over specialization

The process should operate on "any table matching shape S", not "the
hypotheses table specifically". Example: thin-content enrichment should
take (table, prose_column, priority_column) as config and work over
hypotheses, wiki pages, experiments, analyses uniformly. One process,
N content types.

Acceptance criteria

Reviewer checks before merge:

☐ No hardcoded entity / keyword lists or regex-for-semantic checks.

☐ No hardcoded column names that could be discovered.

☐ Gap-predicate query; no calendar-driven full-table scans.

☐ Idempotent upsert with rubric_version stamp.

☐ <theme>_runs + <theme>_audit_log tables created, populated.

☐ LLM calls go through llm_rubric_judge primitive (versioned).

☐ All three surfaces exist: FastAPI, orchestra recurring, MCP.

☐ Outcome-feedback plumbing: <theme>_outcome_feedback table

exists and the feedback worker is registered.

☐ Thresholds configurable in theme_config, not module constants.

☐ Failure modes degrade gracefully (tested: LLM 429, API timeout,

malformed LLM output).

☐ Batch size ≤ 50 or operator gate in place.

☐ Shared primitives used where they exist.

☐ Meta-task registered: "review rubric outputs weekly, propose

rubric_vN+1".

☐ PR description explains how the process gets better over time.

Bootstrapping order within the task

Create the PG tables: <theme>_runs, <theme>_audit_log,

<theme>_outcome_feedback, theme_rubric (if first theme), seed
theme_rubric with v1.

Implement scidex.{{LAYER_SLUG}}.{{THEME_SLUG}}.run_once() using

shared primitives. Dry-run path first.

FastAPI route.

MCP tool registration.

Orchestra recurring task registration (orchestra create ...).

task).

Write integration test: dry-run, then real run with batch_size=3

against staging; verify <theme>_runs row, audit log entries,
idempotency.

PR with measurements: before-state gap count, after-run gap delta,

cost per run, wall-clock per run.

What NOT to deliver

A script named {{THEME_SLUG}}.py that does the work. (Deliver a

service.)

A one-shot "fix all rows right now" mode.
A rebuild that omits the outcome-feedback loop. (Without the loop,

the process is day-one-quality forever.)

Hardcoded "top 10 entities" or "hypotheses ID list" anywhere.
More than ONE new table-backed config source. If you find yourself

adding a second, consolidate into theme_config.

References

docs/design/retired_scripts_patterns.md — design principles +

theme entry.

docs/planning/specs/rebuild_theme_template_spec.md — this file.
Existing shared infra:

- scidex.core.database (PG dispatcher)
- scidex.core.db_connect (raw PG)
- Orchestra CLI: orchestra get --id, orchestra list --search,
orchestra cost report, orchestra backoff --status --json.

File: rebuild_theme_template_spec.md

Modified: 2026-04-25 22:00

Size: 9.7 KB