Rebuild theme <ID>: <name> as a continuous process

← All Specs

Rebuild theme <ID>: <name> as a continuous process

Template for "rebuild a retired theme" tasks. Each concrete rebuild
spec references this template rather than duplicating the acceptance
criteria. When you see {{PLACEHOLDER}}, fill it in from the per-theme
spec.

Context

One of the themes in docs/design/retired_scripts_patterns.md must be
rebuilt as a continuous process. The old implementation was a cluster
of ~N one-off scripts that rotted. This task replaces them with a
single, robust, LLM-first recurring process that gets better over time
without operator intervention.

Theme: {{THEME_ID}}{{THEME_NAME}} Layer: {{LAYER}} Spec anchor: docs/design/retired_scripts_patterns.md#{{THEME_ID}}

Read these in order before starting:

  • docs/design/retired_scripts_patterns.md — the "Design principles
  • for continuous processes" section. Every principle is load-bearing.
  • The theme's entry in the same doc.
  • CLAUDE.md + AGENTS.md for worktree rules and commit standards.
  • What you are NOT allowed to do

  • **No hardcoded entity lists, keyword lists, regex patterns, or
  • canonical name tables for semantic work.** If you feel the urge to
    write KEYWORDS = ["therapeutic", "clinical", ...], stop and write
    an LLM rubric instead.
  • **No hardcoded table/column names for work that could be
  • polymorphic.** Use information_schema to discover shape.
  • No calendar-driven sweeps. Every run must be gap-predicate
  • driven. "If nothing is broken, do nothing."
  • No single-use scripts. Your output is a service + recurring
  • task + CLI command + MCP tool, not a scripts/one_off_X.py.
  • No raw sqlite3.connect. SciDEX uses PG via
  • scidex.core.database.get_db(); Orchestra via orchestra CLI/MCP.
  • **No writes without version stamp + audit log entry + idempotency
  • key.** Any write that can't be safely replayed is a bug.
  • No blast radius exceeding 50 rows per run without operator gate.
  • What you MUST do

    1. LLM rubric, not rules

    The core judgment in this process is {{CORE_JUDGMENT}}. That
    judgment is made by a versioned LLM rubric:

    • Rubric lives in a PG table theme_rubric with columns
    (theme_id, version, prompt, output_schema_json, created_at).
    • Every rubric output row records the rubric_version that produced
    it. Upgrading the rubric is a config change, not a code change.
    • The rubric itself is LLM-writable: a meta-task ("review rubric_v3
    outputs, propose rubric_v4") is how the rubric evolves.

    Where tempting to use a regex: ask yourself, "will content three
    months from now still match this pattern?" If not, it's a rubric job.

    2. Gap-predicate query, not calendar trigger

    Define the predicate for "this row needs work": {{GAP_PREDICATE}}. The process does:

    SELECT <minimal_columns>
    FROM <content_table>
    WHERE <gap_predicate>
    ORDER BY <priority_signal> DESC
    LIMIT <batch_size>

    Where priority_signal is itself a learned quality score, not
    hardcoded "newest first".

    3. Bounded batch, idempotent, version-stamped

    Batch size ≤ 50 per run. Writes:

    INSERT INTO <output_table> (..., rubric_version, model, run_id)
    VALUES (...)
    ON CONFLICT (<idempotency_key>) DO UPDATE SET ...
    WHERE EXCLUDED.rubric_version > <output_table>.rubric_version

    run_id is the row from <theme>_runs that produced this output;
    every output row can be traced back to its run.

    4. Observability contract

    Three tables:

    • <theme>_runs(run_id, started_at, finished_at, items_considered,
    items_processed, items_skipped, items_errored, llm_calls,
    llm_cost_usd, rubric_version, duration_ms, error_summary)
    • <theme>_audit_log(id, run_id, entity_id, before_json, after_json,
    reason, created_at) — one row per state change.
    • <theme>_outcome_feedback(id, output_row_id, outcome_metric,
    outcome_value, observed_at) — how downstream signals judged this
    output. Populated by the outcome feedback loop, not by the process
    itself.

    5. Graceful failure

    • LLM rate-limited or non-200: back off once, retry on a cheaper
    model, then skip-and-log.
    • External API down: skip-and-log; the gap-predicate will re-select
    the row next cycle.
    • Malformed LLM output: one retry with stricter JSON-mode prompt; if
    still bad, skip-and-log. NEVER write malformed output.
    • Never raise unhandled; every failure is logged + skipped.

    6. Three surfaces

    • FastAPI route POST /api/{{LAYER_SLUG}}/{{THEME_SLUG}}/run
    operator-invocable, returns the run_id. Query param ?dry_run=1
    returns what WOULD be processed without writing.
    • Orchestra recurring task created in orchestra.db with cadence
    {{CADENCE}} and description pointing at this spec.
    • MCP tool {{LAYER_SLUG}}__{{THEME_SLUG}}_run so agents can
    invoke when they detect gap conditions themselves.

    All three call the same underlying scidex.{{LAYER_SLUG}}.{{THEME_SLUG}}.run_once() function. Zero code duplication.

    7. Progressive-improvement feedback loop

    A separate recurring task (one per theme, shared pattern) correlates
    this theme's outputs with downstream quality signals:

    • For A1 (KG edge extraction): did extracted edges get confirmed by
    later papers / retracted / scored well by agents?
    • For AG1 (thin-content enrichment): did expanded pages get higher
    quality scores / user engagement / citation counts?
    • For S3 (FTS coverage): did search usage hit the re-indexed content?

    The feedback worker writes <theme>_outcome_feedback rows. A second
    meta-worker periodically (weekly) proposes rubric/threshold adjustments
    based on these signals, writing to theme_rubric_proposal for
    operator approval.

    **This loop is the difference between "process that gets better over
    time" and "process that stays at day-one quality forever."**

    8. Self-calibrating thresholds

    Thresholds (batch size, priority weights, gap-predicate constants)
    live in a PG theme_config(theme_id, key, value, updated_at, reason)
    table. A meta-worker re-evaluates them monthly based on runs table
    metrics. Operators can tune via SQL without code deployment.

    Never write THINNESS_THRESHOLD = 500 as a module constant.

    9. Composition over reimplementation

    Use (or create, once) these shared primitives:

    • scidex.core.external_source_client — throttled HTTP + cache + retry.
    • scidex.core.llm_rubric_judge — rubric+input → structured verdict,
    versioned, cost-tracked.
    • scidex.core.versioned_upsert — PG ON CONFLICT with version stamp.
    • scidex.core.priority_batch_selector — gap-predicate + priority
    → batch.
    • scidex.core.run_metrics — context manager that opens a run row,
    tallies counts, emits the row on exit.
    • scidex.core.rubric_registry — load rubric by theme+version.

    If your theme is the first to need one of these, build it as a
    first-class helper, not inline. Subsequent themes reuse it.

    10. Polymorphism over specialization

    The process should operate on "any table matching shape S", not "the
    hypotheses table specifically". Example: thin-content enrichment should
    take (table, prose_column, priority_column) as config and work over
    hypotheses, wiki pages, experiments, analyses uniformly. One process,
    N content types.

    Acceptance criteria

    Reviewer checks before merge:

    ☐ No hardcoded entity / keyword lists or regex-for-semantic checks.
    ☐ No hardcoded column names that could be discovered.
    ☐ Gap-predicate query; no calendar-driven full-table scans.
    ☐ Idempotent upsert with rubric_version stamp.
    <theme>_runs + <theme>_audit_log tables created, populated.
    ☐ LLM calls go through llm_rubric_judge primitive (versioned).
    ☐ All three surfaces exist: FastAPI, orchestra recurring, MCP.
    ☐ Outcome-feedback plumbing: <theme>_outcome_feedback table
    exists and the feedback worker is registered.
    ☐ Thresholds configurable in theme_config, not module constants.
    ☐ Failure modes degrade gracefully (tested: LLM 429, API timeout,
    malformed LLM output).
    ☐ Batch size ≤ 50 or operator gate in place.
    ☐ Shared primitives used where they exist.
    ☐ Meta-task registered: "review rubric outputs weekly, propose
    rubric_vN+1".
    ☐ PR description explains how the process gets better over time.

    Bootstrapping order within the task

  • Create the PG tables: <theme>_runs, <theme>_audit_log,
  • <theme>_outcome_feedback, theme_rubric (if first theme), seed
    theme_rubric with v1.
  • Implement scidex.{{LAYER_SLUG}}.{{THEME_SLUG}}.run_once() using
  • shared primitives. Dry-run path first.
  • FastAPI route.
  • MCP tool registration.
  • Orchestra recurring task registration (orchestra create ...).
  • Register the outcome-feedback worker (another orchestra recurring
  • task).
  • Write integration test: dry-run, then real run with batch_size=3
  • against staging; verify <theme>_runs row, audit log entries,
    idempotency.
  • PR with measurements: before-state gap count, after-run gap delta,
  • cost per run, wall-clock per run.

    What NOT to deliver

    • A script named {{THEME_SLUG}}.py that does the work. (Deliver a
    service.)
    • A one-shot "fix all rows right now" mode.
    • A rebuild that omits the outcome-feedback loop. (Without the loop,
    the process is day-one-quality forever.)
    • Hardcoded "top 10 entities" or "hypotheses ID list" anywhere.
    • More than ONE new table-backed config source. If you find yourself
    adding a second, consolidate into theme_config.

    References

    • docs/design/retired_scripts_patterns.md — design principles +
    theme entry.
    • docs/planning/specs/rebuild_theme_template_spec.md — this file.
    • Existing shared infra:
    - scidex.core.database (PG dispatcher)
    - scidex.core.db_connect (raw PG)
    - Orchestra CLI: orchestra get --id, orchestra list --search,
    orchestra cost report, orchestra backoff --status --json.

    File: rebuild_theme_template_spec.md
    Modified: 2026-04-25 22:00
    Size: 9.7 KB