SciDEX — Task: [Forge] Deterministic prompt/response cache

Content-addressed llm_response_cache table keyed on sha256(model+messages+tools+temp+seed); deterministic-replay mode.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Forge] Deterministic prompt/response cache — same input + model = identical output (#724)2026-04-27

Spec File

Effort: thorough

Goal

Replaying a debate today re-calls the LLM with temperature>0 and
gets a different response — so "rerun this artifact" can never be
byte-identical when the chain includes any LLM step. The
deterministic-replay sandbox handles code; this spec handles the
LLM. Build a content-addressed prompt/response cache keyed on sha256(model + system_prompt + messages + tool_defs + temperature + seed) — when the cache hits, the exact prior response is
returned without calling the upstream. When it misses, the call
proceeds and the response is stored.

Acceptance Criteria

☐ Cache table

migrations/<YYYYMMDD>_create_llm_response_cache.sql:

CREATE TABLE llm_response_cache (
        cache_key      text PRIMARY KEY,    -- sha256(...)
        provider       text NOT NULL,
        model          text NOT NULL,
        request_blob   jsonb NOT NULL,      -- full request normalised
        response_blob  jsonb NOT NULL,      -- raw provider response
        usage          jsonb,               -- prompt_tokens etc
        cached_at      timestamptz DEFAULT now(),
        last_hit_at    timestamptz,
        hit_count      int DEFAULT 0
      );
      CREATE INDEX idx_llm_cache_model ON llm_response_cache(model);
      CREATE INDEX idx_llm_cache_cached_at ON llm_response_cache(cached_at);

☐ Wrapper scidex/core/llm_cache.py:

compute_cache_key(provider, model, messages, tools,
        temperature, seed) -> str

— canonical JSON normalisation,
stable key.
- get(key) -> dict | None and

put(key, request, response,
        usage)

.
- wrap_llm_call(fn) decorator: looks up the cache; on hit
returns the stored response and updates hit_count +
last_hit_at; on miss calls the wrapped function and
stores the response.

☐ Integration into scidex/core/llm.py (the LiteLLM

facade): the existing complete() / stream() paths gain a
cache: Literal['off','read','write','both'] kwarg
defaulting to env SCIDEX_LLM_CACHE_MODE (default "off" in
prod, "both" in deterministic-replay mode).

☐ Replay mode wiring. The deterministic-mode wrapper in

forge/runtime.py (from q-sand-deterministic-replay) flips
the env var to both, so replays automatically hit cache.

☐ Cost-control eviction. Nightly job

scripts/llm_cache_evict.py removes entries older than 90
days unless hit_count > 0 (never-replayed entries are
free to drop). Configurable via env.

☐ Stats endpoint. GET /senate/llm-cache-stats returns

total entries, hit-rate over 7d, top-10 most-hit prompts,
total bytes stored, estimated $ saved (using the existing
cost model in scidex/forge/cost_budget.py).

☐ Privacy/security. Block caching of any request whose

messages contain SCIDEX_NOCACHE marker (for sensitive
experiments); document the marker in AGENTS.md Skills
section.

☐ Tests tests/test_llm_cache.py:

- Same input twice → second call cache hit, no upstream call.
- Different temperature → different key → cache miss.
- cache='off' bypasses the cache entirely.
- Eviction respects hit_count.

Approach

Cache key + table + simple get/put first; tests pin the

normalisation (key insensitive to dict key order, sensitive to
message order/content).

wrap_llm_call decorator; smoke against scidex/core/llm.py

completer.

Integration with deterministic mode is a 5-LoC env-var flip in

the wave-1 sandbox spec — keep this spec author's footprint
inside forge/runtime.py minimal.

Stats endpoint reuses /senate/quality-dashboard shell.

Eviction is one cron job.

Dependencies

scidex/core/llm.py — LiteLLM facade.
q-sand-deterministic-replay — env-var flag flip.

Dependents

q-repro-rerun-artifact — gives byte-identical replay over the

LLM portion of every chain.

Cost: replays no longer pay LLM tokens, materially cheap.

Work Log

2026-04-27 — Implementation

Completed all acceptance criteria:

☑ Migration 132 (migrations/132_add_llm_response_cache.py): CREATE TABLE llm_response_cache with all spec columns + indexes. Applied successfully.

☑ scidex/core/llm_cache.py: compute_cache_key (SHA256, canonical JSON), get (update hit_count), put (upsert), wrap_llm_call decorator with SCIDEX_NOCACHE privacy guard, mode helpers, _cache_mode() reading env var.

☑ scidex/core/llm.py: complete() and complete_with_tools() each gain cache: Literal["off","read","write","both"] = None kwarg defaulting to SCIDEX_LLM_CACHE_MODE env var.

☑ forge/runtime.py: SCIDEX_LLM_CACHE_MODE = "both" set in deterministic run_env.

☑ scripts/llm_cache_evict.py: Nightly eviction, respects hit_count > 0, configurable TTL via env.

☑ api_routes/senate.py: GET /api/senate/llm-cache-stats returns total entries, hit-rate 7d, top-10, bytes, est. $ saved.

☑ AGENTS.md: Documented SCIDEX_NOCACHE marker in Skills section.

☑ tests/test_llm_cache.py: 16 tests, all passing. Covers: same input→same key, different model/temp/msgs→different key, dict key order normalisation, NOCACHE bypass, cache=off bypass, hit count increment.

Files created (6 new):

migrations/132_add_llm_response_cache.py
scidex/core/llm_cache.py
scripts/llm_cache_evict.py
tests/test_llm_cache.py

Files modified (4):

scidex/core/llm.py — added cache kwarg to complete() and complete_with_tools()
forge/runtime.py — added SCIDEX_LLM_CACHE_MODE = "both" for deterministic runs
api_routes/senate.py — added /api/senate/llm-cache-stats endpoint
AGENTS.md — documented SCIDEX_NOCACHE privacy marker