[Forge] Deterministic prompt/response cache - same input + model = identical output done

← Analysis Sandboxing
Content-addressed llm_response_cache table keyed on sha256(model+messages+tools+temp+seed); deterministic-replay mode.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Forge] Deterministic prompt/response cache — same input + model = identical output (#724)2026-04-27
Spec File

Effort: thorough

Goal

Replaying a debate today re-calls the LLM with temperature>0 and
gets a different response — so "rerun this artifact" can never be
byte-identical when the chain includes any LLM step. The
deterministic-replay sandbox handles code; this spec handles the
LLM. Build a content-addressed prompt/response cache keyed on sha256(model + system_prompt + messages + tool_defs + temperature
+ seed)
— when the cache hits, the exact prior response is
returned without calling the upstream. When it misses, the call
proceeds and the response is stored.

Acceptance Criteria

Cache table
migrations/<YYYYMMDD>_create_llm_response_cache.sql:

CREATE TABLE llm_response_cache (
        cache_key      text PRIMARY KEY,    -- sha256(...)
        provider       text NOT NULL,
        model          text NOT NULL,
        request_blob   jsonb NOT NULL,      -- full request normalised
        response_blob  jsonb NOT NULL,      -- raw provider response
        usage          jsonb,               -- prompt_tokens etc
        cached_at      timestamptz DEFAULT now(),
        last_hit_at    timestamptz,
        hit_count      int DEFAULT 0
      );
      CREATE INDEX idx_llm_cache_model ON llm_response_cache(model);
      CREATE INDEX idx_llm_cache_cached_at ON llm_response_cache(cached_at);

Wrapper scidex/core/llm_cache.py:
- compute_cache_key(provider, model, messages, tools,
temperature, seed) -> str
— canonical JSON normalisation,
stable key.
- get(key) -> dict | None and put(key, request, response,
usage)
.
- wrap_llm_call(fn) decorator: looks up the cache; on hit
returns the stored response and updates hit_count +
last_hit_at; on miss calls the wrapped function and
stores the response.
Integration into scidex/core/llm.py (the LiteLLM
facade): the existing complete() / stream() paths gain a
cache: Literal['off','read','write','both'] kwarg
defaulting to env SCIDEX_LLM_CACHE_MODE (default "off" in
prod, "both" in deterministic-replay mode).
Replay mode wiring. The deterministic-mode wrapper in
forge/runtime.py (from q-sand-deterministic-replay) flips
the env var to both, so replays automatically hit cache.
Cost-control eviction. Nightly job
scripts/llm_cache_evict.py removes entries older than 90
days unless hit_count > 0 (never-replayed entries are
free to drop). Configurable via env.
Stats endpoint. GET /senate/llm-cache-stats returns
total entries, hit-rate over 7d, top-10 most-hit prompts,
total bytes stored, estimated $ saved (using the existing
cost model in scidex/forge/cost_budget.py).
Privacy/security. Block caching of any request whose
messages contain SCIDEX_NOCACHE marker (for sensitive
experiments); document the marker in AGENTS.md Skills
section.
Tests tests/test_llm_cache.py:
- Same input twice → second call cache hit, no upstream call.
- Different temperature → different key → cache miss.
- cache='off' bypasses the cache entirely.
- Eviction respects hit_count.

Approach

  • Cache key + table + simple get/put first; tests pin the
  • normalisation (key insensitive to dict key order, sensitive to
    message order/content).
  • wrap_llm_call decorator; smoke against scidex/core/llm.py
  • completer.
  • Integration with deterministic mode is a 5-LoC env-var flip in
  • the wave-1 sandbox spec — keep this spec author's footprint
    inside forge/runtime.py minimal.
  • Stats endpoint reuses /senate/quality-dashboard shell.
  • Eviction is one cron job.
  • Dependencies

    • scidex/core/llm.py — LiteLLM facade.
    • q-sand-deterministic-replay — env-var flag flip.

    Dependents

    • q-repro-rerun-artifact — gives byte-identical replay over the
    LLM portion of every chain.
    • Cost: replays no longer pay LLM tokens, materially cheap.

    Work Log

    2026-04-27 — Implementation

    Completed all acceptance criteria:

    Migration 132 (migrations/132_add_llm_response_cache.py): CREATE TABLE llm_response_cache with all spec columns + indexes. Applied successfully.
    scidex/core/llm_cache.py: compute_cache_key (SHA256, canonical JSON), get (update hit_count), put (upsert), wrap_llm_call decorator with SCIDEX_NOCACHE privacy guard, mode helpers, _cache_mode() reading env var.
    scidex/core/llm.py: complete() and complete_with_tools() each gain cache: Literal["off","read","write","both"] = None kwarg defaulting to SCIDEX_LLM_CACHE_MODE env var.
    forge/runtime.py: SCIDEX_LLM_CACHE_MODE = "both" set in deterministic run_env.
    scripts/llm_cache_evict.py: Nightly eviction, respects hit_count > 0, configurable TTL via env.
    api_routes/senate.py: GET /api/senate/llm-cache-stats returns total entries, hit-rate 7d, top-10, bytes, est. $ saved.
    AGENTS.md: Documented SCIDEX_NOCACHE marker in Skills section.
    tests/test_llm_cache.py: 16 tests, all passing. Covers: same input→same key, different model/temp/msgs→different key, dict key order normalisation, NOCACHE bypass, cache=off bypass, hit count increment.

    Files created (6 new):

    • migrations/132_add_llm_response_cache.py
    • scidex/core/llm_cache.py
    • scripts/llm_cache_evict.py
    • tests/test_llm_cache.py
    Files modified (4):
    • scidex/core/llm.py — added cache kwarg to complete() and complete_with_tools()
    • forge/runtime.py — added SCIDEX_LLM_CACHE_MODE = "both" for deterministic runs
    • api_routes/senate.py — added /api/senate/llm-cache-stats endpoint
    • AGENTS.md — documented SCIDEX_NOCACHE privacy marker

    Sibling Tasks in Quest (Analysis Sandboxing) ↗