Agents repeatedly hit the same failure modes — pubmed_search returns
empty for an over-specific query, chembl_drug_targets 404s on a gene
symbol that needs UniProt resolution first, the LLM returns invalid
JSON because the prompt didn't say "respond with JSON only", a skill
times out and the agent tries the same skill 3 more times before
giving up. Each of these has a known fix (broaden the query, resolve
the symbol, append "respond with JSON only", switch skill instead of
retrying), but agents rediscover them every run. Build an
error-recovery memory that pattern-matches recurring failures and
auto-suggests (or auto-applies) the historical fix.
Effort: thorough
scidex/agents/error_recovery.py::lookup_recovery(error_signature: str, agent_id: str, skill_name: str) -> RecoveryHint | None returns {fix_kind, fix_payload, confidence, n_observations, last_succeeded_at} when a known fix exists.error_signature = sha256 prefix over (skill_name, error_class, normalized_input_pattern). Normalization strips IDs/tokens but keeps query shape (e.g. pubmed_search:empty_result:single_token_query is one bucket).migrations/20260428_error_recovery_memory.sql: error_recovery_memory(id, error_signature TEXT, fix_kind TEXT CHECK (fix_kind IN ('broaden_query','retry_with_resolver','reformat_prompt','switch_skill','widen_window','reduce_specificity','escalate_model','give_up')), fix_payload JSONB, n_observations INT, n_successes INT, confidence REAL, last_observed_at, last_succeeded_at). UNIQUE(error_signature, fix_kind).economics_drivers/ci_error_recovery_mining.py scans the last 30 days of agent_skill_invocations for failure→retry→success sequences (same agent, same skill, within 60 s, distinct inputs) and extracts the input-delta as the candidate fix.scidex/agora/skill_evidence._call_skill() (or wherever skills are invoked) calls lookup_recovery() on every failure and either auto-applies the fix (if confidence ≥ 0.85 and fix_kind is auto-safe like broaden_query) or attaches the hint to the agent's next-step prompt.GET /api/skills/error-recovery?signature=<sig> returns the hint; GET /senate/error-recovery-memory HTML page lists top-50 frequent recoveries.tests/test_error_recovery.py: synthetic failure → retry → success in mined data → memory entry created with fix_kind='broaden_query'; lookup returns the hint; auto-apply path executes the fix; low-confidence hint surfaces as suggestion not auto-apply.fix_kind='escalate_model' is never auto-applied — always surfaces as a suggestion (cost discipline).agent.py:_log_skill_invocation() to confirm the failure logging shape; extend it to write a normalized input_pattern column.recovery_auto_apply env flag (default off for the first week).agent_skill_invocations (shipped) — failure log source.q-mem-agent-skill-preference-log — shares the agent-id keyed analytics surface.q-mem-evolving-prompt-suggestions — reformat_prompt recoveries feed prompt evolution.{
"completion_shas": [
"a056062"
],
"completion_shas_checked_at": ""
}