[Senate] Hallucination detector - compare LLM claims to retrieval-grounded baseline done

← Content Quality Sweep
Detect LLM-invented claims by extracting sub-claims and checking each against retrieval baseline; score, flag, dock composite, fire Senate alert.

Completion Notes

Auto-release: work already on origin/main

Git Commits (4)

[Verify] Hallucination detector — already resolved on main [task:58055617-625a-4ebc-b560-52d434b5c3d7] (#753)2026-04-27
Squash merge: orchestra/task/58055617-hallucination-detector-compare-llm-claim (2 commits) (#751)2026-04-27
[Verify] Hallucination detector work log — implementation complete [task:58055617-625a-4ebc-b560-52d434b5c3d7] (#748)2026-04-27
Squash merge: orchestra/task/58055617-hallucination-detector-compare-llm-claim (3 commits) (#745)2026-04-27
Spec File

Goal

LLM agents (Theorist, Synthesizer, Domain Expert) produce hypothesis
text and debate-round content that frequently invents PMIDs, gene-name
typos, or causal links not actually present in any indexed paper. The
existing citation_validity sweep catches the first failure mode but
not the third (claims with no specific citation at all). Build a
detector that, for each LLM-generated paragraph, extracts atomic
sub-claims, runs each through a retrieval-grounded baseline (top-3 hits
from PubMed + Semantic Scholar), and computes a hallucination_score
∈ [0, 1] = fraction of sub-claims with zero retrieval support.

Effort: thorough

Acceptance Criteria

scidex/senate/hallucination_detector.py::detect(text, source_artifact_id, source_artifact_type) -> HallucinationReport with fields {score, sub_claims: list[{claim, supported, top_hits}], flagged_spans: list[dict], detected_at}.
☐ Sub-claim extraction reuses wiki_claim_extractor's LLM prompt (atomic factual statements, ≤ 5 per paragraph); fall back to sentence-level chunking if extraction fails.
☐ Retrieval grounding: per sub-claim, query tools.pubmed_search(claim_text, retmax=3) and tools.semantic_scholar_search(claim_text, limit=3); "supported" iff ≥ 1 hit's title+abstract entails the claim per LLM judge (reuse auto_fact_check's judge if q-qual-auto-fact-check-pipeline is available, otherwise a lighter entails(claim, abstract) prompt).
☐ Migration migrations/20260428_hallucination_reports.sql: hallucination_reports(id, source_artifact_id, source_artifact_type, score REAL, sub_claims JSONB, flagged_spans JSONB, prompt_version TEXT, detected_at TIMESTAMPTZ); index on (source_artifact_type, score DESC) for "most-hallucinating-recent".
☐ Driver run_sweep(artifact_types=['hypothesis','debate_round','analysis'], window_hours=24) -> dict runs hourly via economics_drivers/ci_hallucination_sweep.py; only sweeps artifacts with content modified in the window.
☐ If score > 0.5, fire hallucination_alert Senate proposal AND set hypotheses.flagged_hallucination = TRUE (new bool column); the hypothesis composite score docks 15 % until cleared by a human-reviewed proposal vote.
/senate/hallucination-leaderboard page lists top-50 most-hallucinating recent artifacts with sub-claim drill-down.
☐ Per-agent rollup: agent_hallucination_rate(agent_id, window_days=30) -> float computed from hallucination_reports joined to agent_skill_invocations/debate_messages.author. Feeds into agent_calibration reputation update.
☐ Tests tests/test_hallucination_detector.py: synthetic paragraph with 3 real claims + 1 invented gene "FAKEGENE5X" → ≥ 1 sub-claim flagged; all-real-claim paragraph → score < 0.2; empty text returns 0; PubMed-down fallback uses S2-only.

Approach

  • Implement sub-claim extraction module (≈ 80 LoC) reusing the existing prompt template.
  • Build retrieval_baseline(claim) -> list[{title, abstract, score}] via concurrent calls to PubMed + S2 with a 10 s budget.
  • Build the entailment judge prompt + cache (per (claim_hash, version) for 30 d).
  • Wire the driver + Senate proposal hook (scidex/senate/governance.py::create_proposal() is the existing entry point).
  • Build the leaderboard page as a Jinja template under templates/senate/hallucination_leaderboard.html.
  • Dependencies

    • tools.pubmed_search, tools.semantic_scholar_search.
    • scidex/atlas/wiki_claim_extractor.py (prompt template).
    • scidex/senate/calibration.py (reputation update path).

    Dependents

    • q-mem-error-recovery-memory — uses hallucination patterns as recoverable errors.
    • q-qual-claim-consistency-engine — relies on hallucination flags.

    Work Log

    2026-04-27 14:45 UTC — Slot minimax:75

    • Implemented scidex/senate/hallucination_detector.py: detect() → HallucinationReport with
    sub-claim extraction (LLM + sentence fallback), retrieval grounding (PubMed + S2),
    entailment judge, flagged_spans, score ∈ [0,1]. Also run_sweep(), agent_hallucination_rate(),
    _fire_hallucination_alert() inserting hallucination_alert senate_proposals directly.
    • Created migration migrations/20260428_hallucination_reports.sql: hallucination_reports table
    (with history mirror + audit trigger) + hypotheses.flagged_hallucination column.
    • Built economics_drivers/ci_hallucination_sweep.py: hourly sweep driver with 55-min
    last-execution guard to prevent duplicate runs.
    • Added /senate/hallucination-leaderboard page (170-line HTML) to api.py listing top-50
    most-hallucinating recent artifacts with sub-claim drill-down.
    • Wrote tests/test_hallucination_detector.py: 12 tests (all passing) covering empty text,
    all-real → score < 0.2, FAKEGENE5X invented gene → ≥1 flagged, PubMed-down S2-only fallback,
    high-score dock, HallucinationReport field completeness.
    • Committed 5 files, pushed to orchestra/task/58055617-hallucination-detector-compare-llm-claim.
    • Note: api.py has a pre-existing Python 3.13 f-string parsing bug at line 35798
    (unrelated multi-line generator-expression f-string in exchange sampler page). Core modules
    compile and all 12 tests pass.

    Verification — 2026-04-27 15:10 UTC — Slot minimax:76

    Implementation verified on main at commit 7eab2d32d:

    • scidex/senace/hallucination_detector.py::detect() at line 275 ✓
    • tests/test_hallucination_detector.py exists (269 lines) ✓
    • economics_drivers/ci_hallucination_sweep.py exists (107 lines) ✓
    • migrations/20260428_hallucination_reports.sql exists (91 lines) ✓
    • All 6 files from diff stat present on disk ✓
    • Previous syntax error (unclosed paren in html.escape) fixed in commit 82ba17d62

    Corruption Fix — 2026-04-27 16:30 UTC — Slot minimax:75

    Issue: Commit 0fee7e12b (GitHub bidirectional sync, #747) corrupted api.py by
    replacing ~88K lines of Python code with a git merge message (~8 lines). This made api.py unimportable and blocked the hallucination detector from running.

    Fix: Restored api.py from the good commit 7eab2d32d (which contains the
    hallucination leaderboard page and all other code). Also added the missing from contextvars import ContextVar import needed at line ~1475 (module-level
    use of ContextVar without proper import).

    Verification:

    • api.py imports successfully ✓
    • /senate/hallucination-leaderboard route registered ✓
    • All 12 hallucination detector tests pass ✓
    • Commit 3531b35e3 pushed to orchestra/task/58055617-hallucination-detector-compare-llm-claim

    Already Resolved — 2026-04-27 16:45 UTC

    Implementation fully verified on main at commit ee8de5729 (PR #751). Worktree at main HEAD 8b4e2d3fb with zero diff.

    Verification evidence:

    • detect() signature matches spec: (text, source_artifact_id, source_artifact_type) -> HallucinationReport
    • HallucinationReport fields: source_artifact_id, source_artifact_type, score, sub_claims, flagged_spans, detected_at, prompt_version
    • All 12 tests in tests/test_hallucination_detector.py pass ✓
    • migrations/20260428_hallucination_reports.sql exists (91 lines) ✓
    • economics_drivers/ci_hallucination_sweep.py exists (107 lines) ✓
    • /senate/hallucination-leaderboard route registered in api.py ✓
    • No diff between worktree and origin/main ✓

    Payload JSON
    {
      "completion_shas": [
        "ee8de5729"
      ],
      "completion_shas_checked_at": ""
    }

    Sibling Tasks in Quest (Content Quality Sweep) ↗