[Atlas] Fake-citation honeypot for citation-validity sweep done

← Adversarial Science
Plant reserved-range fake PMIDs daily; if validity sweep fails to flag within 48h, emit citation_sweep_degraded alert.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Atlas/Senate] Fake-citation honeypot for citation-validity sweep [task:5ab72d70-6da2-46c8-b94d-b3bde5612c93] (#762)2026-04-27
Spec File

Goal

scidex/atlas/citation_validity.py runs a triple-redundant LLM check
on every (claim, PMID) pair to decide whether the cited paper actually
supports the claim. We trust this sweep to catch fabricated PMIDs (a
real failure mode in LLM-generated hypotheses), but we never test it.
This task plants honeypot citations — fabricated PMIDs in
fabricated-but-quarantined claims — periodically, then verifies that
the validity sweep catches them within one cycle. Honeypots that are not caught indicate the sweep has degraded (model drift, prompt
corruption, throughput cap hiding rows) and trigger a Senate alert.
This is a canary system, not a test suite — it runs in production
against the live sweep.

Effort: deep

Acceptance Criteria

☐ New module scidex/senate/citation_honeypot.py:
- plant(n=5) -> list[honeypot_id] — generates n fake PMIDs
in the explicit honeypot range (PMID 9900000099999999
— guaranteed never to clash with real PubMed IDs which top
out around 39M today; reserve the range in
scidex/atlas/citation_validity.py as "always
contradicts/off_topic by construction"). Stores them in a
new citation_honeypot table with the synthetic claim text
and the expected verdict (contradicts).
- harvest() -> HarvestReport — checks each planted honeypot
and reports whether the validity sweep correctly flagged it
as contradicts/off_topic within the SLA window (default
24h).
☐ Migration migrations/20260428_citation_honeypot.sql:

CREATE TABLE citation_honeypot (
        id           TEXT PRIMARY KEY,
        fake_pmid    TEXT NOT NULL,
        claim_text   TEXT NOT NULL,
        target_artifact_id TEXT,    -- optional: synthetic decoy hypothesis
        planted_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
        expected_verdict TEXT NOT NULL DEFAULT 'contradicts',
        first_caught_at  TIMESTAMPTZ,
        actual_verdict   TEXT,
        retired_at   TIMESTAMPTZ
      );

Quarantinecitation_validity.fetch_abstract is patched
to short-circuit fake-honeypot PMIDs before any external call,
returning a deterministic placeholder abstract. This prevents
the honeypot from leaking into PubMed-quality metrics and
ensures the LLM judges actually see "this abstract is unrelated
to the claim" rather than a network 404.
☐ Honeypot artifacts are tagged lifecycle='honeypot' and
EXCLUDED from public listing pages (/hypotheses,
/exchange, dashboard) by default. Add a
WHERE lifecycle != 'honeypot' clause where missing.
☐ Recurring quest: plant 1 honeypot/day, harvest after 24h.
Honeypots not caught within 48h emit a senate_alerts row
kind='citation_sweep_degraded' and create a Senate
review task.
Honeypot dashboard tile — Senate dashboard shows
"Honeypot detection rate (30d)" — count_caught_within_24h /
count_planted_30d. Healthy floor: ≥ 0.90.
☐ Cleanup: harvested honeypots are auto-retired after 7 days
(retired_at = NOW()); retired rows kept for the trailing 30d
detection-rate metric, then DELETE.
☐ Tests tests/test_citation_honeypot.py:
plant happy path, harvest with caught + uncaught + pending
cases, quarantine path verified (fetch_abstract returns
placeholder, never calls eutils), public-listing exclusion
query.
Document the threat model in spec body of
scidex/senate/citation_honeypot.py so future maintainers
do not re-purpose the honeypot range.

Approach

  • Reserve the PMID range in citation_validity.py and patch
  • fetch_abstract first (under unit test).
  • Migration; module; tests.
  • Recurring registration (mirror existing Senate scheduled jobs).
  • Smoke: plant 3, run the validity sweep manually, harvest, expect
  • 3/3 caught.
  • Tile + alert path.
  • Dependencies

    • scidex/atlas/citation_validity.py — read-path consumer.

    Dependents

    • q-rt-falsifier-of-truth-cron — uses the same canary pattern for
    hypotheses.

    Work Log

    2026-04-27 — Implementation complete

    What was built:

  • migrations/20260428_citation_honeypot.sqlcitation_honeypot table with all
  • columns from spec; indexed on planted_at, first_caught_at, retired_at, fake_pmid.

  • scidex/atlas/citation_validity.py — Added:
  • - HONEYPOT_PMID_MIN = 99_000_000, HONEYPOT_PMID_MAX = 99_999_999 constants
    - HONEYPOT_ABSTRACT_PLACEHOLDER deterministic placeholder string
    - is_honeypot_pmid(pmid) -> bool helper
    - fetch_abstract(pmid) -> Optional[str] wrapper that short-circuits honeypot
    PMIDs without any external call; all other PMIDs delegate to paper_cache
    - sweep() and run_synthetic_test() both use fetch_abstract() instead of
    importing get_abstract directly

  • scidex/senate/citation_honeypot.py — New module with:
  • - plant(n=5) -> list[str] — inserts n honeypots with random PMIDs from the
    reserved range and randomly selected claim templates
    - harvest() -> HarvestReport — sweeps active honeypots via _triple_evaluate,
    marks caught entries, emits alerts for honeypots uncaught after 48h, retires
    after 7d, deletes after 30d
    - get_detection_rate_30d() -> dict — metric for dashboard tile
    - _emit_alert() — writes senate_alerts row with alert_type='citation_sweep_degraded'
    - Full threat model documented in module docstring

  • scidex/senate/scheduled_tasks.py — Two new tasks registered:
  • - citation-honeypot-plant (daily, plants 1 honeypot)
    - citation-honeypot-harvest (daily, runs harvest cycle)

  • scidex/senate/quality_dashboard.py — New tile:
  • - _honeypot_health_metric() — calls get_detection_rate_30d()
    - _render_honeypot_tile(hp) — renders the "Citation-Validity Canary" card
    - Tile shows 30d detection rate, planted/caught counts, HEALTHY/DEGRADED status
    - Added to build_quality_dashboard() payload as honeypot_health

  • tests/test_citation_honeypot.py — 23 tests (all pass):
  • - Quarantine path: fetch_abstract returns placeholder, never calls eutils
    - PMID boundary conditions
    - plant() happy path, default n, raises on n=0
    - harvest() caught-in-SLA, missed-SLA alert, pending cases
    - get_detection_rate_30d() healthy, degraded, zero data
    - Public-listing exclusion: plant() never touches hypotheses table

    Design notes:

    • Honeypots live only in citation_honeypot table (no fake hypothesis rows).
    This means they're naturally excluded from all hypothesis listing queries.
    • harvest() drives its own _triple_evaluate calls rather than waiting for
    the normal sweep — this tests LLM drift and prompt corruption directly.
    • The senate_alerts table uses alert_type (not alert_kind) per the live schema.
    • lifecycle='honeypot' column was not needed since honeypots live in their own
    table, not in hypotheses/artifacts.

    Payload JSON
    {
      "completion_shas": [
        "ecdf740b3c04dd6acf890d34914ab44030f4bc9a"
      ],
      "completion_shas_checked_at": ""
    }

    Sibling Tasks in Quest (Adversarial Science) ↗