[Agora] Gap resolution engine — auto-close knowledge gaps when hypothesis + evidence satisfies resolution criteria

← All Specs

Goal

SciDEX has 3,545 knowledge gaps of which only 12 are resolved (0.34% resolution rate, 2026-04-28).
Gaps are generated rapidly but never close — the gap system is a noise accumulator, not a progress
tracker. This task builds the resolution engine that closes gaps when sufficient evidence exists.

Why this matters

A gap resolution rate of 0.34% means:

  • The platform cannot demonstrate progress to researchers
  • The knowledge graph has no "proof of work" feedback
  • Market participants cannot price gap-resolution contributions
  • The system looks like it generates problems without solving them

Fixing the resolution rate from 0.34% → 10%+ in one task would be the single largest measurable
improvement to SciDEX's scientific output value.

What to build

The resolution pipeline

A gap is considered resolved when all three conditions are met:

  • Hypothesis coverage: At least one hypothesis with composite_score ≥ 0.7 directly addresses
  • the gap's topic (match by disease, target gene/pathway, or question text)
  • Evidence support: The matching hypothesis has evidence_for with ≥ 2 PubMed-cited entries
  • Debate engagement: At least one debate session exists for an analysis related to the hypothesis
  • Build a function resolve_matching_gaps(batch_size=50) that:

  • Queries open knowledge gaps (status='open')
  • For each gap, finds matching hypotheses using:
  • - Direct analysis_id or target_gene foreign keys where available
    - Text similarity between gap title/description and hypothesis title/description
    - Disease field matching
  • Checks the three resolution conditions above
  • If conditions met: updates knowledge_gaps.status = 'resolved' with a resolution_summary
  • JSON field containing hypothesis_id, debate_session_id, evidence_summary
  • Emits a KG edge: gap -[resolved_by]-> hypothesis
  • What NOT to do

    • Do NOT manually write resolution summaries for each gap (that's row-count work)
    • Do NOT lower the resolution bar to artificially inflate the count
    • Do NOT create a new recurring driver for this — after building and testing the engine,
    register it with the existing [Agora] CI: Trigger debates for analyses with 0 debate sessions
    driver or propose integration with the Senate world-model driver

    Schema reference

    -- knowledge_gaps table: id, title, description, status, disease, analysis_id
    -- Update target: status = 'resolved', add resolution metadata to description

    Acceptance criteria

    resolve_matching_gaps() function implemented and tested
    ☑ Resolution conditions verified against real data (not loosened to inflate count)
    ☑ At least 50 gaps resolved in first run, with resolution summaries
    ☑ KG edges emitted for resolved gaps
    ☑ Resolution rate improves from 0.34% baseline
    ☑ Function registered as utility in scidex/agora/ — see scidex/agora/gap_resolution_engine.py

    Implementation (2026-04-28)

    Module: scidex/agora/gap_resolution_engine.py

    Algorithm

    Two-phase SQL query to avoid expensive cross-join:

    Phase 1 — Pre-filter qualifying hypotheses:

    • composite_score >= 0.7
    • evidence_for has ≥ 2 entries with pmid field
    • At least one debate_session exists for the hypothesis's analysis_id
    Phase 2 — Match gaps to hypotheses using four signals:
  • Direct FK chain (priority=100): analyses.gap_id = gap.idhypotheses.analysis_id
  • Exact domain/disease (priority=50): LOWER(hypothesis.disease) = LOWER(gap.domain)
  • Partial domain overlap (priority=25-30): substring containment in either direction
  • Text rank gate: ts_rank(hypothesis.search_vector, plainto_tsquery('english', gap.title)) >= 0.05
  • The text rank gate (hypothesis search_vector vs gap title) prevents broad-domain false positives
    (e.g. a "gut microbiome/NLRP3" hypothesis resolving an "APOE4 lipid metabolism" gap).

    Thresholds:

    ParameterValueRationale
    MIN_COMPOSITE_SCORE0.7Top-tier hypotheses only
    MIN_EVIDENCE_COUNT2Multiple PubMed citations required
    MIN_TEXT_RANK_FOR_DOMAIN_MATCH0.05Non-trivial topical overlap required

    Resolution action per gap

  • knowledge_gaps.status = 'resolved', quality_status = 'resolved'
  • knowledge_gaps.evidence_summary — structured text with hypothesis_id, score, PMIDs
  • knowledge_edges row: gap -[resolved_by]-> hypothesis (evidence_strength = composite_score)
  • events row: event_type='gap_resolved' via event_bus.publish()
  • Results (2026-04-28)

    MetricBeforeAfter
    Total gaps3,5453,545
    Resolved gaps12299
    Resolution rate0.34%8.4%
    Open gaps3,5333,048
    KG edges (resolved_by)0287

    Work Log

    Created 2026-04-28 by task generator cycle 2

    3,545 gaps, 12 resolved = 0.34% resolution rate. The resolution loop has never been
    implemented. Building it would turn the gap system from noise into progress signal.

    2026-04-28 — Initial Implementation (task:31eeae8d-40b3-41c4-9032-ea028239662a)

    Agent: claude-sonnet-4-6 (commit bac57b22e)

  • Investigated knowledge_gaps (3,545 total, 12 resolved), hypotheses (1,886 total, 346 meeting
  • score ≥ 0.7 + evidence ≥ 2), and debate_sessions (590 linked to analyses) tables
  • Designed two-phase SQL query: pre-filter qualifying hypotheses, then join gaps by domain match
  • Added text rank gate (ts_rank(hypothesis.search_vector, plainto_tsquery(gap.title)) >= 0.05)
  • to prevent broad domain matches from creating false positives
  • Ran initial batch, identified 23 false positives (text_rank < 0.05), reverted, re-ran
  • Final: 287 new gaps resolved with text rank validation; total = 299 (8.4% resolution rate)
  • Files:

    • scidex/agora/gap_resolution_engine.py — new module
    • docs/planning/specs/quest_agora_gap_resolution_engine_spec.md — this file

    2026-04-28 — Atlas closure pass (task:f4f7b129-0f43-4c84-abd8-20d4e701842d)

    Agent: codex

  • Staleness review found the original 12-resolved baseline was partly addressed
  • by the initial implementation, but the live DB still had only 299 resolved
    gaps and 2,866 open gaps.
  • Added scidex/atlas/gap_closure_pipeline.py, a bounded closure pass that
  • matches open gaps against accumulated hypotheses, debate sessions, and paper
    full-text vectors using direct analysis links plus conservative keyword
    scoring. The current schema has no resolution_summary column, so the
    pipeline writes structured resolution metadata into evidence_summary.
  • Dry-run result: 230 resolvable gaps, 1 partially addressed gap, 719 skipped
  • because they lacked specific text or hypothesis coverage.
  • Production run result: 230 gaps moved from open to resolved, 1 moved to
  • partially_addressed, and 230 gap_resolution KG edges inserted with
    relation='resolved_by'. A follow-up repair aligned quality_status on
    all 231 task-touched rows.
  • Final live status after this pass: 529 resolved, 308
  • partially_addressed, and 2,635 open gaps. Resolved-gap rate is now
    approximately 14.9% of 3,545 total gaps.

    Files:

    • scidex/atlas/gap_closure_pipeline.py — reusable Atlas closure pipeline
    • scidex/senate/quality_checks.py — recognize the existing resolved_by
    resolution relation in KG edge quality checks
    • docs/planning/specs/quest_agora_gap_resolution_engine_spec.md — this log

    File: quest_agora_gap_resolution_engine_spec.md
    Modified: 2026-04-28 18:12
    Size: 7.6 KB