[Agora] Gap resolution engine — auto-close knowledge gaps when hypothesis + evidence satisfies resolution criteria

Goal

SciDEX has 3,545 knowledge gaps of which only 12 are resolved (0.34% resolution rate, 2026-04-28).
Gaps are generated rapidly but never close — the gap system is a noise accumulator, not a progress
tracker. This task builds the resolution engine that closes gaps when sufficient evidence exists.

Why this matters

A gap resolution rate of 0.34% means:

The platform cannot demonstrate progress to researchers
The knowledge graph has no "proof of work" feedback
Market participants cannot price gap-resolution contributions
The system looks like it generates problems without solving them

Fixing the resolution rate from 0.34% → 10%+ in one task would be the single largest measurable
improvement to SciDEX's scientific output value.

What to build

The resolution pipeline

A gap is considered resolved when all three conditions are met:

Hypothesis coverage: At least one hypothesis with composite_score ≥ 0.7 directly addresses

the gap's topic (match by disease, target gene/pathway, or question text)

Evidence support: The matching hypothesis has evidence_for with ≥ 2 PubMed-cited entries

Debate engagement: At least one debate session exists for an analysis related to the hypothesis

Build a function resolve_matching_gaps(batch_size=50) that:

Queries open knowledge gaps (status='open')

For each gap, finds matching hypotheses using:

- Direct analysis_id or target_gene foreign keys where available
- Text similarity between gap title/description and hypothesis title/description
- Disease field matching

Checks the three resolution conditions above

If conditions met: updates knowledge_gaps.status = 'resolved' with a resolution_summary

JSON field containing hypothesis_id, debate_session_id, evidence_summary

Emits a KG edge: gap -[resolved_by]-> hypothesis

What NOT to do

Do NOT manually write resolution summaries for each gap (that's row-count work)
Do NOT lower the resolution bar to artificially inflate the count
Do NOT create a new recurring driver for this — after building and testing the engine,

register it with the existing [Agora] CI: Trigger debates for analyses with 0 debate sessions
driver or propose integration with the Senate world-model driver

Schema reference

-- knowledge_gaps table: id, title, description, status, disease, analysis_id
-- Update target: status = 'resolved', add resolution metadata to description

Acceptance criteria

☑ resolve_matching_gaps() function implemented and tested

☑ Resolution conditions verified against real data (not loosened to inflate count)

☑ At least 50 gaps resolved in first run, with resolution summaries

☑ KG edges emitted for resolved gaps

☑ Resolution rate improves from 0.34% baseline

☑ Function registered as utility in scidex/agora/ — see scidex/agora/gap_resolution_engine.py

Implementation (2026-04-28)

Module: scidex/agora/gap_resolution_engine.py

Algorithm

Two-phase SQL query to avoid expensive cross-join:

Phase 1 — Pre-filter qualifying hypotheses:

composite_score >= 0.7
evidence_for has ≥ 2 entries with pmid field
At least one debate_session exists for the hypothesis's analysis_id

Phase 2 — Match gaps to hypotheses using four signals:

Direct FK chain (priority=100): analyses.gap_id = gap.id → hypotheses.analysis_id

Exact domain/disease (priority=50): LOWER(hypothesis.disease) = LOWER(gap.domain)

Partial domain overlap (priority=25-30): substring containment in either direction

Text rank gate: ts_rank(hypothesis.search_vector, plainto_tsquery('english', gap.title)) >= 0.05

The text rank gate (hypothesis search_vector vs gap title) prevents broad-domain false positives
(e.g. a "gut microbiome/NLRP3" hypothesis resolving an "APOE4 lipid metabolism" gap).

Thresholds:

Parameter	Value	Rationale
`MIN_COMPOSITE_SCORE`	0.7	Top-tier hypotheses only
`MIN_EVIDENCE_COUNT`	2	Multiple PubMed citations required
`MIN_TEXT_RANK_FOR_DOMAIN_MATCH`	0.05	Non-trivial topical overlap required

Resolution action per gap

knowledge_gaps.status = 'resolved', quality_status = 'resolved'

knowledge_gaps.evidence_summary — structured text with hypothesis_id, score, PMIDs

knowledge_edges row: gap -[resolved_by]-> hypothesis (evidence_strength = composite_score)

events row: event_type='gap_resolved' via event_bus.publish()

Results (2026-04-28)

Metric	Before	After
Total gaps	3,545	3,545
Resolved gaps	12	299
Resolution rate	0.34%	8.4%
Open gaps	3,533	3,048
KG edges (resolved_by)	0	287

Work Log

Created 2026-04-28 by task generator cycle 2

3,545 gaps, 12 resolved = 0.34% resolution rate. The resolution loop has never been
implemented. Building it would turn the gap system from noise into progress signal.

2026-04-28 — Initial Implementation (task:31eeae8d-40b3-41c4-9032-ea028239662a)

Agent: claude-sonnet-4-6 (commit bac57b22e)

Investigated knowledge_gaps (3,545 total, 12 resolved), hypotheses (1,886 total, 346 meeting

score ≥ 0.7 + evidence ≥ 2), and debate_sessions (590 linked to analyses) tables

Designed two-phase SQL query: pre-filter qualifying hypotheses, then join gaps by domain match

Added text rank gate (ts_rank(hypothesis.search_vector, plainto_tsquery(gap.title)) >= 0.05)

to prevent broad domain matches from creating false positives

Ran initial batch, identified 23 false positives (text_rank < 0.05), reverted, re-ran

Final: 287 new gaps resolved with text rank validation; total = 299 (8.4% resolution rate)

Files:

scidex/agora/gap_resolution_engine.py — new module
docs/planning/specs/quest_agora_gap_resolution_engine_spec.md — this file

2026-04-28 — Atlas closure pass (task:f4f7b129-0f43-4c84-abd8-20d4e701842d)

Agent: codex

Staleness review found the original 12-resolved baseline was partly addressed

by the initial implementation, but the live DB still had only 299 resolved
gaps and 2,866 open gaps.

Added scidex/atlas/gap_closure_pipeline.py, a bounded closure pass that

matches open gaps against accumulated hypotheses, debate sessions, and paper
full-text vectors using direct analysis links plus conservative keyword
scoring. The current schema has no resolution_summary column, so the
pipeline writes structured resolution metadata into evidence_summary.

Dry-run result: 230 resolvable gaps, 1 partially addressed gap, 719 skipped

because they lacked specific text or hypothesis coverage.

Production run result: 230 gaps moved from open to resolved, 1 moved to

partially_addressed, and 230 gap_resolution KG edges inserted with
relation='resolved_by'. A follow-up repair aligned quality_status on
all 231 task-touched rows.

Final live status after this pass: 529 resolved, 308

partially_addressed, and 2,635 open gaps. Resolved-gap rate is now
approximately 14.9% of 3,545 total gaps.

Files:

scidex/atlas/gap_closure_pipeline.py — reusable Atlas closure pipeline
scidex/senate/quality_checks.py — recognize the existing resolved_by

resolution relation in KG edge quality checks

docs/planning/specs/quest_agora_gap_resolution_engine_spec.md — this log

File: quest_agora_gap_resolution_engine_spec.md

Modified: 2026-04-28 18:12

Size: 7.6 KB