SciDEX — Task: [Atlas] Mine open questions from 18,447 wiki pages

Heuristic+LLM miner that extracts open_question artifacts from wiki content_md with SimHash dedup and source_links.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (4)

Squash merge: orchestra/task/a4c450f7-biomni-analysis-parity-port-15-use-cases (87 commits) (#717)2026-04-27

Squash merge: orchestra/task/b4cd6c06-mine-open-questions-from-18-447-wiki-pag (2 commits) (#652)2026-04-27

[Atlas] Fix DB connection leak in mine_page; reuse one connection per question loop [task:b4cd6c06-626e-4fdd-a8be-35027858d328]2026-04-27

[Atlas] Add open_question_miner_wiki: heuristic+LLM extraction from wiki pages [task:b4cd6c06-626e-4fdd-a8be-35027858d328]2026-04-27

Spec File

Goal

Extract first-class open_question artifacts from the 18,447 existing artifact_type='wiki_page' records by parsing their content_md for explicit
question prompts ("It remains unclear...", "Whether X causes Y is unknown",
H2/H3 sections titled "Open questions" or "Unresolved", and bare interrogative
sentences ending in "?"). Today the only open_questions in artifacts come
from knowledge_gaps extraction (~hundreds), leaving the bulk of latent
question signal in our largest content corpus untapped. This task wires a
deterministic+LLM-graded pipeline so every wiki page contributes its open
questions to the per-field Elo leaderboards built by scidex/agora/open_question_tournament.py.

Acceptance Criteria

☐ New module scidex/agora/open_question_miner_wiki.py (≤700 LoC) with

mine_page(artifact_id) -> list[CandidateQuestion] and
run_corpus_mining(limit, since_updated_at) -> dict entrypoints.

☐ CLI: python -m scidex.agora.open_question_miner_wiki --batch 200.

☐ Heuristic extractor finds: explicit ## Open Questions/## Unresolved

sections; sentences matching (?i)\b(remain[s]? (un)?(known|clear|resolved)|whether .{5,200}\?|it is (un)?known whether);
bare interrogatives ≥6 tokens ending in ? outside fenced code/quotes.

☐ LLM grader (uses scidex.core.llm.complete, model from

model_router.py cheap tier) labels each candidate with
{is_real_question, is_answerable_in_principle, field_tag,
tractability_score, potential_impact_score, dedup question_hash
via SimHash on normalized question text}.

☐ Inserts pass scidex.atlas.artifact_registry.register_open_question (or

direct insert into artifacts with artifact_type='open_question' and
the recommended metadata fields listed at artifact_registry.py:92-95:

question_text, field_tag, importance_elo, evidence_summary, status,
      source_kind='wiki_page', source_id=<wiki_artifact_id>, question_hash

☐ source_id foreign-key check: every emitted question links back to

the originating wiki artifact via artifact_links (link_type=
derived_from).

☐ Dedup: if question_hash matches an existing open_question whose

status NOT IN ('answered','retired'), append the new wiki to its
derived_from links instead of creating a duplicate. Counter-test:
mining the same page twice produces zero new artifacts.

☐ Initial corpus run: process 2,000 highest-edit-count wiki pages, write

summary JSON to data/scidex-artifacts/reports/openq_wiki_mining_<utc>.json
with counts per field, dup-rate, LLM cost, and 10 random samples for
manual eyeballing.

☐ Pytest: tests/agora/test_open_question_miner_wiki.py covers heuristic

regex, LLM grader stub, dedup path, and source link creation.

Approach

Read scidex/atlas/artifact_registry.py lines 1-200 + the

register_open_question function to learn the exact metadata contract.

Read scidex/agora/gap_pipeline.py for the existing gap-mining patterns

(cost throttling, batch scheduling, error logging).

Implement mine_page as: (a) regex/section pre-filter producing ≤30

candidates per page, (b) single LLM call per page batching all candidates
into one structured-JSON response, (c) emit via register_open_question.

Drive 2k-page corpus run with cost ceiling $3 (assert via

scidex.exchange.cost_ledger).

Add to scidex-pubmed-pipeline.timer pattern: a daily scidex-openq-miner

systemd timer scoped to wiki pages updated since last run
(updated_at > last_high_water).

Dependencies

b2d85e76-51f3 — open_question artifact_type schema (done)
47ee9103-ccc0 — Elo tournament reads knowledge_gaps.importance_elo
scidex/agora/open_question_tournament.py — consumer of mined questions

Work Log

2026-04-27 — Implementation

Files created:

scidex/agora/open_question_miner_wiki.py (654 LoC) — main module
tests/agora/test_open_question_miner_wiki.py — 33 pytest tests (all passing)
tests/agora/__init__.py

Architecture notes:

Content source: wiki_pages.content_md (17,642 pages with content); wiki artifacts

in artifacts table lack content_md so the module queries wiki_pages directly.

mine_page(page_id) accepts both wp- (wiki_pages.id) and wiki- (artifact id).

For wiki-* IDs it joins artifacts → wiki_pages via slug matching.

SimHash: 64-bit, stored as 16-char hex in question_hash metadata field.

Near-duplicate threshold: Hamming distance ≤3 bits.

LLM grader: single call per page via scidex.core.llm.complete; falls back to

heuristic defaults on failure.

Dedup: loads all active question_hash values into memory once per batch run,

then checks each candidate before inserting.

artifact_links (derived_from): created when a matching wiki artifact exists

in the artifacts table (124 of 17,642 pages have one); otherwise source_id
is stored in metadata only — consistent with existing open_question pattern.