Extract first-class open_question artifacts from the 18,447 existing
artifact_type='wiki_page' records by parsing their content_md for explicit
question prompts ("It remains unclear...", "Whether X causes Y is unknown",
H2/H3 sections titled "Open questions" or "Unresolved", and bare interrogative
sentences ending in "?"). Today the only open_questions in artifacts come
from knowledge_gaps extraction (~hundreds), leaving the bulk of latent
question signal in our largest content corpus untapped. This task wires a
deterministic+LLM-graded pipeline so every wiki page contributes its open
questions to the per-field Elo leaderboards built by
scidex/agora/open_question_tournament.py.
scidex/agora/open_question_miner_wiki.py (≤700 LoC) withmine_page(artifact_id) -> list[CandidateQuestion] andrun_corpus_mining(limit, since_updated_at) -> dict entrypoints.
python -m scidex.agora.open_question_miner_wiki --batch 200.## Open Questions/## Unresolved(?i)\b(remain[s]? (un)?(known|clear|resolved)|whether .{5,200}\?|it is (un)?known whether);? outside fenced code/quotes.
scidex.core.llm.complete, model frommodel_router.py cheap tier) labels each candidate withis_real_question, is_answerable_in_principle, field_tag,tractability_score, potential_impact_score, dedup question_hashscidex.atlas.artifact_registry.register_open_question (orartifacts with artifact_type='open_question' andartifact_registry.py:92-95:question_text, field_tag, importance_elo, evidence_summary, status,
source_kind='wiki_page', source_id=<wiki_artifact_id>, question_hash).
source_id foreign-key check: every emitted question links back toartifact_links (link_type=derived_from).
question_hash matches an existing open_question whosestatus NOT IN ('answered','retired'), append the new wiki to itsderived_from links instead of creating a duplicate. Counter-test:data/scidex-artifacts/reports/openq_wiki_mining_<utc>.jsontests/agora/test_open_question_miner_wiki.py covers heuristicscidex/atlas/artifact_registry.py lines 1-200 + theregister_open_question function to learn the exact metadata contract.
scidex/agora/gap_pipeline.py for the existing gap-mining patternsmine_page as: (a) regex/section pre-filter producing ≤30register_open_question.
$3 (assert viascidex.exchange.cost_ledger).
scidex-openq-minerupdated_at > last_high_water).b2d85e76-51f3 — open_question artifact_type schema (done)47ee9103-ccc0 — Elo tournament reads knowledge_gaps.importance_eloscidex/agora/open_question_tournament.py — consumer of mined questionsFiles created:
scidex/agora/open_question_miner_wiki.py (654 LoC) — main moduletests/agora/test_open_question_miner_wiki.py — 33 pytest tests (all passing)tests/agora/__init__.pywiki_pages.content_md (17,642 pages with content); wiki artifactsartifacts table lack content_md so the module queries wiki_pages directly.
mine_page(page_id) accepts both wp- (wiki_pages.id) and wiki- (artifact id).wiki-* IDs it joins artifacts → wiki_pages via slug matching.
question_hash metadata field.scidex.core.llm.complete; falls back toquestion_hash values into memory once per batch run,artifact_links (derived_from): created when a matching wiki artifact existsartifacts table (124 of 17,642 pages have one); otherwise source_idsync_metadata table under keyopenq_miner_wiki_last_run.
estimated_cost_usd >= cost_ceiling_usd ($3 default).Acceptance criteria status:
mine_page and run_corpus_mining entrypointspython -m scidex.agora.open_question_miner_wiki --batch 200question_hashregister_artifact with all recommended metadata fieldsartifact_links derived_from created when wiki artifact existsdata/scidex-artifacts/reports/--batch 2000 invocation; skipped