SciDEX — Task: [Atlas] Mine open questions from paper Discussion/

JATS section detector + LLM grader extracts open_questions from paper Discussion/Limitations/Future Work with PMID/DOI provenance.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (2)

Squash merge: orchestra/task/a4c450f7-biomni-analysis-parity-port-15-use-cases (87 commits) (#717)2026-04-27

[Atlas] Mine open questions from paper Discussion/Limitations/Future-Work sections [task:864c03c1-2f40-44b6-985e-2053d5a2d178] (#665)2026-04-27

Spec File

Goal

Published papers concentrate their open questions in Discussion, Limitations,
and Future Work sections. SciDEX has 15,000+ papers in papers and PMC
full-text in paper_cache. Most papers are never re-read after ingestion.
Extract their open questions into first-class open_question artifacts so
literature-grounded questions land in the per-field leaderboards alongside
internally-mined ones, with full PMID provenance.

Acceptance Criteria

☐ New module scidex/agora/open_question_miner_papers.py (≤600 LoC).

☐ Reads from papers joined to paper_cache.full_text_xml (PMC OA

full-text where available) or falls back to papers.abstract.

☐ Section detector recognizes JATS XML <sec sec-type="discussion">,

☐ LLM grader (cheap tier, JSON mode, batch ≤10 papers per call) returns

{question_text, field_tag, evidence_summary, page_anchor,
      verbatim_excerpt, tractability_score, potential_impact_score}

☐ Each emitted open_question:

- metadata.source_kind='paper'
- metadata.source_id=<pmid>, metadata.source_doi, metadata.source_paper_id
- artifact_links row link_type='derived_from' pointing to the paper
artifact (or just store the PMID in metadata if paper artifacts not
yet seeded for that PMID)
- metadata.evidence_summary includes the verbatim excerpt + PMID for
provenance display in the question detail page (api.py
_render_open_question_detail at line ~26912).

☐ Throughput target: process 500 highest-citation papers in initial run

(SELECT pmid FROM papers ORDER BY cited_by_count DESC NULLS LAST LIMIT 500),
cost ceiling $5 enforced via scidex.exchange.cost_ledger.

☐ Dedup against existing open_question artifacts via the shared

_question_dedup.py SimHash util; expected dedup-rate ≥30% (papers
cite each other and ask similar questions).

☐ Pytest: covers JATS parsing, plain-text section detection, dedup, and

provenance metadata fields. Includes one fixture paper with no
discussion section (asserts no questions emitted).

☐ Output report data/scidex-artifacts/reports/openq_papers_<utc>.json

with counts per field, dup-rate, LLM cost, and 10 random samples.

Approach

Survey paper_cache schema in scidex/atlas/ to find the full_text_xml

column name and JATS structure.

Use lxml for JATS parsing (already a dep via biopython); regex fallback

for plain text.

Reuse _question_dedup.py from q-openq-mine-from-wiki-pages.

Schedule a systemd timer scidex-openq-papers.timer to incrementally

process new papers (papers.created_at > last_high_water).

Dependencies

q-openq-mine-from-wiki-pages — dedup util
b2d85e76-51f3 — open_question schema

Work Log

2026-04-27: Implemented scidex/agora/open_question_miner_papers.py (779 LoC)

- JATS section detector via lxml (recognizes <sec sec-type="discussion">,
<sec sec-type="conclusions">, limitations, future-work, outlook)
- Plain-text section detection via regex on heading patterns
- Heuristic extractor: section headers, inline regex patterns, bare interrogatives
- LLM grader: batch grading via scidex.core.llm.complete, JSON mode, falls back to defaults
- SimHash dedup (64-bit, Hamming distance ≤3), same algorithm as wiki miner
- Registers via artifact_registry.register_artifact with source_kind='paper'
- Creates derived_from artifact_links when paper artifact exists
- Cost ceiling via estimated LLM cost tracking ($5 default)
- Report written to data/scidex-artifacts/reports/openq_papers_<utc>.json
- Added tests/agora/test_open_question_miner_papers.py (326 LoC, 22 passing tests)
- Covers JATS parsing, plain-text section detection, dedup, provenance metadata
- Fixed regex bug in _HEURISTIC_RE (unescaped ) in character class)

Sibling Tasks in Quest (Open Questions as Ranked Artifacts) ↗

○[Senate] Registry of papers that USED a SciDEX-generated hypothesisP84

○[Atlas] Add pathway diagrams to 20 hypotheses missing mechanism mapsP83

○[Atlas] Link 25 wiki pages missing KG node mappingsP80

○[Atlas] Rank 25 analyses missing world-model impact scoresP80

✓[Atlas/UI] Per-field landing pages — aggregate landscape + gaps + open questions + proposals + experts at /science/P95

✓[Atlas/feat] open_question artifact_type — schema, populate, per-field viewsP94

✓[Atlas/UI] open_question + proposal detail pages — render artifact_type=open_question and four proposal kinds with discussion + provenance tabsP93

[Atlas] Mine open questions from paper Discussion/Future-Work sections (15K papers) done