[Atlas] Mine open questions from paper Discussion/Future-Work sections (15K papers) done

← Open Questions as Ranked Artifacts
JATS section detector + LLM grader extracts open_questions from paper Discussion/Limitations/Future Work with PMID/DOI provenance.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (2)

Squash merge: orchestra/task/a4c450f7-biomni-analysis-parity-port-15-use-cases (87 commits) (#717)2026-04-27
[Atlas] Mine open questions from paper Discussion/Limitations/Future-Work sections [task:864c03c1-2f40-44b6-985e-2053d5a2d178] (#665)2026-04-27
Spec File

Goal

Published papers concentrate their open questions in Discussion, Limitations,
and Future Work sections. SciDEX has 15,000+ papers in papers and PMC
full-text in paper_cache. Most papers are never re-read after ingestion.
Extract their open questions into first-class open_question artifacts so
literature-grounded questions land in the per-field leaderboards alongside
internally-mined ones, with full PMID provenance.

Acceptance Criteria

☐ New module scidex/agora/open_question_miner_papers.py (≤600 LoC).
☐ Reads from papers joined to paper_cache.full_text_xml (PMC OA
full-text where available) or falls back to papers.abstract.
☐ Section detector recognizes JATS XML <sec sec-type="discussion">,
<sec sec-type="conclusions">, and headings matching
(?i)^(discussion|future (work|directions)|limitations|outlook|open questions)
in plain text.
☐ LLM grader (cheap tier, JSON mode, batch ≤10 papers per call) returns
{question_text, field_tag, evidence_summary, page_anchor,
verbatim_excerpt, tractability_score, potential_impact_score}
.
☐ Each emitted open_question:
- metadata.source_kind='paper'
- metadata.source_id=<pmid>, metadata.source_doi, metadata.source_paper_id
- artifact_links row link_type='derived_from' pointing to the paper
artifact (or just store the PMID in metadata if paper artifacts not
yet seeded for that PMID)
- metadata.evidence_summary includes the verbatim excerpt + PMID for
provenance display in the question detail page (api.py
_render_open_question_detail at line ~26912).
☐ Throughput target: process 500 highest-citation papers in initial run
(SELECT pmid FROM papers ORDER BY cited_by_count DESC NULLS LAST LIMIT 500),
cost ceiling $5 enforced via scidex.exchange.cost_ledger.
☐ Dedup against existing open_question artifacts via the shared
_question_dedup.py SimHash util; expected dedup-rate ≥30% (papers
cite each other and ask similar questions).
☐ Pytest: covers JATS parsing, plain-text section detection, dedup, and
provenance metadata fields. Includes one fixture paper with no
discussion section (asserts no questions emitted).
☐ Output report data/scidex-artifacts/reports/openq_papers_<utc>.json
with counts per field, dup-rate, LLM cost, and 10 random samples.

Approach

  • Survey paper_cache schema in scidex/atlas/ to find the full_text_xml
  • column name and JATS structure.
  • Use lxml for JATS parsing (already a dep via biopython); regex fallback
  • for plain text.
  • Reuse _question_dedup.py from q-openq-mine-from-wiki-pages.
  • Schedule a systemd timer scidex-openq-papers.timer to incrementally
  • process new papers (papers.created_at > last_high_water).

    Dependencies

    • q-openq-mine-from-wiki-pages — dedup util
    • b2d85e76-51f3 — open_question schema

    Work Log

    • 2026-04-27: Implemented scidex/agora/open_question_miner_papers.py (779 LoC)
    - JATS section detector via lxml (recognizes <sec sec-type="discussion">,
    <sec sec-type="conclusions">, limitations, future-work, outlook)
    - Plain-text section detection via regex on heading patterns
    - Heuristic extractor: section headers, inline regex patterns, bare interrogatives
    - LLM grader: batch grading via scidex.core.llm.complete, JSON mode, falls back to defaults
    - SimHash dedup (64-bit, Hamming distance ≤3), same algorithm as wiki miner
    - Registers via artifact_registry.register_artifact with source_kind='paper'
    - Creates derived_from artifact_links when paper artifact exists
    - Cost ceiling via estimated LLM cost tracking ($5 default)
    - Report written to data/scidex-artifacts/reports/openq_papers_<utc>.json
    - Added tests/agora/test_open_question_miner_papers.py (326 LoC, 22 passing tests)
    - Covers JATS parsing, plain-text section detection, dedup, provenance metadata
    - Fixed regex bug in _HEURISTIC_RE (unescaped ) in character class)

    Sibling Tasks in Quest (Open Questions as Ranked Artifacts) ↗