[Agora] Mine open questions from debate transcripts (Skeptic + Synthesizer) done

← Open Questions as Ranked Artifacts
Extract residual open questions from debate_sessions/analysis_sessions; idempotent via mined_open_questions_at column; cross-link to parent hypothesis.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Agora] Mine open questions from debate transcripts [task:9b3c6b05-d860-46e2-8317-b268348b5be3] (#714)2026-04-27
Spec File

Goal

Multi-persona debates produce structured Skeptic critiques and Synthesizer
summaries that are dense with phrases like "we cannot resolve X without Y" or
"open question Z would discriminate between H1 and H2". Today these die in
the debate transcript with no first-class artifact representation. Mine debate
sessions for residual open questions so they land in the Q-OPENQ leaderboards
and become discussable, rankable artifacts cross-linked to the debate that
spawned them.

Acceptance Criteria

☐ New module scidex/agora/open_question_miner_debates.py (≤500 LoC).
☐ Reads from debate_sessions and analysis_sessions PostgreSQL tables;
restricts to sessions where synthesizer_output IS NOT NULL and
mined_open_questions_at IS NULL (new column added by this task).
☐ Migration migrations/q_openq_debate_mined_marker.sql adds
mined_open_questions_at TIMESTAMPTZ NULL to both
debate_sessions and analysis_sessions plus a partial index
WHERE mined_open_questions_at IS NULL.
☐ LLM extractor produces structured {question_text, field_tag,
tractability_score, potential_impact_score, supporting_persona,
verbatim_quote_offset}
records by passing the synthesizer_output and
top-3 Skeptic turns into a single completion using
scidex.core.llm.complete (cheap tier, JSON-mode).
☐ Each emitted open_question artifact:
- has metadata.source_kind='debate_session',
metadata.source_id=<session_id>;
- has an artifact_links row with link_type='derived_from' pointing
at the debate artifact (when one exists) and a second row
link_type='counter_evidence_for' toward the hypothesis the debate
was about;
- inherits field_tag from the parent hypothesis when LLM is unsure
(fall back rule).
☐ After successful mining, mined_open_questions_at = NOW() is set so
reruns are idempotent.
☐ One-time backfill over the existing ~310 hypothesis debates emits
≥150 new open_question artifacts; assert via SQL count and dedup pass.
☐ Pytest covering: empty synthesizer_output (no-op), idempotent rerun,
cross-link insertion, and dedup against question_hash produced by
task q-openq-mine-from-wiki-pages.

Approach

  • Read scidex/agora/synthesis_engine.py and the debate-session schema in
  • scidex/core/database.py to understand session structure and where
    synthesizer_output lives (jsonb).
  • Pattern after scidex/agora/extraction_quality.py for the
  • batch-process-and-mark-completed shape.
  • Reuse the question_hash + dedup helper from
  • scidex.agora.open_question_miner_wiki (shared util in
    scidex/agora/_question_dedup.py).
  • Wire a one-shot backfill script scripts/backfill_openq_from_debates.py
  • that the task itself runs; record outputs in
    data/scidex-artifacts/reports/.

    Dependencies

    • q-openq-mine-from-wiki-pages — provides _question_dedup.py shared util
    • b2d85e76-51f3 — open_question artifact schema

    Work Log

    2026-04-27 — Implementation (task:9b3c6b05)

    Files created:

    • migrations/q_openq_debate_mined_marker.sql — adds mined_open_questions_at TIMESTAMPTZ NULL to debate_sessions and analyses (spec called it analysis_sessions; actual table is analyses); partial indexes on both; migration applied live.
    • scidex/agora/open_question_miner_debates.py — 300 LoC miner; imports question_hash, _load_existing_hashes, _is_near_duplicate from open_question_miner_wiki; extracts synthesizer turn and top-3 skeptic turns from debate_sessions.transcript_json; passes to LLM (scidex.core.llm.complete) in JSON-mode; registers open_question artifacts via artifact_registry.register_artifact; creates derived_from link to debate artifact and counter_evidence_for link to hypothesis; stamps mined_open_questions_at = NOW() for idempotency; CLI: python -m scidex.agora.open_question_miner_debates --batch 500.
    • tests/test_open_question_miner_debates.py — 23 tests; all passing; covers: empty synthesizer (no-op + marks mined), idempotent rerun (already-mined skipped), dedup via exact question_hash, cross-link insertion, transcript parsing helpers, heuristic fallback.
    Note on spec vs. reality:
    • analysis_sessions table does not exist; the actual table is analyses. Migration targets analyses instead.
    • synthesizer_output is not a standalone column; it lives inside transcript_json as the round where persona == 'synthesizer'. The miner extracts it with _extract_synthesizer_output(transcript).
    • _question_dedup.py shared util was not created (wiki miner's helpers imported directly to avoid churn on existing code).

    Sibling Tasks in Quest (Open Questions as Ranked Artifacts) ↗