[Agora] Debate transcript causal claim extractor — mine 835 debates for mechanistic KG edges done

← Agora
**Problem:** SciDEX has 835 debate sessions containing rich mechanistic reasoning — Theorist, Skeptic, Expert, and Synthesizer personas argue causal claims of the form "Gene A activates Pathway B leading to Phenotype C". This content is scientifically valuable but entirely unextracted: the KG grows from papers and analyses, not from the platform's own debates. **Goal:** Extract structured causal claims from the top 200 debate transcripts (by num_hypotheses_generated + num_rounds) and add them as mechanistic KG edges. **Implementation:** 1. Fetch top 200 debates: `SELECT id, question, transcript FROM debate_sessions WHERE transcript IS NOT NULL ORDER BY num_rounds DESC, num_hypotheses_generated DESC LIMIT 200` (check actual column name for transcript — may be `debate_log`, `content`, or `turns_json`) 2. For each debate, use LLM to extract causal triples: (subject_entity, relationship, object_entity, confidence, supporting_quote) - Focus on: ACTIVATES, INHIBITS, CAUSES, PREVENTS, BIOMARKER_FOR, THERAPEUTIC_TARGET_FOR relationships - Only extract claims that appeared in ≥2 debate turns (consensus threshold) 3. Resolve entities against existing KG nodes: `SELECT id, name FROM knowledge_nodes WHERE name ILIKE %%` 4. Insert new edges: `INSERT INTO knowledge_edges (source_id, target_id, relationship, confidence, evidence_type, provenance) VALUES (..., ..., ..., , 'debate_extracted', 'debate_session:')` 5. Target: ≥2,000 new mechanistic edges from debate content 6. Deduplicate: skip edges where source_id + target_id + relationship already exists **Use `scidex.core.database.get_db()` and `llm.py` for extraction. Commit edges in batches of 100.** **Expected outcome:** 2,000+ new mechanistic edges in the KG sourced from debate transcripts, enriching the world model with the platform's own reasoning output.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (5)

Squash merge: orchestra/task/72f50712-debate-transcript-causal-claim-extractor (2 commits) (#1224)2026-04-28
[Agora] Verify debate causal extractor — 2,020+ edges hit milestone [task:72f50712-2194-475b-86bd-45cb6df6b175]2026-04-28
[Agora] Add debate transcript causal claim extractor [task:72f50712-2194-475b-86bd-45cb6df6b175]2026-04-28
[Agora] Add debate transcript causal claim extractor [task:72f50712-2194-475b-86bd-45cb6df6b175]2026-04-28
[Senate] Create ambitious quest task generator spec + 6 strategic tasks [task:80ffb77b-8391-493c-8644-37086c8e2e3c] (#1211)2026-04-28
Spec File

Goal

Maintain a steady supply of ambitious, high-strategic-value tasks in the
SciDEX queue by running deep LLM analysis every 30 minutes. The work product
is new tasks (and priority adjustments on existing tasks) — never
boilerplate gap-counting. Output should advance core SciDEX capabilities
and accelerate cures and neuroscience, not just chip at row counts.

This spec deliberately replaces scripts/quest_engine.py and the prior quest-engine-ci.md template-driven approach, both of which produced
filler one-shots that mirror existing recurring CI work without adding
strategic leverage. The retired approach is documented for context only —
do not run it.

> ## Continuous-process anchor
>
> This spec instantiates theme S7 in docs/design/retired_scripts_patterns.md
> ("Quest task generation"). Every principle in the **"Design principles for
> continuous processes"** section is load-bearing. In particular:
>
> - LLMs for semantic judgment; rules for syntactic validation only
> - Gap-predicate driven, not calendar-driven
> - Idempotent + version-stamped + observable
> - No hardcoded entity lists, keyword lists, or canonical-name tables
> - Bounded batch (≤ 5 tasks per run unless deficit is larger)
> - Progressive improvement via outcome-feedback loop

The invariant this task enforces

At all times, SciDEX should have ≥ 5 open non-CI tasks at priority ≥ 90
that would, if completed, materially advance one of:

  • Core capability: scoring, debate, market mechanisms, KG quality,
  • tool reliability, governance, agent orchestration
  • Scientific output: hypothesis quality, evidence chains, falsifiable
  • predictions, experiment proposals, target validation for neurodegeneration
  • System value: SciDEX's role as a machine for prioritizing,
  • organizing, synthesizing, inventing, funding, and rewarding science

    Gap predicate (run this first — if false, mostly-no-op):

    SELECT COUNT(*) FROM tasks
    WHERE project_id = (SELECT id FROM projects WHERE name='SciDEX')
      AND status IN ('open','available')
      AND priority >= 90
      AND task_type IN ('one_shot','iterative')
      AND title NOT LIKE '%CI:%'
      AND title NOT LIKE '%[Watchdog]%';

    If COUNT() >= 5: skip generation. Run the priority audit* step only
    (below). If COUNT() < 5: generate (5 - COUNT()) to min(10, deficit*2)
    new tasks at priority ≥ 90.

    What "ambitious" means here (the bar)

    A task qualifies as ambitious if at least two of:

  • It targets a capability SciDEX doesn't have yet, or one whose current
  • implementation is a known-thin scaffold (not a row-count backfill).
  • It would, on completion, change the value of one of SciDEX's measurable
  • outputs (debate quality, market signal, hypothesis novelty, KG density,
    reproducibility, time-to-first-experiment, agent throughput).
  • It frames a scientific question the system itself should answer
  • (cross-disease mechanistic synthesis, novel target proposal with
    falsifiable prediction, paper-claim contradiction surface, etc.).
  • It builds a feedback loop / meta-mechanism (rubric versioning, tournament
  • driver, reward-eligibility check, calibration meta-job).
  • It improves cross-layer integration (Atlas ↔ Exchange, Agora ↔ Senate,
  • Forge ↔ Atlas) where today the layers are loosely coupled.

    A task is filler (do not create) if it is any of:

    • "Backfill column X for N rows" where a recurring CI task already covers
    this (see §"Recurring tasks already covering common gaps" below).
    • "Add references to N wiki pages" / "score N hypotheses" / "extract claims
    from N papers" — these mirror existing every-6h drivers; the right
    intervention is unsticking the driver, not creating a one-off chip.
    • Anything whose acceptance is "process N rows" without naming a capability
    improvement that wouldn't happen otherwise.
    • Anything that's a wrapper around a Senate dedup/cleanup activity that
    forces destructive action regardless of judgment.

    What the agent does each cycle (xhigh effort)

    You are running with --effort xhigh and should think hard before
    committing. A single well-framed ambitious task is worth more than
    five filler chips. Spend most of the cycle reading and synthesizing.

    Phase 1 — State snapshot (read-only, ~5 min)

    Gather and summarize:

  • Queue state via Orchestra MCP (list_tasks with project=SciDEX):
  • - Total open one-shot/iterative count
    - Count at priority ≥ 90, ≥ 95, ≥ 99 (excluding CI: and Watchdog tasks)
    - Top 25 highest-priority open tasks: id, title, priority, age, quest_id
    - Top 10 oldest open tasks (for staleness review)
    - Last 30 completed tasks (last 24h): id, title, layer, completion_summary

  • Recent run outcomes (last 24-48h):
  • - How many tasks completed vs abandoned vs still-running
    - Which quests/layers shipped real PRs (check pr_links_json,
    commit_links_json, merge_verified_at)
    - Which agents/models were most productive (group assigned_worker)

  • Active quests (Orchestra quests table):
  • - For each active quest: name, layer, priority, current open-task count
    - Read the quest's spec file under docs/planning/specs/quest_<layer>_spec.md
    and any quest_<layer>_<topic>_spec.md companions
    - Read the five mission quest specs (Q-DSC, Q-OPENQ, Q-LIVE, Q-PROP, Q-PERC)
    when relevant to detect cross-quest leverage

  • Recurring CI health (critical — distinguishes filler from ambitious):

  • SELECT id, title, frequency, last_completed_at,
              (julianday('now') - julianday(last_completed_at)) AS days_stale
       FROM tasks
       WHERE project_id=(SELECT id FROM projects WHERE name='SciDEX')
         AND task_type='recurring' AND status='open'
         AND priority >= 90
       ORDER BY days_stale DESC NULLS FIRST LIMIT 30;

    - If a recurring driver is stale > 24h, **a one-off chip at the same gap
    is filler.** Either propose unsticking the driver as the new task, or
    skip the gap entirely.

  • SciDEX world-model state via PostgreSQL (scidex.core.database).
  • Polymorphic via information_schema — never hardcode column lists.
    Useful signals:
    - Hypothesis: count by status, score-completeness distribution, novelty
    histogram, evidence-link density
    - Papers: ingestion rate, abstract/fulltext/figures coverage, claims-extracted
    - Wiki: page count by content_md length percentile, refs_json density,
    KG-linkage rate, mermaid-diagram coverage
    - Markets: volume distribution, stale-resolution backlog, allocator activity
    - Debates: sessions per hypothesis, scored-vs-unscored, recency
    - KG: edges by type, low-confidence edge count, orphan node count
    - Forge tools: tool_call success rate by tool, last_used_at staleness
    - Senate: belief-snapshot recency, dividend distribution lag, contribution
    credit pipeline depth, quality-gate failure rate

  • Site state (optional, when LLM judges relevant):
  • - curl -s http://127.0.0.1:8000/api/... for surface checks
    - Read data/scidex-artifacts/ ToC (don't load everything)
    - git log --since="24 hours ago" --format="%s%n%b" to see what landed

    Phase 2 — Synthesis (xhigh effort think)

    With state in hand, ask yourself the hard questions. Each answer is
    candidate task material:

    Strategic gaps (priority 95-99 candidates):

    • What capability would change SciDEX's value proposition if shipped this
    week? (Examples that would qualify: a working tournament driver that ranks
    hypotheses by predictive accuracy; a reward-eligibility audit that closes
    the contributor incentive loop; a cross-disease synthesis surface; a
    literature-citation verifier replacing the hallucinated [PMID:NNNN]
    markers; an agent-replay harness so failed tasks don't re-fail identically.)
    • What scientific question is the system uniquely positioned to answer
    this week? (e.g. "Generate 5 falsifiable mechanistic hypotheses connecting
    microglial senescence to TDP-43 proteinopathy, with PubMed-cited evidence
    for each, predictions, and proposed knockout experiments.")
    • Which layers have the least cross-talk and what bridge would help most?
    Capability gaps (priority 90-94 candidates):
    • Which thin-scaffold capability is closest to ready-for-real-use? (e.g. an
    experiment-proposal generator that has a working prompt but no UI surface.)
    • Which feedback loop is open-loop today and would close with one task?
    (e.g. debate quality → hypothesis prior; agent contribution → reward.)

    Health gaps (priority 90-92, only if no recurring CI covers it):

    • Which stuck recurring driver is the most leveraged to unstick? Frame as
    "diagnose + fix + verify resumed throughput" — not "do the work the driver
    was supposed to do."

    Phase 3 — Bounded creation

    For each new task:

  • Title: [Layer] <verb> <object> <specifics> — concrete, scannable.
  • Description: includes (a) why this matters (cite the strategic gap
  • from Phase 2), (b) what success looks like in measurable terms,
    (c) what the agent should read first, (d) what not to do.
  • Priority: 90-99 by leverage. Only use ≥ 95 for genuine
  • capability-shift work; reserve 99 for "this changes the whole system."
  • Layer / quest_id: route via existing layer quests when fit; create
  • no new quests.
  • Spec_path: prefer existing quest layer specs (quest_<layer>_spec.md).
  • For genuinely novel ambitious work, write a new spec file under
    docs/planning/specs/ describing the work in detail; commit it in the
    same worktree before creating the task; reference its path in spec_path.
  • Task type: default iterative with max_iterations=15 for deep
  • work; use one_shot only for clearly-bounded single-PR work.
  • Tags: include quest-engine, layer name, and a topical tag.
  • Payload requirements: set realistic skill floors (reasoning,
  • analysis, coding, creativity, safety).

    Cap creation at max(5, deficit + 2) per cycle. Never exceed 10.

    Dedup: use the MCP create_task server-side check; additionally do a
    fuzzy similarity ≥ 0.7 check against last 60 days of titles (open + done)
    before submitting.

    Phase 4 — Priority audit (every cycle)

    Independent of creation, audit existing open task priorities:

    • Overdue + low priority (e.g. 60d old at priority 50): demote to 30 or
    close as stale (with rationale in completion_summary).
    • Underprioritized strategic work: open task whose description references
    a Phase 2 strategic theme but priority < 85 → bump to 85-90.
    • Overprioritized filler: 95-priority task whose work is also covered
    by an active every-6h recurring driver → demote to 70.
    • Surface contention: if 3+ tasks are racing on the same files, lower
    the priority of all but the most strategic.

    Make ≤ 5 priority adjustments per cycle. Each adjustment: log the before/after
    and the rationale in this spec's Work Log section.

    Phase 5 — Cycle log

    Append a Work Log entry with:

    • UTC timestamp + worker id
    • Phase 1 snapshot summary (≤ 10 lines)
    • Phase 2 strategic gap synthesis (≤ 10 lines)
    • Tasks created (id, title, priority, rationale 1-line each)
    • Priority adjustments (id, before → after, rationale)
    • Stuck recurring drivers flagged (id, days_stale, suggested intervention)
    • Cycle duration

    Recurring tasks already covering common gaps

    Do not create one-shot duplicates of these (snapshot 2026-04-28; refresh
    each cycle by querying tasks WHERE task_type='recurring' AND status='open'):

    GapRecurring driverFrequency
    Hypothesis score updates from new debates[Exchange] CI: Update hypothesis scores from new debate roundsevery-2h
    Belief snapshots[Economics] CI: Snapshot hypothesis prices for price historyevery-2h
    PubMed evidence backfill[Exchange] CI: Backfill evidence_for/evidence_against with PubMed citationsevery-6h
    Discovery dividends[Exchange] Discovery dividend backprop credit (driver #14)every-6h
    Squad dividend multiplier[Exchange] Squad-member dividend multiplier on backprop (driver #23)every-6h
    Reward emission[Exchange] Reward emission for contributions (driver #5)every-6h
    Token bounty issuance[Exchange] Token bounty issuance for open work (driver #4)every-6h
    Funding allocator[Exchange] Funding allocator activation (driver #10)every-6h
    Quadratic funding[Exchange] Quadratic funding allocator (driver #15)every-6h
    Wiki citation enrichment[Atlas] Wiki citation enrichment — add inline citationsevery-6h
    Wiki ↔ KG cross-linking[Atlas] CI: Cross-link new wiki pages to KG entitiesevery-6h
    Wiki mermaid regen[Atlas] Wiki mermaid LLM regenevery-6h
    Paper figures[Atlas] Extract and reference figures from scientific papersevery-2h
    Paper abstracts[Forge] Reduce PubMed metadata backlog for papers missing abstractsevery-6h
    Notebook coverage[Artifacts] CI: Verify notebook coverageevery-12h
    Stub notebook regen[Artifacts] Audit all 67 stub notebooksevery-6h
    Debate sessions[Agora] CI: Trigger debates for analyses with 0 debate sessionsevery-24h
    Debate quality scoring[Agora] CI: Run debate quality scoring on new/unscored sessionsevery-6h
    Counter-argument bounties[Agora] Counter-argument bounty market (driver #7)every-6h
    Squad enrollment[Agora] Squad open enrollment & recruitment (driver #21)every-6h
    Tool call failure triage[Forge] CI: Test all scientific tools for availabilitydaily
    World-model improvements[Senate] World-model improvement detector (driver #13)every-6h
    Agent contribution credit[Senate] Agent contribution credit pipeline (driver #11)every-6h
    Abandoned-run watchdog[Senate] CI: Abandoned-run watchdogevery-1h
    Strategic engine guardian[Senate] Strategic engine guardianevery-15-min
    Orchestra operator watchdog[Senate] Orchestra operator watchdog and self-repair loopevery-2h
    If a row-count gap matches one of these recurring drivers AND that driver
    has run in the last 24h, do not generate a one-off duplicate. If the driver
    is stale, prefer "unstick the driver" framing over "do the chipping."

    Critical constraints

    • Pri 99 every-30-min: this spec must run regardless of queue depth
    (priority audit always happens; creation only when invariant violated).
    • xhigh effort: agents picking this up should run with maximum
    reasoning effort. The supervisor must route to a slot capable of
    reasoning >= 10. (Note: per auth._ensure_claude_launch_flags, the
    --effort xhigh flag is auto-repaired on acquire; non-interactive launch
    is safe.)
    • Provider any: claude or codex; both are acceptable. Codex is the
    default for cycles requiring deep code reading.
    • Idempotent: re-running this task immediately should be a no-op
    (queue invariant satisfied → no creation; priority audit converges).
    • No destructive action: this task only creates tasks and adjusts
    priorities. It never deletes or merges artifacts. Never close another
    task's lifecycle.
    • Bounded blast radius: ≤ 10 task creations per cycle, ≤ 5 priority
    adjustments per cycle. Anything larger needs operator gate.
    • Worktree discipline: any new spec files must be written in a
    worktree (per SciDEX guard-main-writes hook), committed, and PR'd.
    Do not write to the main checkout.
    • No new quests: route to existing quests; if no quest fits, route to
    the most relevant layer quest (qe-<layer> family).

    Acceptance criteria

    ☐ Each cycle reads queue + recurring health + world-model state in
    Phase 1 before any creation
    ☐ Each cycle produces a Work Log entry with the structure described in
    Phase 5
    ☐ Steady state: ≥ 5 open non-CI tasks at priority ≥ 90
    ☐ Created tasks pass the "ambitious" bar (at least 2 of the 5 criteria)
    ☐ Created tasks do not duplicate active recurring CI work
    ☐ Stale recurring drivers (>24h since last_completed_at) are flagged in
    every cycle's log, regardless of whether intervention happened

    Dependencies

    • /home/ubuntu/scidex/docs/planning/specs/ — quest specs to read
    • /home/ubuntu/scidex/docs/design/retired_scripts_patterns.md — design principles
    • /home/ubuntu/Orchestra/orchestra.db — task queue (read via MCP, write via MCP)
    • scidex.core.database.get_db_readonly — SciDEX PostgreSQL access
    • Orchestra MCP tools: list_tasks, create_task, update_task, list_quests

    Dependents

    • The agent fleet's productivity at any given moment is bounded by the
    ambition of the queue this task generates.
    • Strategic engine guardian (7a9c642b) reopens this if it gets blocked.
    • Watchdog (433698d3 family) creates [Watchdog] Fix: tasks when this
    abandons too often — those should be diagnosed, not silenced.

    Work Log

    2026-04-28 — Bootstrap (rewrite from quest-engine-ci.md)

    Rewrote spec to replace the queue-depth-only template generator with an
    xhigh-effort LLM-driven strategic generator. Prior approach (quest_engine.py
    + quest-engine-ci.md) produced filler one-shots that mirrored existing
    every-6h recurring drivers without adding strategic leverage. New spec:

    • Gap predicate is "≥5 non-CI priority-≥90 tasks", not "queue size < 50"
    • Every cycle reads recurring CI health and refuses to chip at gaps that
    active drivers already own
    • Phases mandate state snapshot → strategic synthesis → bounded creation
    → priority audit → log
    • "Ambitious" bar: capability work, scientific output, value-prop shift
    • Hard prohibition on row-count backfill duplicates of existing drivers

    Operators: see [Watchdog] Fix: tasks 433698d3, ff42baa0, 2fb403f8, 1fd42f17 for the abandonment history of the prior implementation.

    Cycle 1 — 2026-04-28T09:11Z (first agent run on the new spec)

    World model snapshot (Phase 1):

    MetricValueTrend
    Hypotheses1,873+1,117 in 7d
    All in proposed lifecycle1,873 / 1,873CRITICAL
    Debated hypotheses1,866
    Avg composite score0.561
    Top composite score0.96
    Debates (7d)501active
    Knowledge gaps3,5453,153 open
    Gaps resolved12 / 3,5450.34%
    Wiki pages17,662
    KG edges697,224
    Papers29,503
    Analyses470
    Benchmarks0CRITICAL
    ML models (artifacts)9low
    Debate sessions835 total
    Queue health: 56 open non-CI tasks at priority ≥ 90 (gap predicate met),
    97 recurring drivers active, only 3 one-shot tasks open (low for a
    platform with this activity level).

    Strategic gaps synthesized (Phase 2):

  • Hypothesis lifecycle frozen — every hypothesis stuck in proposed;
  • no promotion engine exists. World model can never improve.
  • Gap resolution rate 0.34% — 3,545 gaps, 12 resolved; gap tracker is
  • noise, not signal.
  • Zero benchmarks — Forge has no computational benchmarks; platform
  • lacks predictive-validity demonstration.
  • Debate transcripts unused for causal KG — 835 debates contain
  • mechanistic A→B→C reasoning never extracted as KG edges.
  • No cross-disease mechanism mining — debates siloed by disease;
  • no AD/PD/ALS/FTD analogy miner.
  • Top hypotheses lack computational validation — top 25 at
  • composite_score ≥ 0.88 debated but not computationally validated.

    Ambitious tasks created (Phase 3):

    IDTitlePriority
    7db6be96[Agora] Hypothesis lifecycle promotion engine — debate-to-validated pipeline94
    c17abaf7[Forge] Computational validation of top 25 hypotheses — enrichment + expression analyses93
    f4f7b129[Atlas] Gap closure pipeline — match 500 open gaps to evidence and resolve92
    72f50712[Agora] Debate transcript causal claim extractor — mine 835 debates for mechanistic KG edges91
    1186a9ab[Forge] Seed neurodegeneration ML benchmark registry — 5 prediction tasks with answer keys91
    1b1ebf23[Agora] Cross-disease mechanism analogy miner — transfer AD/PD/ALS/FTD insights90
    All six pass the "ambitious" bar (capability work, scientific-output,
    or value-prop shift — not row-count backfill). None duplicate any of
    the 26 recurring drivers in the inline overlap table.

    Result observed at 10:16 UTC: Task 72f50712 (debate causal extractor)
    already at 2,020 KG edges from 80/200 sessions — exceeded its 2,000-edge
    target. The new generator is producing tasks that actually move the needle.

    Sibling Tasks in Quest (Agora) ↗