SciDEX — Task: [Agora] Debate transcript causal claim extractor

**Problem:** SciDEX has 835 debate sessions containing rich mechanistic reasoning — Theorist, Skeptic, Expert, and Synthesizer personas argue causal claims of the form "Gene A activates Pathway B leading to Phenotype C". This content is scientifically valuable but entirely unextracted: the KG grows from papers and analyses, not from the platform's own debates. **Goal:** Extract structured causal claims from the top 200 debate transcripts (by num_hypotheses_generated + num_rounds) and add them as mechanistic KG edges. **Implementation:** 1. Fetch top 200 debates: `SELECT id, question, transcript FROM debate_sessions WHERE transcript IS NOT NULL ORDER BY num_rounds DESC, num_hypotheses_generated DESC LIMIT 200` (check actual column name for transcript — may be `debate_log`, `content`, or `turns_json`) 2. For each debate, use LLM to extract causal triples: (subject_entity, relationship, object_entity, confidence, supporting_quote) - Focus on: ACTIVATES, INHIBITS, CAUSES, PREVENTS, BIOMARKER_FOR, THERAPEUTIC_TARGET_FOR relationships - Only extract claims that appeared in ≥2 debate turns (consensus threshold) 3. Resolve entities against existing KG nodes: `SELECT id, name FROM knowledge_nodes WHERE name ILIKE %%` 4. Insert new edges: `INSERT INTO knowledge_edges (source_id, target_id, relationship, confidence, evidence_type, provenance) VALUES (..., ..., ..., , 'debate_extracted', 'debate_session:')` 5. Target: ≥2,000 new mechanistic edges from debate content 6. Deduplicate: skip edges where source_id + target_id + relationship already exists **Use `scidex.core.database.get_db()` and `llm.py` for extraction. Commit edges in batches of 100.** **Expected outcome:** 2,000+ new mechanistic edges in the KG sourced from debate transcripts, enriching the world model with the platform's own reasoning output.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (5)

Squash merge: orchestra/task/72f50712-debate-transcript-causal-claim-extractor (2 commits) (#1224)2026-04-28

[Agora] Verify debate causal extractor — 2,020+ edges hit milestone [task:72f50712-2194-475b-86bd-45cb6df6b175]2026-04-28

[Agora] Add debate transcript causal claim extractor [task:72f50712-2194-475b-86bd-45cb6df6b175]2026-04-28

[Senate] Create ambitious quest task generator spec + 6 strategic tasks [task:80ffb77b-8391-493c-8644-37086c8e2e3c] (#1211)2026-04-28

Spec File

Goal

Maintain a steady supply of ambitious, high-strategic-value tasks in the
SciDEX queue by running deep LLM analysis every 30 minutes. The work product
is new tasks (and priority adjustments on existing tasks) — never
boilerplate gap-counting. Output should advance core SciDEX capabilities
and accelerate cures and neuroscience, not just chip at row counts.

This spec deliberately replaces scripts/quest_engine.py and the prior quest-engine-ci.md template-driven approach, both of which produced
filler one-shots that mirror existing recurring CI work without adding
strategic leverage. The retired approach is documented for context only —
do not run it.

> ## Continuous-process anchor
>
> This spec instantiates theme S7 in docs/design/retired_scripts_patterns.md
> ("Quest task generation"). Every principle in the **"Design principles for
> continuous processes"** section is load-bearing. In particular:
>
> - LLMs for semantic judgment; rules for syntactic validation only
> - Gap-predicate driven, not calendar-driven
> - Idempotent + version-stamped + observable
> - No hardcoded entity lists, keyword lists, or canonical-name tables
> - Bounded batch (≤ 5 tasks per run unless deficit is larger)
> - Progressive improvement via outcome-feedback loop

The invariant this task enforces

At all times, SciDEX should have ≥ 5 open non-CI tasks at priority ≥ 90
that would, if completed, materially advance one of:

Core capability: scoring, debate, market mechanisms, KG quality,

tool reliability, governance, agent orchestration

Scientific output: hypothesis quality, evidence chains, falsifiable

predictions, experiment proposals, target validation for neurodegeneration

System value: SciDEX's role as a machine for prioritizing,

organizing, synthesizing, inventing, funding, and rewarding science

Gap predicate (run this first — if false, mostly-no-op):

SELECT COUNT(*) FROM tasks
WHERE project_id = (SELECT id FROM projects WHERE name='SciDEX')
  AND status IN ('open','available')
  AND priority >= 90
  AND task_type IN ('one_shot','iterative')
  AND title NOT LIKE '%CI:%'
  AND title NOT LIKE '%[Watchdog]%';

If COUNT() >= 5: skip generation. Run the priority audit* step only
(below). If COUNT() < 5: generate (5 - COUNT()) to min(10, deficit*2)
new tasks at priority ≥ 90.

What "ambitious" means here (the bar)

A task qualifies as ambitious if at least two of:

It targets a capability SciDEX doesn't have yet, or one whose current

implementation is a known-thin scaffold (not a row-count backfill).

It would, on completion, change the value of one of SciDEX's measurable

outputs (debate quality, market signal, hypothesis novelty, KG density,
reproducibility, time-to-first-experiment, agent throughput).

It frames a scientific question the system itself should answer

(cross-disease mechanistic synthesis, novel target proposal with
falsifiable prediction, paper-claim contradiction surface, etc.).

It builds a feedback loop / meta-mechanism (rubric versioning, tournament

driver, reward-eligibility check, calibration meta-job).

It improves cross-layer integration (Atlas ↔ Exchange, Agora ↔ Senate,

Forge ↔ Atlas) where today the layers are loosely coupled.

A task is filler (do not create) if it is any of:

"Backfill column X for N rows" where a recurring CI task already covers

this (see §"Recurring tasks already covering common gaps" below).

"Add references to N wiki pages" / "score N hypotheses" / "extract claims

from N papers" — these mirror existing every-6h drivers; the right
intervention is unsticking the driver, not creating a one-off chip.

Anything whose acceptance is "process N rows" without naming a capability

improvement that wouldn't happen otherwise.

Anything that's a wrapper around a Senate dedup/cleanup activity that

forces destructive action regardless of judgment.

What the agent does each cycle (xhigh effort)

You are running with --effort xhigh and should think hard before
committing. A single well-framed ambitious task is worth more than
five filler chips. Spend most of the cycle reading and synthesizing.

Phase 1 — State snapshot (read-only, ~5 min)

Gather and summarize:

Queue state via Orchestra MCP (list_tasks with project=SciDEX):

- Total open one-shot/iterative count
- Count at priority ≥ 90, ≥ 95, ≥ 99 (excluding CI: and Watchdog tasks)
- Top 25 highest-priority open tasks: id, title, priority, age, quest_id
- Top 10 oldest open tasks (for staleness review)
- Last 30 completed tasks (last 24h): id, title, layer, completion_summary

Recent run outcomes (last 24-48h):

- How many tasks completed vs abandoned vs still-running
- Which quests/layers shipped real PRs (check pr_links_json,
commit_links_json, merge_verified_at)
- Which agents/models were most productive (group assigned_worker)

Active quests (Orchestra quests table):

- For each active quest: name, layer, priority, current open-task count
- Read the quest's spec file under docs/planning/specs/quest_<layer>_spec.md
and any quest_<layer>_<topic>_spec.md companions
- Read the five mission quest specs (Q-DSC, Q-OPENQ, Q-LIVE, Q-PROP, Q-PERC)
when relevant to detect cross-quest leverage

Recurring CI health (critical — distinguishes filler from ambitious):

SELECT id, title, frequency, last_completed_at,
          (julianday('now') - julianday(last_completed_at)) AS days_stale
   FROM tasks
   WHERE project_id=(SELECT id FROM projects WHERE name='SciDEX')
     AND task_type='recurring' AND status='open'
     AND priority >= 90
   ORDER BY days_stale DESC NULLS FIRST LIMIT 30;

- If a recurring driver is stale > 24h, **a one-off chip at the same gap
is filler.** Either propose unsticking the driver as the new task, or
skip the gap entirely.

SciDEX world-model state via PostgreSQL (scidex.core.database).

Polymorphic via information_schema — never hardcode column lists.
Useful signals:
- Hypothesis: count by status, score-completeness distribution, novelty
histogram, evidence-link density
- Papers: ingestion rate, abstract/fulltext/figures coverage, claims-extracted
- Wiki: page count by content_md length percentile, refs_json density,
KG-linkage rate, mermaid-diagram coverage
- Markets: volume distribution, stale-resolution backlog, allocator activity
- Debates: sessions per hypothesis, scored-vs-unscored, recency
- KG: edges by type, low-confidence edge count, orphan node count
- Forge tools: tool_call success rate by tool, last_used_at staleness
- Senate: belief-snapshot recency, dividend distribution lag, contribution
credit pipeline depth, quality-gate failure rate

Site state (optional, when LLM judges relevant):

- curl -s http://127.0.0.1:8000/api/... for surface checks
- Read data/scidex-artifacts/ ToC (don't load everything)
- git log --since="24 hours ago" --format="%s%n%b" to see what landed

Phase 2 — Synthesis (xhigh effort think)

With state in hand, ask yourself the hard questions. Each answer is
candidate task material:

Strategic gaps (priority 95-99 candidates):

What capability would change SciDEX's value proposition if shipped this

week? (Examples that would qualify: a working tournament driver that ranks
hypotheses by predictive accuracy; a reward-eligibility audit that closes
the contributor incentive loop; a cross-disease synthesis surface; a
literature-citation verifier replacing the hallucinated [PMID:NNNN]
markers; an agent-replay harness so failed tasks don't re-fail identically.)

What scientific question is the system uniquely positioned to answer

this week? (e.g. "Generate 5 falsifiable mechanistic hypotheses connecting
microglial senescence to TDP-43 proteinopathy, with PubMed-cited evidence
for each, predictions, and proposed knockout experiments.")

Which layers have the least cross-talk and what bridge would help most?

Capability gaps (priority 90-94 candidates):

Which thin-scaffold capability is closest to ready-for-real-use? (e.g. an

experiment-proposal generator that has a working prompt but no UI surface.)

Which feedback loop is open-loop today and would close with one task?

(e.g. debate quality → hypothesis prior; agent contribution → reward.)

Health gaps (priority 90-92, only if no recurring CI covers it):

Which stuck recurring driver is the most leveraged to unstick? Frame as

"diagnose + fix + verify resumed throughput" — not "do the work the driver
was supposed to do."

Phase 3 — Bounded creation

For each new task:

Title: [Layer] <verb> <object> <specifics> — concrete, scannable.

Description: includes (a) why this matters (cite the strategic gap

from Phase 2), (b) what success looks like in measurable terms,
(c) what the agent should read first, (d) what not to do.

Priority: 90-99 by leverage. Only use ≥ 95 for genuine

capability-shift work; reserve 99 for "this changes the whole system."

Layer / quest_id: route via existing layer quests when fit; create

no new quests.

Spec_path: prefer existing quest layer specs (quest_<layer>_spec.md).

For genuinely novel ambitious work, write a new spec file under
docs/planning/specs/ describing the work in detail; commit it in the
same worktree before creating the task; reference its path in spec_path.

Task type: default iterative with max_iterations=15 for deep

work; use one_shot only for clearly-bounded single-PR work.

Tags: include quest-engine, layer name, and a topical tag.

Payload requirements: set realistic skill floors (reasoning,

analysis, coding, creativity, safety).

Cap creation at max(5, deficit + 2) per cycle. Never exceed 10.

Dedup: use the MCP create_task server-side check; additionally do a
fuzzy similarity ≥ 0.7 check against last 60 days of titles (open + done)
before submitting.

Phase 4 — Priority audit (every cycle)

Independent of creation, audit existing open task priorities:

Overdue + low priority (e.g. 60d old at priority 50): demote to 30 or

close as stale (with rationale in completion_summary).

Underprioritized strategic work: open task whose description references

a Phase 2 strategic theme but priority < 85 → bump to 85-90.

Overprioritized filler: 95-priority task whose work is also covered

by an active every-6h recurring driver → demote to 70.

Surface contention: if 3+ tasks are racing on the same files, lower

the priority of all but the most strategic.

Make ≤ 5 priority adjustments per cycle. Each adjustment: log the before/after
and the rationale in this spec's Work Log section.

Phase 5 — Cycle log

Append a Work Log entry with:

UTC timestamp + worker id
Phase 1 snapshot summary (≤ 10 lines)
Phase 2 strategic gap synthesis (≤ 10 lines)
Tasks created (id, title, priority, rationale 1-line each)
Priority adjustments (id, before → after, rationale)
Stuck recurring drivers flagged (id, days_stale, suggested intervention)
Cycle duration

Recurring tasks already covering common gaps

Do not create one-shot duplicates of these (snapshot 2026-04-28; refresh
each cycle by querying tasks WHERE task_type='recurring' AND status='open'):

Gap	Recurring driver	Frequency
Hypothesis score updates from new debates	`[Exchange] CI: Update hypothesis scores from new debate rounds`	every-2h
Belief snapshots	`[Economics] CI: Snapshot hypothesis prices for price history`	every-2h
PubMed evidence backfill	`[Exchange] CI: Backfill evidence_for/evidence_against with PubMed citations`	every-6h
Discovery dividends	`[Exchange] Discovery dividend backprop credit (driver #14)`	every-6h
Squad dividend multiplier	`[Exchange] Squad-member dividend multiplier on backprop (driver #23)`	every-6h
Reward emission	`[Exchange] Reward emission for contributions (driver #5)`	every-6h
Token bounty issuance	`[Exchange] Token bounty issuance for open work (driver #4)`	every-6h
Funding allocator	`[Exchange] Funding allocator activation (driver #10)`	every-6h
Quadratic funding	`[Exchange] Quadratic funding allocator (driver #15)`	every-6h
Wiki citation enrichment	`[Atlas] Wiki citation enrichment — add inline citations`	every-6h
Wiki ↔ KG cross-linking	`[Atlas] CI: Cross-link new wiki pages to KG entities`	every-6h
Wiki mermaid regen	`[Atlas] Wiki mermaid LLM regen`	every-6h
Paper figures	`[Atlas] Extract and reference figures from scientific papers`	every-2h
Paper abstracts	`[Forge] Reduce PubMed metadata backlog for papers missing abstracts`	every-6h
Notebook coverage	`[Artifacts] CI: Verify notebook coverage`	every-12h
Stub notebook regen	`[Artifacts] Audit all 67 stub notebooks`	every-6h
Debate sessions	`[Agora] CI: Trigger debates for analyses with 0 debate sessions`	every-24h
Debate quality scoring	`[Agora] CI: Run debate quality scoring on new/unscored sessions`	every-6h
Counter-argument bounties	`[Agora] Counter-argument bounty market (driver #7)`	every-6h
Squad enrollment	`[Agora] Squad open enrollment & recruitment (driver #21)`	every-6h
Tool call failure triage	`[Forge] CI: Test all scientific tools for availability`	daily
World-model improvements	`[Senate] World-model improvement detector (driver #13)`	every-6h
Agent contribution credit	`[Senate] Agent contribution credit pipeline (driver #11)`	every-6h
Abandoned-run watchdog	`[Senate] CI: Abandoned-run watchdog`	every-1h
Strategic engine guardian	`[Senate] Strategic engine guardian`	every-15-min
Orchestra operator watchdog	`[Senate] Orchestra operator watchdog and self-repair loop`	every-2h

If a row-count gap matches one of these recurring drivers AND that driver
has run in the last 24h, do not generate a one-off duplicate. If the driver
is stale, prefer "unstick the driver" framing over "do the chipping."

Critical constraints

Pri 99 every-30-min: this spec must run regardless of queue depth

(priority audit always happens; creation only when invariant violated).

xhigh effort: agents picking this up should run with maximum

reasoning effort. The supervisor must route to a slot capable of
reasoning >= 10. (Note: per auth._ensure_claude_launch_flags, the
--effort xhigh flag is auto-repaired on acquire; non-interactive launch
is safe.)

Provider any: claude or codex; both are acceptable. Codex is the

default for cycles requiring deep code reading.

Idempotent: re-running this task immediately should be a no-op

(queue invariant satisfied → no creation; priority audit converges).

No destructive action: this task only creates tasks and adjusts

priorities. It never deletes or merges artifacts. Never close another
task's lifecycle.

Bounded blast radius: ≤ 10 task creations per cycle, ≤ 5 priority

adjustments per cycle. Anything larger needs operator gate.

Worktree discipline: any new spec files must be written in a

worktree (per SciDEX guard-main-writes hook), committed, and PR'd.
Do not write to the main checkout.

No new quests: route to existing quests; if no quest fits, route to

the most relevant layer quest (qe-<layer> family).

Acceptance criteria

☐ Each cycle reads queue + recurring health + world-model state in

Phase 1 before any creation

☐ Each cycle produces a Work Log entry with the structure described in

Phase 5

☐ Steady state: ≥ 5 open non-CI tasks at priority ≥ 90

☐ Created tasks pass the "ambitious" bar (at least 2 of the 5 criteria)

☐ Created tasks do not duplicate active recurring CI work

☐ Stale recurring drivers (>24h since last_completed_at) are flagged in

every cycle's log, regardless of whether intervention happened

Dependencies

/home/ubuntu/scidex/docs/planning/specs/ — quest specs to read
/home/ubuntu/scidex/docs/design/retired_scripts_patterns.md — design principles
/home/ubuntu/Orchestra/orchestra.db — task queue (read via MCP, write via MCP)
scidex.core.database.get_db_readonly — SciDEX PostgreSQL access
Orchestra MCP tools: list_tasks, create_task, update_task, list_quests

Dependents

The agent fleet's productivity at any given moment is bounded by the

ambition of the queue this task generates.

Strategic engine guardian (7a9c642b) reopens this if it gets blocked.
Watchdog (433698d3 family) creates [Watchdog] Fix: tasks when this

abandons too often — those should be diagnosed, not silenced.

Work Log

2026-04-28 — Bootstrap (rewrite from quest-engine-ci.md)

Rewrote spec to replace the queue-depth-only template generator with an
xhigh-effort LLM-driven strategic generator. Prior approach (quest_engine.py
+ quest-engine-ci.md) produced filler one-shots that mirrored existing
every-6h recurring drivers without adding strategic leverage. New spec:

Gap predicate is "≥5 non-CI priority-≥90 tasks", not "queue size < 50"
Every cycle reads recurring CI health and refuses to chip at gaps that

active drivers already own

Phases mandate state snapshot → strategic synthesis → bounded creation

→ priority audit → log

"Ambitious" bar: capability work, scientific output, value-prop shift
Hard prohibition on row-count backfill duplicates of existing drivers

Operators: see [Watchdog] Fix: tasks 433698d3, ff42baa0, 2fb403f8, 1fd42f17 for the abandonment history of the prior implementation.

Cycle 1 — 2026-04-28T09:11Z (first agent run on the new spec)

World model snapshot (Phase 1):

Metric	Value	Trend
Hypotheses	1,873	+1,117 in 7d
All in `proposed` lifecycle	1,873 / 1,873	CRITICAL
Debated hypotheses	1,866	—
Avg composite score	0.561	—
Top composite score	0.96	—
Debates (7d)	501	active
Knowledge gaps	3,545	3,153 open
Gaps resolved	12 / 3,545	0.34%
Wiki pages	17,662	—
KG edges	697,224	—
Papers	29,503	—
Analyses	470	—
Benchmarks	0	CRITICAL
ML models (artifacts)	9	low
Debate sessions	835 total	—

Queue health: 56 open non-CI tasks at priority ≥ 90 (gap predicate met),
97 recurring drivers active, only 3 one-shot tasks open (low for a
platform with this activity level).

Strategic gaps synthesized (Phase 2):

Hypothesis lifecycle frozen — every hypothesis stuck in proposed;

no promotion engine exists. World model can never improve.

Gap resolution rate 0.34% — 3,545 gaps, 12 resolved; gap tracker is

noise, not signal.

Zero benchmarks — Forge has no computational benchmarks; platform

lacks predictive-validity demonstration.

Debate transcripts unused for causal KG — 835 debates contain

mechanistic A→B→C reasoning never extracted as KG edges.

No cross-disease mechanism mining — debates siloed by disease;

no AD/PD/ALS/FTD analogy miner.

Top hypotheses lack computational validation — top 25 at

composite_score ≥ 0.88 debated but not computationally validated.

Ambitious tasks created (Phase 3):

ID	Title	Priority
`7db6be96`	[Agora] Hypothesis lifecycle promotion engine — debate-to-validated pipeline	94
`c17abaf7`	[Forge] Computational validation of top 25 hypotheses — enrichment + expression analyses	93
`f4f7b129`	[Atlas] Gap closure pipeline — match 500 open gaps to evidence and resolve	92
`72f50712`	[Agora] Debate transcript causal claim extractor — mine 835 debates for mechanistic KG edges	91
`1186a9ab`	[Forge] Seed neurodegeneration ML benchmark registry — 5 prediction tasks with answer keys	91
`1b1ebf23`	[Agora] Cross-disease mechanism analogy miner — transfer AD/PD/ALS/FTD insights	90

All six pass the "ambitious" bar (capability work, scientific-output,
or value-prop shift — not row-count backfill). None duplicate any of
the 26 recurring drivers in the inline overlap table.

Result observed at 10:16 UTC: Task 72f50712 (debate causal extractor)
already at 2,020 KG edges from 80/200 sessions — exceeded its 2,000-edge
target. The new generator is producing tasks that actually move the needle.