[Cross-cutting] Wire existing K-Dense skills into analyses + debates + persona contexts running

← Mission Control
Drive debate-engine analyses + experiment-extraction + hypothesis-enrichment to actually invoke the 23 existing K-Dense scientific skills under `scidex/skills/`. We've stopped building new tool wrappers (53+ duplicate tasks archived/deprioritized 2026-04-24); the focus shifts to making the tools we have part of the regular generation loop. **Available skills** (each is a SKILL.md bundle the agent can call): - allen-brain-expression - alphafold-structure - chembl-drug-targets - clinvar-variants - dgidb-drug-gene - disgenet-gene-diseases - drugbank-drug-info - enrichr-analyze - gnomad-gene-variants - gtex-tissue-expression - gwas-genetic-associations - open-targets-associations - openalex-works - openfda-adverse-events - paper-corpus-search - paper-figures - pubmed-search - reactome-pathways - research-topic - search-trials - semantic-scholar-search - string-protein-interactions - uniprot-protein-info **Acceptance criteria (per iteration):** 1. Pick one analysis layer that does NOT yet call skills (e.g. agora hypothesis enrichment, debate evidence-grounding, experiment-extraction methods-section parsing, or persona research-context loading). Examples to grep: `scidex/agora/`, `scidex/forge/tools.py`, `scidex/ingest/experiment_extraction*.py`. 2. Add ≥1 skill invocation to the layer's flow with a real call site (not a stub). Each skill call must be wrapped in try/except + a fallback so a single API outage doesn't break the analysis. 3. Cite the skill output in the artifact (debate round, hypothesis evidence_for/against, experiment cited papers, persona research brief). The citation must include the source URL or paper PMID/DOI from the skill output so reviewers can audit. 4. Add a metric to `agent_skill_invocations` (or equivalent table; create if missing) recording skill name, called_from artifact_class + artifact_id, latency, success/error, output excerpt for downstream analytics. 5. Update an existing showcase artifact (Allen-experiment, recent hypothesis, top-debate, etc.) to demonstrate the skill citation in the UI. **Per iteration, score how many skills are actually called per artifact-generation event** — target ≥2 distinct skills cited per debate round, ≥1 cited per hypothesis enrichment, ≥3 cited per experiment proposal. **Why this matters now:** workers are generating artifacts without grounding them in real biomedical evidence the platform already has at hand. Inventions/hypotheses/experiments produced without skill calls score lower on the `gap_signal` and `utility_signal` axes (per the scidex_economy_design_spec.md §2 valuation signals) — they get auto-demoted by the meta-arena. Wiring the skills in is the cheapest move from 'generates plausible text' to 'generates evidence-grounded artifacts'. Spec: docs/planning/specs/scidex_economy_design_spec.md (see §2 utility signal, §3 generation loop).

Git Commits (9)

WIP on main: 1f0e35929 Squash merge: orchestra/task/b1a8e549-cross-cutting-wire-existing-k-dense-skil (2 commits)2026-04-25
index on main: 1f0e35929 Squash merge: orchestra/task/b1a8e549-cross-cutting-wire-existing-k-dense-skil (2 commits)2026-04-25
Squash merge: orchestra/task/b1a8e549-cross-cutting-wire-existing-k-dense-skil (2 commits)2026-04-25
[Agora] Wire existing K-Dense-backed tools into debate orchestration [task:b1a8e549-6f31-43c5-80f5-7c4717c267e4]2026-04-25
Squash merge: orchestra/task/b1a8e549-cross-cutting-wire-existing-k-dense-skil (1 commits)2026-04-25
[Agora] Wire 3 missing tools into debate skill_functions, fix citation persistence bug [task:b1a8e549-6f31-43c5-80f5-7c4717c267e4]2026-04-25
[Agora] Fix skeptic persona research: use theorist output, not undefined skeptic_response [task:b1a8e549-6f31-43c5-80f5-7c4717c267e4]2026-04-24
[Agora] Fix PostgreSQL placeholders, wire hypothesis enrichment + experiment skills [task:b1a8e549-6f31-43c5-80f5-7c4717c267e4]2026-04-24
[Agora] Wire K-Dense skills into debate engine with invocation tracking [task:b1a8e549-6f31-43c5-80f5-7c4717c267e4]2026-04-24
Spec File

SciDEX economy — holistic design

> Purpose. Unify invention / experiment / gap / landscape / discovery / hypothesis / paper / target artifacts under one generation-and-valuation pipeline so the agent fleet does measurable, directional work instead of shipping plausibly-helpful-but-unranked output. This is the umbrella spec; each concrete quest spec (quest_inventions, quest_experiments, quest_gaps, quest_landscape_analyses, showcase UI) references this doc for shared mechanics.

The design is motivated by two observations from the 2026-04-24 audit:

  • The fleet has been generating lots of tasks that do not obviously advance any feature. Many land as "already resolved on main" no-ops, polish tweaks, or CI-watchdog fixes for symptoms of the agent fleet itself. This is the "junk/waste" pattern.
  • Existing specs cover many pieces (gap generation, gap prioritization, artifact lifecycle, market pricing, agora debates) but aren't explicitly wired into one pipeline. Each quest optimizes locally; nothing globally prioritizes the artifact frontier.
  • The correction is an economy — artifacts have explicit value signals, the quests compete for capacity, and the output is a ranked set of showcase artifacts with traceable provenance.

    ---

    1. Artifact types and what they are

    SciDEX produces seven artifact classes. Each is a discrete node in the world-model graph.

    ClassOne-line definitionUpstream ofDownstream of
    GapA specific, actionable deficit in the world model (unknown mechanism, contested claim, uncovered population, missing connection)experiments, inventions, targetslandscape analyses
    Landscape analysisA living map of a scientific field — clusters, gaps, trends, unknownsgapscorpus snapshots, papers
    InventionA novel concept, mechanism, method, or design that plausibly closes a gapexperiments, discoveries, patentsgaps, landscape analyses
    ExperimentA testable protocol with an expected information gaindiscoverieshypotheses, inventions, targets
    HypothesisA falsifiable scientific claim at a defined confidenceexperiments, discoveriesdebates, gaps
    TargetA disease target / pathway / molecule worth investigatingexperiments, inventionsgaps, landscape analyses
    DiscoveryA surprising, reproducible findingpapers, follow-up experimentsexperiments, hypotheses
    PaperA composed narrative that bundles discoveries + methods + contextcitations, subsequent papersdiscoveries, experiments
    "Paper" is the public-facing composition of the other six. A showcase paper is the canonical demo of end-to-end value.

    ---

    2. The economy — six signal sources that price every artifact

    Every artifact gets a composite value computed from six underlying signals. Each signal is independently produced; the composition is a learned weighted sum whose weights are themselves artifacts (meta-inventions tuned by epistemic rigor).

  • Gap signal. How well does the artifact close an identified gap? Needs the gap to exist and a projection that links the artifact to it.
  • - Producer: quest_gaps + Atlas world-model graph edges.
    - Range: 0 … 1 (fraction of a named gap closed, upper-bounded at 1).
  • Landscape signal. Is this trodden ground or genuinely new? Requires a current landscape analysis covering the artifact's domain.
  • - Producer: quest_landscape_analyses.
    - Range: 0 … 1 (1 = no prior art in the mapped literature, 0 = saturated).
  • Market signal. What do market participants bid when the artifact is listed?
  • - Producer: Exchange quest + Market Participants quest (existing).
    - Range: 0 … ∞ (tokens).
  • Adversarial signal. Does it survive red-team challenge by the Senate?
  • - Producer: Adversarial Science quest (existing).
    - Range: 0 … 1 (1 = cleanly survived, 0 = refuted).
  • Evolutionary signal. Does it win in arena tournaments against peer artifacts of the same class?
  • - Producer: Evolutionary Arenas quest (existing — Elo over pairwise judgments).
    - Range: Elo score (1200 baseline, ±400 range).
  • Utility signal. Does the artifact measurably improve downstream artifacts or benchmarks when used?
  • - Producer: quest_experiments (runs a planned utility test), Forge benchmarks.
    - Range: 0 … ∞ (domain-specific — e.g. % improvement on a benchmark, citations, deployed uses).

    Composite value V(artifact) = Σ w_i · normalize(signal_i) where weights are model artifacts owned by Epistemic Rigor. The weights themselves compete on a meta-arena (Elo among weight-vectors based on which vectors best predict long-horizon utility). This is what makes the system self-improving rather than hand-tuned.

    V is probability-like — it lives in [0, 1] and measures belief / quality / resolution-likelihood. It does NOT measure how much is at stake. A gap with V=0.9 and a gap with V=0.9 that is a thousand times more important look identical under V alone. §2a fixes that.

    ---

    2a. Size, market cap, volume, liquidity — the other axis

    V answers "how valid / probable is this artifact?". It says nothing about magnitude. To distinguish a plausible footnote from a plausible paradigm shift we add three class-calibrated dimensions to every artifact:

    2a.1 Size (a.k.a. impact)

    S(artifact) ∈ [0, ∞) — a scalar in class-appropriate units that estimates what's at stake if the artifact resolves positively. Size is independent of whether it WILL resolve (that's V's job). Size is a pure upside question.

    Per-class definition + units:

    ClassS unitHow to estimate
    Hypothesisexpected-citations-per-year (epy) at the 5-year markRegression on existing hypothesis cohort (citation-curve percentile × novelty)
    Gapfraction_of_world_model_improved × domain_weightWorld-model graph reach from the gap (centrality) × operator-set domain weights
    Inventionpotential applications — deployments_p10 … deployments_p90 (lognormal)Analogy to closest N prior inventions in the same landscape cell
    Experimentexpected_information_gain_bits × downstream_artifacts_enabledIIG from the existing spec × fan-out over the world model
    Discoverynovelty × reach (both [0, 1]; product in [0, 1] — note this class saturates)Embedding distance to nearest K discoveries × paper-cite fan-out
    Paperprojected-citations at 5-year mark (uses hypothesis estimator)Same estimator as hypothesis; paper's wrapper character gives more signal
    Targetdruggability × unmet_medical_needDruggability score (existing) × WHO/regulatory/disease-burden numbers
    Landscape analysisdomain_coverage × downstream_gap_rateFraction of domain mapped × gaps-produced-per-refresh
    Size is computed once at admission time and recomputed on weekly meta-arena cycles (so S drifts as the field evolves). The estimator for each class is itself a model artifact under Epistemic Rigor; competing estimators face off in a size-meta-arena just like the composite-weight vectors.

    2a.2 Market capitalization

    MarketCap(artifact) = V(artifact) × S(artifact).

    This is the expected impact-weighted value — probability × magnitude. It's the single scalar that answers "which artifacts should get the most agent-capacity?" more faithfully than V alone. The showcase UI's default ranking switches from V to MarketCap once size estimators exist for every class. V remains visible as the confidence component.

    Two artifacts with identical V:

    • paper-class showcase A: V = 0.85, S = 200 epy → MC = 170
    • paper-class showcase B: V = 0.85, S = 3 epy → MC = 2.55

    A dominates B under MarketCap even though they tie on V.

    2a.3 Open interest (shares outstanding analogue)

    OpenInterest(artifact) = total tokens committed across open market participant positions (sum of stakes on both YES and NO sides). This is the conviction dimension — it measures how much capital the market has bet for or against this artifact. High open interest + low V means "the market strongly disagrees with this artifact" rather than "nobody has paid attention".

    OI grows when new participants enter; decays when positions close at resolution. Stored per-artifact in the existing exch-qm-01-MEXT_extend_market_pricing_spec.md market rows.

    2a.4 Volume + liquidity

    • Volume_24h(artifact) = total tokens exchanged in bids/asks in the last 24 hours. Measures attention independent of conviction — an artifact can have high OI with zero recent volume (stable consensus) or low OI with high volume (new + thinly-traded).
    • Liquidity(artifact) = effective depth — the LMSR-b parameter for this artifact's market × pool-tokens. Proxy for "how much can be bet before the price moves materially". Low-liquidity artifacts' prices are noisy; the scheduler SHOULD NOT treat them as well-calibrated until liquidity exceeds a class floor.

    2a.5 Derived rankings (UI + scheduler)

    The showcase UI and the quest scheduler consume the following rankings, each answering a different question:

    RankingFormulaQuestion it answers
    by_market_capV × SWhere should the most agent-capacity go?
    by_size_moonshotS / (V + 0.1)Which long-shots carry the most upside if we're wrong about probability?
    by_volumeVolume_24hWhat's getting attention right now?
    by_convictionOpenInterestWhat does the market believe strongest in (either direction)?
    by_v_aloneVThe previous default — still useful for confidence-only views.
    The default /showcase tab sorts by market_cap; alternate tabs expose the others. The scheduler's Phase A seeding in §3 is switched to market_cap × inverse_stock × capacity_available — previously it was urgency × novelty × capacity, which conflated size and probability.

    2a.6 Calibration + drift

    Each size estimator is tested once per week against realized outcomes (paper citations actually accrued, experiments whose IIG was measurable in hindsight, etc.). An estimator whose S-predictions diverge ≥2σ from realizations across a rolling window gets deprecated and the second-place estimator in its meta-arena gets promoted. This is the same self-improving pattern as the composite-weight vector in §2.

    2a.7 Anti-gaming

    Four guardrails keep the market-cap axis from being manipulable:

  • Size estimator is owned by Epistemic Rigor, not the artifact's originator. You can't inflate your own artifact's size field.
  • OpenInterest is weighted by participant believability (existing exch-qm-02-PART — market participant accuracy track). A novice's shares count less than a proven-accurate participant's.
  • Volume has a spam floor — wash trading by the same participant within 10 minutes is deduplicated into one "intent" position before being added to Volume_24h.
  • Size decays on repeated non-resolution. If an artifact sits at V < 0.3 for 4 consecutive weekly windows without new evidence, S is damped by 0.8 per window; prevents forever-unresolved grandiose claims from hoarding capacity.
  • 2a.8 Where the fields live

    • S + MarketCap + OpenInterest + Volume_24h + Liquidity → new columns on the artifact row (or JSON within payload_json for artifact classes that don't have dedicated tables yet).
    • LMSR market rows (per the existing exch-qm-01-MEXT_extend_market_pricing_spec.md) now carry open_interest, volume_24h, liquidity_b, and the new size_estimate + market_cap fields derived from the linked artifact.
    • Size-estimator artifacts live under artifact_class = "size_estimator" (a sub-class of invention — they're literally inventions about measurement) so they get their own market pricing / meta-arena loop.

    ---

    3. The generation loop

    Every artifact is produced by one four-phase loop. The phases are the same regardless of artifact class; the inputs and acceptance criteria differ.

    Phase A — SEEDING
      Select (gap, landscape cell) pair to work on.
      Priority = urgency_from_gap × novelty_from_landscape × capacity_available
      Emits: proposal prompt + context bundle.
    
    Phase B — MULTI-AGENT DEBATE
      N agents with differentiated roles: Proposer, Critic, Synthesizer, Red-Teamer.
      Constrained rounds (4 default; see agora_debate_coverage specs).
      World-model context (Atlas) threaded into every round.
      Emits: candidate artifact + debate transcript + confidence.
    
    Phase C — ADVERSARIAL + MARKET
      Senate red-team runs standardized challenges against the candidate.
      Market participants bid on composite value.
      Arena tournament if there are ≥2 candidates in the same cell.
      Emits: adversarial_score, market_bid, arena_elo.
    
    Phase D — ITERATE OR RETIRE
      If V(artifact) exceeds the cell's current floor, it replaces the
      incumbent and becomes the new floor.
      If it's within the retry budget and below floor, feed the critique
      back into Phase A for a second iteration.
      If all budget burned and still below floor, retire to the archive
      (still indexed, still citable).

    The loop explicitly requires multiple agents and multiple iterations before an artifact is admitted. Tasks generated for this loop use task_type=multi_iter with fields max_iterations, required_participants, debate_rounds — see multi_iter_debate_tasks_spec.md (new).

    ---

    4. Quest choreography

    The quests compose as follows; each arrow represents data or task-generation flowing downstream.

    quest_landscape_analyses
             ↓  (cells + empty regions)
           quest_gaps
             ↓  (gap queue, prioritized)
       ┌─────┴──────┐
       ↓            ↓
     quest_         quest_
     inventions    experiments
       ↓            ↓
       └──► Artifact ◄─── market_participants (bid)
                │      ◄─── adversarial_science (red-team)
                │      ◄─── evolutionary_arenas (Elo)
                ↓
         composite value V
                ↓
         showcase / retire

    quest_gaps is the funnel. Gap quality is the most important input quality gate in the system — garbage-in yields garbage artifacts. The existing quest_gap_factory, gap_quality_scoring, gap_priority_debate_tasks, gap_governance_review_tasks, gap_prediction_markets specs are all load-bearing and stay; this spec adds the wiring that says gaps MUST be tagged with (domain, layer, confidence, expected_value) before they're dequeued by a downstream quest.

    quest_landscape_analyses is new. It scans corpora by field (Atlas literature index) and emits a living map. It is the ONLY gap-source that is allowed to manufacture truly novel gaps — the other gap-generators (debate-triggered, watchdog-triggered) reinforce existing gaps rather than discovering new territory. Without landscape, the system pattern-matches on what it already knows.

    quest_inventions and quest_experiments are new downstream quests. They are the most capacity-hungry and get the most agent slots once unpaused.

    ---

    5. Task shape — multi-iteration, multi-agent

    CI-style "one-shot script runs once" tasks are banned from the four downstream quests. Their task rows carry:

    • task_type = multi_iter
    • max_iterations (default 3)
    • required_roles = ["proposer", "critic", "synthesizer", "red_teamer"]
    • artifact_class one of {invention, experiment, hypothesis, target, discovery, paper, gap, landscape}
    • target_cell = (domain, gap_id) the task is working in
    • acceptance_criteria = list of measurable thresholds (arena Elo ≥ baseline+50, adversarial ≥ 0.6, market bid ≥ median, etc.)

    Tasks run until any of:
    • all acceptance criteria met → artifact admitted
    • max_iterations reached → retire-and-archive
    • cell superseded by another task's winner → abandon cleanly

    This replaces the current one_shot default where a worker runs once, produces whatever, and closes. The new shape forces convergence.

    ---

    6. Task generation — only one generator

    Per the 2026-04-24 directive: only quest task generation runs as a CI/cron job. All other recurring task-generators (watchdog auto-repair, CI checks, broken-link scanners, CI self-maintenance, stub audit) are paused. They can be re-enabled later; not now.

    The sole survivor is quest_engine.py (or equivalent) which:

    • polls the gap queue from quest_gaps
    • assigns open gaps to quest_inventions / quest_experiments based on gap tag
    • creates one multi_iter task per gap-cell per capacity slot
    • re-prioritizes tasks by V(expected) whenever the composite-value model changes

    Everything else stops generating tasks. Existing tasks that are in flight continue; new noise doesn't get created.

    ---

    7. Showcase artifacts — what they are, how they surface

    For each of the seven artifact classes, the system maintains ≥2 showcase artifacts at all times. A showcase artifact has:

    • composite value V above the floor for that class, stable across ≥3 consecutive weekly meta-arena runs
    • complete provenance chain: gap → landscape cell → debate transcript → adversarial outcome → market bids → arena matchups → composite score
    • a narrative wrapper (a paper artifact, short form — see papers class) explaining why a non-expert should care
    • a utility demonstration — a concrete application where using this artifact improved some measurable thing

    Showcase artifacts are the public face of SciDEX. They are pinned in the UI. Their provenance chain is fully drillable — clicking the invention opens the debate that produced it, the gap it closes, the landscape cell it occupies, and the arena history that ranked it there.

    7.1 Model artifacts

    A subset of showcase artifacts are model artifacts — ones whose utility is so clearly measurable that we mint them as references. The weight-vector for the composite-value function is one such model artifact. So are: the best-of-class invention that closed the biggest-impact gap, the experiment whose information-gain-per-dollar is highest, the landscape analysis most cited by other quests. Model artifacts get their own badge in the UI and are exempt from retirement as long as their signals hold.

    ---

    8. UI — showcase surface

    Scope for the companion spec showcase_artifact_ui_spec.md (new). Summary here for context:

    • Top-nav tab /showcase with one tab per artifact class plus a cross-class "Model artifacts" tab.
    • Each tab renders a card grid of the class's showcase artifacts. Card shows: name, one-line value prop, composite score V with bar, an icon strip for the six signals (gap / landscape / market / adversarial / arena / utility) each green/amber/red.
    • Detail view: full provenance chain as a vertical timeline with drill-ins (gap card → landscape map → debate transcript with round-by-round role attribution → adversarial challenges with pass/fail → market bid history → arena matchups table).
    • Cross-class "economy" dashboard at /showcase/economy: plots the weight-vector artifact's current weights, the floor values by class, the top-N rising artifacts per class, and the list of open gaps that have no artifact addressing them yet (prioritized).

    ---

    9. What changes in existing specs

    This spec references the existing specs; it does not replace them. Specific integration points:

    • exch-qm-03-LIFE_artifact_lifecycle_spec.md — artifact states (draft/debate/admitted/showcased/retired) align with this spec's Phase A-D outputs.
    • exch-qm-01-MEXT_extend_market_pricing_spec.md — the market signal in §2 feeds that pricing implementation.
    • quest_gap_factory_spec.md + siblings — the quest_gaps funnel is the union of those specs; no new spec needed there, just the task-shape requirements (gap must carry (domain, layer, confidence, expected_value) before it leaves the funnel).
    • q-ai-tools-landscape_spec.md — existing landscape spec is scoped to AI tools; quest_landscape_analyses generalizes the pattern to all scientific fields. That spec becomes a specialized case.
    • agora_debate_coverage + debate_quality_scoring specs — the Phase B loop in §3 uses these directly.
    • evolutionary_arenas quest — the arena signal in §2 is what that quest already produces.

    ---

    10. Milestones (first 30 days after unpause)

  • Week 1. quest_landscape_analyses emits its first 3 landscape analyses (molecular biology, neuroscience, clinical genetics). Each surfaces ≥10 tagged gaps into quest_gaps.
  • Week 2. quest_inventions + quest_experiments each run ≥20 multi-iter tasks against the surfaced gaps. At least 2 artifacts per class cross the admission floor.
  • Week 3. Showcase UI /showcase lands with ≥2 showcase artifacts per class, provenance chain drillable. Composite-value weight-vector artifact pinned as a model artifact.
  • Week 4. Meta-arena over weight-vectors runs first round; the winning weight-vector replaces the seed vector. Gap queue is re-prioritized under new weights. First cycle of self-improvement demonstrated.
  • If any milestone slips, the retrospective output is itself a landscape analysis + gap in the agent_ecosystem quest — the system uses its own machinery to improve itself.

    ---

    11. Open questions (to resolve in child specs)

    • How is the Phase A seeding priority score computed in practice, and where does it live? (Likely a view over tasks + gaps + artifacts.)
    • What stops a market participant from front-running the composite-value computation? (Bond-weighted bids, delayed reveal, or both — belongs in quest_market_participants_spec_v2.md.)
    • How does retirement interact with existing citations? Retired artifacts keep URLs; they stop being recommended.
    • Should Forge benchmarks be their own artifact class or a sub-class of target? (Proposed: sub-class. Revisit if benchmark valuation diverges meaningfully from target valuation.)
    • What's the CI-task whitelist look like as a concrete list? (See §6 — to be populated by the follow-up task-triage survey.)

    Child specs will drop each of these.