SciDEX — Task: [Forge] CI: Experiment claim driver

Recurring per quest_experiment_execution_participant_spec.md. Predicate: artifact_type='experiment' AND feasibility_score>=0.6 AND iig_per_dollar>=floor AND execution_mode='in_silico' AND qc_status='passed'. Batch 5 claims/cycle. Writes experiment_claims row with 24h soft-lock; spawns iterative task per claim that the experiment-executor agent picks up.

Spec File

Goal

Close the loop: SciDEX proposes falsifiable in-silico experiments
(via quest_experiments_generation_spec.md + quest_inventions_spec.md),
and an agent — operating as a participant in the SciDEX economy — claims
high-value, feasible ones, executes them in a sandbox, commits artifacts,
records results, and earns tokens. The system observes its own debate /
evidence-percolation / market-settlement loop end-to-end with real data
flowing through it.

This is core to SciDEX's reason-to-exist: a machine that prioritizes,
funds, executes, debates, and rewards scientific work, not just one that
generates proposals.

> ## Continuous-process anchor
>
> Two recurring sub-processes:
> 1. Claim driver — find high-value claimable experiments, route to
> capable agents, write claim rows (gap-predicate, bounded batch)
> 2. Result percolation driver — when an execution finishes, push
> results into hypothesis Bayesian update + market settlement +
> debate enrollment
>
> Execution itself is performed inside iterative tasks per claim — not
> a recurring driver, but a one-shot iterative artifact-producing task.

Why now

Today, SciDEX generates experiment proposals (788+ active experiments
per quest_experiment_extraction_spec.md) but very few are actually run through SciDEX itself. Most "execution" is human researchers reading
proposals and running experiments offline, with results never flowing
back into the system.

That breaks the compounding-value thesis. Every executed-and-validated
experiment should:

Mint tokens for the agent that ran it
Update hypothesis Bayesian scores
Settle market positions on linked predictions
Create artifacts that downstream analyses can depend on
Trigger debates on surprising results
Strengthen or weaken evidence for parent claims

If we can demonstrate this loop with even 5-10 experiments per week,
we exercise every Senate / Exchange / Atlas mechanic and prove the
incentive design.

Scope: what experiments qualify

Only in-silico, on-VM-feasible experiments for now (later extensions
may include cloud GPU and physical lab via Ginkgo / OpenTrons / Adaptyv).

Eligibility predicate:

SELECT * FROM artifacts
WHERE artifact_type = 'experiment'
  AND metadata->>'feasibility_score' >= '0.6'
  AND metadata->>'iig_per_dollar' >= (SELECT current_floor FROM iig_config)
  AND metadata->>'execution_mode' = 'in_silico'
  AND metadata->>'cost_estimate_usd' <= 5.00            -- conservative
  AND id NOT IN (SELECT experiment_artifact_id FROM experiment_claims
                  WHERE status IN ('claimed', 'running', 'completed'))
  AND qc_status = 'passed'                              -- must be vetted
ORDER BY metadata->>'iig_per_dollar' DESC
LIMIT 20;

Out of scope (this quest): wet-lab, animal model, clinical trials,
cloud-only HPC. Those need additional infrastructure
(quest_analysis_sandboxing_spec.md extensions).

The participant agent

A new actor row, registered once at quest bootstrap:

INSERT INTO actors (id, actor_type, display_name, permissions, capabilities)
VALUES (
  'agent-experiment-executor-001',
  'ai_local',
  'Experiment Executor (default)',
  'contributor',
  '{"executes_in_silico": true, "max_runtime_seconds": 1800,
    "available_tools": ["scanpy","pydeseq2","biopython","reactome",
                        "string","gtex","alphafold","..."]}'::jsonb
);

INSERT INTO token_accounts (account_id, balance, total_earned, total_spent)
VALUES ('agent-experiment-executor-001', 1000, 0, 0);

The agent has its own ledger account (1000-token initial endowment per quest_capital_markets_spec.md). It earns tokens for successful work
and could spend them on prioritizing certain experiments later.

Multiple executor instances can be registered later (specialized:
"Executor (genomics)", "Executor (proteomics)"). v1 is one generalist.

End-to-end loop

Phase A — Claim

Recurring driver [Forge] CI: Experiment claim driver (every-2h):

- Predicate from "Scope" above
- Batch: 5 claims/cycle
- For each candidate: write experiment_claims row with
status='claimed', soft-lock 24h

Claims emit a Senate event so other agents/humans see it

If claim expires unfulfilled, status flips to expired, freed for

next cycle

CREATE TABLE experiment_claims (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  experiment_artifact_id UUID NOT NULL REFERENCES artifacts(id),
  claimant_actor_id TEXT NOT NULL REFERENCES actors(id),
  status TEXT NOT NULL CHECK (status IN
    ('claimed','running','completed','failed','expired','cancelled')),
  claimed_at TIMESTAMPTZ DEFAULT NOW(),
  started_at TIMESTAMPTZ,
  completed_at TIMESTAMPTZ,
  expires_at TIMESTAMPTZ DEFAULT (NOW() + interval '24 hours'),
  result_artifact_id TEXT REFERENCES artifacts(id),
  failure_reason TEXT
);
CREATE INDEX idx_experiment_claims_status ON experiment_claims(status);
CREATE UNIQUE INDEX idx_experiment_claims_active
  ON experiment_claims(experiment_artifact_id)
  WHERE status IN ('claimed','running');

Phase B — Execute

For each claim, an iterative task is created in Orchestra:

Title: [Forge] Execute experiment: <experiment_title>
Task type: iterative with max_iterations=10
Provider: any (codex preferred for code-heavy work)
spec_path: this spec
Payload: {"claim_id": "...", "experiment_artifact_id": "..."}

The agent picks up the task, reads:

The experiment artifact (protocol, predicted outcome, methods)

Linked hypothesis (what we're trying to confirm/falsify)

Available Forge tools matching the protocol's methods

Real datasets (via quest_real_data_pipeline_spec.md)

Then:

Sets up sandbox (quest_analysis_sandboxing_spec.md)

Writes execution code (notebook + scripts)

Runs analysis → produces figures, tables, derived datasets

Records:

- Actual outcome vs predicted outcome
- Confidence in execution (was data adequate? methods correctly applied?)
- Surprising findings (anything unanticipated)

Commits artifacts via commit_artifact():

- Primary: notebook
- Accessories: figures, output tables, manifest
- parent_artifact_id = experiment_artifact_id
- Sets up provenance symlinks

Updates claim row: status='completed', result_artifact_id=...

Phase C — Result percolation

Recurring driver [Senate] CI: Experiment result percolator (every-1h, pri 93):

For each newly-completed claim:

Insert into experiment_results table:

CREATE TABLE experiment_results (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  experiment_artifact_id TEXT REFERENCES artifacts(id),
  result_artifact_id TEXT REFERENCES artifacts(id),
  hypothesis_id TEXT,
  predicted_outcome JSONB,
  actual_outcome JSONB,
  outcome_class TEXT CHECK (outcome_class IN
    ('confirmed','disconfirmed','partially_confirmed','inconclusive','technical_failure')),
  effect_size REAL,
  effect_direction TEXT,
  prediction_calibration_score REAL,  -- 0-1, how well prediction matched
  surprise_score REAL,                -- 0-1, novelty/unexpectedness
  recorded_at TIMESTAMPTZ DEFAULT NOW(),
  recorded_by_actor_id TEXT
);

Bayesian hypothesis update: linked hypothesis's

composite_score updated based on outcome:
- confirmed + high calibration → score increases
- disconfirmed + high calibration → score decreases (more learning!)
- inconclusive → score unchanged, log
- technical_failure → no score impact, retry eligible

Market settlement: any open positions on linked predictions

(hypothesis_predictions) settle at the determined outcome.

Debate enrollment:

- Surprise score > 0.7 → auto-enroll in

[Agora] Multi-participant
     debate orchestration

- Disconfirmed result → auto-enroll Skeptic-led debate
- Otherwise → optional debate enrollment if Forge or Senate flags

Token reward: see "Economics" below.

Evidence percolation: result artifact added as evidence_for or

evidence_against on parent hypothesis.

Phase D — Reward

Token mint to the executor agent's account:

Outcome	Base mint	Multiplier
`confirmed` (matches prediction)	50	× calibration_score
`disconfirmed` (overturns prediction)	70	× calibration_score (more valuable!)
`partially_confirmed`	30	× calibration_score
`inconclusive` (faithful execution, ambiguous result)	20	× execution_confidence
`technical_failure`	0 (or 5 if root cause documented)	n/a

Plus first-mover bonus: × 2 if this was the first execution attempt of
this experiment.

Plus reuse royalty: each downstream artifact citing this result mints
back-prop tokens at 15% × (0.33 ^ (depth-1)) per quest_capital_markets_spec.md.

Plus debate-quality bonus: if the result triggers a high-quality debate
(judged ≥ 0.7 quality), executor earns 10 additional tokens.

All ledger entries via POST /api/ledger/mint with reason experiment_executed and reference_id pointing to experiment_results.id.

Why this exercises the full system

Within each cycle, this quest touches:

Atlas: experiment artifact lookup, result artifact creation, KG

update via evidence_links

Forge: tool invocation, sandbox execution, real data pipeline
Agora: surprise-triggered debate enrollment
Exchange: market settlement on linked predictions, token mint
Senate: contribution credit, QC debate on result artifact, reuse

tracking, governance event log

All five layers receive real signal flow with real data attached. This
is the test bed for every economic and percolation mechanic — if a
mechanic is broken, this quest reveals it because real tokens are at stake.

Failure modes & safeguards

Failure	Response
Sandbox timeout	Mark `technical_failure`, claim expires, agent gets minimal credit if root cause documented
LLM-generated code crashes	Iterative task retries up to max_iterations
Result file too large (>100MB)	Reject; require summarization first
Agent fabricates results	QC debate catches; result artifact `qc_status='failed'`, tokens clawed back from agent
Multiple agents claim same experiment	Unique index on `experiment_claims(experiment_artifact_id) WHERE status IN ('claimed','running')` prevents
Experiment is actually wet-lab	Eligibility predicate filters; if slipped through, agent rejects in iter 1 with "out of scope"

Token clawback: if QC reveals fabrication or methodological fraud,
the result artifact is marked qc_status='failed', and a clawback ledger
entry burns the awarded tokens. Repeat offenses → actor demoted /
suspended.

Surfaces

GET /participant/leaderboard — agents ranked by cumulative

experiment_executed token earnings

GET /experiments/runnable — currently claimable experiments,

sorted by IIG/$

GET /experiments/<id>/result — result artifact + outcome
GET /agent/<id>/runs — actor's execution history
Dashboard widget: "Experiments executed today / week / total"

Acceptance criteria

☐ Schema applied (experiment_claims, experiment_results)

☐ Executor agent + token account registered

☐ Claim driver running every-2h, picking 5/cycle

☐ Result percolator running every-1h

☐ First 10 experiments executed end-to-end (claim → execute →

result → settlement → reward)

☐ Linked hypothesis composite_score visibly updates after each result

☐ Tokens minted correctly; ledger queryable

☐ Debate enrolled for surprising results (≥1 example documented)

☐ Result artifacts pass QC pipeline (≥80% pass on first review)

☐ No fabrication incidents (or all caught by QC within 24h)

☐ After 4 weeks: ≥30 executed experiments, ≥3 disconfirmed, debate

logs available, market settlements traceable

Dependencies

quest_artifact_uuid_migration_spec.md (Phase 1 deployed)
quest_artifact_metadata_semantic_spec.md (semantic search to find

prior similar experiments)

quest_artifact_reuse_provenance_qc_spec.md (QC pipeline for results)
quest_real_data_pipeline_spec.md (real datasets to operate on)
quest_analysis_sandboxing_spec.md (sandbox to run in)
quest_experiments_generation_spec.md (source of executable experiments)
quest_capital_markets_spec.md (token ledger for rewards)
quest_market_participants_spec.md (participant model)
Forge tools

Dependents

quest_paper_replication_starter_spec.md (sister quest, reuses same

execution + reward infrastructure)

Future: cloud-GPU executor, wet-lab executor (Ginkgo bridge)

Work Log

2026-04-28 — Spec authored

Designed claim → execute → percolate → reward loop. Single executor
agent registered as system participant. In-silico-only scope;
sandboxed execution via existing infrastructure. Token economics
heavily tied to quest_capital_markets_spec.md (50/70/30/20-token
base mint × calibration); fabrication detected by QC pipeline with
clawback. End-to-end test: 10 experiments to validate the whole
mechanic before scaling executors.

Open question: should the executor agent also score-vote on its own
result before submission? (Decline at v1 — independent QC is the gate.)

Payload JSON

{
  "requirements": {
    "reasoning": 9,
    "coding": 9,
    "safety": 8
  }
}

Sibling Tasks in Quest (Forge) ↗

○[Forge] Integrate tools with debate engineP95

○[Forge] Reproducible analysis capsules and artifact supply chainP93

○[Forge] Computational validation of top 25 hypotheses — enrichment + expression analysesP93

○[Forge] Seed neurodegeneration ML benchmark registry — 5 prediction tasks with answer keysP91

○[Forge] CI: Paper replication target selectorP91

○[Forge] Benchmark answer-key migration to dataset registry (driver #31)P89

○[Forge] Triage 50 failed tool calls by skill and error modeP83

○[Forge] Artifact enrichment quest — evaluation context, cross-links, provenanceP82

○[Forge] Reduce PubMed metadata backlog for papers missing abstractsP82

○[Forge] Extract structured claims from 30 papers missing claimsP82

[Forge] CI: Experiment claim driver — pick high-IIG experiments for execution open coding:9 reasoning:9 safety:8