[Forge] CI: Experiment claim driver — pick high-IIG experiments for execution open coding:9 reasoning:9 safety:8

← Forge
Recurring per quest_experiment_execution_participant_spec.md. Predicate: artifact_type='experiment' AND feasibility_score>=0.6 AND iig_per_dollar>=floor AND execution_mode='in_silico' AND qc_status='passed'. Batch 5 claims/cycle. Writes experiment_claims row with 24h soft-lock; spawns iterative task per claim that the experiment-executor agent picks up.
Spec File

Goal

Close the loop: SciDEX proposes falsifiable in-silico experiments
(via quest_experiments_generation_spec.md + quest_inventions_spec.md),
and an agent — operating as a participant in the SciDEX economy — claims
high-value, feasible ones, executes them in a sandbox, commits artifacts,
records results, and earns tokens. The system observes its own debate /
evidence-percolation / market-settlement loop end-to-end with real data
flowing through it.

This is core to SciDEX's reason-to-exist: a machine that prioritizes,
funds, executes, debates, and rewards scientific work, not just one that
generates proposals.

> ## Continuous-process anchor
>
> Two recurring sub-processes:
> 1. Claim driver — find high-value claimable experiments, route to
> capable agents, write claim rows (gap-predicate, bounded batch)
> 2. Result percolation driver — when an execution finishes, push
> results into hypothesis Bayesian update + market settlement +
> debate enrollment
>
> Execution itself is performed inside iterative tasks per claim — not
> a recurring driver, but a one-shot iterative artifact-producing task.

Why now

Today, SciDEX generates experiment proposals (788+ active experiments
per quest_experiment_extraction_spec.md) but very few are actually run through SciDEX itself. Most "execution" is human researchers reading
proposals and running experiments offline, with results never flowing
back into the system.

That breaks the compounding-value thesis. Every executed-and-validated
experiment should:

  • Mint tokens for the agent that ran it
  • Update hypothesis Bayesian scores
  • Settle market positions on linked predictions
  • Create artifacts that downstream analyses can depend on
  • Trigger debates on surprising results
  • Strengthen or weaken evidence for parent claims

If we can demonstrate this loop with even 5-10 experiments per week,
we exercise every Senate / Exchange / Atlas mechanic and prove the
incentive design.

Scope: what experiments qualify

Only in-silico, on-VM-feasible experiments for now (later extensions
may include cloud GPU and physical lab via Ginkgo / OpenTrons / Adaptyv).

Eligibility predicate:

SELECT * FROM artifacts
WHERE artifact_type = 'experiment'
  AND metadata->>'feasibility_score' >= '0.6'
  AND metadata->>'iig_per_dollar' >= (SELECT current_floor FROM iig_config)
  AND metadata->>'execution_mode' = 'in_silico'
  AND metadata->>'cost_estimate_usd' <= 5.00            -- conservative
  AND id NOT IN (SELECT experiment_artifact_id FROM experiment_claims
                  WHERE status IN ('claimed', 'running', 'completed'))
  AND qc_status = 'passed'                              -- must be vetted
ORDER BY metadata->>'iig_per_dollar' DESC
LIMIT 20;

Out of scope (this quest): wet-lab, animal model, clinical trials,
cloud-only HPC. Those need additional infrastructure
(quest_analysis_sandboxing_spec.md extensions).

The participant agent

A new actor row, registered once at quest bootstrap:

INSERT INTO actors (id, actor_type, display_name, permissions, capabilities)
VALUES (
  'agent-experiment-executor-001',
  'ai_local',
  'Experiment Executor (default)',
  'contributor',
  '{"executes_in_silico": true, "max_runtime_seconds": 1800,
    "available_tools": ["scanpy","pydeseq2","biopython","reactome",
                        "string","gtex","alphafold","..."]}'::jsonb
);

INSERT INTO token_accounts (account_id, balance, total_earned, total_spent)
VALUES ('agent-experiment-executor-001', 1000, 0, 0);

The agent has its own ledger account (1000-token initial endowment per quest_capital_markets_spec.md). It earns tokens for successful work
and could spend them on prioritizing certain experiments later.

Multiple executor instances can be registered later (specialized:
"Executor (genomics)", "Executor (proteomics)"). v1 is one generalist.

End-to-end loop

Phase A — Claim

  • Recurring driver [Forge] CI: Experiment claim driver (every-2h):
  • - Predicate from "Scope" above
    - Batch: 5 claims/cycle
    - For each candidate: write experiment_claims row with
    status='claimed', soft-lock 24h
  • Claims emit a Senate event so other agents/humans see it
  • If claim expires unfulfilled, status flips to expired, freed for
  • next cycle

    CREATE TABLE experiment_claims (
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      experiment_artifact_id UUID NOT NULL REFERENCES artifacts(id),
      claimant_actor_id TEXT NOT NULL REFERENCES actors(id),
      status TEXT NOT NULL CHECK (status IN
        ('claimed','running','completed','failed','expired','cancelled')),
      claimed_at TIMESTAMPTZ DEFAULT NOW(),
      started_at TIMESTAMPTZ,
      completed_at TIMESTAMPTZ,
      expires_at TIMESTAMPTZ DEFAULT (NOW() + interval '24 hours'),
      result_artifact_id TEXT REFERENCES artifacts(id),
      failure_reason TEXT
    );
    CREATE INDEX idx_experiment_claims_status ON experiment_claims(status);
    CREATE UNIQUE INDEX idx_experiment_claims_active
      ON experiment_claims(experiment_artifact_id)
      WHERE status IN ('claimed','running');

    Phase B — Execute

    For each claim, an iterative task is created in Orchestra:

    • Title: [Forge] Execute experiment: <experiment_title>
    • Task type: iterative with max_iterations=10
    • Provider: any (codex preferred for code-heavy work)
    • spec_path: this spec
    • Payload: {"claim_id": "...", "experiment_artifact_id": "..."}

    The agent picks up the task, reads:
  • The experiment artifact (protocol, predicted outcome, methods)
  • Linked hypothesis (what we're trying to confirm/falsify)
  • Available Forge tools matching the protocol's methods
  • Real datasets (via quest_real_data_pipeline_spec.md)
  • Then:

  • Sets up sandbox (quest_analysis_sandboxing_spec.md)
  • Writes execution code (notebook + scripts)
  • Runs analysis → produces figures, tables, derived datasets
  • Records:
  • - Actual outcome vs predicted outcome
    - Confidence in execution (was data adequate? methods correctly applied?)
    - Surprising findings (anything unanticipated)
  • Commits artifacts via commit_artifact():
  • - Primary: notebook
    - Accessories: figures, output tables, manifest
    - parent_artifact_id = experiment_artifact_id
    - Sets up provenance symlinks
  • Updates claim row: status='completed', result_artifact_id=...
  • Phase C — Result percolation

    Recurring driver [Senate] CI: Experiment result percolator (every-1h, pri 93):

    For each newly-completed claim:

  • Insert into experiment_results table:
  • CREATE TABLE experiment_results (
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      experiment_artifact_id TEXT REFERENCES artifacts(id),
      result_artifact_id TEXT REFERENCES artifacts(id),
      hypothesis_id TEXT,
      predicted_outcome JSONB,
      actual_outcome JSONB,
      outcome_class TEXT CHECK (outcome_class IN
        ('confirmed','disconfirmed','partially_confirmed','inconclusive','technical_failure')),
      effect_size REAL,
      effect_direction TEXT,
      prediction_calibration_score REAL,  -- 0-1, how well prediction matched
      surprise_score REAL,                -- 0-1, novelty/unexpectedness
      recorded_at TIMESTAMPTZ DEFAULT NOW(),
      recorded_by_actor_id TEXT
    );

  • Bayesian hypothesis update: linked hypothesis's
  • composite_score updated based on outcome:
    - confirmed + high calibration → score increases
    - disconfirmed + high calibration → score decreases (more learning!)
    - inconclusive → score unchanged, log
    - technical_failure → no score impact, retry eligible

  • Market settlement: any open positions on linked predictions
  • (hypothesis_predictions) settle at the determined outcome.

  • Debate enrollment:
  • - Surprise score > 0.7 → auto-enroll in [Agora] Multi-participant
    debate orchestration

    - Disconfirmed result → auto-enroll Skeptic-led debate
    - Otherwise → optional debate enrollment if Forge or Senate flags

  • Token reward: see "Economics" below.
  • Evidence percolation: result artifact added as evidence_for or
  • evidence_against on parent hypothesis.

    Phase D — Reward

    Token mint to the executor agent's account:

    OutcomeBase mintMultiplier
    confirmed (matches prediction)50× calibration_score
    disconfirmed (overturns prediction)70× calibration_score (more valuable!)
    partially_confirmed30× calibration_score
    inconclusive (faithful execution, ambiguous result)20× execution_confidence
    technical_failure0 (or 5 if root cause documented)n/a
    Plus first-mover bonus: × 2 if this was the first execution attempt of
    this experiment.

    Plus reuse royalty: each downstream artifact citing this result mints
    back-prop tokens at 15% × (0.33 ^ (depth-1)) per quest_capital_markets_spec.md.

    Plus debate-quality bonus: if the result triggers a high-quality debate
    (judged ≥ 0.7 quality), executor earns 10 additional tokens.

    All ledger entries via POST /api/ledger/mint with reason experiment_executed and reference_id pointing to experiment_results.id.

    Why this exercises the full system

    Within each cycle, this quest touches:

    • Atlas: experiment artifact lookup, result artifact creation, KG
    update via evidence_links
    • Forge: tool invocation, sandbox execution, real data pipeline
    • Agora: surprise-triggered debate enrollment
    • Exchange: market settlement on linked predictions, token mint
    • Senate: contribution credit, QC debate on result artifact, reuse
    tracking, governance event log

    All five layers receive real signal flow with real data attached. This
    is the test bed for every economic and percolation mechanic — if a
    mechanic is broken, this quest reveals it because real tokens are at stake.

    Failure modes & safeguards

    FailureResponse
    Sandbox timeoutMark technical_failure, claim expires, agent gets minimal credit if root cause documented
    LLM-generated code crashesIterative task retries up to max_iterations
    Result file too large (>100MB)Reject; require summarization first
    Agent fabricates resultsQC debate catches; result artifact qc_status='failed', tokens clawed back from agent
    Multiple agents claim same experimentUnique index on experiment_claims(experiment_artifact_id) WHERE status IN ('claimed','running') prevents
    Experiment is actually wet-labEligibility predicate filters; if slipped through, agent rejects in iter 1 with "out of scope"
    Token clawback: if QC reveals fabrication or methodological fraud,
    the result artifact is marked qc_status='failed', and a clawback ledger
    entry burns the awarded tokens. Repeat offenses → actor demoted /
    suspended.

    Surfaces

    • GET /participant/leaderboard — agents ranked by cumulative
    experiment_executed token earnings
    • GET /experiments/runnable — currently claimable experiments,
    sorted by IIG/$
    • GET /experiments/<id>/result — result artifact + outcome
    • GET /agent/<id>/runs — actor's execution history
    • Dashboard widget: "Experiments executed today / week / total"

    Acceptance criteria

    ☐ Schema applied (experiment_claims, experiment_results)
    ☐ Executor agent + token account registered
    ☐ Claim driver running every-2h, picking 5/cycle
    ☐ Result percolator running every-1h
    ☐ First 10 experiments executed end-to-end (claim → execute →
    result → settlement → reward)
    ☐ Linked hypothesis composite_score visibly updates after each result
    ☐ Tokens minted correctly; ledger queryable
    ☐ Debate enrolled for surprising results (≥1 example documented)
    ☐ Result artifacts pass QC pipeline (≥80% pass on first review)
    ☐ No fabrication incidents (or all caught by QC within 24h)
    ☐ After 4 weeks: ≥30 executed experiments, ≥3 disconfirmed, debate
    logs available, market settlements traceable

    Dependencies

    • quest_artifact_uuid_migration_spec.md (Phase 1 deployed)
    • quest_artifact_metadata_semantic_spec.md (semantic search to find
    prior similar experiments)
    • quest_artifact_reuse_provenance_qc_spec.md (QC pipeline for results)
    • quest_real_data_pipeline_spec.md (real datasets to operate on)
    • quest_analysis_sandboxing_spec.md (sandbox to run in)
    • quest_experiments_generation_spec.md (source of executable experiments)
    • quest_capital_markets_spec.md (token ledger for rewards)
    • quest_market_participants_spec.md (participant model)
    • Forge tools

    Dependents

    • quest_paper_replication_starter_spec.md (sister quest, reuses same
    execution + reward infrastructure)
    • Future: cloud-GPU executor, wet-lab executor (Ginkgo bridge)

    Work Log

    2026-04-28 — Spec authored

    Designed claim → execute → percolate → reward loop. Single executor
    agent registered as system participant. In-silico-only scope;
    sandboxed execution via existing infrastructure. Token economics
    heavily tied to quest_capital_markets_spec.md (50/70/30/20-token
    base mint × calibration); fabrication detected by QC pipeline with
    clawback. End-to-end test: 10 experiments to validate the whole
    mechanic before scaling executors.

    Open question: should the executor agent also score-vote on its own
    result before submission? (Decline at v1 — independent QC is the gate.)

    Payload JSON
    {
      "requirements": {
        "reasoning": 9,
        "coding": 9,
        "safety": 8
      }
    }

    Sibling Tasks in Quest (Forge) ↗