Quest: Experiments (generation)

> Goal. Design experiments with maximal expected information gain per dollar on the inventions/hypotheses coming out of the other downstream quests. An experiment here is a protocol with expected results, not an execution. Execution happens externally (in-silico via the existing experiment extraction pipeline, in-vitro via partners); this quest produces the protocol and the prediction, validates feasibility, and records results when they return.
>
> Distinct from the existing quest_experiment_extraction_spec.md, which extracts structured records from papers. This quest generates new experiment proposals.

Parent: [scidex_economy_design_spec.md](scidex_economy_design_spec.md).
Extraction counterpart: [quest_experiment_extraction_spec.md](quest_experiment_extraction_spec.md).
Experiment-results loop: [5f27f904_d33_experiment_results_spec.md](5f27f904_d33_experiment_results_spec.md).

---

Inputs

Admitted inventions + high-priority hypotheses that carry a utility_plan pointer.
Gap rows from quest_gaps with tractability and data_readiness scores.
Landscape cell data so the quest knows which experiments are already in the literature.
Cost model: token budget + (optional) externalized cost estimate for in-vitro work.
The prior probability of each hypothesis the experiment would test (from Atlas).

Outputs

Experiment artifacts with artifact_class = "experiment", carrying:

- protocol (structured — method, sample size, controls, endpoints, duration, cost estimate)
- predicted_outcome (probabilistic — P(confirm), P(falsify), P(inconclusive))
- information_gain_bits (expected reduction in hypothesis uncertainty per the prior + predicted outcome)
- cost_estimate_usd
- iig_per_dollar = information_gain_bits / cost_estimate_usd
- target_invention_id / target_hypothesis_id it's designed to probe
- feasibility_score (0-1) from the Senate adversarial pre-check

When the experiment executes (externally or in-silico) and results come back, 5f27f904_d33_experiment_results_spec.md closes the loop — comparing predicted to actual, updating Bayesian scores on the target invention/hypothesis.

---

Task shape

task_type = multi_iter, same framework as quest_inventions, but with:

artifact_class = "experiment"
required_roles = ["proposer", "methodologist", "statistician", "critic"]
max_iterations = 3
debate_rounds = 3 (one less than inventions — experiment protocols are more constrained so converge faster)
target_cell = (invention_id OR hypothesis_id) — not a gap cell; experiments are scoped to what they probe
acceptance_criteria:

- iig_per_dollar ≥ class_floor
- feasibility_score ≥ 0.5
- market_bid ≥ median_for_class
- no_redundant_prior_art (not already covered by a paper the landscape analysis flagged)

1. Seeding

Priority formula:

P(experiment_slot) = hypothesis_prior_variance(h) × invention_value(i) × landscape_novelty(cell)
                   × 1/(existing_experiments_for_target + 1)

The hypothesis_prior_variance term is key — experiments on hypotheses where the field is evenly split (variance near 0.5²) are maximally informative. Hypotheses where Atlas already has strong consensus (variance low) produce low-information experiments and should be de-prioritized.

2. Generation — three-round debate

Round 1 — Proposal. Proposer agents each draft a protocol. They must populate: method, sample size (with power calc reasoning), control arms, primary endpoint, secondary endpoints, expected duration, rough cost. Multiple proposals per agent encouraged.

Round 2 — Methodology + statistics review. Methodologist and Statistician agents review every proposal. Methodologist flags confounders, missing controls, wrong model systems. Statistician flags underpowered designs, wrong tests, multiple-comparison issues. Each protocol gets a methodology_score and statistics_score (both 0-1). Protocols below 0.5 on either are eliminated.

Round 3 — Critique + synthesis. Critic agent takes surviving protocols and synthesizes one final protocol that incorporates the best elements and addresses all methodology/stats flags. The Critic also produces the predicted_outcome distribution (explicitly, as probabilities summing to 1 over {confirm, falsify, inconclusive}).

The Senate adversarial quest then runs a standardized "would this experiment actually answer the question?" challenge; the outcome is the feasibility_score.

3. Admission criteria

ALL of:

iig_per_dollar ≥ class_floor (the quest maintains a rolling floor across admitted experiments)
feasibility_score ≥ 0.5
market_bid ≥ median — market participants specifically score the predicted_outcome calibration (a key reason to bid against an experiment is that you think the prediction is mis-calibrated)
no_redundant_prior_art — checked against the landscape cell's paper index

Below-floor experiments are NOT retried directly; their critique is fed to the target invention/hypothesis as a signal that the target might need refinement before it's worth experimenting on.

4. Results loop

When the experiment runs (externally or in-silico), results arrive via the experiment_results path in 5f27f904_d33_experiment_results_spec.md. This quest:

Reads the actual outcome.

Compares to predicted_outcome — records calibration (was P(confirm) well-calibrated?).

Updates target_invention_id / target_hypothesis_id Bayesian scores.

Emits an experiment_completed event that the market participants read to settle their bids.

If the calibration was bad, the experiment's Critic agent gets a negative reputation signal — this is how we tune experiment-design quality over time.

5. Relationship to invention utility demos

Every admitted invention from quest_inventions carries a utility_plan — a pointer to a proposed experiment. That pointer becomes a task in this quest. Invention → experiment proposal → experiment admission → execution → results → invention composite-value update with real utility data.

The utility signal in the parent spec §2 is derived directly from the results of experiments spawned via this mechanism.

6. Capacity

Default: 4 concurrent multi-iter experiment tasks.
Experiment tasks are lighter than inventions (3 rounds vs 4) — budget ~3-4 agent-hours per admission.

7. Interactions

quest_inventions — the principal source of experiment proposals via utility_plan.
quest_hypotheses (implicit — hypothesis class not a dedicated quest yet) — candidate hypotheses surface via Agora debates; the top-N by prior variance enter this quest's queue.
quest_gaps — experiments with low data_readiness gap scores get deprioritized (hard to run).
quest_experiment_extraction — complementary; it fills the world model from existing literature, this one proposes new experiments to add to the world model.
Senate Adversarial Science — feasibility pre-check in Round 3 above.
Exchange + market participants — bid on predicted outcomes; settle on actuals.

8. Metrics

Mean iig_per_dollar of admitted experiments (rising over time = design quality improving)
Calibration — binned predicted_outcome vs actual_outcome calibration curve
Experiments per gap closed (efficiency)
% of admitted inventions that got a utility experiment within 4 weeks (latency — inventions whose utility never gets tested drag on the floor)

9. Open questions

How do we handle in-vitro experiments that can't run internally? (Proposed: generate the protocol + predicted outcome + cost estimate; publish as a "proposed experiment" artifact; if external collaborators run it, results flow back through experiment_results. Still valuable as a proposal artifact even if unrun.)
What's the relationship between this quest and the existing quest_experiment_extraction_spec.md? (Extraction fills Atlas from papers; generation proposes new work. They feed each other but don't block each other.)
How do we avoid producing lots of cheap low-IIG experiments? (Floor on iig_per_dollar, not on information_gain_bits alone — the dollar denominator discourages padding.)

File: quest_experiments_generation_spec.md

Modified: 2026-04-25 22:00

Size: 8.4 KB