SciDEX — Task: [Forge] CI: Paper replication target selector

Recurring per quest_paper_replication_starter_spec.md. Predicate: papers WHERE citation_count>=25 AND has_in_silico_method AND no replication_attempt yet. Batch 3/cycle. LLM rubric picks scoped target (Figure 2A heatmap, GSEA Section 4.2, etc.). Inserts replication_attempts row. Reuses execution loop from experiment-execution spec for claim/run/percolate/reward. Failed replications heavily rewarded (replication crisis amplification).

Spec File

Goal

For each high-value paper in the SciDEX corpus, reproduce one of its
core findings (a figure, a method, or a key statistical result), then extend it — different parameter, different cohort, additional control,
sensitivity analysis. The reproduction validates the paper; the
extension produces a new SciDEX-original artifact derived from a known
foundation.

This compounds artifact value in two ways:

Reproductions that match published results raise SciDEX's credibility

and create vetted starting points for further work

Reproductions that fail to match surface replication-crisis

signals SciDEX is uniquely positioned to amplify

> ## Continuous-process anchor
>
> Recurring driver: pick a paper from the prioritized backlog,
> generate a replication task, queue it. Result handling reuses the
> percolation pipeline from
> quest_experiment_execution_participant_spec.md.
>
> Every principle in docs/design/retired_scripts_patterns.md applies.

Why now

29,425+ papers ingested but most flow through the system as text

(PMID, abstract, figure thumbnails). Replication forces real engagement.

Replication-as-starting-point is a much higher-quality artifact source

than de-novo "generate analysis" tasks because the ground truth is
defined by the paper.

Failed replications are a known scientific value generator (see Open

Science Collaboration 2015, RP:CB 2021, etc.). SciDEX's Senate +
market layer is purpose-built to surface and weight these signals.

The user emphasized: "we can even replicate findings/methods/

experiments/figures/etc. from other scientific papers as starting
points for expanding analyses."

Sister quest

This quest reuses the execution loop (claim → run → submit → reward)
from quest_experiment_execution_participant_spec.md. The differences
are:

Aspect	Experiment Execution	Paper Replication
Input	SciDEX-proposed experiment artifact	Published paper (PMID + figure/finding)
Predicted outcome	From experiment artifact's `predicted_outcome`	From paper's published result
Success criterion	Outcome matches prediction	Reproduction matches published result; extension proposes new finding
Failure mode	Disconfirmed prediction (valuable!)	Failed replication (also valuable — signals replication crisis)
Output	Result artifact	Reproduction artifact + extension artifact

Scope: what papers qualify

Eligibility predicate:

SELECT p.* FROM papers p
WHERE p.citation_count >= 25
  AND p.pmid NOT IN (SELECT paper_pmid FROM replication_attempts WHERE status IN ('claimed','running','completed'))
  AND p.metadata->'figures' IS NOT NULL
  AND jsonb_array_length(p.metadata->'figures') >= 1
  AND p.metadata->>'methods_summary' IS NOT NULL
  AND p.metadata->>'data_availability' IN ('public', 'supplementary')
  AND p.has_in_silico_method = TRUE     -- new computed column
ORDER BY (p.citation_count * COALESCE(p.relevance_to_active_hypotheses, 0.5)) DESC
LIMIT 20;

has_in_silico_method is a new column populated by an LLM rubric that
reads each paper's methods section and decides if any analysis step is
in-silico-only (transcriptomics analysis, structural prediction,
network analysis, etc.). Wet-lab-only methods → out of scope.

relevance_to_active_hypotheses is a similarity score between the
paper's abstract embedding and the current hypothesis pool —
prioritizes papers relevant to active SciDEX work.

End-to-end loop

Phase A — Pick a target

Recurring driver [Forge] CI: Replication target selector (every-6h, pri 91):

Query eligibility predicate; pick top 3

For each: LLM rubric chooses a single, scoped replication target

(not the whole paper):
- "Reproduce Figure 2A's heatmap of microglial gene expression"
- "Reproduce the GSEA enrichment in Section 4.2"
- "Reproduce the survival curve in Supplementary Figure S5"
The target must be:
- A specific deliverable (named figure / table / metric)
- Defined in the paper's methods + data availability
- Estimated < $5 / 30min runtime to attempt

Insert a replication_attempts row:

CREATE TABLE replication_attempts (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  paper_pmid TEXT NOT NULL,
  target_description TEXT NOT NULL,    -- "Figure 2A heatmap"
  target_finding TEXT NOT NULL,        -- "Microglia in AD show 8-fold IFN-pathway upregulation"
  expected_method_summary TEXT,        -- from paper's methods
  expected_data_sources TEXT[],        -- from paper's data availability
  status TEXT CHECK (status IN ('proposed','claimed','running','completed','failed','expired')),
  claimed_by_actor_id TEXT,
  proposed_at TIMESTAMPTZ DEFAULT NOW(),
  claimed_at TIMESTAMPTZ,
  completed_at TIMESTAMPTZ,
  reproduction_artifact_id TEXT REFERENCES artifacts(id),
  extension_artifact_id TEXT REFERENCES artifacts(id),
  reproduction_match_score REAL,       -- 0-1, how close to published result
  extension_novelty_score REAL,        -- 0-1, how distinct extension is
  failure_reason TEXT
);
CREATE INDEX idx_replication_paper ON replication_attempts(paper_pmid);
CREATE INDEX idx_replication_status ON replication_attempts(status);

Phase B — Claim

Reuses claim mechanic from quest_experiment_execution_participant_spec.md. A new actor agent-replication-executor-001 registered with replication-tailored
capabilities (pubmed_search, osf_data_download, paper_figures skill,
domain-specific tools).

Phase C — Reproduce

Iterative task created. Agent reads:

The paper (PMID + abstract + figures + methods section + supplementary)

The target description

Available datasets (Allen, GTEx, GEO, ArrayExpress as cataloged)

The Forge tool registry

Agent then:

Fetches data per paper's data availability

Implements the method as described

Runs analysis in sandbox

Compares output to published figure / number

Writes reproduction notebook + figure(s) + comparison report

Commits as artifact:

- parent_artifact_id = paper_artifact_id
- Derivation type: reproduces
- Includes a side-by-side comparison (original figure vs. reproduction)

Reproduction match scoring (LLM judge):

1.0: pixel-equivalent or numerical match within 5%
0.7-0.9: same shape, magnitudes within 20%
0.5-0.7: same direction, magnitudes off
0.3-0.5: partial reproduction (some panels match, others don't)
< 0.3: failed reproduction (interesting signal!)

Phase D — Extend

After reproduction, agent proposes 1-2 extensions:

Different cohort (e.g. paper used SEA-AD; extension applies to ABC-Atlas)
Different parameter (e.g. paper used PCA; extension uses UMAP)
Sensitivity analysis (e.g. n permutations of bootstrap)
Additional control (e.g. paper compared AD vs control; extension adds

sex stratification)

Negative control (does the method find spurious effects in null data?)

Extension is committed as a separate artifact:

parent_artifact_id = reproduction_artifact_id
Derivation type: extends
Documents what's novel vs. paper

Phase E — Percolate & reward

Result handling reuses quest_experiment_execution_participant_spec.md Phase C. Specifically:

Reproduction match → updates paper's replication_status field

(existing on papers table per quest_experiment_extraction_spec.md)

Failed replication → high-priority debate enrollment (Skeptic-led),

potential evidence_against link to any hypothesis the paper supported

Extension that produces a novel finding → potential new claim

artifact + market position seedable

Tokens minted to executor:

Outcome	Reproduction mint	Extension mint
match_score ≥ 0.9	60	40
0.7-0.9	50	35
0.5-0.7	35	30
0.3-0.5	25 (interesting!)	20
< 0.3 (failed replication)	80 (highest! crisis signal)	n/a
technical_failure	0-5	0

Plus first-mover (this paper not previously replicated): × 1.5

Plus extension-novelty bonus: extension_novelty_score × 30 tokens.

Special case: replication crisis amplification

When reproduction_match_score < 0.3 and the paper has > 100 citations
or supports an active hypothesis:

Auto-enrolls in Skeptic-led debate (Round 1: agent presents

reproduction; Round 2: Theorist defends original; Round 3:
Methodologist arbitrates)

Senate event emitted: replication_failure_high_impact

Hypothesis's evidence_for weighted down if the paper was supporting

Featured on /replication-crisis-tracker dashboard

Optional: open a market position on "Will independent replication

match the original?" — settles when 2+ replications converge

This is one of SciDEX's clearest value propositions: a system that incentivizes finding replication failures rather than burying them.

Surfaces

GET /replications/runnable — currently proposed, claimable
GET /replications/<id> — full record (paper, target, reproduction,

extension, comparison)

GET /papers/<pmid>/replications — all replication attempts on a paper
GET /replication-crisis-tracker — failed replications dashboard
POST /replications/<id>/dispute — reviewer challenges scoring

Acceptance criteria

☐ Schema (replication_attempts) applied

☐ has_in_silico_method column populated for top 1000 cited papers

☐ Replication target selector running every-6h, proposing 3/cycle

☐ First 5 reproduction attempts complete end-to-end

☐ Each comes with a reproduction artifact + extension artifact

☐ Match scoring documented with side-by-side comparison

☐ At least 1 failed replication recorded and debated

☐ Tokens minted correctly; ledger trail clean

☐ Replication crisis dashboard live with at least 1 entry

☐ After 8 weeks: ≥20 reproduction attempts, ≥5 extensions, ≥1

replication-crisis incident with full debate

Dependencies

quest_artifact_uuid_migration_spec.md
quest_artifact_metadata_semantic_spec.md
quest_artifact_reuse_provenance_qc_spec.md
quest_experiment_execution_participant_spec.md (claim/result loop reused)
quest_real_data_pipeline_spec.md (data fetch from external sources)
quest_analysis_sandboxing_spec.md
quest_experiment_extraction_spec.md (papers table + figures metadata)
paper-figures skill (PMID → figure URLs)
paper-corpus-search, pubmed-search, paper-lookup skills

Dependents

Future: cross-paper meta-analysis (when N replications of related

findings accumulate)

Replication-crisis dashboard
Citation-quality scoring (papers with replicated findings rank higher)

Work Log

2026-04-28 — Spec authored

Sister quest to quest_experiment_execution_participant_spec.md,
reusing claim/result/reward mechanics. Reproduction-then-extend
pattern: every replication produces 2 artifacts (the reproduction +
the extension). Failed replications heavily rewarded (80 tokens,
1.5x first-mover) and auto-enrolled in Skeptic debate. Replication
crisis tracker as a public surface. In-silico-only at v1.

Open question: how to handle papers whose data is non-public despite
"public" availability claim? Document attempt as failure_reason='data_access', mint 5 tokens for the diagnosis,
move on. Tracking these failures is itself useful (which journals /
labs honor data-availability claims?).

Payload JSON

{
  "requirements": {
    "reasoning": 9,
    "analysis": 9,
    "coding": 9,
    "safety": 8
  }
}

Sibling Tasks in Quest (Forge) ↗

○[Forge] Integrate tools with debate engineP95

○[Forge] Reproducible analysis capsules and artifact supply chainP93

○[Forge] Computational validation of top 25 hypotheses — enrichment + expression analysesP93

○[Forge] CI: Experiment claim driver — pick high-IIG experiments for executionP93

○[Forge] Seed neurodegeneration ML benchmark registry — 5 prediction tasks with answer keysP91

○[Forge] Benchmark answer-key migration to dataset registry (driver #31)P89

○[Forge] Triage 50 failed tool calls by skill and error modeP83

○[Forge] Artifact enrichment quest — evaluation context, cross-links, provenanceP82

○[Forge] Reduce PubMed metadata backlog for papers missing abstractsP82

○[Forge] Extract structured claims from 30 papers missing claimsP82

[Forge] CI: Paper replication target selector open analysis:9 coding:9 reasoning:9 safety:8