[Forge] CI: Paper replication target selector open analysis:9 coding:9 reasoning:9 safety:8

← Forge
Recurring per quest_paper_replication_starter_spec.md. Predicate: papers WHERE citation_count>=25 AND has_in_silico_method AND no replication_attempt yet. Batch 3/cycle. LLM rubric picks scoped target (Figure 2A heatmap, GSEA Section 4.2, etc.). Inserts replication_attempts row. Reuses execution loop from experiment-execution spec for claim/run/percolate/reward. Failed replications heavily rewarded (replication crisis amplification).
Spec File

Goal

For each high-value paper in the SciDEX corpus, reproduce one of its
core findings (a figure, a method, or a key statistical result), then extend it — different parameter, different cohort, additional control,
sensitivity analysis. The reproduction validates the paper; the
extension produces a new SciDEX-original artifact derived from a known
foundation.

This compounds artifact value in two ways:

  • Reproductions that match published results raise SciDEX's credibility
  • and create vetted starting points for further work
  • Reproductions that fail to match surface replication-crisis
  • signals SciDEX is uniquely positioned to amplify

    > ## Continuous-process anchor
    >
    > Recurring driver: pick a paper from the prioritized backlog,
    > generate a replication task, queue it. Result handling reuses the
    > percolation pipeline from
    > quest_experiment_execution_participant_spec.md.
    >
    > Every principle in docs/design/retired_scripts_patterns.md applies.

    Why now

    • 29,425+ papers ingested but most flow through the system as text
    (PMID, abstract, figure thumbnails). Replication forces real engagement.
    • Replication-as-starting-point is a much higher-quality artifact source
    than de-novo "generate analysis" tasks because the ground truth is
    defined by the paper.
    • Failed replications are a known scientific value generator (see Open
    Science Collaboration 2015, RP:CB 2021, etc.). SciDEX's Senate +
    market layer is purpose-built to surface and weight these signals.
    • The user emphasized: "we can even replicate findings/methods/
    experiments/figures/etc. from other scientific papers as starting
    points for expanding analyses."

    Sister quest

    This quest reuses the execution loop (claim → run → submit → reward)
    from quest_experiment_execution_participant_spec.md. The differences
    are:

    AspectExperiment ExecutionPaper Replication
    InputSciDEX-proposed experiment artifactPublished paper (PMID + figure/finding)
    Predicted outcomeFrom experiment artifact's predicted_outcomeFrom paper's published result
    Success criterionOutcome matches predictionReproduction matches published result; extension proposes new finding
    Failure modeDisconfirmed prediction (valuable!)Failed replication (also valuable — signals replication crisis)
    OutputResult artifactReproduction artifact + extension artifact

    Scope: what papers qualify

    Eligibility predicate:

    SELECT p.* FROM papers p
    WHERE p.citation_count >= 25
      AND p.pmid NOT IN (SELECT paper_pmid FROM replication_attempts WHERE status IN ('claimed','running','completed'))
      AND p.metadata->'figures' IS NOT NULL
      AND jsonb_array_length(p.metadata->'figures') >= 1
      AND p.metadata->>'methods_summary' IS NOT NULL
      AND p.metadata->>'data_availability' IN ('public', 'supplementary')
      AND p.has_in_silico_method = TRUE     -- new computed column
    ORDER BY (p.citation_count * COALESCE(p.relevance_to_active_hypotheses, 0.5)) DESC
    LIMIT 20;

    has_in_silico_method is a new column populated by an LLM rubric that
    reads each paper's methods section and decides if any analysis step is
    in-silico-only (transcriptomics analysis, structural prediction,
    network analysis, etc.). Wet-lab-only methods → out of scope.

    relevance_to_active_hypotheses is a similarity score between the
    paper's abstract embedding and the current hypothesis pool —
    prioritizes papers relevant to active SciDEX work.

    End-to-end loop

    Phase A — Pick a target

    Recurring driver [Forge] CI: Replication target selector (every-6h, pri 91):

  • Query eligibility predicate; pick top 3
  • For each: LLM rubric chooses a single, scoped replication target
  • (not the whole paper):
    - "Reproduce Figure 2A's heatmap of microglial gene expression"
    - "Reproduce the GSEA enrichment in Section 4.2"
    - "Reproduce the survival curve in Supplementary Figure S5"
    The target must be:
    - A specific deliverable (named figure / table / metric)
    - Defined in the paper's methods + data availability
    - Estimated < $5 / 30min runtime to attempt
  • Insert a replication_attempts row:
  • CREATE TABLE replication_attempts (
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      paper_pmid TEXT NOT NULL,
      target_description TEXT NOT NULL,    -- "Figure 2A heatmap"
      target_finding TEXT NOT NULL,        -- "Microglia in AD show 8-fold IFN-pathway upregulation"
      expected_method_summary TEXT,        -- from paper's methods
      expected_data_sources TEXT[],        -- from paper's data availability
      status TEXT CHECK (status IN ('proposed','claimed','running','completed','failed','expired')),
      claimed_by_actor_id TEXT,
      proposed_at TIMESTAMPTZ DEFAULT NOW(),
      claimed_at TIMESTAMPTZ,
      completed_at TIMESTAMPTZ,
      reproduction_artifact_id TEXT REFERENCES artifacts(id),
      extension_artifact_id TEXT REFERENCES artifacts(id),
      reproduction_match_score REAL,       -- 0-1, how close to published result
      extension_novelty_score REAL,        -- 0-1, how distinct extension is
      failure_reason TEXT
    );
    CREATE INDEX idx_replication_paper ON replication_attempts(paper_pmid);
    CREATE INDEX idx_replication_status ON replication_attempts(status);

    Phase B — Claim

    Reuses claim mechanic from quest_experiment_execution_participant_spec.md. A new actor agent-replication-executor-001 registered with replication-tailored
    capabilities (pubmed_search, osf_data_download, paper_figures skill,
    domain-specific tools).

    Phase C — Reproduce

    Iterative task created. Agent reads:

  • The paper (PMID + abstract + figures + methods section + supplementary)
  • The target description
  • Available datasets (Allen, GTEx, GEO, ArrayExpress as cataloged)
  • The Forge tool registry
  • Agent then:

  • Fetches data per paper's data availability
  • Implements the method as described
  • Runs analysis in sandbox
  • Compares output to published figure / number
  • Writes reproduction notebook + figure(s) + comparison report
  • Commits as artifact:
  • - parent_artifact_id = paper_artifact_id
    - Derivation type: reproduces
    - Includes a side-by-side comparison (original figure vs. reproduction)

    Reproduction match scoring (LLM judge):

    • 1.0: pixel-equivalent or numerical match within 5%
    • 0.7-0.9: same shape, magnitudes within 20%
    • 0.5-0.7: same direction, magnitudes off
    • 0.3-0.5: partial reproduction (some panels match, others don't)
    • < 0.3: failed reproduction (interesting signal!)

    Phase D — Extend

    After reproduction, agent proposes 1-2 extensions:

    • Different cohort (e.g. paper used SEA-AD; extension applies to ABC-Atlas)
    • Different parameter (e.g. paper used PCA; extension uses UMAP)
    • Sensitivity analysis (e.g. n permutations of bootstrap)
    • Additional control (e.g. paper compared AD vs control; extension adds
    sex stratification)
    • Negative control (does the method find spurious effects in null data?)

    Extension is committed as a separate artifact:
    • parent_artifact_id = reproduction_artifact_id
    • Derivation type: extends
    • Documents what's novel vs. paper

    Phase E — Percolate & reward

    Result handling reuses quest_experiment_execution_participant_spec.md Phase C. Specifically:

    • Reproduction match → updates paper's replication_status field
    (existing on papers table per quest_experiment_extraction_spec.md)
    • Failed replication → high-priority debate enrollment (Skeptic-led),
    potential evidence_against link to any hypothesis the paper supported
    • Extension that produces a novel finding → potential new claim
    artifact + market position seedable
    • Tokens minted to executor:
    OutcomeReproduction mintExtension mint
    match_score ≥ 0.96040
    0.7-0.95035
    0.5-0.73530
    0.3-0.525 (interesting!)20
    < 0.3 (failed replication)80 (highest! crisis signal)n/a
    technical_failure0-50
    Plus first-mover (this paper not previously replicated): × 1.5

    Plus extension-novelty bonus: extension_novelty_score × 30 tokens.

    Special case: replication crisis amplification

    When reproduction_match_score < 0.3 and the paper has > 100 citations
    or supports an active hypothesis:

  • Auto-enrolls in Skeptic-led debate (Round 1: agent presents
  • reproduction; Round 2: Theorist defends original; Round 3:
    Methodologist arbitrates)
  • Senate event emitted: replication_failure_high_impact
  • Hypothesis's evidence_for weighted down if the paper was supporting
  • Featured on /replication-crisis-tracker dashboard
  • Optional: open a market position on "Will independent replication
  • match the original?" — settles when 2+ replications converge

    This is one of SciDEX's clearest value propositions: a system that incentivizes finding replication failures rather than burying them.

    Surfaces

    • GET /replications/runnable — currently proposed, claimable
    • GET /replications/<id> — full record (paper, target, reproduction,
    extension, comparison)
    • GET /papers/<pmid>/replications — all replication attempts on a paper
    • GET /replication-crisis-tracker — failed replications dashboard
    • POST /replications/<id>/dispute — reviewer challenges scoring

    Acceptance criteria

    ☐ Schema (replication_attempts) applied
    has_in_silico_method column populated for top 1000 cited papers
    ☐ Replication target selector running every-6h, proposing 3/cycle
    ☐ First 5 reproduction attempts complete end-to-end
    ☐ Each comes with a reproduction artifact + extension artifact
    ☐ Match scoring documented with side-by-side comparison
    ☐ At least 1 failed replication recorded and debated
    ☐ Tokens minted correctly; ledger trail clean
    ☐ Replication crisis dashboard live with at least 1 entry
    ☐ After 8 weeks: ≥20 reproduction attempts, ≥5 extensions, ≥1
    replication-crisis incident with full debate

    Dependencies

    • quest_artifact_uuid_migration_spec.md
    • quest_artifact_metadata_semantic_spec.md
    • quest_artifact_reuse_provenance_qc_spec.md
    • quest_experiment_execution_participant_spec.md (claim/result loop reused)
    • quest_real_data_pipeline_spec.md (data fetch from external sources)
    • quest_analysis_sandboxing_spec.md
    • quest_experiment_extraction_spec.md (papers table + figures metadata)
    • paper-figures skill (PMID → figure URLs)
    • paper-corpus-search, pubmed-search, paper-lookup skills

    Dependents

    • Future: cross-paper meta-analysis (when N replications of related
    findings accumulate)
    • Replication-crisis dashboard
    • Citation-quality scoring (papers with replicated findings rank higher)

    Work Log

    2026-04-28 — Spec authored

    Sister quest to quest_experiment_execution_participant_spec.md,
    reusing claim/result/reward mechanics. Reproduction-then-extend
    pattern: every replication produces 2 artifacts (the reproduction +
    the extension). Failed replications heavily rewarded (80 tokens,
    1.5x first-mover) and auto-enrolled in Skeptic debate. Replication
    crisis tracker as a public surface. In-silico-only at v1.

    Open question: how to handle papers whose data is non-public despite
    "public" availability claim? Document attempt as failure_reason='data_access', mint 5 tokens for the diagnosis,
    move on. Tracking these failures is itself useful (which journals /
    labs honor data-availability claims?).

    Payload JSON
    {
      "requirements": {
        "reasoning": 9,
        "analysis": 9,
        "coding": 9,
        "safety": 8
      }
    }

    Sibling Tasks in Quest (Forge) ↗