Quest: Epistemic Rigor

This is the spec for the Epistemic Rigor quest View Quest page →

Quest: Epistemic Rigor

Vision

Every claim in SciDEX should be falsifiable, traceable, versioned, and trust-scored.

Today, hypotheses are scored by composite metrics but lack explicit testable predictions.
Evidence is stored as JSON blobs without provenance. Knowledge graph edges have no trust scores.
There's no dependency structure between hypotheses, experiments, and evidence. Score changes
happen without structured justification.

This quest transforms SciDEX from "we scored this hypothesis 0.73" to "this hypothesis
predicts X, experiment Y tested it, the result was Z, which updated our confidence because
of evidence chain A->B->C, each link traceable to ground truth with trust score T."

Current State (What Exists)

Component	Status	Gap
10-dimension hypothesis scoring	Working	No explicit predictions or falsification criteria
Evidence for/against (JSON)	Working	Unstructured, no provenance, no methodology
Evidence validation (PMID relevance)	Working	Scores relevance, not trustworthiness
Belief snapshots (time series)	Working	Tracks score evolution but not WHY scores changed
Debate quality scoring	Working	Includes falsifiability dimension, but not structured
Persona believability	Working	Per-dimension credibility, but no update from outcomes
Experiments table	Working	No results storage, no prediction-vs-reality comparison
Knowledge graph edges	Working	evidence_strength field but no trust model
Price history with events	Working	event_source is free text, not structured provenance
Quality gates	Working	Code quality, not epistemic quality

Architecture: 8 Tasks in Dependency Order

Task 1: Predictions Table ──┬──> Task 2: Experiment Results
                            │
Task 3: Evidence Chains ────┼──> Task 4: Trust Scores on KG
                            │
                            ├──> Task 5: Dependency Graph
                            │
Task 6: Versioning/Audit ───┘
                            
Task 7: Knowledge Units (depends on 3, 4, 6)

Task 8: Epistemic Dashboard (depends on all above)

Key Design Principles

Ground truth anchoring: Every claim must trace to a paper, experiment, or dataset

Bayesian updating: New evidence should update confidence via structured reasoning, not just recompute

Falsifiability first: Hypotheses without predictions are speculation, not science

Trust propagation: Downstream conclusions inherit (diminished) trust from upstream evidence

Audit completeness: Every score change has a structured justification

Composability: Evidence blocks are atomic, addressable, and combinable

Incremental delivery: Each task is independently valuable and deployable

Experiment-boosted ranking: Hypotheses with explicit falsifiable predictions AND high-quality associated experiments (feasible, impactful) should receive a composite score boost. The system should reward hypotheses that are not just well-scored but actively testable with concrete, feasible experiments.

Hypothesis-Experiment Scoring Feedback Loop

The composite score formula should incorporate a testability bonus that rewards hypotheses linked to high-quality experiments:

Scoring Considerations

Falsifiability bonus: Hypotheses with explicit, testable predictions (via hypothesis_predictions table) receive a score multiplier. A hypothesis that merely claims "X causes Y" ranks lower than one that predicts "If X, then Y should be measurable as Z with effect size > threshold."

Experiment quality signal: When a hypothesis has associated experiments (via experiments.hypothesis_ids), the experiment's own quality scores feed back into the hypothesis ranking:

- Feasibility: A hypothesis testable by a practical, affordable experiment is more valuable than one requiring impossible resources
- Impact: A hypothesis whose experiment would significantly update the world model (high information gain) ranks higher
- Experiment composite = feasibility × 0.4 + impact × 0.4 + novelty × 0.2

Combined boost formula:

testability_bonus = 0.0
   if has_falsifiable_predictions:
       testability_bonus += 0.05
   if has_associated_experiments:
       avg_experiment_quality = mean(exp.composite for exp in linked_experiments)
       testability_bonus += 0.10 * avg_experiment_quality
   adjusted_composite = base_composite + testability_bonus

Virtuous cycle: This creates a feedback loop where:

- Hypotheses with predictions attract experiment design
- High-quality experiments boost hypothesis ranking
- Higher-ranked hypotheses get more attention and resources
- More attention produces better predictions and experiments

Implementation Notes

The testability bonus should be computed in post_process.py alongside the existing composite score
Requires Task 1 (predictions table) and Task 2 (experiment results) to be complete
The bonus is additive, not multiplicative, to avoid runaway scores
Experiments without results still provide a feasibility/impact signal
Cap the total bonus at 0.15 to prevent gaming

Key Files (Existing)

exchange.py — 10-dim scoring, believability weighting, allocation
senate_proposals.py — Evidence strength scoring (papers + citations + recency)
evidence_validator.py — PMID relevance scoring via Claude Haiku
belief_tracker.py — Temporal belief snapshots, convergence metrics
backfill_debate_quality.py — 4-dim debate quality (includes falsifiability)
quality_gates.py — Pre-merge, post-completion, prevention gates
market_dynamics.py — LMSR price model with event logging

Key Tables (Existing)

hypotheses — evidence_for/against (JSON), evidence_validation_score
hypothesis_papers — Junction: hypothesis <-> PMID with direction/claim/strength
papers — PubMed metadata (pmid, title, abstract, citations, year)
knowledge_edges — source/target with evidence_strength (REAL)
experiments — hypothesis_ids (JSON), protocol, expected_outcomes, status
debate_sessions — transcript_json, quality_score
debate_rounds — evidence_cited (JSON), hypotheses_referenced
belief_snapshots — Time series of hypothesis score + evidence count
price_history — Score changes with event_type and event_source
persona_believability — Per-persona, per-dimension credibility
edit_history — Generic audit log (actor, content_type, diff, reason)

Workstreams

WS-rigor-ruleset — Absorb Alpha1 Science's 8-dimension biomedical rigor rubric

Absorb Alpha1 Science's 8-dimension rigor rubric (scientific premise,
study design, blinding, power analysis, resource identification,
statistical reporting, data availability + 1 TBD) grounded in NIH / MDAR / ARRIVE 2.0 / CONSORT / EQUATOR biomedical reporting
guidelines. Every hypothesis and analysis gets a rigor score card.
Every score carries an evidence citation pointing to the exact
text it was derived from. Score card becomes a new artifact type;
optionally publishable to the SciDEX community view. See
[docs/bio_competitive/alpha1_science_profile.md](../../bio_competitive/alpha1_science_profile.md).

Deliverables:

Rubric dictionary: JSON mapping NIH / MDAR / ARRIVE 2.0 / CONSORT /

EQUATOR items to the 8 dimensions, with specific guideline-item
pointers per dimension.

Two-agent independent evaluation pipeline. Reuse the Skeptic

persona; the second agent is a separately-seeded Skeptic instance
(different prompt seed or different provider) to preserve
independence — matches Alpha1's 2-independent-agent design.

Rigor score card as a first-class artifact type in the artifacts

table, with lineage to the hypothesis or analysis it scores.

Evidence-citation schema: every score row carries a

source_quote + source_location (PMID / page / paragraph) so a
reviewer can verify the score without re-reading the whole paper.

Community-publish surface: optional Atlas wiki page per score card

for the ones authors opt to publish.

Recurring Senate-layer task to produce score cards for the

backlog; see
[docs/planning/specs/task-id-pending_rigor_score_card_spec.md](task-id-pending_rigor_score_card_spec.md).

Dependency on existing tasks:

Task 3 (Evidence Chains) provides the structured-evidence table

the rigor score card cites.

Task 4 (Trust Scores on KG) consumes the rigor score to weight KG

edges drawn from papers with a published rigor score.

Task 7 (Knowledge Units) can include a rigor-score summary as one

of its atomic evidence blocks.

Success metric: every hypothesis and every ≥50KB analysis in the
trailing 30 days has a rigor score card with ≥95% of scores carrying
an evidence citation; inter-rater agreement between the two
independent agents ≥0.7 (Cohen's κ or equivalent).

Related Quests

Quest	Relationship
Experiment Extraction (q-experiment-extraction)	Structured experiments are the ground-truth anchors for evidence chains
Artifact Debates (q-artifact-debates)	Any artifact can be debated, accumulating evidence about its quality
Schema Governance (q-schema-governance)	Evidence schemas evolve through governance to maintain integrity
Artifacts (8db4834c-51e)	All evidence is stored as artifacts with lineage and provenance
Competitive Biotools (q-competitive-biotools)	Tracks Alpha1 Science + PRISM; the WS-rigor-ruleset workstream absorbs Alpha1's 8-dim rubric into this quest

How These Quests Interlock

The epistemic rigor vision depends on three supporting quests:

Experiment Extraction provides the ground truth — structured experimental results

with p-values, effect sizes, and methodology that anchor evidence chains to reality.

Artifact Debates provides the self-correction mechanism — when evidence is

contested, structured debates accumulate arguments for and against, and quality scores
update based on debate outcomes.

Schema Governance provides the integrity guarantee — as we learn what evidence

structure is useful, schemas evolve through agent governance rather than ad-hoc changes,
ensuring data remains queryable and trustworthy.

Together: experiments ground claims in data, debates correct errors, governance maintains integrity.

Success Metrics

Falsifiability coverage: >80% of hypotheses have explicit testable predictions
Provenance coverage: >90% of evidence claims traced to source (paper/experiment/debate)
Trust score coverage: 100% of KG edges have computed trust scores
Audit completeness: 100% of score changes have structured justifications
Dependency graph: All hypothesis-experiment links are bidirectional and queryable
Experiment grounding: >500 structured experiments extracted, each traceable to source paper
Debate breadth: >5 artifact types have been debated (not just hypotheses)
Schema integrity: All artifact types have governed schemas with validation

Code Quality Requirements

All code produced by this quest must:

Use shared database.py for DB connections (not local get_db() definitions)
Include migration testing (--dry-run verification before apply)
Add tests for trust computation, Bayesian score updates, and provenance chain traversal
New modules must be < 500 lines; split if larger
No duplicate utility functions — reuse from pubmed_utils.py, kg_extraction_utils.py
Schema changes reviewed for index coverage and query performance

Tasks using this spec (1)

[Agora] epi-01-PRED: Add hypothesis_predictions table for fa

Epistemic Rigor done P92

File: quest_epistemic_rigor.md

Modified: 2026-04-24 07:15

Size: 11.8 KB