[Agora] CRITICAL: Hypothesis QUALITY over quantity — reliable high-quality science loop done

← Agora
The goal is NOT 10 hypotheses/day. The goal is RELIABLY PRODUCING HIGH-QUALITY hypotheses that meaningfully improve the world model. One well-grounded, evidence-backed, peer-reviewed hypothesis is worth more than 100 vague ones. Reframed priorities: 1. QUALITY GATES BEFORE VOLUME: The existing quality gates (evidence_gate, score_gate, specificity_gate) must be enforced rigorously. Hypotheses that fail gates should NOT enter the world model. The gap quality scoring (specificity, evidence coverage, actionability) must filter BEFORE debates — don't waste compute on vague gaps. 2. DEBATE AS QUALITY MECHANISM: Multi-agent debate (Theorist vs Skeptic vs Expert vs Synthesizer) IS the quality mechanism. The debate should be HARDER to pass, not easier. A hypothesis that survives rigorous skepticism is valuable. One that doesn't should be archived, not promoted. 3. RELIABLE PIPELINE: The 10-bug stall showed the pipeline is fragile. Fix: post_process handles only new analyses, explicit db.commit after each hypothesis, case-insensitive analysis matching, reverse-sort for freshness. These are correctness fixes, not throughput optimizations. 4. EVIDENCE GROUNDING: Every hypothesis must cite specific evidence (PMIDs, dataset analyses, KG edges). The evidence_gate should be STRICT — no citations, no hypothesis. Data-driven evidence (computational analysis results) should be weighted HIGHER than literature-only claims. 5. WORLD MODEL CURATION: The world model is not just the KG — it includes ALL artifacts (hypotheses, analyses, datasets, papers, debates, notebooks). The world model improves when: (a) a high-quality hypothesis is promoted, (b) a low-quality one is archived, (c) a gap is filled with evidence, (d) duplicates are merged, (e) contradictions are resolved through debate. CURATION is as important as generation. 6. ECONOMICS AS QUALITY SIGNAL: Token rewards should scale with QUALITY, not quantity. A hypothesis that gets promoted to the world model earns 10x a hypothesis that gets archived. Gap bounties should reward resolution quality, not just completion speed. This makes economics the incentive layer for quality, not volume. Pipeline fixes still needed: post_process correctness (done), scidex-agent reliability (done), LLM provider fallback (in progress), gap quality pre-filter (TODO). But framed as RELIABILITY not throughput.

Completion Notes

Auto-release: non-recurring task produced no commits this iteration; requeuing for next cycle

Git Commits (4)

[Agora] Verify quality loop already on main; update spec work log [task:6956d3d9-a7e6-4297-98a3-1b8387a9f784]2026-04-13
[Agora] Hypothesis quality: evidence scoring, 10x token scaling, gap feedback loop [task:6956d3d9-a7e6-4297-98a3-1b8387a9f784]2026-04-13
[Agora] hypothesis quality loop: gap feedback + strict token scaling + evidence weighting2026-04-11
[Senate] Alignment: quality over quantity, world model as all artifacts2026-04-11
Spec File

Spec: Hypothesis Quality Over Quantity

Task ID: 6956d3d9-a7e6-4297-98a3-1b8387a9f784 Layer: Agora Status: Implemented

Goal

Ensure the hypothesis pipeline produces reliably high-quality, evidence-grounded hypotheses rather than volume. Quality gates must be enforced strictly, token economics must scale with quality, and hypothesis generation must feed back into gap quality scores.

Acceptance Criteria

  • compute_evidence_quality_score() in post_process.py scores evidence composition, weighting data-driven evidence (GWAS, RNA-seq, datasets) 2x vs literature
  • award_hypothesis_tokens() applies a 10x quality multiplier: quality_verified=1 → normal rewards; quality_verified=0 → 0.1x penalty
  • After each hypothesis commit, the originating gap's hypothesis_density and debate_depth scores are updated via gap_quality.update_gap_from_hypothesis()
  • update_gap_from_hypothesis() in scidex/agora/gap_quality.py implements the feedback loop
  • Implementation

    post_process.py

    • Added compute_evidence_quality_score(evidence_list) — scores evidence by data-driven vs literature-only composition
    • Modified award_hypothesis_tokens() — new quality_verified and evidence_quality params; unverified hypotheses earn 10x fewer tokens; data-driven evidence earns 2x bonus
    • Updated call site to pass quality_verified and ev_quality = compute_evidence_quality_score(evidence_for_list)
    • Added gap quality feedback block after hypothesis commit: looks up gap_id from analysis, calls gap_quality.update_gap_from_hypothesis()

    scidex/agora/gap_quality.py

    • Added update_gap_from_hypothesis(db, gap_id, hypothesis_score, quality_verified, debate_survived) — incrementally updates hypothesis_density (+0.15 × score × quality_mult) and debate_depth (+0.2 × score if debate_survived) then recomputes composite gap_quality_score

    Work Log

    • 2026-04-13: Verified all implementation on main (926b1e126). Workspace clean — nothing to commit. Task claimed by slot but work was already completed by prior commits (38fc0a898, 72dcc23cb). All 4 acceptance criteria confirmed present: compute_evidence_quality_score(), award_hypothesis_tokens() with 10x quality scaling, gap_quality.update_gap_from_hypothesis(), and gap quality feedback block in post_process.py.
    • 2026-04-13: Implemented all changes. Prior attempt (72dcc23cb) failed merge due to zombie_sweeper stale heartbeat, not code issues. Re-applied changes against current main using scidex/agora/gap_quality.py (package path) instead of root-level shim.

    Payload JSON
    {
      "_stall_skip_providers": [
        "minimax",
        "max_outlook",
        "codex",
        "max_gmail"
      ],
      "_stall_requeued_by": "max_outlook",
      "_stall_requeued_at": "2026-04-15 22:04:02",
      "_stall_skip_at": {
        "max_outlook": "2026-04-15T22:04:02.341234+00:00",
        "codex": "2026-04-14T20:54:53.706413+00:00",
        "max_gmail": "2026-04-14T20:57:48.363309+00:00"
      },
      "_stall_skip_pruned_at": "2026-04-14T10:37:14.022390+00:00"
    }

    Sibling Tasks in Quest (Agora) ↗