The goal is NOT 10 hypotheses/day. The goal is RELIABLY PRODUCING HIGH-QUALITY hypotheses that meaningfully improve the world model. One well-grounded, evidence-backed, peer-reviewed hypothesis is worth more than 100 vague ones.
Reframed priorities:
1. QUALITY GATES BEFORE VOLUME: The existing quality gates (evidence_gate, score_gate, specificity_gate) must be enforced rigorously. Hypotheses that fail gates should NOT enter the world model. The gap quality scoring (specificity, evidence coverage, actionability) must filter BEFORE debates — don't waste compute on vague gaps.
2. DEBATE AS QUALITY MECHANISM: Multi-agent debate (Theorist vs Skeptic vs Expert vs Synthesizer) IS the quality mechanism. The debate should be HARDER to pass, not easier. A hypothesis that survives rigorous skepticism is valuable. One that doesn't should be archived, not promoted.
3. RELIABLE PIPELINE: The 10-bug stall showed the pipeline is fragile. Fix: post_process handles only new analyses, explicit db.commit after each hypothesis, case-insensitive analysis matching, reverse-sort for freshness. These are correctness fixes, not throughput optimizations.
4. EVIDENCE GROUNDING: Every hypothesis must cite specific evidence (PMIDs, dataset analyses, KG edges). The evidence_gate should be STRICT — no citations, no hypothesis. Data-driven evidence (computational analysis results) should be weighted HIGHER than literature-only claims.
5. WORLD MODEL CURATION: The world model is not just the KG — it includes ALL artifacts (hypotheses, analyses, datasets, papers, debates, notebooks). The world model improves when: (a) a high-quality hypothesis is promoted, (b) a low-quality one is archived, (c) a gap is filled with evidence, (d) duplicates are merged, (e) contradictions are resolved through debate. CURATION is as important as generation.
6. ECONOMICS AS QUALITY SIGNAL: Token rewards should scale with QUALITY, not quantity. A hypothesis that gets promoted to the world model earns 10x a hypothesis that gets archived. Gap bounties should reward resolution quality, not just completion speed. This makes economics the incentive layer for quality, not volume.
Pipeline fixes still needed: post_process correctness (done), scidex-agent reliability (done), LLM provider fallback (in progress), gap quality pre-filter (TODO). But framed as RELIABILITY not throughput.
Completion Notes
Auto-release: non-recurring task produced no commits this iteration; requeuing for next cycle
Git Commits (4)
[Agora] Verify quality loop already on main; update spec work log [task:6956d3d9-a7e6-4297-98a3-1b8387a9f784]2026-04-13
Ensure the hypothesis pipeline produces reliably high-quality, evidence-grounded hypotheses rather than volume. Quality gates must be enforced strictly, token economics must scale with quality, and hypothesis generation must feed back into gap quality scores.
Acceptance Criteria
compute_evidence_quality_score() in post_process.py scores evidence composition, weighting data-driven evidence (GWAS, RNA-seq, datasets) 2x vs literature
award_hypothesis_tokens() applies a 10x quality multiplier: quality_verified=1 → normal rewards; quality_verified=0 → 0.1x penalty
After each hypothesis commit, the originating gap's hypothesis_density and debate_depth scores are updated via gap_quality.update_gap_from_hypothesis()
update_gap_from_hypothesis() in scidex/agora/gap_quality.py implements the feedback loop
Implementation
post_process.py
Added compute_evidence_quality_score(evidence_list) — scores evidence by data-driven vs literature-only composition
Modified award_hypothesis_tokens() — new quality_verified and evidence_quality params; unverified hypotheses earn 10x fewer tokens; data-driven evidence earns 2x bonus
Updated call site to pass quality_verified and ev_quality = compute_evidence_quality_score(evidence_for_list)
Added gap quality feedback block after hypothesis commit: looks up gap_id from analysis, calls gap_quality.update_gap_from_hypothesis()
scidex/agora/gap_quality.py
Added update_gap_from_hypothesis(db, gap_id, hypothesis_score, quality_verified, debate_survived) — incrementally updates hypothesis_density (+0.15 × score × quality_mult) and debate_depth (+0.2 × score if debate_survived) then recomputes composite gap_quality_score
Work Log
2026-04-13: Verified all implementation on main (926b1e126). Workspace clean — nothing to commit. Task claimed by slot but work was already completed by prior commits (38fc0a898, 72dcc23cb). All 4 acceptance criteria confirmed present: compute_evidence_quality_score(), award_hypothesis_tokens() with 10x quality scaling, gap_quality.update_gap_from_hypothesis(), and gap quality feedback block in post_process.py.
2026-04-13: Implemented all changes. Prior attempt (72dcc23cb) failed merge due to zombie_sweeper stale heartbeat, not code issues. Re-applied changes against current main using scidex/agora/gap_quality.py (package path) instead of root-level shim.