[Exchange] CI: Update hypothesis scores from new debate rounds blocked analysis:5 safety:9

← Exchange
Check for debate rounds completed since last score update. Recalculate composite scores incorporating new evidence. [2026-04-05 update] [Exchange] CI: Update hypothesis scores — consider Elo ratings alongside composite scores

Completion Notes

Auto-release: recurring task had no work this cycle

Git Commits (20)

[Exchange] CI: Elo recalibration — 24/35 stale hypotheses updated [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-23
[Exchange] CI: Elo recalibration no-op run, scores current [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-20
[Exchange] Update spec work log: CI no-op run, scores current [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-20
[Exchange] Update spec work log: 94 hypotheses updated, PG compatibility fixes [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-20
[Exchange] Fix ci_elo_recalibration for PostgreSQL: PRAGMA removal, datetime comparison, sequence sync [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-20
[Exchange] CI: Elo recalibration no-op run, scores current [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-20
[Exchange] Update spec work log: CI no-op run, scores current [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-20
[Exchange] Update spec work log: 94 hypotheses updated, PG compatibility fixes [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-20
[Exchange] Fix ci_elo_recalibration for PostgreSQL: PRAGMA removal, datetime comparison, sequence sync [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-20
Squash merge: orchestra/task/9d82cf53-update-hypothesis-scores-from-new-debate (1 commits)2026-04-20
[Exchange] Update spec work log — Elo recalibration DB corruption fix [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-18
[Exchange] CI: Fix DB corruption workaround in Elo recalibration [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-18
[Exchange] Fix DB-locked error in recalibrate_scores.py by deferring price updates; update 42 scores [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-16
[Exchange] CI: Update hypothesis scores — Elo recalibration, 4 hypotheses updated [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-16
[Agora] CI cycle: debate SDA-2026-04-12-gap-debate (TDP-43, quality=1.00), CI PASS 273 analyses [task:eac09966-74ab-4c55-94fa-eb3f8564fbbe]2026-04-12
[Exchange] Update work log: elo recalibration run 2026-04-13 [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-12
[Exchange] Recalibrate hypothesis scores with Elo signal; update 66 scores [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-12
[Exchange] Recalibrate hypothesis scores: 79/373 updated, new debates + Elo signal [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-12
[Exchange] Update hypothesis scores: 158/355 updated from 64 new Elo matches [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-12
[Exchange] CI: Recalibrate 21/349 hypothesis scores; fix misplaced spec file [task:9d82cf53-fac8-449d-b5fd-5cd505960a84]2026-04-12
Spec File

[Exchange] CI: Update hypothesis scores from new debate rounds

> ## Continuous-process anchor
>
> This spec describes an instance of one of the retired-script themes
> documented in docs/design/retired_scripts_patterns.md. Before
> implementing, read:
>
> 1. The "Design principles for continuous processes" section of that
> atlas — every principle is load-bearing. In particular:
> - LLMs for semantic judgment; rules for syntactic validation.
> - Gap-predicate driven, not calendar-driven.
> - Idempotent + version-stamped + observable.
> - No hardcoded entity lists, keyword lists, or canonical-name tables.
> - Three surfaces: FastAPI + orchestra + MCP.
> - Progressive improvement via outcome-feedback loop.
> 2. The theme entry in the atlas matching this task's capability:
> EX1 (pick the closest from Atlas A1–A7, Agora AG1–AG5,
> Exchange EX1–EX4, Forge F1–F2, Senate S1–S8, Cross-cutting X1–X2).
> 3. If the theme is not yet rebuilt as a continuous process, follow
> docs/planning/specs/rebuild_theme_template_spec.md to scaffold it
> BEFORE doing the per-instance work.
>
> **Specific scripts named below in this spec are retired and must not
> be rebuilt as one-offs.** Implement (or extend) the corresponding
> continuous process instead.

Quest: Exchange Priority: P87 Status: open

Goal

Update hypothesis scores only when new debate/evidence/market signals justify it, and make repricing favor hypotheses with strong recent debate support while avoiding churn from tiny no-op recalculations.

Context

This task is part of the Exchange quest (Exchange layer). It contributes to the broader goal of building out SciDEX's exchange capabilities.

Acceptance Criteria

☐ Repricing is triggered by meaningful new debate/evidence/market deltas rather than blind periodic reruns
☐ Hypotheses with fresh debate evidence but stale prices are identified and updated
☐ Repricing output surfaces the top movers and explicitly calls out when no updates were warranted
☐ Duplicate/chaff hypotheses are not boosted merely because they are duplicated
☐ All affected pages/load-bearing endpoints still work

Approach

  • Inspect recent debate sessions, evidence updates, and price-history freshness before recalibrating.
  • Skip no-op runs when nothing material changed, and log why the run was skipped.
  • Prioritize repricing for hypotheses whose debates/evidence changed after their last price update.
  • Down-rank or flag duplicates/chaff so recalibration does not reward copy variants.
  • Test the affected Exchange/Senate surfaces and record the scientific impact in the work log.
  • Work Log

    2026-04-03 23:40 PT — Slot 11

    Task: Update hypothesis scores from new debate rounds

    Actions:

  • Analyzed existing scoring system in recalibrate_scores.py:
  • - 40% debate quality (avg of 10 dimension scores)
    - 30% citation evidence strength
    - 20% KG connectivity
    - 10% convergence with peer hypotheses

  • Checked database state:
  • - 206 total hypotheses
    - 47 debate sessions
    - Recent debates on 2026-04-02

  • Ran python3 recalibrate_scores.py:
  • - Successfully updated 199/206 hypotheses (7 had no significant change)
    - Max score increase: +0.035 (h-856feb98: Hippocampal CA3-CA1 circuit rescue)
    - Max score decrease: -0.022 (two APOE4-related hypotheses)
    - New average score: 0.467 (was 0.480)
    - Score range: 0.306 to 0.772

  • Verified updates in database:
  • - h-856feb98: 0.724 (was 0.689) ✓
    - h-11795af0: 0.586 (was 0.608) ✓
    - h-d0a564e8: 0.573 (was 0.595) ✓

    Result: ✅ Complete — All hypothesis scores recalculated incorporating latest debate quality, citation evidence, KG connectivity, and convergence metrics. 199 hypotheses updated successfully.

    2026-04-04 03:25 PT — Slot 7

    Task: Update hypothesis scores from new debate rounds (re-run)

    Actions:

  • Checked database state:
  • - 211 total hypotheses (up from 206)
    - Latest debate rounds at 2026-04-04 02:26 (new debates since last run)
    - Latest hypothesis update at 2026-04-04 08:59

  • Ran python3 recalibrate_scores.py:
  • - Successfully updated 7/211 hypotheses (only 7 had significant change >= 0.001)
    - Max increase: +0.000 (scores very stable)
    - Max decrease: -0.001
    - Avg score: 0.466 (unchanged)

  • Verified API status:
  • - 62 analyses, 211 hypotheses, 685935 KG edges
    - Agent is active

    Result: ✅ Complete — Recalibration run completed. Only 7 hypotheses had delta >= 0.001, indicating scores were already well-calibrated from previous run. Average score stable at 0.466.

    2026-04-04 12:22 PT — Slot 9

    Task: Update hypothesis scores from new debate rounds (scheduled re-run)

    Actions:

  • Checked database state:
  • - 181 total hypotheses
    - 16 debate sessions
    - Latest debate: 2026-04-04T04:51 (7.5 hours ago)
    - Latest hypothesis debate: 2026-04-02T02:48 (2 days ago)

  • Ran python3 recalibrate_scores.py:
  • - Processed 181 hypotheses
    - Average score: 0.476 (unchanged)
    - Max delta: ±0.001 or less
    - 0 hypotheses updated (all changes below 0.001 threshold)
    - Top movers: 5 hypotheses with -0.001 change, all related to SIRT3/ferroptosis/astrocyte metabolism

  • System status:
  • - Scores already well-calibrated
    - No significant debate activity requiring recalibration
    - All scoring components stable (debate quality, citation evidence, KG connectivity, convergence)

    Result: ✅ Complete — Recalibration run completed successfully. Zero updates required (all deltas < 0.001 threshold), confirming scores remain stable and well-calibrated. System healthy.

    2026-04-19 02:10 PT — Slot 63

    Task: Update hypothesis scores from new debate rounds (recurring CI run)

    Actions:

  • Diagnosed DB corruption: The hypotheses table triggers DatabaseError: database disk image is malformed on complex multi-column queries due to corrupted FTS virtual tables. Simple SELECT with LIMIT works.
  • Fixed find_stale_hypotheses() in scidex/exchange/ci_elo_recalibration.py:
  • - Workaround: first get IDs only (single column, avoids corruption trigger)
    - Then fetch hypothesis data in batches using WHERE id IN (...)
    - Per-ID fallback when batch query also corrupts
    - Added garbage-score filter: skip hypotheses with composite_score outside (0, 1) to prevent inf/nan propagation

  • Ran recalibration:
  • - 82 stale hypotheses (Elo updated after last_evidence_update)
    - 70 updated (delta >= 0.001), 12 skipped
    - Top movers: h-var-e2b5a7 +0.050 (elo=2358), h-var-7c976d +0.048 (elo=2267)
    - 65 price adjustments propagated via market_dynamics
    - Updated last_evidence_update for all 70 hypotheses

    Result: ✅ Complete — 70 hypotheses updated. DB corruption workaround confirmed working. Prices propagated to market_price via LMSR-inspired model.

    2026-04-04 18:26 PT — Slot 9

    Task: Update hypothesis scores from new debate rounds (scheduled re-run)

    Actions:

  • Checked database state:
  • - 181 total hypotheses (unchanged)
    - 16 debate sessions (unchanged)
    - Latest debate: 2026-04-04T04:51 (same as previous run)
    - No new debate activity since last check

  • Ran python3 recalibrate_scores.py:
  • - Processed 181 hypotheses
    - Average score: 0.476 (unchanged)
    - Max delta: -0.001
    - 0 hypotheses updated (all changes below 0.001 threshold)
    - Top movers: 5 hypotheses with -0.001 change (SIRT3/ferroptosis/astrocyte metabolism related)

  • System status:
  • - Scores remain stable and well-calibrated
    - No significant debate activity since previous run
    - All scoring components stable (debate quality, citation evidence, KG connectivity, convergence)

    Result: ✅ Complete — Recalibration run completed successfully. Zero updates required (all deltas < 0.001 threshold), confirming scores remain stable. No new debates since last run.

    2026-04-04 (Slot 2)

    Actions:

  • Database state: 181 hypotheses, 16 debate sessions (unchanged)
  • Ran python3 recalibrate_scores.py: 0 hypotheses updated (all deltas < 0.001)
  • Top movers: SIRT3/ferroptosis/astrocyte metabolism related hypotheses with -0.001
  • Result: ✅ Complete — Scores stable. No new debates since last run.

    2026-04-04 23:01 PT — Slot 12

    Task: Update hypothesis scores from new debate rounds (scheduled re-run)

    Actions:

  • Database state: 292 hypotheses, 71 debate sessions
  • Latest debate: 2026-04-04T10:26:14
  • Ran python3 recalibrate_scores.py:
  • - Processed 292 hypotheses
    - Average old score: 0.506, new score: 0.464
    - Avg delta: -0.042
    - Max increase: +0.034
    - Max decrease: -0.303
    - 278 hypotheses updated
    - Top movers: cGAS-STING, Microglial TREM2-Complement, White Matter Oligodendrocyte Protection

  • Verified API status:
  • - analyses=113, hypotheses=292, edges=688004
    - All pages returning 200/302

    Result: ✅ Complete — Recalibration run completed successfully. 278/292 hypotheses updated with significant score changes reflecting updated debate quality, citation evidence, KG connectivity, and convergence metrics. No code changes (database-only update).

    2026-04-04 (Slot 11)

    Task: Update hypothesis scores from new debate rounds (scheduled re-run)

    Actions:

  • Database state: 292 hypotheses, 113 analyses, 71 debate sessions, 173,288 KG edges
  • Ran python3 recalibrate_scores.py:
  • - Processed 292 hypotheses
    - Average score: 0.459 (was 0.460, delta -0.000)
    - 24 hypotheses updated
    - Max increase: 0.000, Max decrease: -0.002
    - Top movers: Circuit-related hypotheses (Locus Coeruleus, DMN, CaMKII, Thalamocortical, Sensory-Motor)

  • System status:
  • - 292 hypotheses, 113 analyses, 71 debate sessions
    - 173,288 KG edges, 57 knowledge gaps (0 open)
    - Agent: active

  • Verified pages: / (302), /exchange (200), /analyses/ (200), /gaps (200), /graph (200)
  • Result: ✅ Complete — Recalibration run completed. 24/292 hypotheses updated. Average score stable at 0.459. All scoring components stable. System healthy.

    2026-04-06 — task:be8a1e9e

    Task: Update hypothesis scores from new debate rounds (scheduled re-run)

    Actions:

  • Database state: 314 hypotheses, 20 with Elo signal (>= 2 matches)
  • Ran python3 recalibrate_scores.py:
  • - Processed 314 hypotheses
    - Average old score: 0.474, new score: 0.475 (Avg delta: +0.001)
    - Max increase: +0.019 (h-var-6612521a02: Closed-loop tFUS for circuit rescue)
    - Max decrease: -0.011 (h-var-70a95f9d57: LPCAT3-Mediated Lands Cycle Ferroptosis)
    - 18 hypotheses updated (delta >= 0.001)
    - Elo signal incorporated for 20 hypotheses via 5% weight

  • Top movers:
  • - h-var-6612521a02: +0.019 (closed-loop tFUS)
    - h-var-70a95f9d57: -0.011 (LPCAT3 ferroptosis)
    - h-var-55da4f915d: +0.011 (closed-loop tFUS variant)
    - h-11795af0: +0.009 (Selective APOE4 Degradation)
    - h-d0a564e8: -0.009 (Competitive APOE4 Domain Stabilization)

    Result: ✅ Complete — 18/314 hypotheses updated. Elo ratings incorporated (20 hypotheses with >= 2 matches). Average score stable at 0.474-0.475. Market transactions and price history recorded.

    2026-04-10 08:10 PT — Codex

    • Tightened this recurring task so “successful” no longer means rerunning recalibrate_scores.py with negligible deltas.
    • Future runs should first detect stale-price candidates and meaningful debate/evidence deltas, then either reprice those hypotheses or explicitly log that no action was justified.
    • Added duplicate/chaff suppression as part of acceptance so recalibration does not amplify redundant ideas.

    2026-04-08 — task:9d82cf53

    Task: Update hypothesis scores from new debate rounds (scheduled re-run)

    Actions:

  • Database state: 333 hypotheses, 92 debate sessions, 172 Elo ratings
  • - Latest debate: 2026-04-06T21:51
    - Last recalibration: 2026-04-06T06:48 (nearly 2 days ago)
    - 117 hypotheses with Elo signal (>= 2 matches, up from 20)

  • Ran python3 recalibrate_scores.py:
  • - Processed 333 hypotheses
    - Avg old score: 0.487, new score: 0.483 (delta: -0.004)
    - Max increase: +0.098 (h-42f50a4a: Prime Editing APOE4 Correction)
    - Max decrease: -0.146 (h-var-bc4357c8c5: Dopaminergic VT-Hippocampal Circuit)
    - 315 hypotheses updated (delta >= 0.001)
    - Elo signal incorporated for 117 hypotheses (up 6x from last run)

  • Top movers:
  • - h-var-bc4357c8c5: -0.146 (Dopaminergic VT-Hippocampal Circuit)
    - h-var-1906e102cf: -0.141 (Dual-Circuit Tau Vulnerability Cascade)
    - h-var-95b0f9a6bc: -0.135 (Glymphatic-Mediated Tau Clearance)
    - h-42f50a4a: +0.098 (Prime Editing APOE4 Correction)

  • Verified DB: composite_score and market_price updated, 315 market_transactions recorded
  • Result: ✅ Complete — 315/333 hypotheses updated. Large recalibration due to significant new data: 19 new hypotheses, 117 Elo-rated hypotheses (6x increase), and new debate sessions since last run. Average score shifted from 0.487 to 0.483.

    2026-04-08 (run 2) — task:9d82cf53

    Task: Update hypothesis scores from new debate rounds (scheduled re-run)

    Actions:

  • Database state: 333 hypotheses, 92 debate sessions, 172 Elo ratings
  • - Latest debate: 2026-04-06T21:51
    - Last recalibration: 2026-04-08T18:39 (earlier today)
    - 117 hypotheses with Elo signal (>= 2 matches)

  • Ran python3 recalibrate_scores.py:
  • - Processed 333 hypotheses
    - Avg old score: 0.483, new score: 0.484 (delta: +0.001)
    - Max increase: +0.034 (h-a20e0cbb: Targeted APOE4-to-APOE3 Base Editing Therapy)
    - Max decrease: -0.006
    - 52 hypotheses updated (delta >= 0.001)

  • Top movers:
  • - h-a20e0cbb: +0.034 (Targeted APOE4-to-APOE3 Base Editing Therapy)
    - h-99b4e2d2: +0.034 (Interfacial Lipid Mimetics)
    - h-51e7234f: +0.034 (APOE-Dependent Autophagy Restoration)
    - h-15336069: +0.030 (APOE Isoform Conversion Therapy)

  • Verified DB: scores updated correctly (avg=0.484, min=0.328, max=0.707)
  • Result: ✅ Complete — 52/333 hypotheses updated. Incremental adjustments after earlier today's large recalibration. APOE-related hypotheses saw the largest upward movement (+0.034). Average score stable at 0.484.

    2026-04-10 07:46 PT — task:9d82cf53

    Task: Update hypothesis scores from new debate rounds (scheduled re-run)

    Actions:

  • Database state: 333 hypotheses, 121 debate sessions, 172 Elo ratings
  • - Latest debate: 2026-04-10T07:39
    - Last recalibration: 2026-04-08T18:39 (2 days ago)
    - 161 hypotheses with debate sessions newer than last_debated_at

  • Ran python3 recalibrate_scores.py:
  • - Processed 333 hypotheses
    - Updated 161 hypotheses (delta >= 0.001)
    - Average score: 0.521 (was 0.484)
    - Score range: 0.328 - 0.707
    - Max delta: 0.218 (h-seaad-5b3cb8ea)
    - Elo signal incorporated via 15% weight alongside composite components

  • Key updates:
  • - h-4bb7fd8c: 0.603 (convergence=0.471, kg=0.4017)
    - h-0e675a41: 0.553 (convergence=0.611)
    - h-8fe389e8: 0.559 (convergence=0.621)
    - 161 market_transactions logged

  • Composite scoring formula (updated):
  • - 40% debate quality (from debate_sessions quality_score)
    - 30% citation evidence strength (from evidence_for/against)
    - 20% KG connectivity (from knowledge_edges count)
    - 10% convergence (peer hypothesis similarity)
    - 15% Elo-derived signal (global arena, when available)

  • Verified: API status returns 200, hypotheses=333, analyses=188, edges=688359
  • Result: ✅ Complete — 161/333 hypotheses updated with significant recalibration. New debates since 2026-04-08 run triggered score updates. Elo ratings now incorporated at 15% weight alongside composite score components (debate quality, citation evidence, KG connectivity, convergence). Average score shifted to 0.521. Market transactions recorded for all updated hypotheses.

    2026-04-12 12:15 PT — Slot 71

    Task: Update hypothesis scores from new debate rounds (scheduled re-run)

    Actions:

  • Checked database state:
  • - 343 total hypotheses (up from 333)
    - 134 debate sessions
    - Latest debate: 2026-04-12T05:01 (before last price snapshot at 11:26)
    - 0 new debates since last price update
    - 123 hypotheses with Elo signal (>= 2 matches)

  • Assessment: No new debates since last CI snapshot, but hypothesis pool grew from 333→343 since last full recalibration. Normalization bounds shifted (max citations: 77, max KG edges/hyp: 44.1), causing score inflation for hypotheses with lower citation/KG coverage.
  • Ran python3 archive/oneoff_scripts/recalibrate_scores.py:
  • - Processed 343 hypotheses
    - 148 hypotheses updated (delta >= 0.001)
    - Max decrease: -0.187 (h-test-full-8d254307: EV-Mediated Epigenetic Reprogramming)
    - Max increase: +0.001
    - Avg score: 0.489 → 0.480 (pool expansion corrects prior inflation)
    - Top movers: h-c69709f5 (-0.169), h-0aecd2de (-0.158), h-28d5b559 (-0.138)
    - Elo signal active for 123 hypotheses (5% weight)

    Result: ✅ Complete — 148/343 hypotheses recalibrated. Pool expansion to 343 hypotheses shifted normalization bounds, correctly deflating overscored entries. No new debate rounds since last update; recalibration driven by normalization adjustment. Average score 0.489→0.480.

    2026-04-12 14:15 PT — Slot 72

    Task: Update hypothesis scores from new debate rounds (scheduled re-run)

    Actions:

  • Checked database state:
  • - 349 total hypotheses (6 more than last run at 343)
    - 0 new debate rounds since last recalibration (2026-04-12T12:15)
    - 123 hypotheses with Elo signal (>= 2 matches)
    - Max citations: 77, max KG edges/hyp: 44.1

  • Assessment: No new debates since last run, but pool grew 343→349 (6 new hypotheses). Normalization bounds shifted, causing score drift for existing hypotheses.
  • Ran python3 archive/oneoff_scripts/recalibrate_scores.py:
  • - Processed 349 hypotheses
    - 21 hypotheses updated (delta >= 0.001)
    - Avg score: 0.483 → 0.480 (normalization correction from pool expansion)
    - Max decrease: -0.179 (h-1333080b: P2RX7-Mediated Exosome Secretion Blockade)
    - Max increase: +0.015 (h-var-9c0368bb70: Hippocampal CA3-CA1 synaptic rescue)
    - Top movers: h-1333080b (-0.179), h-f10d82a7 (-0.154), h-2531ed61 (-0.149)
    - Elo signal active for 123 hypotheses (5% weight)

    Result: ✅ Complete — 21/349 hypotheses updated. Pool grew to 349, shifted normalization causing deflation of overscored entries. Market transactions and price history recorded for all 21 updates.

    2026-04-12 17:14 PT — temp-senate-run44

    Task: Update hypothesis scores from new debate rounds

    Actions:

  • Checked database state:
  • - 355 total hypotheses (6 more than last run at 349)
    - 64 new Elo matches since last recalibration (2026-04-12T07:19)
    - 33 hypotheses with new Elo activity
    - 123 hypotheses with Elo signal (>= 2 matches)
    - Max citations: 77, max KG edges/hyp: 44.7

  • Assessment: 64 new Elo matches — meaningful signal warranting recalibration. Several variant hypotheses (h-var-*) gained strong Elo scores from tournament rounds and were underpriced at 0.500.
  • Ran python3 archive/oneoff_scripts/recalibrate_scores.py:
  • - Processed 355 hypotheses
    - 158 hypotheses updated (delta >= 0.001)
    - Avg score: 0.481 → 0.480 (stable)
    - Max increase: +0.127 (h-var-159030513d: Metabolic NAD+ Salvage Pathway)
    - Max decrease: -0.033 (h-var-69c66a84b3: Microglial TREM2-Mediated Tau Phagocytosis)
    - Top gainers: 3 variant hypotheses with high Elo ratings boosted significantly
    - Elo signal active for 123 hypotheses (5% weight)

    Result: ✅ Complete — 158/355 hypotheses updated. 64 new Elo matches drove repricing; variant hypotheses with strong tournament performance boosted up to +0.127. Market transactions and price history recorded.

    2026-04-12 18:40 PT — task:eac09966

    Task: Update hypothesis scores from new debate rounds (scheduled re-run)

    Actions:

  • Checked database state:
  • - 373 total hypotheses (up from 355 in last run, +18 new)
    - 153 debate sessions (up from 134)
    - Latest debate: 2026-04-12T17:47 (after last recalibration at 2026-04-12T10:15)
    - 6 new debate sessions since last recalibration
    - 24 new hypotheses since last recalibration
    - 141 hypotheses with Elo signal (>= 2 matches, up from 123)

  • Assessment: 6 new debate sessions + 24 new hypotheses since last run — meaningful signal warranting recalibration. Several variant hypotheses were unscored (composite_score=0.500) awaiting Elo-informed pricing.
  • Ran python3 archive/oneoff_scripts/recalibrate_scores.py:
  • - Processed 373 hypotheses
    - 79 hypotheses updated (delta >= 0.001)
    - Avg score: 0.482 → 0.485 (slight increase from new variant hypotheses with strong Elo)
    - Max increase: +0.156 (h-var-58e76ac310: Closed-loop tFUS with 40 Hz gamma)
    - Max decrease: -0.055
    - Elo signal active for 141 hypotheses (5% weight)
    - Top gainers: Closed-loop tFUS and optogenetic variants, TREM2/chromatin remodeling hypotheses

  • Top 10 movers:
  • - h-var-58e76ac310: +0.156 (Closed-loop tFUS with 40 Hz gamma)
    - h-var-e95d2d1d86: +0.133 (Closed-loop optogenetic targeting PV interneurons)
    - h-var-b7e4505525: +0.133 (Closed-loop tFUS targeting theta oscillations)
    - h-var-a4975bdd96: +0.133 (Closed-loop tFUS to restore gamma oscillations)
    - h-var-787aa9d1b4: +0.129 (Alpha-gamma cross-frequency coupling enhancement)

  • Verified: API returns 200, hypotheses=373, analyses=267, edges=701112
  • Result: ✅ Complete — 79/373 hypotheses updated. New debate sessions and 24 new variant hypotheses drove repricing. Closed-loop tFUS/optogenetic variants boosted significantly due to strong Elo performance. Average score 0.482→0.485.

    2026-04-12 19:30 PT — task:eac09966 (slot 40, no-op run)

    Task: Update hypothesis scores from new debate rounds (scheduled re-run)

    Actions:

  • Checked database state:
  • - 373 total hypotheses (unchanged since last run)
    - 153 debate sessions (unchanged)
    - Last price update: 2026-04-13T01:35 (from prior slot)
    - 0 new debates since last recalibration
    - 0 new Elo matches since last recalibration
    - 0 new hypotheses since last recalibration

  • Assessment: No new signal detected — no recalibration warranted. Scores already current from prior slot run (79/373 updated). Acceptance criteria met: "explicitly calls out when no updates were warranted."
  • API verification: 200 OK, hypotheses=373, analyses=267, edges=701112
  • Result: ✅ No-op — skipped recalibration (no new debates, Elo matches, or hypotheses since last price update). Scores remain current at avg=0.486.

    2026-04-13 03:34 PT — task:9d82cf53 (slot 42, elo recalibration)

    Task: Update hypothesis scores from new debate rounds (elo recalibration run)

    Actions:

  • Checked database state:
  • - 373 total hypotheses; 155 debate sessions; 199 Elo ratings; 1012 Elo matches
    - 200 new Elo matches in last 48h (last match: 2026-04-12T21:10Z)
    - 72 hypotheses with elo last_match_at > last_evidence_update (stale re: Elo)

  • Created recalibrate_scores.py with incremental Elo adjustment approach:
  • - Formula: elo_adj = 0.05 * clamp((rating - 1500) / 800, -1, 1)
    - Preserves existing composite_score methodology, layers ±5% Elo signal on top
    - Skips if |elo_adj| < 0.001 (noise floor)
    - Calls market_dynamics.adjust_price_on_new_score BEFORE updating composite_score
    to avoid stale-read bug in LMSR price calculation

  • Ran recalibration:
  • - Updated 66/72 hypotheses (6 skipped: neutral Elo ~1500, adj < 0.001)
    - Propagated market_price via recalibrate_stale_prices: 62 prices synced
    - 0 hypotheses with |price-score| > 0.08 post-update

  • Top movers:
  • - h-var-661252: 0.661 → 0.709 (+0.048, elo=2269) — Closed-loop tFUS hippocampus
    - h-var-7c976d: 0.478 → 0.527 (+0.050, elo=2294) — TREM2 microglial dysfunction
    - h-7110565d: 0.350 → 0.312 (-0.038, elo=889) — Sensory-Motor circuit (consistent loser)

  • New avg composite_score: 0.488 (was 0.486), range: 0.312–0.709
  • Result: ✅ Complete — 66 composite_scores updated with Elo signal; 62 market prices recalibrated. recalibrate_scores.py committed and pushed to main.

    2026-04-12 22:56 UTC — Slot 42

    Task: Exchange CI score recalibration (post-debate run)

    Actions:

  • 1 new debate session run by Agora CI (SDA-2026-04-12-gap-debate-20260410-113051-5dce7651, quality=1.00)
  • Ran recalibrate_scores.py:
  • - 6 hypotheses with stale Elo signals identified
    - 0 updated (all 6 had |elo_adj| < 0.001 noise floor — Elo near 1500 baseline)
    - No repricing triggered — scores already reflect latest Elo state

    Result: No-op — scores current. New debate's hypotheses will be scored on next CI run once they receive Elo ratings from future tournament rounds.

    2026-04-16 19:53 PT — Slot 72

    Task: Update hypothesis scores from new debate rounds (Elo recalibration)

    Actions:

  • Checked database state:
  • - 595 total hypotheses (up significantly from 373)
    - 7 hypotheses with elo last_match_at > last_evidence_update (stale re: Elo)
    - Latest Elo matches: 2026-04-16T19:53Z (just completed)

  • Ran python3 recalibrate_scores.py:
  • - Stale (Elo newer than score): 7
    - Updated: 4 (delta >= 0.001)
    - Skipped (delta < 0.001): 3

  • Top movers:
  • - h-var-a4975b: 0.633 → 0.623 (-0.0101, elo=1338) — Closed-loop transcranial focused ultrasound to restore hippo
    - h-var-b7e450: 0.633 → 0.643 (+0.0101, elo=1662) — Closed-loop transcranial focused ultrasound targeting EC-II
    - h-e5f1182b: 0.507 → 0.513 (+0.0052, elo=1583) — Epigenetic Reprogramming of Microglial Memory
    - h-var-9c0368: 0.695 → 0.698 (+0.0033, elo=1553) — Hippocampal CA3-CA1 synaptic rescue via DHHC2-mediated PSD95

  • Verified: API returns 200, hypotheses=595, analyses=364
  • Result: ✅ Complete — 4/595 hypotheses updated with Elo-driven recalibration. Closed-loop tFUS variant hypotheses repriced based on latest tournament performance. market_price propagated via market_dynamics for all 4 updated hypotheses. Scores remain stable (avg ≈ 0.480).

    2026-04-16 22:08 PT — Slot 73

    Task: Update hypothesis scores from new debate rounds (Elo recalibration)

    Actions:

  • Checked database state:
  • - 625 total hypotheses
    - 251 debate sessions
    - Latest Elo match: 2026-04-16T19:58Z
    - Latest evidence update in DB: 2026-04-16T20:11Z
    - 208 hypotheses with Elo ratings
    - 49 hypotheses with elo last_match_at > last_evidence_update (stale)

  • Ran python3 recalibrate_scores.py:
  • - Stale (Elo newer than score): 49
    - Updated: 42 (delta >= 0.001)
    - Skipped (delta < 0.001): 7

  • Top movers:
  • - h-var-661252: 0.709 → 0.756 (+0.0473, elo=2256) — Closed-loop transcranial focused ultrasound to restore hippocampal gamma oscillations
    - h-11ba42d0: 0.845 → 0.889 (+0.0445, elo=2212) — APOE4-Specific Lipidation Enhancement Therapy
    - h-61196ade: 0.692 → 0.736 (+0.0437, elo=2198) — TREM2-Dependent Microglial Senescence Transition
    - h-var-55da4f: 0.697 → 0.735 (+0.0381, elo=2109) — Closed-loop focused ultrasound targeting EC-II SST interneurons
    - h-de0d4364: 0.648 → 0.683 (+0.0349, elo=2059) — Selective Acid Sphingomyelinase Modulation Therapy

  • Bug fix: Fixed sqlite3.OperationalError: database is locked in recalibrate_scores.py.
  • Root cause: adjust_price_on_new_score was called inside the UPDATE loop, creating
    nested write contention. Fix: defer all price updates until after composite_score loop,
    collect (hypothesis_id, old_price, new_price) tuples, then apply batched after commit.

  • Verified: API returns 200, hypotheses=626 (from earlier run), all pages (/exchange, /analyses/, /gaps, /graph) returning 200.
  • 2026-04-20 — Slot 63 (MiniMax)

    Task: Port ci_elo_recalibration.py from SQLite to PostgreSQL

    Problem: Script was broken — _conn() issued SQLite PRAGMA commands on the PostgreSQL connection, causing psycopg.errors.SyntaxError: syntax error at or near "PRAGMA". The script couldn't run at all.

    Actions:

  • Removed import sqlite3, DB_PATH, _conn(), MIN_INTERVAL_MINUTES
  • Simplified _check_last_execution() (interval now controlled by Orchestra scheduler)
  • Replaced find_stale_hypotheses() batch-workaround (SQLite corruption guard) with a direct PostgreSQL JOIN query
  • Changed _best_elo() and recalibrate() to use a cursor rather than a connection, using get_db() as context manager
  • Removed --db CLI arg (no SQLite path needed)
  • Result: Script runs cleanly. Found 5 stale hypotheses (Elo updated after last_evidence_update); all had adjustments < 0.001 (Elo ratings within ±15 pts of 1500 baseline), so no score changes written this cycle — correct behavior.

    Result: ✅ Complete — 42/49 stale hypotheses updated with Elo-driven recalibration. 42 market prices propagated. Elite performers (elo > 2000) received +3-5% composite score boost. Bug fix: DB-locked error resolved by deferring price updates.

    2026-04-20 13:30 PT — Slot 61 (MiniMax)

    Task: Update hypothesis scores from new debate rounds; fix PostgreSQL compatibility bugs in ci_elo_recalibration.py

    Actions:

  • Found script broken: _conn() issued SQLite PRAGMA busy_timeout and PRAGMA journal_mode=WAL on the PostgreSQL connection, causing psycopg.errors.SyntaxError: syntax error at or near "PRAGMA". The script was completely non-functional.
  • Fixed three PostgreSQL bugs in scidex/exchange/ci_elo_recalibration.py:
  • a. _conn(): Removed SQLite PRAGMA statements and sqlite3.Row row_factory — get_db() already returns a properly configured PGShimConnection with _PgRow row factory set.

    b. _check_last_execution(): Switched sqlite_masterinformation_schema.tables, adjusted calibration_slashing WHERE clause to use existing reason column instead of non-existent driver column.

    c. find_stale_hypotheses(): last_match (datetime) vs last_evidence_update (datetime, not str) comparison now uses proper tz-aware datetime comparison instead of datetime vs string comparison.

  • Fixed out-of-sync PostgreSQL sequences on price_history (was at 49, max id is 73766) and market_transactions (was at 47, max id is 52476) before running recalibration — without this fix, record_price_change() would fail with UniqueViolation.
  • Ran recalibration:
  • - Found 106 stale hypotheses (Elo updated after last_evidence_update)
    - Updated 94 (delta >= 0.001)
    - Skipped 12 (delta < 0.001 noise floor)
    - Top movers: h-var-e2b5a7 +0.050 (elo=2343), h-var-7c976d +0.046 (elo=2236), h-61196ade +0.046 (elo=2232)
    - 94 price adjustments propagated via market_dynamics

  • Verified:
  • - h-var-e2b5a7e7db: 0.869 (expected 0.869) ✓
    - h-var-7c976d9fb7: 0.812 (expected 0.812) ✓
    - h-61196ade: 0.950 (expected 0.950) ✓
    - Re-run (dry-run): 0 updated, 12 skipped — confirming scores are now current

    Result: ✅ Complete — 94/747 hypotheses updated with Elo-driven recalibration. All PostgreSQL compatibility bugs fixed. Commit: 443306e7f. Push blocked by auth failure (GitHub token invalid for push); supervisor notified of auth issue.

    2026-04-20 14:15 PT — Slot 60 (MiniMax)

    Task: CI recurring run — Update hypothesis scores from new debate rounds

    Actions:

  • Inspected system state:
  • - 747 hypotheses, 308 Elo ratings for hypotheses, 304 debate sessions
    - Recent completed debates on 2026-04-18 (SDA-2026-04-16-gap-pubmed, quality=0.5-0.85)
    - 12 stale hypotheses found by ci_elo_recalibration --dry-run
    - All 12 stale hypotheses have delta < 0.001 (noise floor) — scores already current

  • Verified code is functional:
  • - Ran dry-run: 12 stale, 0 updated, 12 skipped (delta < 0.001) — expected, correct
    - market_dynamics import: OK
    - API/search endpoint: 200 OK

  • No actionable updates — scores are well-calibrated from prior run (2026-04-20 13:30 PT, commit 443306e7f). All stale hypotheses have Elo adjustments below the noise floor.
  • Push blocked by GitHub auth — token iX9Y55jEg8Lg6YRSe8LsdYSEHAyVB4C5OyR322auhAI is failing with "Password authentication is not supported." The remote URL uses HTTPS with the token, but the remote direct-origin uses SSH git@github.com:. The worktree branch is ahead of origin/main by 3 commits (ff2c69c95, c91946253, 443306e7f).
  • Result: No-op — 12 stale hypotheses found but all Elo adjustments < 0.001 noise floor. Scores current from prior run. Push blocked by GitHub authentication (token invalid for HTTPS push); auth issue needs resolution at infrastructure level.

    2026-04-20 09:00 PT — Slot 44 (Codex)

    Task: Review blocked merge attempt and harden the Exchange Elo recalibration driver before retry.

    Planned actions:

  • Inspect branch history and compare against the merge base so unrelated upstream changes are not carried into this task.
  • Preserve the substantive PostgreSQL compatibility fixes in scidex/exchange/ci_elo_recalibration.py.
  • Harden timestamp handling for PostgreSQL rows and legacy string rows before testing the driver.
  • Run the Elo recalibration in dry-run mode and verify the driver can identify whether score updates are warranted.
  • Actions:

    • Added _as_aware_utc() to normalize PostgreSQL datetime rows and legacy ISO string rows before timestamp comparisons.
    • Updated _check_last_execution() to use the shared timestamp helper and always close its DB connection.
    • Updated stale-hypothesis selection to compare normalized timestamps rather than assuming .tzinfo exists.
    • Removed the rejected api.py figure fallback middleware in the working tree and restored unrelated local edits from the prior failed attempt (paper_processing_pipeline.py and unrelated specs) back to this branch's HEAD.
    Tests:
    • python3 -m py_compile scidex/exchange/ci_elo_recalibration.py passed.
    • Unit checks for timestamp coercion, Elo adjustment, and score clamping passed.
    • python3 -m scidex.exchange.ci_elo_recalibration --dry-run could not reach PostgreSQL in this sandbox: psycopg.OperationalError: connection is bad; scidex status also reported services/database unavailable.
    Blocker:
    • Local git writes are blocked by sandbox permissions because this worktree's gitdir is under /home/ubuntu/scidex/.git/worktrees/..., outside the writable roots. git restore failed creating index.lock with Read-only file system, so rebase/stage/commit/push cannot be completed from this worker even though file edits inside the worktree are writable.
    • GitHub app fallback cannot publish the branch because the installed connector only has access to repositories under kganjam/*; SciDEX-AI/SciDEX returned 404.
    Result: Code-level hardening complete; live recalibration is blocked by local service/database availability rather than driver syntax or timestamp handling. Publishing is blocked by sandbox/gitdir permissions and unavailable GitHub app repository access.

    Payload JSON
    {
      "requirements": {
        "analysis": 5,
        "safety": 9
      },
      "_stall_skip_providers": [],
      "_stall_requeued_by": "codex",
      "_stall_requeued_at": "2026-04-11 03:23:03",
      "completion_shas": [
        "d6e4d69582e1f16c25a89bc2b1ddf5d6e39d0058",
        "c368b80d371327e98bdf552eeec059655714b3af",
        "6b7e0a09a36c9533741ed14f6c3210893e62208e",
        "8b261095c957c22f45ef729fcf3f90ca8a9cb1ce",
        "a9a6f4c38868cc573cf6f8e868f6bacc57eca289",
        "0e216535c287c5b688470bd3230158eeff20cbf1",
        "638dd1b139c7f5567cf5d756639a3d91c23d4aa6",
        "c151b0a2f48833aad22360cea1ac25312c2867a4",
        "024b048bed2ea49f37ef0ab3a1e5bf6bed4465d4",
        "1c01bf952c71491fdb9f2850c8236444aa56a75e",
        "276d122e479b9ca6fa237d2654eeb641d77b0ef1",
        "d08276b46267377147ec9f5529bad97cfd07d6d2",
        "6e06ea7b91064564bc9421c5db7f3251a3ae42c9",
        "797f6d245cdd2f1477161bfc6555f30d6f340787",
        "61337777aca5e61716126a9c7764f30685dc2a7b",
        "14aaff1a7c83892ccb43f259b59cbf31ecf9c649",
        "3e6798b48f6a6a36bf1df444cb73a54f3964339e",
        "86574aa5895201f9d634ecde4c1f3281c8125b58",
        "b1e831f93e33626910bda5085bfd7f25ddf26927",
        "3f6391e0786418991b35ec3d2d3f952939302413",
        "df50463ef976af011c79431fa17446e40e3d6c0e",
        "25173fab7606fefac46007da16c52a4e7c4ff3c2",
        "6886e80b175788a4b4e7224c7cf38da40d0d4daa",
        "652a193482a975d6261286adbaddb6cbdd86ba0e",
        "ce1ec141eeea6e7a6fe9a73f2a805ad033080a05"
      ],
      "completion_shas_checked_at": "2026-04-13T05:47:09.348822+00:00",
      "completion_shas_missing": [
        "2d2e5233aaefd0e498ae9c36d290ee29ecb109ef",
        "e4e43aebac98fe4eb5c1fd34f8b0a3ea695aeac8",
        "60a0a1775c2857346603f637f66be8134e18b4a3",
        "a7e011ac5c57bbed1a52484b577790d683aed67a",
        "6849f29c8ea1e343895fb3c3f719c4cd597d2f47",
        "5a04529ae05a8fe829b67208befc06307edbec41",
        "863690a39964adeb2b3658471f35190d634d0640"
      ],
      "_stall_skip_at": {},
      "_stall_skip_pruned_at": "2026-04-14T10:37:14.022390+00:00"
    }

    Sibling Tasks in Quest (Exchange) ↗