is_latest=1 only after it clears apromotion_state so agents can diagnose + iteratescidex_tools/model_eval_gate.py with the canonical gate runner:def run_eval_gate(candidate_artifact_id: str) -> dict:
"""Run the benchmark suite declared for this model, compare to
parent, decide promotion. Writes outcome to model_versions +
emits world_model_improvements event.
"""model_versions rows.benchmark_id (from candidate; if null, falls back toartifacts/models/{candidate_id}/eval.py andartifacts/models/{parent_id}/eval.py on the same split (shared(candidate_id, parent_id)) todelta ≥ 0 and 95% CI excludes 0 → promote.delta < 0 or CI includes 0 → reject, unless thetradeoff_justification with an"param_count": "-50%","latency": "-4x") → promote with tradeoff.model_versions.promotion_notes.
artifacts.is_latest=1,artifacts.lifecycle_state='active',model_versions.promotion_state='promoted'.artifacts.is_latest=0,artifacts.lifecycle_state='superseded',artifacts.superseded_by=<candidate_id>,model_versions.promotion_state='superseded'.world_model_improvements row withevent_type='model_version_promoted',target_artifact_id=<candidate_id>,magnitude = delta magnitude bucket,detection_metadata containing the full metric table + CI.artifact_lifecycle_history row on parent.
is_latest=0,lifecycle_state='candidate',model_versions.promotion_state='rejected',promotion_notes populated with metric table + CI + reason.candidate-rejected ticket for the authorgovernance_artifacts/model_promotion_policy.md describing the gate,run_eval_gate() executes on ≥1 candidate version (produced by WS3);world_model_improvements event emitted, detail page (once UI lands)lifecycle_state='candidate';promotion_notes is populated; no orphan is_latest=1 flip.
metadata.tradeoff_justification and negative primaryeval.py from the subtree on aquest_quality_standards_spec.md — no one-sided eval; thequest_senate_spec.md for the governance-policy filingforce_promote(). Human@log_tool_call records every gate invocation with the candidate +Task deemed COMPLETE. Implementation was merged to main in prior work.
Files verified on origin/main:
scidex_tools/model_eval_gate.py (31711 bytes) — exists and implements:run_eval_gate(candidate_artifact_id) with @log_tool_call decorator (line 387)_bootstrap_delta_cis() with deterministic seed via _deterministic_seed() (lines 118-152)_run_eval_in_sandbox() using bwrap sandbox (line 157+)run_eval_gate_quick() for dry-run with 100 resamples (line 725)governance_artifacts/model_promotion_policy.md (7327 bytes, 192 lines) — exists and covers:Success criteria check:
run_eval_gate() implemented — ready for WS3 candidate versions_deterministic_seed(candidate_id, parent_id)_has_valid_tradeoff()from scidex_tools.model_eval_gate import run_eval_gate, run_eval_gate_quick — OKNote: Cannot execute end-to-end gate (no WS3 candidate artifacts exist yet). Gate fails gracefully when benchmark_id or eval.py not found.
quest_model_artifacts_spec.mdpromotion_state column), WS2 (eval.py in subtree),quest_senate_spec.md, quest_quality_standards_spec.md,quest_evolutionary_arenas_spec.md (arena Elo is analogous; this isproject_economics_v2_credit_backprop_2026-04-10 —model_version_promoted events drive dividend payouts.benchmarks table exists, world_model_improvements table existsscidex_tools/model_eval_gate.py:run_eval_gate(candidate_artifact_id) with @log_tool_call decorator_bootstrap_delta_cis() for 95% CI via 1000-resample bootstrap with deterministic seed_run_eval_in_sandbox() using bwrap with minimal sandbox (ro /usr, bind model dir, tmpfs /tmp)_load_candidate_and_parent(): loads candidate/parent model_versions rows_resolve_benchmark(): benchmark_id from candidate or parent fallback_has_valid_tradeoff(): parses changelog/metadata for allowlisted tradeoff keys_ensure_universal_artifact(): registers model in universal_artifacts (FK integrity for artifact_lifecycle_history)run_eval_gate_quick() for CI pre-check (100 resamples, no DB write)
governance_artifacts/model_promotion_policy.md (192 lines, 8 sections):promote_with_tradeout → promote_with_tradeoff{
"_reset_note": "This task was reset after a database incident on 2026-04-17.\n\n**Context:** SciDEX migrated from SQLite to PostgreSQL after recurring DB\ncorruption. Some work done during Apr 16-17 may have been lost.\n\n**Before starting work:**\n1. Check if the task's goal is ALREADY satisfied (run the relevant checks)\n2. Check `git log --all --grep=task:YOUR_TASK_ID` for prior commits\n3. If complete, verify and mark done. If partial, continue. If not done, proceed.\n\n**DB change:** SciDEX now uses PostgreSQL. `get_db()` auto-detects via\nSCIDEX_DB_BACKEND=postgres env var.",
"_reset_at": "2026-04-18T06:29:22.046013+00:00",
"_reset_from_status": "done"
}