SciDEX — Task: [Atlas] CI: Verify experiment extraction quality m

Check experiment extraction quality: (1) count papers with structured experiments (goal: >500), (2) check field completeness ratio (goal: >80%), (3) check KG entity link precision via sampling, (4) run extraction pipeline on papers added since last run that have 0 experiment records, (5) generate summary report. See: experiment_extractor.py, ci_notebook_coverage.py pattern for CI style. Spec: docs/planning/specs/quest_experiment_extraction_spec.md

Completion Notes

Auto-release: recurring task had no work this cycle

Git Commits (2)

[Atlas] Improve experiment extraction CI targeting and worktree DB access [task:8f7dc2dc-829a-41e0-8824-c5c872cde977]2026-04-09

[Atlas] CI: experiment extraction quality check + incremental pipeline [task:8f7dc2dc-829a-41e0-8824-c5c872cde977]2026-04-06

Spec File

[Atlas] CI: Verify experiment extraction quality metrics and extract from new papers

> ## Continuous-process anchor
>
> This spec describes an instance of one of the retired-script themes
> documented in docs/design/retired_scripts_patterns.md. Before
> implementing, read:
>
> 1. The "Design principles for continuous processes" section of that
> atlas — every principle is load-bearing. In particular:
> - LLMs for semantic judgment; rules for syntactic validation.
> - Gap-predicate driven, not calendar-driven.
> - Idempotent + version-stamped + observable.
> - No hardcoded entity lists, keyword lists, or canonical-name tables.
> - Three surfaces: FastAPI + orchestra + MCP.
> - Progressive improvement via outcome-feedback loop.
> 2. The theme entry in the atlas matching this task's capability:
> AG2 (pick the closest from Atlas A1–A7, Agora AG1–AG5,
> Exchange EX1–EX4, Forge F1–F2, Senate S1–S8, Cross-cutting X1–X2).
> 3. If the theme is not yet rebuilt as a continuous process, follow
> docs/planning/specs/rebuild_theme_template_spec.md to scaffold it
> BEFORE doing the per-instance work.
>
> **Specific scripts named below in this spec are retired and must not
> be rebuilt as one-offs.** Implement (or extend) the corresponding
> continuous process instead.

ID: 8f7dc2dc-829 Priority: 92 Frequency: weekly Status: open

Goal

This recurring check should explicitly suppress chaff. Candidate selection and
reporting should distinguish empirical studies from reviews/editorials, so the
system spends extraction effort on papers that can actually ground analyses
and mechanistic claims.

Acceptance Criteria

☑ Concrete deliverables created

☑ Work log updated with timestamped entry

Work Log

2026-04-06 — Initial CI run (task:8f7dc2dc-829a-41e0-8824-c5c872cde977)

Created ci_experiment_extraction.py following the ci_notebook_coverage.py pattern.

Metrics from this run:

Check	Result	Goal	Status
Papers with experiments	26	≥ 500	❌ Far below target
Field completeness	95.1% avg	≥ 80%	✅
KG entity precision	94.7% (71/75 entities)	≥ 90%	✅

Notes:

14,819 papers have abstracts but no extracted experiments.
Extraction batch of 5 papers ran but yielded 0 experiments — the top-cited unprocessed papers appear to be review articles (expected; extractor only extracts from empirical papers with clear methods/results).
Multi-entity string issue: 13 artifacts store multiple genes as "SQSTM1, CALCOCO2" in a single array element instead of separate elements. These still resolve correctly after comma-splitting.
disease field completeness is 74.8% (below 80%) but all other fields are ≥ 82%, bringing the average to 95.1%.
Report saved to: docs/code_health/experiment_extraction_ci_2026-04-06.md

2026-04-09 20:38 PDT — Slot 20

Re-ran the experiment extraction CI in a worktree after reviewing ci_experiment_extraction.py, experiment_extractor.py, and the quest spec.
Identified the main selection problem: the batch extractor was ordering unprocessed papers by citations only, which kept surfacing reviews and guidelines instead of empirical studies.
Added empirical-study ranking heuristics in experiment_extractor.py so candidate selection now boosts method/result-style abstracts and penalizes review-like titles and journals.
Extended ci_experiment_extraction.py reporting to show selected candidate papers and post-run coverage so the report reflects the actual state after extraction.
Found and fixed a worktree-specific DB bug in artifact_registry.py: its default DB_PATH was relative (PostgreSQL), which created/used an empty local DB in worktrees and caused no such table: artifacts during registration.
Validated the fix with live extraction runs:

- ci_experiment_extraction.py --limit 2 registered 11 experiments across PMIDs 41268978 and 40693377
- ci_experiment_extraction.py --limit 1 registered 2 experiments for PMID 40446574
- ci_experiment_extraction.py --limit 1 registered 5 experiments for PMID 40020261

Final report: docs/code_health/experiment_extraction_ci_2026-04-10.md
Post-run metrics from the latest validation:

- Papers with extracted experiments: 30
- Field completeness: 95.5%
- KG entity precision: 94.7%

Result: recurring CI now selects experiment-like papers and successfully registers extracted experiments from an Orchestra worktree.

Payload JSON

{
  "requirements": {
    "coding": 7,
    "reasoning": 6,
    "analysis": 6,
    "safety": 6
  }
}

Sibling Tasks in Quest (Experiment Extraction) ↗

✓[Atlas] Define experiment extraction schemas per experiment typeP93

✓[Atlas] Auto-link extracted experiments to KG entitiesP93

✓[Atlas] Backfill 188 existing experiment artifacts with structured metadataP93

✓[Atlas] Build LLM extraction pipeline from paper abstracts and full textP92

✓[Atlas] Extraction quality scoring and confidence calibrationP88

✓[Atlas] API endpoints for experiment browsing, search, and filteringP87

✓[Atlas] Replication tracking — match experiments testing same hypothesisP86

✓[Atlas] Meta-analysis support — aggregate results across experimentsP84

[Atlas] CI: Verify experiment extraction quality metrics and extract from new papers blocked analysis:6 coding:7 reasoning:6 safety:6