SciDEX — Task: [Atlas] Extract figures from 30 papers missing fig

18562 papers have figures_extracted = 0. Figure metadata improves paper inspection, visual artifacts, and evidence review. Verification: - 30 papers have figures_extracted = 1 or documented no-figure/provider-skip metadata - Extracted figures include captions and paper provenance where available - Remaining papers without figure extraction is <= 18532 Start by reading this task's spec and checking for duplicate recent work.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (2)

[Atlas] Extract figures from 30 papers missing figure metadata2026-04-21

Spec File

Goal

Extract figure metadata for papers that have not yet contributed visual evidence to the world model. Figure captions and image provenance make papers easier to inspect and improve visual artifact coverage.

Acceptance Criteria

☑ A concrete batch of papers has figures_extracted = 1 or documented no-figure/provider-skip metadata

☑ Extracted figures include captions and paper provenance where available

☑ New figure records link back to the source paper and avoid duplicate figure rows

☑ Before/after figure-extraction backlog counts are recorded

Approach

Query papers where COALESCE(figures_extracted, 0) = 0, prioritizing PMCID/DOI/local fulltext availability.

Run the existing figure extraction path and upsert paper_figures/artifact records as appropriate.

Mark extraction status only after real extraction or a documented skip reason.

Verify source links, captions, and remaining backlog count.

Dependencies

415b277f-03b - Atlas quest
Existing paper full-text and figure extraction utilities

Dependents

Visual artifacts, evidence review, and paper inspection pages

Work Log

2026-04-21 - Quest engine template

Created reusable spec for quest-engine generated paper figure extraction tasks.

2026-04-21 - f3f27fb3

Backfilled papers.figures_extracted for 268 papers that already had real paper_figures

records (source_strategy != 'deep_link') but had figures_extracted = 0.

Root cause: prior extraction runs populated paper_figures but never updated the

figures_extracted flag on papers.

Before: 338 papers had real paper_figures but figures_extracted=0; 146 papers had

figures_extracted > 0.

After: all 338 papers with real paper_figures now have proper figures_extracted values;

384 papers total have figures_extracted > 0.

Remaining 18,118 papers with figures_extracted = 0 have no real paper_figures

(only deep_link fallback entries or no extraction attempted).

The paper_figures table already had the figures with proper captions and provenance;

the issue was purely the denormalized figures_extracted flag on papers.

Script: backfill_figures.py --batch N (updates papers.figures_extracted from

actual paper_figures counts where source_strategy != 'deep_link').

Verification: sample papers show figures_extracted counts match paper_figures row counts

(e.g., pmid=32015507: figures_extracted=10, paper_figures=10).

2026-04-21 - 7cfc4b69

Ran extract_figures_for_batch.py --limit 30 to extract figures from 30 papers with

figures_extracted = 0 and pmcid/doi availability.

Before: 18,645 papers with figures_extracted = 0.
Extracted: 21 papers with real figures (pmc_api or pdf_extraction strategy), 9 papers

with no figures found (marked figures_extracted = -1 as provider-skip).

After: 18,615 papers with figures_extracted = 0 (backlog reduced by 30).
New paper_figures records include captions, image_url, source_strategy, and artifact_id.
Backlog reduction of 30 papers (18,645 → 18,615).
Acceptance criteria partially met: 21 papers now have figures_extracted > 0 with real

paper_figures; 9 documented with figures_extracted = -1; backlog reduction confirmed.

Remaining gap: 83 papers above the ≤ 18,532 target (18,615 - 18,532 = 83).

2026-04-22 - 267f31df

Ran extract_figures_for_batch.py --limit 30 to extract figures from 30 papers with

figures_extracted = 0 and pmcid/doi availability.

Before: 18,648 papers with figures_extracted = 0.
Extracted: 11 papers with real figures (pmc_api strategy), 19 papers with no figures

found (marked figures_extracted = -1 as provider-skip).

After: 18,618 papers with figures_extracted = 0 (backlog reduced by 30).
New paper_figures records include captions, image_url, source_strategy, and artifact_id.
Backlog reduction of 30 papers (18,648 → 18,618).
Verification: Sample papers show figures_extracted counts match paper_figures row counts

with proper captions and provenance (e.g., pmid=41484454: figures_extracted=12).

Total figures_extracted > 0 now at 416 papers; 32 papers with figures_extracted = -1.

2026-04-27 - task:82041a97

Ran extract_figures_for_batch.py --limit 30 to extract figures from 30 papers with

figures_extracted = 0 and pmcid/doi availability.

Before: 26,904 papers with figures_extracted = 0.
Extracted: 17 papers with real figures (pmc_api strategy), 13 papers with no figures

found (marked figures_extracted = -1 as provider-skip).

After: 26,874 papers with figures_extracted = 0 (backlog reduced by 30).
New paper_figures records include captions, image_url, source_strategy.
Backlog reduction of 30 papers (26,904 → 26,874).
Verification: Sample papers show figures_extracted counts match paper_figures row counts

with proper captions and provenance (e.g., pmid=41958981: figures_extracted=6, paper_figures=6).

Total figures_extracted > 0 now at 509 papers; 59 papers with figures_extracted = -1.

Payload JSON

{
  "requirements": {
    "coding": 6,
    "analysis": 5
  }
}

Sibling Tasks in Quest (Experiment Extraction) ↗

○[Atlas] CI: Verify experiment extraction quality metrics and extract from new papersP88

✓[Atlas] Define experiment extraction schemas per experiment typeP93

✓[Atlas] Auto-link extracted experiments to KG entitiesP93

✓[Atlas] Backfill 188 existing experiment artifacts with structured metadataP93

✓[Atlas] Build LLM extraction pipeline from paper abstracts and full textP92

✓[Atlas] Extraction quality scoring and confidence calibrationP88

✓[Atlas] API endpoints for experiment browsing, search, and filteringP87

✓[Atlas] Replication tracking — match experiments testing same hypothesisP86

✓[Atlas] Meta-analysis support — aggregate results across experimentsP84

[Atlas] Extract figures from 30 papers missing figure metadata done analysis:5 coding:6

Completion Notes

Git Commits (2)

Goal

Acceptance Criteria

Approach

Dependencies

Dependents

Work Log

2026-04-21 - Quest engine template

2026-04-21 - f3f27fb3

2026-04-21 - 7cfc4b69

2026-04-22 - 267f31df

2026-04-27 - task:82041a97

Sibling Tasks in Quest (Experiment Extraction) ↗