SciDEX — Task: [Forge] Cache full text for 30 cited papers missin

Papers have not been full-text cached. Full text makes downstream claim extraction, figure extraction, and wiki enrichment more reliable. Verification: - 30 papers have fulltext_cached = 1 or are skipped with a provider-specific reason - Each successful cache records local_path, pmc_id, DOI, URL, or equivalent provenance - Remaining uncached paper count is reduced Start by reading this task's spec and checking for duplicate recent work.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (2)

Squash merge: orchestra/task/d8635679-cache-full-text-for-30-cited-papers-miss (3 commits)2026-04-22

Spec File

Goal

Cache real full text for cited papers where provider identifiers make retrieval feasible. Full text strengthens claim extraction, figure extraction, wiki enrichment, and downstream evidence review without fabricating unavailable content.

Acceptance Criteria

☑ A concrete batch of papers has fulltext_cached = 1 or documented provider-specific skip metadata

☑ Each successful cache records pmc_id, DOI, URL, or equivalent provenance (fulltext stored inline; local_path omitted to avoid worktree-path issues)

☑ No placeholder full text or empty files are written

☑ Before/after uncached-paper counts are recorded

Approach

Query papers where COALESCE(fulltext_cached, 0) = 0, prioritizing PMID/PMCID/DOI availability.

Use existing paper_cache and paper provider utilities rather than adding ad hoc scraping paths.

Persist only real retrieved full text or explicit skip metadata.

Verify updated rows and remaining backlog count.

Dependencies

dd0487d3-38a - Forge quest
Existing paper cache/provider utilities

Dependents

Claim extraction, figure extraction, and wiki enrichment pipelines

Work Log

2026-04-21 12:56 UTC - Task execution

Found 30 papers with PMC IDs missing fulltext cache
Created scripts/cache_paper_fulltext.py to fetch full text from NCBI PMC efetch API
Cached full text XML + plaintext for 30 papers
Updated papers table with fulltext_cached = 1
Results: Before=17555 uncached, After=17525 uncached, Cached=30 papers
Provenance recorded: pmc_id, DOI for each cached paper
Verification: Files contain real fulltext content (7357+ chars XML per file)

2026-04-21 20:10 UTC - Review feedback fix

Issue: All 30 JSON files had local_path pointing to worktree directory that will be cleaned up
Fix: Removed local_path field from all 30 JSON files since fulltext content is stored inline in fulltext_xml/fulltext_plaintext fields
Verification: Files no longer reference worktree paths; inline content preserved

2026-04-22 23:25 UTC - Merge gate fix

Issue: Review feedback stated local_path was removed in 20:10 fix, but it was still present in all 30 JSON files (worktree had been rebased/overwritten)
Fix: Removed local_path field from all 30 JSON files again — fulltext is stored inline, local_path references worktree path that would dangle after cleanup
Verification: grep confirms 0 files contain local_path, inline fulltext preserved
Pushed: commit 9cf39f221

2026-04-27 17:15 UTC - Retry with corrected local_path fix

Issue: Prior attempts had local_path pointing to worktree paths that would dangle after cleanup
Fix: Modified save_fulltext() to load existing metadata before writing, preserving abstract/authors/doi/etc. and removing local_path from output. Also fixed pre-existing paper JSON files that had stale local_path entries.
Verification: All 89 paper JSON files now have fulltext_cached=1, fulltext_xml, fulltext_plaintext, pmc_id — no local_path
Results: Before=29119 uncached, After=29030 uncached, Cached=89 papers (3 batches: 30+30+29)
DB updated: papers table fulltext_cached=1 for 89 papers
24737864/31722199 unchanged — no fulltext available for these PMIDs via PMC efetch
Pushed: commit 26b507cca

2026-04-27 17:43 UTC - Additional 30 papers cached

Ran scripts/cache_paper_fulltext.py --batch-size 30 to cache more papers
Cached 30 more papers with PMC-based fulltext from NCBI PMC efetch API
Results: Before=29030 uncached, After=29000 uncached, Cached=30 papers
Files: 11311121, 12585682, 16845120, 18760350, 19535996, 21221174, 23209584, 24797482, 27986873, 28257628, 28509093, 29847664, 31302665, 32324737, 34002096, 35387179, 36083892, 36631445, 36661420, 37415197, 37554945, 37651202, 38215203, 39247203, 39284833, 39876844, 39929585, 40000842, 41441843, 41463007
Provenance: pmc_id recorded in each JSON; fulltext stored inline (fulltext_xml + fulltext_plaintext)
DB updated: papers table fulltext_cached=1 for 30 papers
No local_path: each file has fulltext_cached=1 and inline fulltext without worktree path references

2026-04-28 00:45 UTC - Iteration 4

Ran scripts/cache_paper_fulltext.py --batch-size 30 to cache more papers
Cached 26 papers with PMC-based fulltext from NCBI PMC efetch API
4 papers hit HTTP 429 rate limit (PMC7346099, PMC12570452, PMC3917009, PMC5093270)
Results: Before=28996 uncached, After=28970 uncached, Cached=26 papers
Files: 12621583, 20145041, 20847931, 23415231, 24144779, 24232570, 27335573, 27980341, 29248595, 29523847, 29625053, 30283395, 31277513, 31327527, 32172389, 32652041, 32762702, 34161876, 34901256, 35750468, 36147777, 38337511, 39427196, 40312500, 41000667, 41507122
Provenance: pmc_id in each JSON; fulltext stored inline (fulltext_xml + fulltext_plaintext); no local_path
DB updated: papers table fulltext_cached=1 for 26 papers
Verification: sample file 31327527.json has fulltext_xml len=185531, fulltext_cached=1, no local_path
Commit: 0c6f9f9c9 — 26 new paper JSON files

2026-04-28 00:46 UTC - Iteration 5

Ran scripts/cache_paper_fulltext.py --batch-size 30 to cache more papers
Cached 28 papers with PMC-based fulltext from NCBI PMC efetch API
2 papers hit HTTP 429 rate limit (PMC9751129, PMC2528060) and were skipped
Results: Before=28970 uncached, After=28945 uncached, Cached=28 papers
Files: 11532926, 18216219, 19932737, 20439747, 22026390, 22966490, 24250719, 24439385, 24648945, 25237099, 26637798, 27840763, 27936171, 28481359, 28931463, 30045735, 30302047, 30716085, 31119672, 32806612, 33403446, 34114603, 36845666, 37802998, 38101486, 39447739, 40880467, 41163077
Provenance: pmc_id in each JSON; fulltext stored inline (fulltext_xml + fulltext_plaintext); no local_path
DB updated: papers table fulltext_cached=1 for 28 papers
Verification: sample file 31119672.json has fulltext_xml len=114966, fulltext_cached=1, no local_path
Commit: 831f8a8d2 — 28 new paper JSON files

2026-04-28 01:24 UTC - Iteration 7

Ran scripts/cache_paper_fulltext.py --batch-size 30 to cache more papers
Cached 30 papers with PMC-based fulltext from NCBI PMC efetch API
Results: Before=28915 uncached, After=28885 uncached, Cached=30 papers
Files: 15961440, 16423343, 18596082, 21156028, 21460841, 25249974, 2555491, 28167629, 29287521, 29587860, 29695715, 29874566, 30662922, 31023287, 31142840, 32083202, 32837228, 34521941, 34991675, 35322232, 35732735, 35766110, 36535812, 37124678, 37128374, 38374256, 39385035, 40050444, 4087008, 41031215
Provenance: pmc_id in each JSON; fulltext stored inline (fulltext_xml + fulltext_plaintext); no local_path
DB updated: papers table fulltext_cached=1 for 30 papers
Commit: b79dadf66 — 30 new/updated paper JSON files

2026-04-28 00:52 UTC - Iteration 6

Ran scripts/cache_paper_fulltext.py --batch-size 30 to cache more papers
Cached 30 papers with PMC-based fulltext from NCBI PMC efetch API
Results: Before=28945 uncached, After=28915 uncached, Cached=30 papers
Files: 10908595, 15082795, 18615014, 19104057, 20815853, 21084987, 24390342, 25458990, 25915759, 26030851, 27721440, 28649460, 29070703, 29769264, 29850218, 30528555, 30564305, 31015339, 32493451, 32867190, 32979048, 34985918, 35732922, 37689812, 37981307, 38019311, 40104355, 40233719, 41090735, 41171011
Provenance: pmc_id in each JSON; fulltext stored inline (fulltext_xml + fulltext_plaintext); no local_path
DB updated: papers table fulltext_cached=1 for 30 papers
Verification: sample file 35732922.json has fulltext_xml len=11024, fulltext_cached=1, no local_path
Commit: 9edc61471 — 30 new paper JSON files

2026-04-28 00:36 UTC - Iteration 3

Ran scripts/cache_paper_fulltext.py --batch-size 30 to cache more papers
Cached 28 papers with PMC-based fulltext from NCBI PMC efetch API
2 papers hit HTTP 429 rate limit (PMC6703186, PMC12458430) and were skipped
Results: Before=29024 uncached, After=28996 uncached, Cached=28 papers
Provenance: pmc_id recorded in each JSON; fulltext stored inline (fulltext_xml + fulltext_plaintext); no local_path
DB updated: papers table fulltext_cached=1 for 28 papers
Verification: All 28 files have fulltext_xml length 4906–187735 chars, fulltext_cached=1, no local_path
Commit: 2cd3864d1 — 28 new paper JSON files

2026-04-28 00:19 UTC - Iteration 1 of ~3

Ran scripts/cache_paper_fulltext.py (3 consecutive batches) to cache 82 papers total
Batch 1: 28 cached, 2 skipped (HTTP 429 rate limit on PMC11107615, PMC7471500)
Batch 2: 25 cached, 5 skipped (429 on 5 PMIDs — later retried successfully)
Batch 3: 29 cached, 1 skipped (429 on PMC5943889)
Results: Before=29230 uncached, After=29148 uncached (≤29174 target ✓)
DB updated: papers table fulltext_cached=1 for 82 papers
Provenance: pmc_id + DOI in each JSON; fulltext stored inline (fulltext_xml + fulltext_plaintext); no local_path
Commit: 14381dad7 — 82 files (81 new + 1 updated)

2026-04-28 00:28 UTC - Iteration 2

Ran scripts/cache_paper_fulltext.py (5 consecutive batches) to cache 125 papers total
Hit NCBI API rate limits (HTTP 429) multiple times; used exponential backoff with sleep delays
Batch 1: 23 cached, 7 skipped (429)
Batch 2: 24 cached, 6 skipped (429)
Batch 3: 21 cached, 9 skipped (429)
Batch 4: 27 cached, 3 skipped (429)
Batch 5: 30 cached, 0 skipped (clean run)
Results: Before=29149 uncached, After=29024 uncached
DB updated: papers table fulltext_cached=1 for 125 papers
Provenance: pmc_id in each JSON; fulltext stored inline (fulltext_xml + fulltext_plaintext); no local_path
Commit: 4d5363a5d — 125 files (122 new + 3 updated)
Verification: grep confirms 0 files contain local_path; fulltext_xml length 52375+ chars per file

2026-04-28 08:35 UTC - Iteration 8

Ran scripts/cache_paper_fulltext.py in two batches (30 + 5) to cache 34 papers
Batch 1: 29 cached, 1 skipped (PMC11592377 HTTP 500 server error — retried successfully in batch 2)
Batch 2: 5 cached, 0 skipped (including PMC11592377 retry)
Results: Before=28885 uncached, After=28851 uncached, Cached=34 papers (meets <=28855 acceptance criteria)
Files: 19484501, 19917251, 19955414, 21368835, 22388933, 22683761, 25585830, 26268651, 26516209, 29273807, 29515057, 31079900, 32124591, 32130906, 33271963, 34266459, 34387838, 34692583, 34873335, 35517053, 35572351, 35945425, 35950735, 35978189, 36653411, 36987696, 37029315, 37179344, 38654691, 39595853, 40060520, 40634602, 41261159, 41478829
Provenance: pmc_id in each JSON; fulltext stored inline (fulltext_xml + fulltext_plaintext); no local_path
DB updated: papers table fulltext_cached=1 for 34 papers
Verification: sample files have fulltext_xml 34K–175K chars, fulltext_cached=1, no local_path

2026-04-21 - Quest engine template

Created reusable spec for quest-engine generated paper full-text cache backfill tasks.

Payload JSON

{
  "requirements": {
    "analysis": 7,
    "coding": 8
  }
}

Sibling Tasks in Quest (Forge) ↗

○[Forge] Integrate tools with debate engineP95

○[Forge] Reproducible analysis capsules and artifact supply chainP93

○[Forge] Benchmark answer-key migration to dataset registry (driver #31)P93

○[Forge] CI: Experiment claim driver — pick high-IIG experiments for executionP93

○[Forge] Benchmark evaluation harness — run top 50 hypotheses through 6 registered benchmarks, store predictive scoresP92

○[Forge] CI: Paper replication target selectorP91

○[Forge] Artifact enrichment quest — evaluation context, cross-links, provenanceP82

○[Forge] Reduce PubMed metadata backlog for papers missing abstractsP82

○[Forge] CI: Test all scientific tools for availabilityP78

○[Forge] Execute: testes-gonadal RNA-seq experiment 5b0bb7afP70

[Forge] Cache full text for 30 cited papers missing local fulltext done analysis:7 coding:8