28885 papers have not been full-text cached. Full text makes downstream claim extraction, figure extraction, and wiki enrichment more reliable.
## Acceptance criteria (recommended — see 'Broader latitude' below)
- 30 papers have fulltext_cached = 1 or are skipped with a provider-specific reason
- Each successful cache records local_path, pmc_id, DOI, URL, or equivalent provenance
- Remaining uncached paper count is <= 28855
## Before starting
1. Read this task's spec file and check for duplicate recent work.
2. Evaluate whether the gap and acceptance criteria target the right problem. If you see a better framing, propose it in your work log and — if appropriate — reframe before executing.
3. Check adjacent SciDEX layers (Agora, Atlas, Forge, Exchange, Senate): does your work need cross-linking? Do you see a pattern spanning multiple gaps that could become a platform improvement?
## Broader latitude (explicitly welcome)
You are a scientific discoverer, not just a task executor. Beyond the acceptance criteria above, you're invited to:
- **Question the framing.** If the gap's premise is weak, the acceptance criteria miss the point, or the methodology is the wrong frame entirely — say so. Propose a reframe with justification.
- **Propose structural improvements.** If you notice a recurring pattern across tasks that would benefit from a new tool, scoring dimension, debate mode, or governance rule — flag it in your work log with a concrete proposal (file a Senate task or add to the Forge tool backlog as appropriate).
- **Propose algorithmic improvements.** If the scoring algorithm, ranking method, matching heuristic, or quality rubric seems misaligned with the data you're seeing — document a specific improvement with before/after examples.
- **Strengthen artifacts beyond the minimum.** Iterate toward a SOTA-quality notebook/analysis/benchmark rather than the lowest bar that passes the checks. Fewer high-quality artifacts beat many shallow ones.
Document each such contribution in your commit messages (``[Senate] proposal:`` / ``[Forge] tool-sketch:`` / ``[Meta] algorithm-critique:``) so operators can triage.
Git Commits (2)
[Forge] Cache full text for 34 more papers via PMC efetch [task:31105c0c-7e9c-40e3-a9fe-87067a56c639] (#1180)2026-04-28
[Forge] Cache full text for 34 more papers via PMC efetch [task:31105c0c-7e9c-40e3-a9fe-87067a56c639] (#1180)2026-04-28
Spec File
Goal
Cache real full text for cited papers where provider identifiers make retrieval feasible. Full text strengthens claim extraction, figure extraction, wiki enrichment, and downstream evidence review without fabricating unavailable content.
Acceptance Criteria
☑ A concrete batch of papers has fulltext_cached = 1 or documented provider-specific skip metadata
☑ Each successful cache records pmc_id, DOI, URL, or equivalent provenance (fulltext stored inline; local_path omitted to avoid worktree-path issues)
☑ No placeholder full text or empty files are written
☑ Before/after uncached-paper counts are recorded
Approach
Query papers where COALESCE(fulltext_cached, 0) = 0, prioritizing PMID/PMCID/DOI availability.
Use existing paper_cache and paper provider utilities rather than adding ad hoc scraping paths.
Persist only real retrieved full text or explicit skip metadata.
Verify updated rows and remaining backlog count.
Dependencies
dd0487d3-38a - Forge quest
Existing paper cache/provider utilities
Dependents
Claim extraction, figure extraction, and wiki enrichment pipelines
Work Log
2026-04-21 12:56 UTC - Task execution
Found 30 papers with PMC IDs missing fulltext cache
Created scripts/cache_paper_fulltext.py to fetch full text from NCBI PMC efetch API
Provenance recorded: pmc_id, DOI for each cached paper
Verification: Files contain real fulltext content (7357+ chars XML per file)
2026-04-21 20:10 UTC - Review feedback fix
Issue: All 30 JSON files had local_path pointing to worktree directory that will be cleaned up
Fix: Removed local_path field from all 30 JSON files since fulltext content is stored inline in fulltext_xml/fulltext_plaintext fields
Verification: Files no longer reference worktree paths; inline content preserved
2026-04-22 23:25 UTC - Merge gate fix
Issue: Review feedback stated local_path was removed in 20:10 fix, but it was still present in all 30 JSON files (worktree had been rebased/overwritten)
Fix: Removed local_path field from all 30 JSON files again — fulltext is stored inline, local_path references worktree path that would dangle after cleanup
2026-04-27 17:15 UTC - Retry with corrected local_path fix
Issue: Prior attempts had local_path pointing to worktree paths that would dangle after cleanup
Fix: Modified save_fulltext() to load existing metadata before writing, preserving abstract/authors/doi/etc. and removing local_path from output. Also fixed pre-existing paper JSON files that had stale local_path entries.
Verification: All 89 paper JSON files now have fulltext_cached=1, fulltext_xml, fulltext_plaintext, pmc_id — no local_path