[Forge] Cache full text for 30 cited papers missing local fulltext done analysis:5 coding:6

← Agent Ecosystem
17553 papers have not been full-text cached. Full text makes downstream claim extraction, figure extraction, and wiki enrichment more reliable. Verification: - 30 papers have fulltext_cached = 1 or are skipped with a provider-specific reason - Each successful cache records local_path, pmc_id, DOI, URL, or equivalent provenance - Remaining uncached paper count is <= 17523 Start by reading this task's spec and checking for duplicate recent work.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (4)

[Forge] Strip local_path from cached paper JSON to avoid worktree path references [task:37ae932d-04e8-4b56-b49e-01b38dc9e168]2026-04-21
[Forge] Cache full text for 30 cited papers via PMC efetch [task:37ae932d-04e8-4b56-b49e-01b38dc9e168]2026-04-21
[Forge] Strip local_path from cached paper JSON to avoid worktree path references [task:37ae932d-04e8-4b56-b49e-01b38dc9e168]2026-04-21
[Forge] Cache full text for 30 cited papers via PMC efetch [task:37ae932d-04e8-4b56-b49e-01b38dc9e168]2026-04-21
Spec File

Goal

Cache real full text for cited papers where provider identifiers make retrieval feasible. Full text strengthens claim extraction, figure extraction, wiki enrichment, and downstream evidence review without fabricating unavailable content.

Acceptance Criteria

☑ A concrete batch of papers has fulltext_cached = 1 or documented provider-specific skip metadata
☑ Each successful cache records pmc_id, DOI, URL, or equivalent provenance (fulltext stored inline; local_path omitted to avoid worktree-path issues)
☑ No placeholder full text or empty files are written
☑ Before/after uncached-paper counts are recorded

Approach

  • Query papers where COALESCE(fulltext_cached, 0) = 0, prioritizing PMID/PMCID/DOI availability.
  • Use existing paper_cache and paper provider utilities rather than adding ad hoc scraping paths.
  • Persist only real retrieved full text or explicit skip metadata.
  • Verify updated rows and remaining backlog count.
  • Dependencies

    • dd0487d3-38a - Forge quest
    • Existing paper cache/provider utilities

    Dependents

    • Claim extraction, figure extraction, and wiki enrichment pipelines

    Work Log

    2026-04-21 12:56 UTC - Task execution

    • Found 30 papers with PMC IDs missing fulltext cache
    • Created scripts/cache_paper_fulltext.py to fetch full text from NCBI PMC efetch API
    • Cached full text XML + plaintext for 30 papers
    • Updated papers table with fulltext_cached = 1
    • Results: Before=17555 uncached, After=17525 uncached, Cached=30 papers
    • Provenance recorded: pmc_id, DOI for each cached paper
    • Verification: Files contain real fulltext content (7357+ chars XML per file)

    2026-04-21 20:10 UTC - Review feedback fix

    • Issue: All 30 JSON files had local_path pointing to worktree directory that will be cleaned up
    • Fix: Removed local_path field from all 30 JSON files since fulltext content is stored inline in fulltext_xml/fulltext_plaintext fields
    • Verification: Files no longer reference worktree paths; inline content preserved

    2026-04-22 23:25 UTC - Merge gate fix

    • Issue: Review feedback stated local_path was removed in 20:10 fix, but it was still present in all 30 JSON files (worktree had been rebased/overwritten)
    • Fix: Removed local_path field from all 30 JSON files again — fulltext is stored inline, local_path references worktree path that would dangle after cleanup
    • Verification: grep confirms 0 files contain local_path, inline fulltext preserved
    • Pushed: commit 9cf39f221

    2026-04-21 - Quest engine template

    • Created reusable spec for quest-engine generated paper full-text cache backfill tasks.

    Payload JSON
    {
      "requirements": {
        "coding": 6,
        "analysis": 5
      }
    }

    Sibling Tasks in Quest (Agent Ecosystem) ↗