[Atlas] Cancer-vertical knowledge-gap importer (DepMap + GWAS + cBioPortal + lit) done

← Atlas
Mines 4 cancer-specific sources for unsolved mechanism gaps tagged to MONDO oncology nodes; seeds the cancer hypothesis tree.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Atlas] Cancer-vertical knowledge-gap importer (DepMap + GWAS + cBioPortal + lit) [task:8000455d-e121-4d71-ba9e-75a5ff4a0c8d] (#780)2026-04-27
Spec File

Effort: thorough

Goal

Stand up a cancer-specific knowledge-gap importer that mines four
high-signal sources — DepMap unexplained dependencies, NHGRI-EBI GWAS
catalog cancer entries, cBioPortal driver-mutation orphans, and a
PubMed query targeting "no consensus" / "remains unclear" cancer-mechanism
phrasings — emitting structured gaps rows tagged to MONDO oncology
nodes from the disease catalog. This is the cancer analogue of the
existing ND gap pipeline (gap_pipeline.py, gap_enricher.py) and
seeds the cancer hypothesis tree.

Why this matters

The Q-OPENQ ranker and Q-PROP analysis-proposal generator both feed off gaps. With ND-only gaps, every wave-1/2 ranker output is biased toward
neuroscience even when the hypothesis space is general. Importing a
batch of high-quality cancer gaps gives the Theorist real cancer-specific
priors to work with on day one; without it, oncology personas sit idle
because the queue has nothing to debate.

Acceptance Criteria

☐ New module scidex/atlas/cancer_gap_importer.py (≤700 LoC) with
four importer functions:
1. import_depmap_unexplained() — pulls genes with
mean_chronos < -0.5 in ≥3 lineages but no chembl_drug_targets
hit and no high-confidence disgenet cancer association →
emits gap "Why is <gene> essential in <lineages> despite
no known cancer driver role?"
2. import_gwas_cancer() — queries NHGRI-EBI GWAS catalog
(https://www.ebi.ac.uk/gwas/rest/api/) for trait-class
"Cancer" hits with p<5e-8 and no mapped causal mechanism in
hypotheses → emits one gap per (SNP, trait, mapped_gene).
3. import_cbioportal_drivers() — queries cBioPortal API
(https://www.cbioportal.org/api) for mutationsInGenes
endpoint across the TCGA pan-cancer studies, finds genes
with mutationCount > 50 lacking a SciDEX hypothesis →
emits gap.
4. import_lit_uncertainty() — runs PubMed search
(cancer OR oncology) AND ("remains unclear" OR "no consensus"
OR "mechanism unknown") AND (mechanism OR pathway)
last-2-year
window, deduplicates against existing gaps via the
gap_quality.fingerprint helper.
☐ Each gap row written via the existing gap_pipeline.create_gap
with vertical='oncology', mondo_id set, and a source_provenance
JSON column listing source URL + payload hash.
☐ One-shot seeding script scripts/seed_cancer_gaps.py runs all
four importers, target ≥500 cancer gaps in gaps after first run.
☐ Recurring scidex-cancer-gap-importer.timer weekly Tuesday 03:00
UTC; --limit flag and --source filter for ad-hoc reruns.
☐ Audit dashboard tile on /atlas/landscape: "Cancer gaps —
<n> open, <m> newly imported this week" linking to a filtered
/atlas/gaps?vertical=oncology view.
☐ Tests: tests/test_cancer_gap_importer.py — mocks each provider,
asserts gap rows are de-duplicated against existing gaps and
tagged with the right MONDO ID via the canonical resolver.

Approach

  • Reuse scidex/atlas/gap_pipeline.py writer functions; do not
  • bypass gap_quality.fingerprint.
  • DepMap data: leverage the parquet cache from
  • scidex/forge/depmap_client.py (already populated by
    q-rdp-depmap-target-dependency).
  • GWAS catalog REST is paginated (size=200); cache page payloads
  • under data/gwas_catalog/<date>/.
  • cBioPortal pan-cancer studies enumerated once and cached under
  • data/cbioportal/studies/.
  • Use canonical_disease() from q-vert-disease-ontology-catalog to
  • resolve every imported disease label to a MONDO ID; reject the
    gap if it cannot be resolved.

    Dependencies

    • q-vert-disease-ontology-catalog — MONDO resolver.
    • q-rdp-depmap-target-dependency (done) — DepMap parquet cache.
    • Existing gap_pipeline.py, gap_quality.py.

    Work Log

    2026-04-27 — Implementation (task:8000455d-e121-4d71-ba9e-75a5ff4a0c8d)

    Delivered:

    • migrations/add_cancer_gap_columns.sql — adds vertical TEXT, mondo_id TEXT,
    source_provenance JSONB to knowledge_gaps; migration applied to DB.
    • scidex/atlas/cancer_gap_importer.py (≤700 LoC) with all four importer functions:
    - import_depmap_unexplained() — curated list of 170+ cancer-essential candidate
    genes queried via scidex.forge.depmap_client; filters mean_chronos < -0.5 in
    ≥3 lineages; skips genes with existing oncology hypotheses in DB.
    - import_gwas_cancer() — NHGRI-EBI GWAS REST API, pages 200 at a time, filters
    p < 5e-8 and cancer traits; caches payloads under data/gwas_catalog/<date>/.
    - import_cbioportal_drivers() — queries 20 TCGA pan-cancer studies via cBioPortal
    API, finds genes with mutationCount > 50 lacking SciDEX hypothesis; caches under
    data/cbioportal/studies/.
    - import_lit_uncertainty() — PubMed esearch + esummary with 2-year window; rate-
    limited at 3 req/s; deduplicates via content_hash.
    - Embedded resolve_mondo_id() with 30 cancer-type mappings (fallback while
    q-vert-disease-ontology-catalog is pending).
    - get_cancer_gap_stats() helper for the dashboard tile.
    • scripts/seed_cancer_gaps.py — one-shot seeder with --limit, --source,
    --dry-run flags; logs before/after counts.
    • deploy/scidex-cancer-gap-importer.service + .timer — weekly Tue 03:00 UTC.
    • api.py — added ?vertical= filter to /api/gaps, new /api/cancer-gaps/stats
    endpoint, new /atlas/landscape HTML page with tile, new /atlas/gaps browser.
    • tests/test_cancer_gap_importer.py — 13 tests, all passing.
    Verification: Smoke-ran import_lit_uncertainty(limit=20) against live DB;
    20 oncology gaps created successfully with vertical='oncology', MONDO IDs, and source_provenance JSON. PubMed API confirmed live.

    Design note: q-vert-disease-ontology-catalog (MONDO resolver dependency) is
    pending; an embedded 30-term resolver handles the common cases until it lands.

    Sibling Tasks in Quest (Atlas) ↗