Effort: thorough
Stand up a cancer-specific knowledge-gap importer that mines four
high-signal sources — DepMap unexplained dependencies, NHGRI-EBI GWAS
catalog cancer entries, cBioPortal driver-mutation orphans, and a
PubMed query targeting "no consensus" / "remains unclear" cancer-mechanism
phrasings — emitting structured gaps rows tagged to MONDO oncology
nodes from the disease catalog. This is the cancer analogue of the
existing ND gap pipeline (gap_pipeline.py, gap_enricher.py) and
seeds the cancer hypothesis tree.
The Q-OPENQ ranker and Q-PROP analysis-proposal generator both feed off
gaps. With ND-only gaps, every wave-1/2 ranker output is biased toward
neuroscience even when the hypothesis space is general. Importing a
batch of high-quality cancer gaps gives the Theorist real cancer-specific
priors to work with on day one; without it, oncology personas sit idle
because the queue has nothing to debate.
scidex/atlas/cancer_gap_importer.py (≤700 LoC) withimport_depmap_unexplained() — pulls genes withmean_chronos < -0.5 in ≥3 lineages but no chembl_drug_targetsdisgenet cancer association →<gene> essential in <lineages> despiteimport_gwas_cancer() — queries NHGRI-EBI GWAS cataloghttps://www.ebi.ac.uk/gwas/rest/api/) for trait-classhypotheses → emits one gap per (SNP, trait, mapped_gene).import_cbioportal_drivers() — queries cBioPortal APIhttps://www.cbioportal.org/api) for mutationsInGenesmutationCount > 50 lacking a SciDEX hypothesis →import_lit_uncertainty() — runs PubMed search(cancer OR oncology) AND ("remains unclear" OR "no consensus"
OR "mechanism unknown") AND (mechanism OR pathway) last-2-yeargap_quality.fingerprint helper.
gap_pipeline.create_gapvertical='oncology', mondo_id set, and a source_provenancescripts/seed_cancer_gaps.py runs allgaps after first run.
scidex-cancer-gap-importer.timer weekly Tuesday 03:00--limit flag and --source filter for ad-hoc reruns.
/atlas/landscape: "Cancer gaps —<n> open, <m> newly imported this week" linking to a filtered/atlas/gaps?vertical=oncology view.
tests/test_cancer_gap_importer.py — mocks each provider,scidex/atlas/gap_pipeline.py writer functions; do notgap_quality.fingerprint.
scidex/forge/depmap_client.py (already populated byq-rdp-depmap-target-dependency).
size=200); cache page payloadsdata/gwas_catalog/<date>/.
data/cbioportal/studies/.
canonical_disease() from q-vert-disease-ontology-catalog toq-vert-disease-ontology-catalog — MONDO resolver.q-rdp-depmap-target-dependency (done) — DepMap parquet cache.gap_pipeline.py, gap_quality.py.Delivered:
migrations/add_cancer_gap_columns.sql — adds vertical TEXT, mondo_id TEXT,source_provenance JSONB to knowledge_gaps; migration applied to DB.
scidex/atlas/cancer_gap_importer.py (≤700 LoC) with all four importer functions:import_depmap_unexplained() — curated list of 170+ cancer-essential candidatescidex.forge.depmap_client; filters mean_chronos < -0.5 inimport_gwas_cancer() — NHGRI-EBI GWAS REST API, pages 200 at a time, filtersdata/gwas_catalog/<date>/.import_cbioportal_drivers() — queries 20 TCGA pan-cancer studies via cBioPortaldata/cbioportal/studies/.import_lit_uncertainty() — PubMed esearch + esummary with 2-year window; rate-resolve_mondo_id() with 30 cancer-type mappings (fallback whileq-vert-disease-ontology-catalog is pending).get_cancer_gap_stats() helper for the dashboard tile.
scripts/seed_cancer_gaps.py — one-shot seeder with --limit, --source,--dry-run flags; logs before/after counts.
deploy/scidex-cancer-gap-importer.service + .timer — weekly Tue 03:00 UTC.api.py — added ?vertical= filter to /api/gaps, new /api/cancer-gaps/stats/atlas/landscape HTML page with tile, new /atlas/gaps browser.
tests/test_cancer_gap_importer.py — 13 tests, all passing.import_lit_uncertainty(limit=20) against live DB;vertical='oncology', MONDO IDs, and
source_provenance JSON. PubMed API confirmed live.Design note: q-vert-disease-ontology-catalog (MONDO resolver dependency) is
pending; an embedded 30-term resolver handles the common cases until it lands.