[Atlas] Per-disease ontology + entity catalog (cancer, cardio, infectious, metabolic, immunology) done

← Atlas
MONDO/DOID/EFO catalog across 5 verticals with canonical_disease() resolver - foundation every other wave-4 vertical task rides on.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (1)

[Atlas] Disease ontology catalog — MONDO/DOID/EFO resolver across 5 verticals [task:520d6abd-efca-4891-8998-a6acc9b2b5fe] (#778)2026-04-27
Spec File

Effort: thorough

Goal

Today the disease side of the SciDEX KG is anchored on neurodegeneration — hypotheses.disease, wiki_entities, and the disease-landing dashboard
all assume an ND vocabulary. Build a **multi-vertical disease ontology
catalog** that imports MONDO, DOID, and EFO IDs for the top entities in
five new verticals (oncology, cardiovascular, infectious, metabolic,
immunology), maps them to canonical entity rows, and exposes a single canonical_disease(slug) resolver every cross-cutting feature can reuse.
Without this, every wave-4 vertical task otherwise reinvents disease ID
plumbing.

Why this matters

Five wave-4 specs (per-vertical landing pages, persona injection,
gap importers, cross-disease analogy engine, priority scoring) all need
the same answer to "is colorectal-cancer the same node as MONDO:0005575?" Today there is no canonical resolver — disease
columns hold free text ("AD", "Alzheimer's disease", "alzheimers-disease",
"Alzheimer disease (G30)"). A central catalog with MONDO IDs collapses
the ambiguity once and lets every other vertical task ride on top.

Acceptance Criteria

☐ Migration disease_ontology_catalog(mondo_id PRIMARY KEY,
label, vertical TEXT CHECK IN ('oncology','cardiovascular',
'infectious','metabolic','immunology','neurodegeneration','other'),
doid_id, efo_id, icd10, mesh_id, parent_mondo_id, synonyms_json,
gard_id, omim_id, n_known_genes, n_clinical_trials, n_papers,
catalog_version, fetched_at)
. Index on vertical and on
lower(label).
☐ New module scidex/atlas/disease_ontology.py (≤500 LoC) with:
- import_from_mondo() — pulls MONDO OWL via OLS REST
(https://www.ebi.ac.uk/ols4/api/ontologies/mondo) and walks
the five vertical sub-trees (MONDO:0004992 cancer, MONDO:0005267
heart disease, MONDO:0005550 infectious, MONDO:0005066
metabolic, MONDO:0005046 immune system).
- canonical_disease(query: str) -> CanonicalDisease | None — fuzzy
resolves a free-text disease name (uses rapidfuzz against
synonyms_json) returning the catalog row.
- vertical_for(mondo_id) and subtree(mondo_id) helpers.
☐ Backfill script scripts/backfill_canonical_disease.py walks
hypotheses.disease, wiki_entities WHERE entity_type='disease',
and wiki_pages WHERE category='disease', calls
canonical_disease(), and writes a new
entity_disease_canonical(entity_id, mondo_id, confidence,
resolved_at)
join table. Reports unresolved-rate per vertical.
☐ Seed top-50 MONDO terms per vertical into the catalog so the
first migration run lands with ≥250 rows ready, then schedule
scidex-disease-ontology-refresh.timer weekly on Mondays 04:00
UTC to pull MONDO updates.
☐ New /api/disease/{mondo_id} JSON endpoint returns the catalog
row plus aggregate counts (hypotheses, papers, debates linked).
/atlas/diseases page lists the 250+ catalog rows grouped by
vertical, sortable by n_papers and n_hypotheses.
☐ Existing /disease-landing/<slug> route
(api.py per q-synth-disease-landing) is updated to look the
slug up via the new resolver instead of the hard-coded ND list.
☐ Unit tests: tests/test_disease_ontology.py — round-trip
"alzheimers-disease" / "AD" / "Alzheimer disease" all resolve
to MONDO:0004975; "colorectal cancer" → MONDO:0005575;
"type 2 diabetes" → MONDO:0005148; unknown string returns
None, not exception.

Approach

  • Use OLS4 REST (no auth) — /ontologies/mondo/terms?obo_id=MONDO:...&size=200
  • walks subtree by paging children. Cache page payloads under
    data/disease_ontology/mondo/v<version>/.
  • Synonyms come from MONDO's oboSynonym annotation; ICD/OMIM/MeSH
  • xrefs come from MONDO databaseCrossReferences.
  • Resolver uses rapidfuzz.process.extractOne with score cutoff 85.
  • Aggregate counts populated by triggers + a nightly refresh job
  • so the /api/disease/{mondo_id} payload is fast.
  • Mirror the canonical-entity pattern from
  • scidex/atlas/canonical_entity_links.py so wave-4 personas and
    landing pages have a uniform interface.

    Dependencies

    • q-synth-disease-landing — landing route looks up via the new resolver.
    • Existing wiki_entities, canonical_entity_links infrastructure.

    Work Log

    Sibling Tasks in Quest (Atlas) ↗