[Atlas] Entity canonicalization and ontology alignment done

← Atlas
Build kg_normalize.py: (a) map relation types to controlled vocabulary of ~20 canonical relations, (b) merge duplicate entities, (c) align entity types to fixed ontology (gene, protein, pathway, disease, drug, phenotype, cell_type, brain_region). Runs in post_process.py pipeline. Acceptance: relations <=25 types; no duplicates.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (4)

[Atlas] Work log: kg_normalize.py built and verified on current main2026-04-25
[Atlas] Add kg_normalize.py: canonical relation and entity type mapping2026-04-25
[Atlas] Work log: kg_normalize.py built and verified on current main2026-04-25
[Atlas] Add kg_normalize.py: canonical relation and entity type mapping2026-04-25
Spec File

[Atlas] Entity canonicalization and ontology alignment

Goal

Build kg_normalize.py: (a) map relation types to controlled vocabulary of ~20 canonical relations, (b) merge duplicate entities, (c) align entity types to fixed ontology (gene, protein, pathway, disease, drug, phenotype, cell_type, brain_region). Runs in post_process.py pipeline. Acceptance: relations <=25 types; no duplicates.

Acceptance Criteria

☐ Implementation complete and tested
☐ All affected pages load (200 status)
☐ Work visible on the website frontend
☐ No broken links introduced
☐ Code follows existing patterns

Approach

  • Read relevant source files to understand current state
  • Plan implementation based on existing architecture
  • Implement changes
  • Test affected pages with curl
  • Commit with descriptive message and push
  • Work Log

    • 2026-04-01 10:00 - Started implementation. Examining current pipeline and data structures.
    • 2026-04-01 10:30 - Completed implementation:
    * Built kg_normalize.py with canonicalization logic
    * Mapped 77 relation types → 10 canonical relations (✓ <=25)
    * Mapped 33 entity types → 5 canonical types (✓ <=8)
    * Implemented duplicate entity merging (removed 405 duplicates)
    * Integrated into post_process.py pipeline (step 2)
    * Tested successfully: all acceptance criteria met
    * Canonical relations: activates, causes, component_of, decreases_risk, encodes, increases_risk, inhibits, participates_in, produces, regulates
    * Canonical entity types: disease, drug, gene, phenotype, protein
    • 2026-04-25 21:15 - Rebuilt kg_normalize.py on current main:
    * Canonical relations: 25 (limit 25) ✓
    * Canonical entity types: 8 (limit 8) ✓
    * DB had 892 distinct relation values → 25 canonical relations used ✓
    * DB had 134 distinct entity types → 8 canonical types used ✓
    * Normalized 104,984 relation values, 49,251 entity types
    * Merged 27,422 duplicate edges on UniqueViolation
    * Commit: 6253aa178 [Atlas] Add kg_normalize.py [task:ed421469-317a-46da-b33f-d440a705cd9f]

    Sibling Tasks in Quest (Atlas) ↗