[Atlas] Cross-reference tabular datasets with KG: bidirectional entity linking done analysis:5 coding:5 safety:9

← Atlas
When a tabular dataset is registered, automatically identify columns that map to existing KG entities (genes, proteins, diseases, brain regions, etc.) and create artifact_links. When browsing a KG entity, show linked datasets that mention it. When browsing a dataset, show KG context for each column. This creates a bridge between the graph world model and structured tabular observations. Depends on: atl-ds-01-GROW.

Completion Notes

Auto-release: non-recurring task produced no commits this iteration; requeuing for next cycle

Git Commits (20)

[Atlas] Cross-reference tabular datasets with KG: bidirectional entity linking [task:atl-ds-02-XREF]2026-04-26
[Atlas] Work log: verify complete and update spec status [task:atl-ds-02-XREF]2026-04-26
[Atlas] api.py + registry dataset-KG bidirectional cross-links [task:atl-ds-02-XREF]2026-04-26
[Atlas] Cross-reference tabular datasets with KG [task:atl-ds-02-XREF]2026-04-26
[Atlas] Cross-reference tabular datasets with KG [task:atl-ds-02-XREF]2026-04-26
[Atlas] Cross-reference tabular datasets with KG [task:atl-ds-02-XREF]2026-04-26
[Atlas] Cross-reference tabular datasets with KG [task:atl-ds-02-XREF]2026-04-26
[Atlas] Cross-reference tabular datasets with KG [task:atl-ds-02-XREF]2026-04-26
[Atlas] Cross-reference tabular datasets with KG [task:atl-ds-02-XREF]2026-04-26
[Atlas] Cross-reference tabular datasets with KG [task:atl-ds-02-XREF]2026-04-26
[Atlas] Cross-reference tabular datasets with KG [task:atl-ds-02-XREF]2026-04-26
[Atlas] api.py + registry dataset-KG bidirectional cross-links [task:atl-ds-02-XREF]2026-04-26
[Atlas] api.py + registry dataset-KG bidirectional cross-links [task:atl-ds-02-XREF]2026-04-26
[Atlas] api.py + registry dataset-KG bidirectional cross-links [task:atl-ds-02-XREF]2026-04-26
[Atlas] api.py + registry dataset-KG bidirectional cross-links [task:atl-ds-02-XREF]2026-04-26
[Atlas] api.py + registry dataset-KG bidirectional cross-links [task:atl-ds-02-XREF]2026-04-26
[Atlas] api.py + registry dataset-KG bidirectional cross-links [task:atl-ds-02-XREF]2026-04-26
[Atlas] Restore api.py dataset-KG cross-reference wiring [task:atl-ds-02-XREF]2026-04-25
[Atlas] Resolve merge conflicts and push cross-reference linking [task:atl-ds-02-XREF]2026-04-25
[Atlas] Implement dataset-KG bidirectional cross-reference linking — complete, tests passing [task:atl-ds-02-XREF]2026-04-25
Spec File

[Atlas] Cross-reference tabular datasets with KG: bidirectional entity linking

Goal

Create a seamless bridge between the graph world model and tabular observations. When a researcher browses a gene in the KG, they should see which datasets contain measurements for that gene. When browsing a dataset, they should see rich KG context for each column.

Bidirectional Linking

KG → Datasets ("What data exists for this entity?")

When viewing a KG entity page (e.g., gene TREM2):

Datasets containing TREM2:
┌─────────────────────────────┬──────────────┬──────────┬────────────┐
│ Dataset                      │ Column       │ Rows     │ Values     │
├─────────────────────────────┼──────────────┼──────────┼────────────┤
│ SEA-AD Differential Expr.    │ gene_symbol  │ 15,000   │ log_fc=2.3 │
│ Allen Brain Cell Atlas       │ gene         │ 50,000   │ expr=high  │
│ AD GWAS Summary Stats        │ gene_name    │ 8,000    │ p=1.2e-8   │
└─────────────────────────────┴──────────────┴──────────┴────────────┘

Implementation:

def get_datasets_for_entity(entity_name, entity_type=None):
    """Query artifact_links and column metadata to find datasets
    containing this entity. Returns dataset artifacts with the 
    specific column and summary statistics for that entity."""

Datasets → KG ("What does the KG know about this data?")

When viewing a tabular dataset page:

Column: gene_symbol (linked to: gene entities)
  Known genes in KG: 12,450 / 15,000 (83% coverage)
  Top entities by KG connectivity:
    TREM2 (142 edges) | APOE (138 edges) | APP (95 edges) | MAPT (89 edges)
  
Column: cell_type (linked to: cell_type entities)
  Known cell types in KG: 8 / 8 (100% coverage)
  microglia (256 edges) | astrocyte (198 edges) | neuron (312 edges)

Implementation:

def get_kg_context_for_dataset(dataset_artifact_id):
    """For each linked column in the dataset, resolve entities against
    the KG and return coverage stats and top entities by connectivity."""

Auto-Linking on Registration

When a tabular dataset is registered (via register_tabular_dataset):
  • For each column with linked_entity_type:
  • - Sample N values from the column
    - Resolve each against KG entities of that type
    - Create artifact_links from dataset to matched entities
  • Compute and store coverage statistics in dataset metadata
  • Flag columns with low coverage for review
  • UI Components

    • KG entity page: "Related Datasets" section with table of matching datasets
    • Dataset detail page: "KG Coverage" section per linked column
    • Click-through: clicking a dataset row for an entity navigates to the dataset filtered to that entity

    Acceptance Criteria

    get_datasets_for_entity() returns datasets containing a given entity
    get_kg_context_for_dataset() returns KG coverage per linked column
    ☐ Auto-linking runs on tabular dataset registration
    ☐ Coverage statistics stored in dataset metadata
    ☐ KG entity pages show "Related Datasets" section
    ☐ Dataset pages show "KG Coverage" per column
    ☐ Click-through navigation between dataset and KG views
    ☐ Work log updated with timestamped entry

    Dependencies

    • atl-ds-01-GROW (KG edge extraction pipeline)

    Dependents

    • d16-22-TABD0001 (demo: gene expression cross-referencing)

    Work Log

    Payload JSON
    {
      "requirements": {
        "coding": 5,
        "analysis": 5,
        "safety": 9
      },
      "completion_shas": [
        "915c9692b",
        "520dbcc2e"
      ],
      "completion_shas_checked_at": ""
    }

    Sibling Tasks in Quest (Atlas) ↗