[Artifacts] Tabular dataset support: schema tracking, column metadata, linkage to KG done analysis:6 coding:7

← Artifacts
Support tabular datasets as first-class artifacts. Store column schemas (name, dtype, description, linked_entity) in metadata JSON. Implement register_tabular_dataset(title, columns, source, row_count) that creates artifact with type=tabular_dataset. Add ability to link individual columns to KG entities (e.g. column gene_symbol maps to entity type gene). Tabular datasets contribute to the world model by providing structured observations that complement graph edges. Depends on: a17-19-TYPE0001, a17-21-EXTD0001.

Git Commits (1)

[Atlas] Tabular dataset schema tracking, entity linkage queries [task:a17-22-TABL0001]2026-04-24
Spec File

[Artifacts] Tabular dataset support: schema tracking, column metadata, linkage to KG

Goal

Tabular datasets (CSV, TSV, Parquet) are a major form of scientific data that complement the knowledge graph. A gene expression matrix, a clinical trial results table, or a protein interaction screen all contain structured observations that should be queryable alongside graph data. This task makes tabular datasets first-class citizens with column-level schema tracking and entity linking.

Design

Column Schema Format

Each tabular dataset artifact stores column definitions in its metadata:

{
  "columns": [
    {
      "name": "gene_symbol",
      "dtype": "string",
      "description": "HGNC gene symbol",
      "linked_entity_type": "gene",
      "linked_entity_resolver": "hgnc",
      "sample_values": ["TREM2", "APOE", "APP"]
    },
    {
      "name": "cell_type",
      "dtype": "string", 
      "description": "Cell type classification",
      "linked_entity_type": "cell_type",
      "sample_values": ["microglia", "astrocyte", "neuron"]
    },
    {
      "name": "log_fold_change",
      "dtype": "float",
      "description": "Log2 fold change vs control",
      "linked_entity_type": null
    },
    {
      "name": "p_value",
      "dtype": "float",
      "description": "Adjusted p-value (BH correction)",
      "linked_entity_type": null
    }
  ],
  "row_count": 15000,
  "source": "derived",
  "parent_dataset_id": "dataset-allen_brain-SEA-AD",
  "format": "csv"
}

Entity Linking

Columns with linked_entity_type create a bridge to the KG:
  • On registration, for each linked column, create artifact_links from the tabular_dataset to KG entities matching sample values
  • When browsing a KG entity (e.g., gene "TREM2"), surface tabular datasets that contain it
  • When browsing a tabular dataset, show KG context for linked columns
  • Implementation

    register_tabular_dataset(title, columns, source, row_count, **kwargs)

    def register_tabular_dataset(title, columns, source="derived",
                                  row_count=None, parent_dataset_id=None,
                                  format="csv"):
        metadata = {
            "columns": columns,
            "row_count": row_count,
            "source": source,
            "parent_dataset_id": parent_dataset_id,
            "format": format,
            "linked_entity_types": [c["linked_entity_type"] 
                                     for c in columns 
                                     if c.get("linked_entity_type")]
        }
        artifact_id = register_artifact(
            artifact_type="tabular_dataset",
            title=title,
            metadata=metadata,
            quality_score=0.7
        )
        # Auto-link to parent dataset if specified
        if parent_dataset_id:
            create_link(artifact_id, parent_dataset_id, "derives_from")
        return artifact_id

    get_datasets_for_entity(entity_name, entity_type) → list
    Query artifact_links to find tabular datasets containing a given entity.

    get_entity_context_for_dataset(dataset_id) → dict
    For each linked column, return matching KG entities with their graph neighborhoods.

    Acceptance Criteria

    register_tabular_dataset() creates artifact with type=tabular_dataset
    ☐ Column schema stored in metadata with name, dtype, description, linked_entity_type
    ☐ Auto-linking to parent dataset via derives_from
    get_datasets_for_entity() returns datasets containing an entity
    get_entity_context_for_dataset() returns KG context per linked column
    ☐ Sample values stored for entity resolution
    ☐ Work log updated with timestamped entry

    Dependencies

    • a17-19-TYPE0001 (tabular_dataset type must be registered)
    • a17-21-EXTD0001 (parent datasets may be external references)

    Dependents

    • atl-ds-01-GROW (KG edge extraction from tabular data)
    • d16-22-TABD0001 (demo: gene expression table)

    Work Log

    Payload JSON
    {
      "requirements": {
        "coding": 7,
        "analysis": 6
      }
    }

    Sibling Tasks in Quest (Artifacts) ↗

    Task Dependencies

    ↓ Referenced by (downstream)