[Artifacts] Tabular dataset support: schema tracking, column metadata, linkage to KG
Goal
Tabular datasets (CSV, TSV, Parquet) are a major form of scientific data that complement the knowledge graph. A gene expression matrix, a clinical trial results table, or a protein interaction screen all contain structured observations that should be queryable alongside graph data. This task makes tabular datasets first-class citizens with column-level schema tracking and entity linking.
Design
Column Schema Format
Each tabular dataset artifact stores column definitions in its metadata:
{
"columns": [
{
"name": "gene_symbol",
"dtype": "string",
"description": "HGNC gene symbol",
"linked_entity_type": "gene",
"linked_entity_resolver": "hgnc",
"sample_values": ["TREM2", "APOE", "APP"]
},
{
"name": "cell_type",
"dtype": "string",
"description": "Cell type classification",
"linked_entity_type": "cell_type",
"sample_values": ["microglia", "astrocyte", "neuron"]
},
{
"name": "log_fold_change",
"dtype": "float",
"description": "Log2 fold change vs control",
"linked_entity_type": null
},
{
"name": "p_value",
"dtype": "float",
"description": "Adjusted p-value (BH correction)",
"linked_entity_type": null
}
],
"row_count": 15000,
"source": "derived",
"parent_dataset_id": "dataset-allen_brain-SEA-AD",
"format": "csv"
}
Entity Linking
Columns with
linked_entity_type create a bridge to the KG:
On registration, for each linked column, create artifact_links from the tabular_dataset to KG entities matching sample values
When browsing a KG entity (e.g., gene "TREM2"), surface tabular datasets that contain it
When browsing a tabular dataset, show KG context for linked columnsImplementation
register_tabular_dataset(title, columns, source, row_count, **kwargs)
def register_tabular_dataset(title, columns, source="derived",
row_count=None, parent_dataset_id=None,
format="csv"):
metadata = {
"columns": columns,
"row_count": row_count,
"source": source,
"parent_dataset_id": parent_dataset_id,
"format": format,
"linked_entity_types": [c["linked_entity_type"]
for c in columns
if c.get("linked_entity_type")]
}
artifact_id = register_artifact(
artifact_type="tabular_dataset",
title=title,
metadata=metadata,
quality_score=0.7
)
# Auto-link to parent dataset if specified
if parent_dataset_id:
create_link(artifact_id, parent_dataset_id, "derives_from")
return artifact_id
get_datasets_for_entity(entity_name, entity_type) → list
Query artifact_links to find tabular datasets containing a given entity.
get_entity_context_for_dataset(dataset_id) → dict
For each linked column, return matching KG entities with their graph neighborhoods.
Acceptance Criteria
☐ register_tabular_dataset() creates artifact with type=tabular_dataset
☐ Column schema stored in metadata with name, dtype, description, linked_entity_type
☐ Auto-linking to parent dataset via derives_from
☐ get_datasets_for_entity() returns datasets containing an entity
☐ get_entity_context_for_dataset() returns KG context per linked column
☐ Sample values stored for entity resolution
☐ Work log updated with timestamped entry
Dependencies
- a17-19-TYPE0001 (tabular_dataset type must be registered)
- a17-21-EXTD0001 (parent datasets may be external references)
Dependents
- atl-ds-01-GROW (KG edge extraction from tabular data)
- d16-22-TABD0001 (demo: gene expression table)
Work Log