[Atlas] Dataset-driven knowledge growth: extract KG edges from tabular datasets
Goal
Papers are currently the primary source of knowledge graph edges. But structured datasets (gene expression matrices, protein interaction screens, clinical trial tables) contain dense, high-confidence observations that can dramatically expand the KG. This task builds a pipeline to automatically extract KG edges from registered tabular datasets.
Design
Extraction Templates
Define reusable templates that map dataset schemas to KG edge patterns:
EXTRACTION_TEMPLATES = {
"gene_expression": {
"description": "Extract expressed_in edges from gene expression data",
"required_columns": {"gene": "gene", "tissue_or_cell": "cell_type|tissue|brain_region"},
"optional_columns": {"expression_level": "float", "p_value": "float"},
"edge_pattern": {
"source_type": "gene",
"source_column": "gene",
"relation": "expressed_in",
"target_type": "cell_type",
"target_column": "tissue_or_cell",
"weight_column": "expression_level",
"confidence_column": "p_value"
}
},
"protein_interaction": {
"description": "Extract interacts_with edges from PPI data",
"required_columns": {"protein_a": "protein", "protein_b": "protein"},
"optional_columns": {"interaction_score": "float", "method": "string"},
"edge_pattern": {
"source_type": "protein",
"source_column": "protein_a",
"relation": "interacts_with",
"target_type": "protein",
"target_column": "protein_b",
"weight_column": "interaction_score"
}
},
"disease_gene_association": {
"description": "Extract associated_with edges from GWAS/association data",
"required_columns": {"gene": "gene", "disease": "disease"},
"optional_columns": {"odds_ratio": "float", "p_value": "float"},
"edge_pattern": {
"source_type": "gene",
"source_column": "gene",
"relation": "associated_with",
"target_type": "disease",
"target_column": "disease",
"weight_column": "odds_ratio",
"confidence_column": "p_value"
}
}
}
Extraction Pipeline
def extract_kg_edges_from_dataset(dataset_artifact_id, template_name,
confidence_threshold=0.05):
"""
1. Load tabular dataset artifact metadata (column schema)
2. Match columns to template requirements
3. For each row passing confidence threshold:
a. Resolve source entity against existing KG nodes
b. Resolve target entity against existing KG nodes
c. Create KG edge with weight and provenance
4. Register extracted edges as versioned KG edge artifacts
5. Link edges to source dataset via artifact_links (derives_from)
6. Return extraction summary (edges created, entities resolved, etc.)
"""
Entity Resolution
When extracting edges, map column values to existing KG entities:
- Exact match on entity name/identifier
- Fuzzy match with confidence score (for synonyms, abbreviations)
- Create new entity if no match and confidence is high (configurable)
- Log unresolved entities for manual review
Versioning
Each extraction run creates versioned KG edge artifacts:
- Re-running extraction on the same dataset creates a new version
- If dataset is updated (new version), re-extraction picks up changes
- Provenance chain: dataset v2 → extraction run → KG edges v2
Acceptance Criteria
☐ At least 3 extraction templates defined (gene_expression, protein_interaction, disease_gene)
☐ Pipeline matches dataset columns to template requirements
☐ Entity resolution against existing KG nodes (exact + fuzzy)
☐ Confidence threshold filtering (skip low-confidence rows)
☐ Extracted edges registered as versioned artifacts
☐ Provenance links: edges derives_from dataset
☐ Extraction summary returned (counts, unresolved entities)
☐ Re-extraction creates new version, not duplicates
☐ Work log updated with timestamped entry
Dependencies
- a17-22-TABL0001 (tabular dataset support with column schemas)
Dependents
- atl-ds-02-XREF (bidirectional cross-referencing)
- d16-22-TABD0001 (demo exercises this pipeline)
Work Log