SciDEX — Task: [Atlas] Dataset-driven knowledge growth: extract K

Build a pipeline that takes registered tabular datasets and extracts KG edges from them. For example, a gene expression dataset with columns (gene, tissue, expression_level) can generate edges like (gene)-[expressed_in]->(tissue) with expression_level as edge weight. Define extraction templates per dataset schema pattern. Each extraction run creates versioned KG edge artifacts linked to the source dataset. This lets us grow the knowledge graph from structured data, not just papers. Depends on: a17-22-TABL0001 (Artifacts quest).

Git Commits (16)

[Atlas] Cross-reference tabular datasets with KG: bidirectional entity linking [task:atl-ds-02-XREF]2026-04-26

[Atlas] Work log: verify complete and update spec status [task:atl-ds-02-XREF]2026-04-26

[Atlas] api.py + registry dataset-KG bidirectional cross-links [task:atl-ds-02-XREF]2026-04-26

[Atlas] Restore api.py dataset-KG cross-reference wiring [task:atl-ds-02-XREF]2026-04-25

[Atlas] Resolve merge conflicts and push cross-reference linking [task:atl-ds-02-XREF]2026-04-25

[Atlas] Implement dataset-KG bidirectional cross-reference linking — complete, tests passing [task:atl-ds-02-XREF]2026-04-25

[Atlas] Work log: push resolution and branch state [task:atl-ds-02-XREF]2026-04-25

[Atlas] Ignore .orchestra/audit/ directory [task:atl-ds-02-XREF]2026-04-25

[Atlas] Resolve .orchestra-slot.json from rebase [task:atl-ds-02-XREF]2026-04-25

[Atlas] Work log: branch cleanup [task:atl-ds-02-XREF]2026-04-25

[Atlas] Implement dataset-KG bidirectional cross-reference linking [task:atl-ds-02-XREF]2026-04-25

[Verify] Dataset-driven KG extraction already resolved on main [task:atl-ds-01-GROW]2026-04-25

[Atlas] Work log: re-verify after rebase to current main [task:atl-ds-02-XREF]2026-04-25

[Atlas] Work log: verify rebase + push complete [task:atl-ds-02-XREF]2026-04-25

[Atlas] Implement dataset-KG bidirectional cross-reference linking [task:atl-ds-02-XREF]2026-04-25

[Atlas] Implement dataset-KG cross-reference: bidirectional linking [task:atl-ds-02-XREF]2026-04-03

Spec File

[Atlas] Dataset-driven knowledge growth: extract KG edges from tabular datasets

Goal

Papers are currently the primary source of knowledge graph edges. But structured datasets (gene expression matrices, protein interaction screens, clinical trial tables) contain dense, high-confidence observations that can dramatically expand the KG. This task builds a pipeline to automatically extract KG edges from registered tabular datasets.

Design

Extraction Templates

Define reusable templates that map dataset schemas to KG edge patterns:

EXTRACTION_TEMPLATES = {
    "gene_expression": {
        "description": "Extract expressed_in edges from gene expression data",
        "required_columns": {"gene": "gene", "tissue_or_cell": "cell_type|tissue|brain_region"},
        "optional_columns": {"expression_level": "float", "p_value": "float"},
        "edge_pattern": {
            "source_type": "gene",
            "source_column": "gene",
            "relation": "expressed_in",
            "target_type": "cell_type",
            "target_column": "tissue_or_cell",
            "weight_column": "expression_level",
            "confidence_column": "p_value"
        }
    },
    "protein_interaction": {
        "description": "Extract interacts_with edges from PPI data",
        "required_columns": {"protein_a": "protein", "protein_b": "protein"},
        "optional_columns": {"interaction_score": "float", "method": "string"},
        "edge_pattern": {
            "source_type": "protein",
            "source_column": "protein_a",
            "relation": "interacts_with",
            "target_type": "protein",
            "target_column": "protein_b",
            "weight_column": "interaction_score"
        }
    },
    "disease_gene_association": {
        "description": "Extract associated_with edges from GWAS/association data",
        "required_columns": {"gene": "gene", "disease": "disease"},
        "optional_columns": {"odds_ratio": "float", "p_value": "float"},
        "edge_pattern": {
            "source_type": "gene",
            "source_column": "gene",
            "relation": "associated_with",
            "target_type": "disease",
            "target_column": "disease",
            "weight_column": "odds_ratio",
            "confidence_column": "p_value"
        }
    }
}

Extraction Pipeline

def extract_kg_edges_from_dataset(dataset_artifact_id, template_name, 
                                   confidence_threshold=0.05):
    """
    1. Load tabular dataset artifact metadata (column schema)
    2. Match columns to template requirements
    3. For each row passing confidence threshold:
       a. Resolve source entity against existing KG nodes
       b. Resolve target entity against existing KG nodes
       c. Create KG edge with weight and provenance
    4. Register extracted edges as versioned KG edge artifacts
    5. Link edges to source dataset via artifact_links (derives_from)
    6. Return extraction summary (edges created, entities resolved, etc.)
    """

Entity Resolution

When extracting edges, map column values to existing KG entities:

Exact match on entity name/identifier
Fuzzy match with confidence score (for synonyms, abbreviations)
Create new entity if no match and confidence is high (configurable)
Log unresolved entities for manual review

Versioning

Each extraction run creates versioned KG edge artifacts:

Re-running extraction on the same dataset creates a new version
If dataset is updated (new version), re-extraction picks up changes
Provenance chain: dataset v2 → extraction run → KG edges v2

Acceptance Criteria

☐ At least 3 extraction templates defined (gene_expression, protein_interaction, disease_gene)

☐ Pipeline matches dataset columns to template requirements

☐ Entity resolution against existing KG nodes (exact + fuzzy)

☐ Confidence threshold filtering (skip low-confidence rows)

☐ Extracted edges registered as versioned artifacts

☐ Provenance links: edges derives_from dataset

☐ Extraction summary returned (counts, unresolved entities)

☐ Re-extraction creates new version, not duplicates

☐ Work log updated with timestamped entry

Dependencies

a17-22-TABL0001 (tabular dataset support with column schemas)

Dependents

atl-ds-02-XREF (bidirectional cross-referencing)
d16-22-TABD0001 (demo exercises this pipeline)

Work Log

Payload JSON

{
  "requirements": {
    "analysis": 5
  }
}

Sibling Tasks in Quest (Atlas) ↗

○[Atlas] Squad findings bubble-up driver (driver #20)P94

○[Atlas] Install Dolt server + migrate first dataset (driver #26)P92

○[Atlas] Dataset PR review & merge driver (driver #27)P92

○[Atlas] Wiki mermaid LLM regen — 50 pages/run, parallel agentsP92

○[Atlas] Gap closure pipeline — match 500 open gaps to accumulated evidence and resolveP92

○[Atlas] CI: Drive artifact folder migration backfillP92

○[Atlas] Versioned tabular datasets — overall coordination questP90

○[Atlas] KG ↔ dataset cross-link driver (driver #30)P90

○[Atlas] CI: Generate semantic metadata for unsummarized artifactsP90

○[Atlas] PubMed evidence update pipelineP87

Task Dependencies

↓ Referenced by (downstream)

✓[Atlas] Cross-reference tabular datasets with KG: bidirectional entity linkingP83Atlas

[Atlas] Dataset-driven knowledge growth: extract KG edges from tabular datasets done analysis:5