World Model Multi-Representation Framework — Atlas Layer

Goal

Design a unified architecture for SciDEX's world model that bridges multiple representations of scientific knowledge (knowledge graphs, wiki pages, papers, hypotheses, causal models, notebooks, ontologies). Each representation captures different aspects of understanding — structured relationships, natural language context, primary evidence, predictions, causality, computation, and taxonomic hierarchies. By linking all representations through canonical entity IDs and measuring completeness via composite scores, SciDEX can systematically identify knowledge gaps and direct research toward maximally underexplored regions.

Acceptance Criteria

☐ Define 7 core representation types with their unique strengths

☐ Specify unification layer architecture with canonical entity_id linking

☐ Design world model score formula measuring entity understanding completeness

☐ Define gap detection algorithm identifying representation mismatches

☐ Design artifact registry table schema for cross-representation tracking

☐ Provide examples showing how the framework integrates across layers

☐ Document migration path from current architecture

Approach

1. Representation Types

Define seven complementary knowledge representations:

a. Knowledge Graph (Neo4j)

Strengths: Structured relationships, graph traversal, inference chains, path finding
Current State: knowledge_edges table in SQLite; planned Neo4j migration
Entity Types: gene, protein, pathway, disease, drug, phenotype, cell_type, brain_region
Edge Types: REGULATES, CAUSES, INHIBITS, EXPRESSES, ASSOCIATES_WITH, TARGETS
Query Patterns: "Find all genes regulating APOE", "Shortest path between MAPT and tau aggregation"

b. Wiki Pages (Markdown)

Strengths: Natural language depth, narrative context, synthesis, accessibility
Current State: NeuroWiki integration (16K+ pages), /wiki/{entity} route in api.py
Content: Background, mechanisms, clinical relevance, controversies, references
Linking: WikiLinks [[APOE]] map to entity_id, bidirectional references

c. Papers (PubMed)

Strengths: Primary evidence, citations, temporal provenance, experimental detail
Current State: papers table (PMID, title, abstract, authors, year, citations)
Linking: Papers cite entities; analyses cite papers; hypotheses cite papers
Metadata: Journal, impact factor, citation count, MeSH terms

d. Hypotheses (Scored)

Strengths: Actionable predictions, testable claims, market-scored confidence
Current State: hypotheses table with composite_score (0-1), evidence_for/against
Scoring: 10-dimension Exchange scoring (testability, plausibility, novelty, impact, ...)
Lifecycle: Generated (Agora) → Scored (Exchange) → Validated (Forge) → Integrated (Atlas)

e. Causal Models

Strengths: Directed causal edges with confidence, interventional reasoning, counterfactuals
Current State: Some causal edges in knowledge_edges (relation='causes')
Enhancement: Add confidence, directionality, evidence_pmids[], extracted_from_analysis_id
Query Patterns: "What causes tau aggregation?", "If we inhibit X, what downstream effects?"

f. Notebooks (Jupyter)

Strengths: Computational artifacts, reproducible analysis, data visualizations, code provenance
Current State: Planned integration (not yet implemented)
Storage: Notebooks stored in site/notebooks/, metadata in artifact registry
Linking: Notebooks link to analyses that generated them, entities analyzed, datasets used

g. Ontologies

Strengths: Canonical type hierarchies, is-a relationships, semantic reasoning
Current State: Implicit in entity_type field; no formal ontology
Structure: gene > protein > protein_complex > pathway > biological_process > disease
Standards: Map to GO (Gene Ontology), DO (Disease Ontology), UBERON (anatomy)

2. Unification Layer

Every scientific artifact has a canonical entity_id (string, globally unique across SciDEX).

Entity ID Format

Genes: GENE:APOE, GENE:MAPT, GENE:PSEN1
Proteins: PROTEIN:tau, PROTEIN:amyloid-beta
Diseases: DISEASE:alzheimers, DISEASE:parkinsons
Pathways: PATHWAY:apoptosis, PATHWAY:autophagy
Custom: ENTITY:{name} for emerging concepts

Cross-Representation Links

For a single entity (e.g., GENE:APOE):

GENE:APOE
├── Wiki Page: /wiki/apoe
│   └── Content: markdown with [[WikiLinks]]
├── KG Neighborhood: /entity/APOE
│   ├── 47 edges (32 outgoing, 15 incoming)
│   └── Types: REGULATES(12), ASSOCIATES_WITH(18), ...
├── Hypotheses: 23 mentioning APOE
│   ├── "APOE4 increases Aβ aggregation via impaired clearance" (score: 0.87)
│   └── "APOE2 protects via enhanced autophagy" (score: 0.72)
├── Papers: 1,847 PubMed results
│   ├── Top cited: PMID:12345678 (2,341 citations)
│   └── Recent: 89 papers in last 12 months
├── Causal Edges: 8 causal claims
│   ├── APOE4 → impaired_lipid_transport (confidence: 0.91)
│   └── APOE2 → enhanced_autophagy (confidence: 0.68)
├── Notebooks: 3 analyses
│   └── "APOE4 lipid dysfunction meta-analysis" (2026-03-15)
└── Ontology: gene > protein > lipid_transport_protein

Implementation

Database Schema Addition:

-- Add entity_ids JSON column to existing tables
ALTER TABLE hypotheses ADD COLUMN entity_ids TEXT; -- JSON array
ALTER TABLE analyses ADD COLUMN entity_ids TEXT;   -- JSON array
ALTER TABLE papers ADD COLUMN entity_ids TEXT;      -- JSON array (extracted from abstract/MeSH)

-- Update knowledge_edges to use canonical IDs
ALTER TABLE knowledge_edges ADD COLUMN source_entity_id TEXT;
ALTER TABLE knowledge_edges ADD COLUMN target_entity_id TEXT;
CREATE INDEX idx_ke_source_entity ON knowledge_edges(source_entity_id);
CREATE INDEX idx_ke_target_entity ON knowledge_edges(target_entity_id);

API Route:

@app.get("/entity/{entity_id}")
async def get_entity_unified_view(entity_id: str):
    """Return all SciDEX knowledge about this entity."""
    return {
        "entity_id": entity_id,
        "wiki_page": fetch_wiki_page(entity_id),
        "kg_neighborhood": fetch_kg_edges(entity_id),
        "hypotheses": fetch_hypotheses_mentioning(entity_id),
        "papers": fetch_papers_mentioning(entity_id),
        "causal_edges": fetch_causal_edges(entity_id),
        "notebooks": fetch_notebooks_analyzing(entity_id),
        "ontology_path": fetch_ontology_hierarchy(entity_id),
        "world_model_score": calculate_world_model_score(entity_id)
    }

3. World Model Score

Composite metric for how well-understood an entity is across all representations.

Formula

world_model_score(entity) = weighted_sum([
    wiki_depth_score,
    kg_connectivity_score,
    hypothesis_coverage_score,
    evidence_count_score,
    causal_clarity_score,
    computational_validation_score,
    ontology_integration_score
])

Component Definitions

def calculate_world_model_score(entity_id: str) -> float:
    """Calculate composite world model score (0-1)."""
    
    # 1. Wiki depth (0-1): Does the entity have rich wiki content?
    wiki_page = fetch_wiki_page(entity_id)
    wiki_depth = min(len(wiki_page["content"]) / 5000, 1.0) if wiki_page else 0.0
    
    # 2. KG connectivity (0-1): How well-connected in the knowledge graph?
    edges = fetch_kg_edges(entity_id)
    kg_connectivity = min(len(edges) / 50, 1.0)  # Normalize to 50 edges
    
    # 3. Hypothesis coverage (0-1): How many scored hypotheses mention it?
    hypotheses = fetch_hypotheses_mentioning(entity_id)
    hypothesis_coverage = min(len(hypotheses) / 20, 1.0)
    
    # 4. Evidence count (0-1): How many papers provide evidence?
    papers = fetch_papers_mentioning(entity_id)
    evidence_count = min(len(papers) / 100, 1.0)
    
    # 5. Causal clarity (0-1): Do we have high-confidence causal edges?
    causal_edges = fetch_causal_edges(entity_id)
    if causal_edges:
        avg_confidence = sum(e["confidence"] for e in causal_edges) / len(causal_edges)
        causal_clarity = avg_confidence * min(len(causal_edges) / 10, 1.0)
    else:
        causal_clarity = 0.0
    
    # 6. Computational validation (0-1): Has analysis code verified claims?
    notebooks = fetch_notebooks_analyzing(entity_id)
    computational_validation = min(len(notebooks) / 5, 1.0)
    
    # 7. Ontology integration (0-1): Is it properly typed in hierarchy?
    ontology_path = fetch_ontology_hierarchy(entity_id)
    ontology_integration = 1.0 if ontology_path else 0.0
    
    # Weighted sum
    score = (
        0.15 * wiki_depth +
        0.20 * kg_connectivity +
        0.15 * hypothesis_coverage +
        0.20 * evidence_count +
        0.15 * causal_clarity +
        0.10 * computational_validation +
        0.05 * ontology_integration
    )
    
    return round(score, 3)

Interpretation

0.00-0.30: Barely known — entity mentioned but not studied
0.31-0.60: Partially understood — some evidence, few connections
0.61-0.80: Well-studied — rich across multiple representations
0.81-1.00: Deeply understood — comprehensive multi-modal knowledge

4. Gap Detection

Identify entities with representation mismatches to prioritize research.

Gap Types

def detect_knowledge_gaps(min_score=0.3) -> List[Dict]:
    """Find entities with representation mismatches."""
    gaps = []
    
    # Get all entities with any presence in SciDEX
    all_entities = get_all_entity_ids()
    
    for entity_id in all_entities:
        scores = calculate_component_scores(entity_id)
        total_score = calculate_world_model_score(entity_id)
        
        # Gap Type 1: Wiki-rich but KG-poor
        if scores["wiki_depth"] > 0.7 and scores["kg_connectivity"] < 0.3:
            gaps.append({
                "entity_id": entity_id,
                "gap_type": "wiki_rich_kg_poor",
                "priority": scores["wiki_depth"] - scores["kg_connectivity"],
                "recommendation": "Extract structured relationships from wiki to populate KG"
            })
        
        # Gap Type 2: KG-rich but wiki-poor
        if scores["kg_connectivity"] > 0.7 and scores["wiki_depth"] < 0.3:
            gaps.append({
                "entity_id": entity_id,
                "gap_type": "kg_rich_wiki_poor",
                "priority": scores["kg_connectivity"] - scores["wiki_depth"],
                "recommendation": "Generate wiki page synthesizing KG neighborhood"
            })
        
        # Gap Type 3: Evidence-rich but hypothesis-poor
        if scores["evidence_count"] > 0.7 and scores["hypothesis_coverage"] < 0.3:
            gaps.append({
                "entity_id": entity_id,
                "gap_type": "evidence_rich_hypothesis_poor",
                "priority": scores["evidence_count"] - scores["hypothesis_coverage"],
                "recommendation": "Run Agora debate to generate testable hypotheses"
            })
        
        # Gap Type 4: Hypothesis-rich but causal-poor
        if scores["hypothesis_coverage"] > 0.7 and scores["causal_clarity"] < 0.3:
            gaps.append({
                "entity_id": entity_id,
                "gap_type": "hypothesis_rich_causal_poor",
                "priority": scores["hypothesis_coverage"] - scores["causal_clarity"],
                "recommendation": "Extract causal claims from hypotheses, score confidence"
            })
        
        # Gap Type 5: Claims-rich but computation-poor
        if total_score > 0.6 and scores["computational_validation"] < 0.2:
            gaps.append({
                "entity_id": entity_id,
                "gap_type": "claims_rich_computation_poor",
                "priority": total_score - scores["computational_validation"],
                "recommendation": "Run Forge analysis to computationally validate claims"
            })
    
    # Sort by priority (highest mismatch first)
    gaps.sort(key=lambda x: x["priority"], reverse=True)
    
    return gaps

Integration with Knowledge Gaps Table

def sync_gaps_to_db():
    """Write detected gaps to knowledge_gaps table for agent consumption."""
    detected_gaps = detect_knowledge_gaps()
    
    for gap in detected_gaps[:50]:  # Top 50 gaps
        conn.execute("""
            INSERT OR IGNORE INTO knowledge_gaps 
            (title, description, priority_score, status, entity_ids, gap_type)
            VALUES (?, ?, ?, 'open', ?, ?)
        """, (
            f"Improve {gap['entity_id']} world model: {gap['gap_type']}",
            gap['recommendation'],
            gap['priority'],
            json.dumps([gap['entity_id']]),
            gap['gap_type']
        ))

5. Artifact Registry Table

Central table tracking all knowledge artifacts with their entity associations.

Schema

CREATE TABLE IF NOT EXISTS artifact_registry (
    id TEXT PRIMARY KEY,                    -- UUID
    artifact_type TEXT NOT NULL,            -- 'wiki_page', 'kg_edge', 'paper', 'hypothesis', 'notebook', 'causal_edge', 'ontology_entry'
    entity_ids TEXT NOT NULL,               -- JSON array of canonical entity IDs
    content_hash TEXT,                      -- SHA256 of content for deduplication
    source_table TEXT,                      -- Where the artifact lives ('hypotheses', 'papers', 'analyses', etc)
    source_id TEXT,                         -- Foreign key to source table
    created_by TEXT,                        -- 'agent', 'orchestrator', 'user', 'import_script'
    created_at TEXT DEFAULT CURRENT_TIMESTAMP,
    updated_at TEXT DEFAULT CURRENT_TIMESTAMP,
    quality_score REAL,                     -- 0-1, artifact-specific quality metric
    provenance TEXT,                        -- JSON: what analysis/debate/import generated this
    metadata TEXT                           -- JSON: artifact-specific fields
);

CREATE INDEX idx_ar_type ON artifact_registry(artifact_type);
CREATE INDEX idx_ar_entity_ids ON artifact_registry(entity_ids);  -- JSON search
CREATE INDEX idx_ar_created_at ON artifact_registry(created_at);
CREATE INDEX idx_ar_quality ON artifact_registry(quality_score);

Usage Examples

# Register a new hypothesis
def register_hypothesis(hypothesis_id: str, entity_ids: List[str], quality_score: float):
    conn.execute("""
        INSERT INTO artifact_registry 
        (id, artifact_type, entity_ids, source_table, source_id, created_by, quality_score)
        VALUES (?, 'hypothesis', ?, 'hypotheses', ?, 'orchestrator', ?)
    """, (str(uuid.uuid4()), json.dumps(entity_ids), hypothesis_id, quality_score))

# Find all artifacts for an entity
def get_artifacts_for_entity(entity_id: str) -> List[Dict]:
    results = conn.execute("""
        SELECT * FROM artifact_registry 
        WHERE json_extract(entity_ids, '$') LIKE ?
        ORDER BY quality_score DESC, created_at DESC
    """, (f'%{entity_id}%',)).fetchall()
    return [dict(r) for r in results]

# Get artifact counts by type
def get_artifact_summary() -> Dict[str, int]:
    results = conn.execute("""
        SELECT artifact_type, COUNT(*) as count 
        FROM artifact_registry 
        GROUP BY artifact_type
    """).fetchall()
    return {r["artifact_type"]: r["count"] for r in results}

6. Cross-Layer Integration Examples

Example 1: Agora → Atlas Pipeline

1. [Agora] Debate generates hypothesis:
   "APOE4 impairs mitochondrial function via disrupted lipid trafficking"

2. [Artifact Registry] Register hypothesis artifact:
   - artifact_type: 'hypothesis'
   - entity_ids: ['GENE:APOE', 'ENTITY:mitochondrial_function', 'ENTITY:lipid_trafficking']
   - quality_score: composite_score from Exchange

3. [Atlas] Extract knowledge edges:
   - APOE4 --[IMPAIRS]--> mitochondrial_function
   - APOE4 --[DISRUPTS]--> lipid_trafficking
   - lipid_trafficking --[REGULATES]--> mitochondrial_function

4. [World Model Score] Recalculate for all 3 entities
   - APOE: +0.02 (new hypothesis, new causal edges)
   - mitochondrial_function: +0.05 (was KG-poor, now connected)
   - lipid_trafficking: +0.03 (bridge entity revealed)

5. [Gap Detection] Identify new gap:
   "mitochondrial_function is hypothesis-rich but evidence-poor"
   → Trigger Forge analysis to validate via PubMed/data

Example 2: Senate → Gap Closure Loop

-- Add entity_ids JSON column to existing tables
ALTER TABLE hypotheses ADD COLUMN entity_ids TEXT; -- JSON array
ALTER TABLE analyses ADD COLUMN entity_ids TEXT;   -- JSON array
ALTER TABLE papers ADD COLUMN entity_ids TEXT;      -- JSON array (extracted from abstract/MeSH)

-- Update knowledge_edges to use canonical IDs
ALTER TABLE knowledge_edges ADD COLUMN source_entity_id TEXT;
ALTER TABLE knowledge_edges ADD COLUMN target_entity_id TEXT;
CREATE INDEX idx_ke_source_entity ON knowledge_edges(source_entity_id);
CREATE INDEX idx_ke_target_entity ON knowledge_edges(target_entity_id);

7. Migration Path

Phase 1: Foundation (Week 1)

☐ Add entity_ids JSON column to hypotheses, analyses, papers tables

☐ Create artifact_registry table

☐ Implement calculate_world_model_score() function

☐ Add /entity/{entity_id} unified view route to api.py

Phase 2: Gap Detection (Week 2)

☐ Implement detect_knowledge_gaps() function

☐ Add gap_type column to knowledge_gaps table

☐ Create /gaps/world-model dashboard showing gap distribution

☐ Integrate gap detection with agent task selection

Phase 3: Artifact Registry Population (Week 3)

☐ Backfill artifact_registry with existing hypotheses

☐ Backfill artifact_registry with existing papers

☐ Backfill artifact_registry with existing knowledge_edges

☐ Add artifact registration to post_process.py pipeline

Phase 4: Neo4j Migration (Week 4+)

☐ Set up Neo4j instance

☐ Migrate knowledge_edges → Neo4j graph

☐ Update API to query Neo4j for KG neighborhood

☐ Implement graph algorithms (PageRank, community detection)

Work Log

2026-04-02 09:00 PT — Slot 2

Started task: Design world model multi-representation framework
Read AGENTS.md to understand spec format and architecture
Created comprehensive spec with 7 representation types, unification layer, scoring formula, gap detection algorithm, artifact registry, and integration examples
Next: Review spec, then begin Phase 1 implementation

File: world_model_framework_spec.md

Modified: 2026-04-25 17:55

Size: 18.5 KB