Design a unified architecture for SciDEX's world model that bridges multiple representations of scientific knowledge (knowledge graphs, wiki pages, papers, hypotheses, causal models, notebooks, ontologies). Each representation captures different aspects of understanding — structured relationships, natural language context, primary evidence, predictions, causality, computation, and taxonomic hierarchies. By linking all representations through canonical entity IDs and measuring completeness via composite scores, SciDEX can systematically identify knowledge gaps and direct research toward maximally underexplored regions.
Define seven complementary knowledge representations:
knowledge_edges table in SQLite; planned Neo4j migration/wiki/{entity} route in api.py[[APOE]] map to entity_id, bidirectional referencespapers table (PMID, title, abstract, authors, year, citations)hypotheses table with composite_score (0-1), evidence_for/againstknowledge_edges (relation='causes')confidence, directionality, evidence_pmids[], extracted_from_analysis_idsite/notebooks/, metadata in artifact registrygene > protein > protein_complex > pathway > biological_process > diseaseEvery scientific artifact has a canonical entity_id (string, globally unique across SciDEX).
GENE:APOE, GENE:MAPT, GENE:PSEN1PROTEIN:tau, PROTEIN:amyloid-betaDISEASE:alzheimers, DISEASE:parkinsonsPATHWAY:apoptosis, PATHWAY:autophagyENTITY:{name} for emerging conceptsFor a single entity (e.g., GENE:APOE):
GENE:APOE
├── Wiki Page: /wiki/apoe
│ └── Content: markdown with [[WikiLinks]]
├── KG Neighborhood: /entity/APOE
│ ├── 47 edges (32 outgoing, 15 incoming)
│ └── Types: REGULATES(12), ASSOCIATES_WITH(18), ...
├── Hypotheses: 23 mentioning APOE
│ ├── "APOE4 increases Aβ aggregation via impaired clearance" (score: 0.87)
│ └── "APOE2 protects via enhanced autophagy" (score: 0.72)
├── Papers: 1,847 PubMed results
│ ├── Top cited: PMID:12345678 (2,341 citations)
│ └── Recent: 89 papers in last 12 months
├── Causal Edges: 8 causal claims
│ ├── APOE4 → impaired_lipid_transport (confidence: 0.91)
│ └── APOE2 → enhanced_autophagy (confidence: 0.68)
├── Notebooks: 3 analyses
│ └── "APOE4 lipid dysfunction meta-analysis" (2026-03-15)
└── Ontology: gene > protein > lipid_transport_proteinDatabase Schema Addition:
-- Add entity_ids JSON column to existing tables
ALTER TABLE hypotheses ADD COLUMN entity_ids TEXT; -- JSON array
ALTER TABLE analyses ADD COLUMN entity_ids TEXT; -- JSON array
ALTER TABLE papers ADD COLUMN entity_ids TEXT; -- JSON array (extracted from abstract/MeSH)
-- Update knowledge_edges to use canonical IDs
ALTER TABLE knowledge_edges ADD COLUMN source_entity_id TEXT;
ALTER TABLE knowledge_edges ADD COLUMN target_entity_id TEXT;
CREATE INDEX idx_ke_source_entity ON knowledge_edges(source_entity_id);
CREATE INDEX idx_ke_target_entity ON knowledge_edges(target_entity_id);API Route:
@app.get("/entity/{entity_id}")
async def get_entity_unified_view(entity_id: str):
"""Return all SciDEX knowledge about this entity."""
return {
"entity_id": entity_id,
"wiki_page": fetch_wiki_page(entity_id),
"kg_neighborhood": fetch_kg_edges(entity_id),
"hypotheses": fetch_hypotheses_mentioning(entity_id),
"papers": fetch_papers_mentioning(entity_id),
"causal_edges": fetch_causal_edges(entity_id),
"notebooks": fetch_notebooks_analyzing(entity_id),
"ontology_path": fetch_ontology_hierarchy(entity_id),
"world_model_score": calculate_world_model_score(entity_id)
}Composite metric for how well-understood an entity is across all representations.
world_model_score(entity) = weighted_sum([
wiki_depth_score,
kg_connectivity_score,
hypothesis_coverage_score,
evidence_count_score,
causal_clarity_score,
computational_validation_score,
ontology_integration_score
])def calculate_world_model_score(entity_id: str) -> float:
"""Calculate composite world model score (0-1)."""
# 1. Wiki depth (0-1): Does the entity have rich wiki content?
wiki_page = fetch_wiki_page(entity_id)
wiki_depth = min(len(wiki_page["content"]) / 5000, 1.0) if wiki_page else 0.0
# 2. KG connectivity (0-1): How well-connected in the knowledge graph?
edges = fetch_kg_edges(entity_id)
kg_connectivity = min(len(edges) / 50, 1.0) # Normalize to 50 edges
# 3. Hypothesis coverage (0-1): How many scored hypotheses mention it?
hypotheses = fetch_hypotheses_mentioning(entity_id)
hypothesis_coverage = min(len(hypotheses) / 20, 1.0)
# 4. Evidence count (0-1): How many papers provide evidence?
papers = fetch_papers_mentioning(entity_id)
evidence_count = min(len(papers) / 100, 1.0)
# 5. Causal clarity (0-1): Do we have high-confidence causal edges?
causal_edges = fetch_causal_edges(entity_id)
if causal_edges:
avg_confidence = sum(e["confidence"] for e in causal_edges) / len(causal_edges)
causal_clarity = avg_confidence * min(len(causal_edges) / 10, 1.0)
else:
causal_clarity = 0.0
# 6. Computational validation (0-1): Has analysis code verified claims?
notebooks = fetch_notebooks_analyzing(entity_id)
computational_validation = min(len(notebooks) / 5, 1.0)
# 7. Ontology integration (0-1): Is it properly typed in hierarchy?
ontology_path = fetch_ontology_hierarchy(entity_id)
ontology_integration = 1.0 if ontology_path else 0.0
# Weighted sum
score = (
0.15 * wiki_depth +
0.20 * kg_connectivity +
0.15 * hypothesis_coverage +
0.20 * evidence_count +
0.15 * causal_clarity +
0.10 * computational_validation +
0.05 * ontology_integration
)
return round(score, 3)Identify entities with representation mismatches to prioritize research.
def detect_knowledge_gaps(min_score=0.3) -> List[Dict]:
"""Find entities with representation mismatches."""
gaps = []
# Get all entities with any presence in SciDEX
all_entities = get_all_entity_ids()
for entity_id in all_entities:
scores = calculate_component_scores(entity_id)
total_score = calculate_world_model_score(entity_id)
# Gap Type 1: Wiki-rich but KG-poor
if scores["wiki_depth"] > 0.7 and scores["kg_connectivity"] < 0.3:
gaps.append({
"entity_id": entity_id,
"gap_type": "wiki_rich_kg_poor",
"priority": scores["wiki_depth"] - scores["kg_connectivity"],
"recommendation": "Extract structured relationships from wiki to populate KG"
})
# Gap Type 2: KG-rich but wiki-poor
if scores["kg_connectivity"] > 0.7 and scores["wiki_depth"] < 0.3:
gaps.append({
"entity_id": entity_id,
"gap_type": "kg_rich_wiki_poor",
"priority": scores["kg_connectivity"] - scores["wiki_depth"],
"recommendation": "Generate wiki page synthesizing KG neighborhood"
})
# Gap Type 3: Evidence-rich but hypothesis-poor
if scores["evidence_count"] > 0.7 and scores["hypothesis_coverage"] < 0.3:
gaps.append({
"entity_id": entity_id,
"gap_type": "evidence_rich_hypothesis_poor",
"priority": scores["evidence_count"] - scores["hypothesis_coverage"],
"recommendation": "Run Agora debate to generate testable hypotheses"
})
# Gap Type 4: Hypothesis-rich but causal-poor
if scores["hypothesis_coverage"] > 0.7 and scores["causal_clarity"] < 0.3:
gaps.append({
"entity_id": entity_id,
"gap_type": "hypothesis_rich_causal_poor",
"priority": scores["hypothesis_coverage"] - scores["causal_clarity"],
"recommendation": "Extract causal claims from hypotheses, score confidence"
})
# Gap Type 5: Claims-rich but computation-poor
if total_score > 0.6 and scores["computational_validation"] < 0.2:
gaps.append({
"entity_id": entity_id,
"gap_type": "claims_rich_computation_poor",
"priority": total_score - scores["computational_validation"],
"recommendation": "Run Forge analysis to computationally validate claims"
})
# Sort by priority (highest mismatch first)
gaps.sort(key=lambda x: x["priority"], reverse=True)
return gapsdef sync_gaps_to_db():
"""Write detected gaps to knowledge_gaps table for agent consumption."""
detected_gaps = detect_knowledge_gaps()
for gap in detected_gaps[:50]: # Top 50 gaps
conn.execute("""
INSERT OR IGNORE INTO knowledge_gaps
(title, description, priority_score, status, entity_ids, gap_type)
VALUES (?, ?, ?, 'open', ?, ?)
""", (
f"Improve {gap['entity_id']} world model: {gap['gap_type']}",
gap['recommendation'],
gap['priority'],
json.dumps([gap['entity_id']]),
gap['gap_type']
))Central table tracking all knowledge artifacts with their entity associations.
CREATE TABLE IF NOT EXISTS artifact_registry (
id TEXT PRIMARY KEY, -- UUID
artifact_type TEXT NOT NULL, -- 'wiki_page', 'kg_edge', 'paper', 'hypothesis', 'notebook', 'causal_edge', 'ontology_entry'
entity_ids TEXT NOT NULL, -- JSON array of canonical entity IDs
content_hash TEXT, -- SHA256 of content for deduplication
source_table TEXT, -- Where the artifact lives ('hypotheses', 'papers', 'analyses', etc)
source_id TEXT, -- Foreign key to source table
created_by TEXT, -- 'agent', 'orchestrator', 'user', 'import_script'
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP,
quality_score REAL, -- 0-1, artifact-specific quality metric
provenance TEXT, -- JSON: what analysis/debate/import generated this
metadata TEXT -- JSON: artifact-specific fields
);
CREATE INDEX idx_ar_type ON artifact_registry(artifact_type);
CREATE INDEX idx_ar_entity_ids ON artifact_registry(entity_ids); -- JSON search
CREATE INDEX idx_ar_created_at ON artifact_registry(created_at);
CREATE INDEX idx_ar_quality ON artifact_registry(quality_score);# Register a new hypothesis
def register_hypothesis(hypothesis_id: str, entity_ids: List[str], quality_score: float):
conn.execute("""
INSERT INTO artifact_registry
(id, artifact_type, entity_ids, source_table, source_id, created_by, quality_score)
VALUES (?, 'hypothesis', ?, 'hypotheses', ?, 'orchestrator', ?)
""", (str(uuid.uuid4()), json.dumps(entity_ids), hypothesis_id, quality_score))
# Find all artifacts for an entity
def get_artifacts_for_entity(entity_id: str) -> List[Dict]:
results = conn.execute("""
SELECT * FROM artifact_registry
WHERE json_extract(entity_ids, '$') LIKE ?
ORDER BY quality_score DESC, created_at DESC
""", (f'%{entity_id}%',)).fetchall()
return [dict(r) for r in results]
# Get artifact counts by type
def get_artifact_summary() -> Dict[str, int]:
results = conn.execute("""
SELECT artifact_type, COUNT(*) as count
FROM artifact_registry
GROUP BY artifact_type
""").fetchall()
return {r["artifact_type"]: r["count"] for r in results}1. [Agora] Debate generates hypothesis:
"APOE4 impairs mitochondrial function via disrupted lipid trafficking"
2. [Artifact Registry] Register hypothesis artifact:
- artifact_type: 'hypothesis'
- entity_ids: ['GENE:APOE', 'ENTITY:mitochondrial_function', 'ENTITY:lipid_trafficking']
- quality_score: composite_score from Exchange
3. [Atlas] Extract knowledge edges:
- APOE4 --[IMPAIRS]--> mitochondrial_function
- APOE4 --[DISRUPTS]--> lipid_trafficking
- lipid_trafficking --[REGULATES]--> mitochondrial_function
4. [World Model Score] Recalculate for all 3 entities
- APOE: +0.02 (new hypothesis, new causal edges)
- mitochondrial_function: +0.05 (was KG-poor, now connected)
- lipid_trafficking: +0.03 (bridge entity revealed)
5. [Gap Detection] Identify new gap:
"mitochondrial_function is hypothesis-rich but evidence-poor"
→ Trigger Forge analysis to validate via PubMed/data-- Add entity_ids JSON column to existing tables
ALTER TABLE hypotheses ADD COLUMN entity_ids TEXT; -- JSON array
ALTER TABLE analyses ADD COLUMN entity_ids TEXT; -- JSON array
ALTER TABLE papers ADD COLUMN entity_ids TEXT; -- JSON array (extracted from abstract/MeSH)
-- Update knowledge_edges to use canonical IDs
ALTER TABLE knowledge_edges ADD COLUMN source_entity_id TEXT;
ALTER TABLE knowledge_edges ADD COLUMN target_entity_id TEXT;
CREATE INDEX idx_ke_source_entity ON knowledge_edges(source_entity_id);
CREATE INDEX idx_ke_target_entity ON knowledge_edges(target_entity_id);0entity_ids JSON column to hypotheses, analyses, papers tablesartifact_registry tablecalculate_world_model_score() function/entity/{entity_id} unified view route to api.pydetect_knowledge_gaps() functiongap_type column to knowledge_gaps table/gaps/world-model dashboard showing gap distribution