Spec: Add infoboxes to top 20 most-connected wiki entities

> ## Continuous-process anchor
>
> This spec describes an instance of one of the retired-script themes
> documented in docs/design/retired_scripts_patterns.md. Before
> implementing, read:
>
> 1. The "Design principles for continuous processes" section of that
> atlas — every principle is load-bearing. In particular:
> - LLMs for semantic judgment; rules for syntactic validation.
> - Gap-predicate driven, not calendar-driven.
> - Idempotent + version-stamped + observable.
> - No hardcoded entity lists, keyword lists, or canonical-name tables.
> - Three surfaces: FastAPI + orchestra + MCP.
> - Progressive improvement via outcome-feedback loop.
> 2. The theme entry in the atlas matching this task's capability:
> A5 (pick the closest from Atlas A1–A7, Agora AG1–AG5,
> Exchange EX1–EX4, Forge F1–F2, Senate S1–S8, Cross-cutting X1–X2).
> 3. If the theme is not yet rebuilt as a continuous process, follow
> docs/planning/specs/rebuild_theme_template_spec.md to scaffold it
> BEFORE doing the per-instance work.
>
> **Specific scripts named below in this spec are retired and must not
> be rebuilt as one-offs.** Implement (or extend) the corresponding
> continuous process instead.

Task ID: 466d13f2-ec31-44b6-8230-20eb239b6b3a Layer: Atlas Status: Done

Goal

536+ wiki entities had thin infobox_data (only auto-generated fields: name, type,
knowledge_graph_edges, top_relations, associated_diseases). Enrich the 20 most-connected
of these with structured data from the knowledge graph, canonical_entities, wiki_pages,
and the targets table.

Approach

Query knowledge_edges (700K+ edges) to compute per-entity connection counts

Join with wiki_entities to find entities with thin infoboxes (<=5 filled fields)

Sort by connection count descending, pick top 20 unique entities

For each entity, enrich infobox using:

- canonical_entities — aliases, external IDs (UniProt, NCBI, Ensembl), description
- wiki_pages — definition paragraph extraction
- knowledge_edges — associated genes, diseases, drugs, pathways, cell types, interactions, brain regions
- targets — protein name, mechanism, target class (for gene/protein entities)

Merge new data into existing infobox (preserving any existing non-empty values)

Write via db_writes.save_wiki_entity for journaled updates

Enriched Entities

Entity	Type	Connections	Before	After
Genes	index	8142	5	7
cancer	disease	8021	5	11
autophagy	biological_process	7580	5	11
Alzheimer'S Disease	disease	7323	5	10
APOPTOSIS	phenotype	7055	5	6
inflammation	pathway	6960	5	10
Microglia	cell_type	5972	5	10
AGING	phenotype	4818	5	6
TUMOR	disease	4721	5	9
neuroinflammation	biological_process	4676	5	11
CYTOKINES	concept	4349	5	9
mitochondria	organelle	4157	5	8
ASTROCYTES	concept	4155	5	10
NEURODEGENERATIVE DISEASES	concept	4117	5	10
ferroptosis	phenotype	3201	5	7
MITOPHAGY	biological_process	3135	5	10
MITOCHONDRIAL DYSFUNCTION	concept	3048	5	10
neurons	cell_type	2783	5	7
Astrocyte	cell_type	2685	5	10
Erk	protein	2554	5	9

Files

enrichment/enrich_top20_connected_infoboxes.py — enrichment script

Work Log

2026-04-16: Initial investigation. Found all 13,640 wiki_entities already have

infobox_data (auto-generated), but 1,187 have thin data (<=5 filled fields).
The knowledge_edges table has 700K+ edges for computing connection counts.

2026-04-16: Updated THIN_THRESHOLD from 5 to 6 to capture entities with exactly

5 boilerplate fields. Increased scan limit from 500 to 2000 to find 20 thin entities
among the most connected.

2026-04-16: Ran enrichment script successfully. All 20 entities enriched from 5

fields to 6-11 fields each, adding structured data from KG edges, canonical entities,
and wiki page content.

Tasks using this spec (1)

[Atlas] Add infoboxes to top 20 most-connected wiki entities

Atlas done P88

File: 466d13f2_ec3_spec.md

Modified: 2026-04-25 23:40

Size: 4.1 KB