Spec: Add infoboxes to top 20 most-connected wiki entities

← All Specs

Spec: Add infoboxes to top 20 most-connected wiki entities

> ## Continuous-process anchor
>
> This spec describes an instance of one of the retired-script themes
> documented in docs/design/retired_scripts_patterns.md. Before
> implementing, read:
>
> 1. The "Design principles for continuous processes" section of that
> atlas — every principle is load-bearing. In particular:
> - LLMs for semantic judgment; rules for syntactic validation.
> - Gap-predicate driven, not calendar-driven.
> - Idempotent + version-stamped + observable.
> - No hardcoded entity lists, keyword lists, or canonical-name tables.
> - Three surfaces: FastAPI + orchestra + MCP.
> - Progressive improvement via outcome-feedback loop.
> 2. The theme entry in the atlas matching this task's capability:
> A5 (pick the closest from Atlas A1–A7, Agora AG1–AG5,
> Exchange EX1–EX4, Forge F1–F2, Senate S1–S8, Cross-cutting X1–X2).
> 3. If the theme is not yet rebuilt as a continuous process, follow
> docs/planning/specs/rebuild_theme_template_spec.md to scaffold it
> BEFORE doing the per-instance work.
>
> **Specific scripts named below in this spec are retired and must not
> be rebuilt as one-offs.** Implement (or extend) the corresponding
> continuous process instead.

Task ID: 466d13f2-ec31-44b6-8230-20eb239b6b3a Layer: Atlas Status: Done

Goal

536+ wiki entities had thin infobox_data (only auto-generated fields: name, type,
knowledge_graph_edges, top_relations, associated_diseases). Enrich the 20 most-connected
of these with structured data from the knowledge graph, canonical_entities, wiki_pages,
and the targets table.

Approach

  • Query knowledge_edges (700K+ edges) to compute per-entity connection counts
  • Join with wiki_entities to find entities with thin infoboxes (<=5 filled fields)
  • Sort by connection count descending, pick top 20 unique entities
  • For each entity, enrich infobox using:
  • - canonical_entities — aliases, external IDs (UniProt, NCBI, Ensembl), description
    - wiki_pages — definition paragraph extraction
    - knowledge_edges — associated genes, diseases, drugs, pathways, cell types, interactions, brain regions
    - targets — protein name, mechanism, target class (for gene/protein entities)
  • Merge new data into existing infobox (preserving any existing non-empty values)
  • Write via db_writes.save_wiki_entity for journaled updates
  • Enriched Entities

    EntityTypeConnectionsBeforeAfter
    Genesindex814257
    cancerdisease8021511
    autophagybiological_process7580511
    Alzheimer'S Diseasedisease7323510
    APOPTOSISphenotype705556
    inflammationpathway6960510
    Microgliacell_type5972510
    AGINGphenotype481856
    TUMORdisease472159
    neuroinflammationbiological_process4676511
    CYTOKINESconcept434959
    mitochondriaorganelle415758
    ASTROCYTESconcept4155510
    NEURODEGENERATIVE DISEASESconcept4117510
    ferroptosisphenotype320157
    MITOPHAGYbiological_process3135510
    MITOCHONDRIAL DYSFUNCTIONconcept3048510
    neuronscell_type278357
    Astrocytecell_type2685510
    Erkprotein255459

    Files

    • enrichment/enrich_top20_connected_infoboxes.py — enrichment script

    Work Log

    • 2026-04-16: Initial investigation. Found all 13,640 wiki_entities already have
    infobox_data (auto-generated), but 1,187 have thin data (<=5 filled fields).
    The knowledge_edges table has 700K+ edges for computing connection counts.
    • 2026-04-16: Updated THIN_THRESHOLD from 5 to 6 to capture entities with exactly
    5 boilerplate fields. Increased scan limit from 500 to 2000 to find 20 thin entities
    among the most connected.
    • 2026-04-16: Ran enrichment script successfully. All 20 entities enriched from 5
    fields to 6-11 fields each, adding structured data from KG edges, canonical entities,
    and wiki page content.

    Tasks using this spec (1)
    [Atlas] Add infoboxes to top 20 most-connected wiki entities
    Atlas done P88
    File: 466d13f2_ec3_spec.md
    Modified: 2026-04-25 23:40
    Size: 4.1 KB