[Atlas] Add infoboxes to top 20 most-connected wiki entities done

← Atlas
536 wiki entities lack infobox_data. Add structured infoboxes (JSON with type, description, key properties) to the 20 most-connected entities that don't have one yet. ## REOPENED TASK — CRITICAL CONTEXT This task was previously marked 'done' but the audit could not verify the work actually landed on main. The original work may have been: - Lost to an orphan branch / failed push - Only a spec-file edit (no code changes) - Already addressed by other agents in the meantime - Made obsolete by subsequent work **Before doing anything else:** 1. **Re-evaluate the task in light of CURRENT main state.** Read the spec and the relevant files on origin/main NOW. The original task may have been written against a state of the code that no longer exists. 2. **Verify the task still advances SciDEX's aims.** If the system has evolved past the need for this work (different architecture, different priorities), close the task with reason "obsolete: " instead of doing it. 3. **Check if it's already done.** Run `git log --grep=''` and read the related commits. If real work landed, complete the task with `--no-sha-check --summary 'Already done in '`. 4. **Make sure your changes don't regress recent functionality.** Many agents have been working on this codebase. Before committing, run `git log --since='24 hours ago' -- ` to see what changed in your area, and verify you don't undo any of it. 5. **Stay scoped.** Only do what this specific task asks for. Do not refactor, do not "fix" unrelated issues, do not add features that weren't requested. Scope creep at this point is regression risk. If you cannot do this task safely (because it would regress, conflict with current direction, or the requirements no longer apply), escalate via `orchestra escalate` with a clear explanation instead of committing.

Git Commits (2)

[Verify] Add infoboxes to top 20 most-connected wiki entities — already resolved [task:466d13f2-ec31-44b6-8230-20eb239b6b3a]2026-04-21
Squash merge: orchestra/task/466d13f2-add-infoboxes-to-top-20-most-connected-w (1 commits)2026-04-16
Spec File

Spec: Add infoboxes to top 20 most-connected wiki entities

> ## Continuous-process anchor
>
> This spec describes an instance of one of the retired-script themes
> documented in docs/design/retired_scripts_patterns.md. Before
> implementing, read:
>
> 1. The "Design principles for continuous processes" section of that
> atlas — every principle is load-bearing. In particular:
> - LLMs for semantic judgment; rules for syntactic validation.
> - Gap-predicate driven, not calendar-driven.
> - Idempotent + version-stamped + observable.
> - No hardcoded entity lists, keyword lists, or canonical-name tables.
> - Three surfaces: FastAPI + orchestra + MCP.
> - Progressive improvement via outcome-feedback loop.
> 2. The theme entry in the atlas matching this task's capability:
> A5 (pick the closest from Atlas A1–A7, Agora AG1–AG5,
> Exchange EX1–EX4, Forge F1–F2, Senate S1–S8, Cross-cutting X1–X2).
> 3. If the theme is not yet rebuilt as a continuous process, follow
> docs/planning/specs/rebuild_theme_template_spec.md to scaffold it
> BEFORE doing the per-instance work.
>
> **Specific scripts named below in this spec are retired and must not
> be rebuilt as one-offs.** Implement (or extend) the corresponding
> continuous process instead.

Task ID: 466d13f2-ec31-44b6-8230-20eb239b6b3a Layer: Atlas Status: Done

Goal

536+ wiki entities had thin infobox_data (only auto-generated fields: name, type,
knowledge_graph_edges, top_relations, associated_diseases). Enrich the 20 most-connected
of these with structured data from the knowledge graph, canonical_entities, wiki_pages,
and the targets table.

Approach

  • Query knowledge_edges (700K+ edges) to compute per-entity connection counts
  • Join with wiki_entities to find entities with thin infoboxes (<=5 filled fields)
  • Sort by connection count descending, pick top 20 unique entities
  • For each entity, enrich infobox using:
  • - canonical_entities — aliases, external IDs (UniProt, NCBI, Ensembl), description
    - wiki_pages — definition paragraph extraction
    - knowledge_edges — associated genes, diseases, drugs, pathways, cell types, interactions, brain regions
    - targets — protein name, mechanism, target class (for gene/protein entities)
  • Merge new data into existing infobox (preserving any existing non-empty values)
  • Write via db_writes.save_wiki_entity for journaled updates
  • Enriched Entities

    EntityTypeConnectionsBeforeAfter
    Genesindex814257
    cancerdisease8021511
    autophagybiological_process7580511
    Alzheimer'S Diseasedisease7323510
    APOPTOSISphenotype705556
    inflammationpathway6960510
    Microgliacell_type5972510
    AGINGphenotype481856
    TUMORdisease472159
    neuroinflammationbiological_process4676511
    CYTOKINESconcept434959
    mitochondriaorganelle415758
    ASTROCYTESconcept4155510
    NEURODEGENERATIVE DISEASESconcept4117510
    ferroptosisphenotype320157
    MITOPHAGYbiological_process3135510
    MITOCHONDRIAL DYSFUNCTIONconcept3048510
    neuronscell_type278357
    Astrocytecell_type2685510
    Erkprotein255459

    Files

    • enrichment/enrich_top20_connected_infoboxes.py — enrichment script

    Work Log

    • 2026-04-16: Initial investigation. Found all 13,640 wiki_entities already have
    infobox_data (auto-generated), but 1,187 have thin data (<=5 filled fields).
    The knowledge_edges table has 700K+ edges for computing connection counts.
    • 2026-04-16: Updated THIN_THRESHOLD from 5 to 6 to capture entities with exactly
    5 boilerplate fields. Increased scan limit from 500 to 2000 to find 20 thin entities
    among the most connected.
    • 2026-04-16: Ran enrichment script successfully. All 20 entities enriched from 5
    fields to 6-11 fields each, adding structured data from KG edges, canonical entities,
    and wiki page content.

    Payload JSON
    {
      "_reset_note": "This task was reset after a database incident on 2026-04-17.\n\n**Context:** SciDEX migrated from SQLite to PostgreSQL after recurring DB\ncorruption. Some work done during Apr 16-17 may have been lost.\n\n**Before starting work:**\n1. Check if the task's goal is ALREADY satisfied (run the relevant checks)\n2. Check `git log --all --grep=task:YOUR_TASK_ID` for prior commits\n3. If complete, verify and mark done. If partial, continue. If not done, proceed.\n\n**DB change:** SciDEX now uses PostgreSQL. `get_db()` auto-detects via\nSCIDEX_DB_BACKEND=postgres env var.",
      "_reset_at": "2026-04-18T06:29:22.046013+00:00",
      "_reset_from_status": "done"
    }

    Sibling Tasks in Quest (Atlas) ↗