Spec: [Atlas] CI: Cross-link new wiki pages to KG entities

← All Specs

Spec: [Atlas] CI: Cross-link new wiki pages to KG entities

> ## Continuous-process anchor
>
> This spec describes an instance of one of the retired-script themes
> documented in docs/design/retired_scripts_patterns.md. Before
> implementing, read:
>
> 1. The "Design principles for continuous processes" section of that
> atlas — every principle is load-bearing. In particular:
> - LLMs for semantic judgment; rules for syntactic validation.
> - Gap-predicate driven, not calendar-driven.
> - Idempotent + version-stamped + observable.
> - No hardcoded entity lists, keyword lists, or canonical-name tables.
> - Three surfaces: FastAPI + orchestra + MCP.
> - Progressive improvement via outcome-feedback loop.
> 2. The theme entry in the atlas matching this task's capability:
> A4 (pick the closest from Atlas A1–A7, Agora AG1–AG5,
> Exchange EX1–EX4, Forge F1–F2, Senate S1–S8, Cross-cutting X1–X2).
> 3. If the theme is not yet rebuilt as a continuous process, follow
> docs/planning/specs/rebuild_theme_template_spec.md to scaffold it
> BEFORE doing the per-instance work.
>
> **Specific scripts named below in this spec are retired and must not
> be rebuilt as one-offs.** Implement (or extend) the corresponding
> continuous process instead.

Task ID: d20e0e93-fdbc-4487-b99e-0132b3e31684 Priority: 50 Type: recurring (daily) Layer: Atlas

Goal

Find wiki_pages that have a kg_node_id set but no corresponding entry in node_wiki_links, and create the missing links. Also attempt to match wiki pages lacking kg_node_id to KG entities via title/slug patterns.

Acceptance Criteria

  • All wiki_pages with kg_node_id set have a corresponding node_wiki_links(kg_node_id, title) entry
  • Wiki pages without kg_node_id are matched to KG nodes where possible via title/slug heuristics
  • Run completes idempotently (re-running does not duplicate links)
  • Reports count of new links created

Background

Tables involved:

  • wiki_pages: has kg_node_id column (set for ~13K of 17K pages)
  • node_wiki_links(kg_node_id, wiki_entity_name): join table for KG node ↔ wiki page links
  • knowledge_edges(source_id, target_id, ...): defines KG node IDs
Gap discovered:
  • 10,401 wiki_pages have kg_node_id set but no matching node_wiki_links entry
  • 3,778 wiki_pages have no kg_node_id (excluding index/navigation pages)

Implementation

Run crosslink_wiki_remaining.py (or equivalent inline script) that:

  • SELECT wiki_pages WHERE kg_node_id IS NOT NULL AND NOT EXISTS (node_wiki_links match)
  • INSERT OR IGNORE INTO node_wiki_links(kg_node_id, wiki_entity_name) VALUES (kg_node_id, title)
  • For pages without kg_node_id: try slug-pattern matching to find KG nodes
  • Work Log

    2026-04-21 10:28 PT - Slot 54 (codex)

    Starting: Re-checked the PostgreSQL wiki/KG link backlog for this recurring run.

    Found:

    • 17,575 wiki_pages total; 16,669 have kg_node_id; 906 have no kg_node_id.
    • 0 pages with kg_node_id are missing exact node_wiki_links(kg_node_id, slug) rows.
    • 2 non-skipped pages without kg_node_id (entities-parkinson, phenotypes-neurodegeneration) already have exact existing node_wiki_links that can safely promote into wiki_pages.kg_node_id.
    Approach: Tighten scripts/ci_crosslink_wiki_kg.py Step 2 to promote only exact existing node-wiki links for candidate scientific pages, including phenotype and generic entity pages, then run the driver and verify idempotency.

    Action: Updated scripts/ci_crosslink_wiki_kg.py so Step 1 verifies exact (kg_node_id, slug) joins and Step 2 promotes only exact existing node_wiki_links into wiki_pages.kg_node_id. Ran the driver and updated 2 pages: entities-parkinson -> parkinson, phenotypes-neurodegeneration -> neurodegeneration.

    Verification: Re-ran python3 scripts/ci_crosslink_wiki_kg.py; second run reported step1=0 step2=0. Verified 0 exact missing node-wiki links, 16,671 wiki_pages with kg_node_id, 904 without, and 0 non-skipped unmatched pages.

    2026-04-04 05:15 PT — Slot 4

    Starting: Investigating state of wiki_pages and node_wiki_links.

    Found:

    • 14 wiki_pages with kg_node_id set
    • 2 of these (APOE Gene, MAPT Gene) were missing node_wiki_links entries
    • 1737 wiki_pages without kg_node_id
    Action: Added missing node_wiki_links for:
    • ent-gene-apoe -> APOE Gene
    • ent-gene-mapt -> MAPT Gene
    Verification: All 14 wiki_pages with kg_node_id now have node_wiki_links (0 missing).

    Slug-pattern matching: Ran script to match wiki pages without kg_node_id to KG entities via title/slug heuristics. Found 0 high-confidence matches - the remaining 1737 wiki pages don't have canonical entity pages that match KG entities. Gene pages without kg_node_id (MITF Gene, ST6GALNAC5 Gene, TFEC Gene) don't correspond to any KG entity.

    Created: scripts/crosslink_wiki_to_kg.py for future matching.

    Result: Done — Primary acceptance criteria met. All wiki_pages with kg_node_id have corresponding node_wiki_links.

    2026-04-06 04:31 UTC — Slot 2

    Verification: 0 wiki_pages with kg_node_id missing node_wiki_links (17,346 total pages, 3,836 without kg_node_id).

    Runs executed:

    • crosslink_wiki_to_kg.py: 16 new node_wiki_links (35,016 → 35,032)
    • crosslink_wiki_all.py: 13,911 new artifact_links (1,220,543 → 1,234,454), unlinked pages 85 → 62
    • crosslink_wiki_hypotheses.py: 1,905 new links/edges (381 hypothesis, 1,036 analysis, 407 KG edges, 81 entity-ID)
    Result: All acceptance criteria met.

    2026-04-10 08:28 UTC — Slot 5

    Verification: 0 wiki_pages with kg_node_id missing node_wiki_links (17,435 total, 15,960 with kg_node_id, 1,475 without).

    Runs executed:

    • crosslink_wiki_to_kg.py: 0 new node_wiki_links (49,413 stable)
    • crosslink_wiki_all.py: blocked by pre-existing DB corruption in artifacts table (sqlite database disk image is malformed) — unrelated to node_wiki_links
    Note: DB corruption in artifacts table (multiple bad indexes, tree page reference errors) affects crosslink_wiki_all.py experiment crosslinking but does NOT affect node_wiki_links or wiki_pages tables which are intact.

    Result: All acceptance criteria met. Primary task (node_wiki_links for kg_node_id pages) is satisfied.

    2026-04-11 14:15 UTC — Slot 61 (minimax)

    Verification: 0 wiki_pages with kg_node_id missing node_wiki_links (17,538 total, 15,991 with kg_node_id, 1,547 without).

    Runs executed:

    • crosslink_wiki_to_kg.py: 0 new node_wiki_links (50,653 stable)
    • crosslink_wiki_all.py: 4,382 new artifact_links (72 hypothesis, 1,405 analysis, 2,784 experiment, 121 KG-edge)
    Result: All acceptance criteria met. node_wiki_links stable at 50,653.

    2026-04-12 01:25 UTC — Slot 62 (minimax)

    Verification: 0 wiki_pages with kg_node_id missing node_wiki_links (17,538 total, 15,991 with kg_node_id, 1,547 without).

    Runs executed:

    • crosslink_wiki_to_kg.py: 0 new node_wiki_links (50,653 stable)
    • crosslink_wiki_all.py: 146 new artifact_links (0 hypothesis, 146 analysis, 0 experiment, 0 KG-edge)
    • crosslink_wiki_hypotheses.py: 10 new links/edges (0 hypothesis, 10 analysis, 0 KG edges, 0 entity-ID)
    Result: All acceptance criteria met. node_wiki_links stable at 50,653. 156 new artifact_links created.

    2026-04-13 00:58 UTC — Slot (minimax)

    Verification: 0 wiki_pages with kg_node_id missing node_wiki_links (17,539 total, 16,637 with kg_node_id, 861 without).

    Runs executed:

    • ci_crosslink_wiki_kg.py: 0 new node_wiki_links (52,966 stable)
    • Step 1: No missing links for pages with kg_node_id (already up to date)
    • Step 2: 861 pages without kg_node_id — 0 new matches found
    Result: All acceptance criteria met. node_wiki_links stable at 52,966. CI pass, nothing actionable.

    2026-04-13 11:45 UTC — Slot (minimax)

    Verification: 0 wiki_pages with kg_node_id missing node_wiki_links (17,539 total, 16,637 with kg_node_id, 902 without).

    Runs executed:

    • crosslink_wiki_to_kg.py: 0 new node_wiki_links (54,416 stable)
    • Step 2: 902 pages without kg_node_id — 0 new matches found
    Result: All acceptance criteria met. node_wiki_links stable at 54,416. CI pass, nothing actionable.

    2026-04-13 11:45 UTC — Slot (minimax)

    Verification: 0 wiki_pages with kg_node_id missing node_wiki_links (17,539 total, 16,637 with kg_node_id, 861 without).

    Runs executed:

    • ci_crosslink_wiki_kg.py: 0 new node_wiki_links (54,416 stable)
    • Step 1: No missing links for pages with kg_node_id (already up to date)
    • Step 2: 861 pages without kg_node_id — 0 new matches found
    Result: All acceptance criteria met. node_wiki_links stable at 54,416. CI pass, nothing actionable.

    2026-04-13 12:15 UTC — Slot (minimax)

    Verification: 0 wiki_pages with kg_node_id missing node_wiki_links (17,539 total, 16,637 with kg_node_id, 902 without).

    Runs executed:

    • crosslink_wiki_all.py: 64 new artifact_links (57 hypothesis, 7 KG-edge via analysis)
    • node_wiki_links: stable at 54,416
    Result: All acceptance criteria met. node_wiki_links stable at 54,416. Artifact links created: 57 Wiki-Hypothesis, 7 Wiki-Analysis via KG edges.

    2026-04-17 11:56 UTC — Slot 51 (glm-5)

    Verification: 0 wiki_pages with kg_node_id missing node_wiki_links after run (16,664 with kg_node_id, DB corruption prevents total count).

    Runs executed:

    • ci_crosslink_wiki_kg.py: 2 new node_wiki_links (54,467 → 54,469)
    • Step 1: Found 2 pages with kg_node_id missing node_wiki_links — inserted
    • Step 2: Blocked by DB corruption in knowledge_edges table (malformed)
    DB corruption note: wiki_pages table still has corruption from prior runs (total count fails), but node_wiki_links reads/writes succeed. artifact_entity_crosslink.py fails on corrupted artifacts table.

    Result: 2 new links created. All pages with kg_node_id now have node_wiki_links (0 unlinked).

    2026-04-17 18:10 UTC — Slot 51 (glm-5)

    Starting state: 17,574 total wiki_pages, 16,663 with kg_node_id, 911 without. 5 pages with kg_node_id missing node_wiki_links.

    Actions:

    • Step 1: Fixed 5 missing node_wiki_links (APOE, MAPT, LabDAO, Autophagy-Related Genes, Diabetic Retinopathy)
    • Step 2: Case-insensitive title matching found 8 wiki_pages without kg_node_id that match KG entities. Set kg_node_id and created links for: gamma-secretase, Investment Landscape, ALZHEIMER, AUTOPHAGY, Inflammation, neurodegeneration, Parkinson, Akt
    • 3 additional node_wiki_links created for newly-matched pages (INFLAMMATION, PARKINSON, AKT titles)
    Totals: 8 new node_wiki_links (54,469 → 54,479), 8 wiki_pages gained kg_node_id (16,663 → 16,671). Remaining 903 pages without kg_node_id are mostly institutions, researchers, and companies with no corresponding KG entity.

    Result: 0 pages with kg_node_id missing node_wiki_links. All acceptance criteria met.

    Tasks using this spec (1)
    [Atlas] CI: Cross-link new wiki pages to KG entities
    Atlas open P80
    File: d20e0e93-fdbc-4487-b99e-0132b3e31684_spec.md
    Modified: 2026-04-25 23:40
    Size: 10.8 KB