[Senate] Link isolated artifacts into the governance graph

← All Specs

Goal

Link isolated artifacts into the artifact governance graph. Artifacts without artifact_links cannot participate in provenance, lifecycle review, or discovery-dividend backpropagation.

Acceptance Criteria

☐ A concrete batch of isolated artifacts gains artifact_links edges or documented no-link rationale
☐ Each link is derived from entity_ids, parent_version_id, dependencies, provenance_chain, or related DB rows
☐ Low-confidence name-only guesses are not inserted
☐ Before/after isolated artifact counts are recorded

Approach

  • Select artifacts with no incoming or outgoing links, ordered by usage_score, citation_count, and recency.
  • Infer relationships from entity_ids, provenance_chain, dependencies, versions, analyses, or hypotheses.
  • Insert only high-confidence artifact_links through the standard DB path.
  • Verify governance graph connectivity counts and inspect a sample.
  • Dependencies

    • 58079891-7a5 - Senate quest

    Dependents

    • Artifact lifecycle governance, provenance, and discovery dividends

    Work Log

    2026-04-21 - Quest engine template

    • Created reusable spec for quest-engine generated artifact link backfill tasks.

    2026-04-21 20:30 UTC — Task ebdcb998 (slot 40)

    Infrastructure blocker: The Bash tool is completely non-functional in this agent session.
    Every shell command fails immediately with EROFS: read-only file system, mkdir
    '/home/ubuntu/Orchestra/data/claude_creds/max_outlook/session-env/9c56830e-4629-43e4-ab78-e0bffcf06cb4'
    .
    The pre-exec harness hook cannot create the session-env directory because that path
    lives on a read-only filesystem. Sub-agents spawned via the Agent tool have the same
    issue. Python scripts, git commands, and orchestra CLI are all inaccessible.

    Work completed despite blocker:

  • Read AGENTS.md, this spec, and artifact-governance.md to understand the system.
  • Analysed the artifacts, artifact_links, hypotheses, analyses, notebooks,
  • knowledge_edges table schemas.
  • Read scidex/atlas/artifact_registry.py, backfill/backfill_artifacts.py,
  • scidex/core/database.py, and quest_engine.py (lines 1490–1544) to understand
    the query for counting isolated artifacts and how links should be inserted.
  • Wrote a complete, production-quality backfill script at
  • scripts/backfill_isolated_artifact_links.py. The script:
    - Counts isolated artifacts BEFORE (query matches quest_engine.py's isolation query)
    - Processes the top-50 isolated artifacts ordered by quality_score DESC, created_at DESC
    - Uses 9 high-confidence inference strategies (no name-only guesses):
    a. parent_version_idderives_from (strength 1.0)
    b. provenance_chain JSON entries → typed links (strength 0.9)
    c. metadata.analysis_id / metadata.source_analysis_idderives_from (1.0)
    d. metadata.hypothesis_idmentions (0.9)
    e. metadata.gap_idmentions (0.85)
    f. metadata.source_notebook_idderives_from (0.9)
    g. Cross-table: hypotheses.analysis_idderives_from (1.0)
    h. Cross-table: analyses.gap_idextends (0.9)
    i. Cross-table: notebooks.associated_analysis_idderives_from (1.0)
    j. Cross-table: knowledge_edges.analysis_idderives_from (1.0)
    k. Cross-table: hypothesis evidence for/against PMID → cites analysis (0.85)
    l. entity_ids → wiki artifact mentions (0.8, only when wiki artifact confirmed)
    - Uses PostgreSQL ON CONFLICT DO NOTHING for safe upserts
    - Uses scidex.core.database.JournalContext for provenance tracking
    - Counts isolated artifacts AFTER and prints a summary report
    - Supports --dry-run and --limit N flags
  • Could not execute the script (Bash blocked) or commit it (git blocked).
  • Script to run when bash is restored:

    cd /home/ubuntu/scidex
    python3 scripts/backfill_isolated_artifact_links.py --dry-run --limit 50
    python3 scripts/backfill_isolated_artifact_links.py --limit 50

    Before count: Unknown (could not query DB). Quest engine spawned this task because
    count was > 0 at task creation time (2026-04-21T19:54:00Z).

    Next steps for follow-on agent:

  • Verify session-env directory issue is resolved (try echo test in Bash)
  • cd to worktree or main repo
  • Run the backfill script (dry-run first, then real run)
  • Confirm before/after counts in summary output
  • Commit: [Senate] Backfill artifact_links for 50 isolated artifacts [task:ebdcb998-cfec-4280-ba56-12f0ff280bea]
  • Push branch and let supervisor auto-complete
  • 2026-04-21 — Task ebdcb998 retry (slot 40, second attempt)

    Infrastructure blocker persists — root cause identified:

    Orchestra sets CLAUDE_CONFIG_DIR=/home/ubuntu/Orchestra/data/claude_creds/max_outlook/
    in the subprocess environment (see orchestra/auth.py lines 918–926). Claude Code's
    bridge REPL v2 (tengu_bridge_repl_v2: true) then attempts to create {CLAUDE_CONFIG_DIR}/session-env/<UUID>/ for shell-state persistence. This fails with EROFS: read-only file system because that path lives on a read-only mount.

    Fix required (by human operator or supervisor):

    • Option A (preferred): mkdir -p /home/ubuntu/Orchestra/data/claude_creds/max_outlook/session-env
    and ensure the mount is writable, OR
    • Option B: Change config_dir for the max_outlook account in Orchestra's DB to a
    writable directory (e.g. /tmp/claude-max-outlook), OR
    • Option C: Disable tengu_bridge_repl_v2 feature for automated workers by setting
    CLAUDE_DISABLE_BRIDGE_REPL=1 (if that env var is respected).

    Work completed this session:

    • Confirmed scripts/backfill_isolated_artifact_links.py exists and is complete (written
    by prior agent slot 40, first attempt)
    • Confirmed Bash and Write tools both fail with EROFS on the session-env path
    • Traced root cause to CLAUDE_CONFIG_DIR pointing at read-only filesystem
    • Could not commit or run the script — waiting on infrastructure fix

    2026-04-21 20:43 UTC — Task ebdcb998 completion

    • Reviewed prior related branch orchestra/task/98628b02-link-50-isolated-artifacts-into-the-gove; it had a separate verification-only spec update and was not present on origin/main for this task.
    • Confirmed live PostgreSQL schema uses artifacts.id and artifact_links.source_artifact_id / target_artifact_id; artifact_links has no natural unique constraint, so the backfill script checks duplicates explicitly before insert.
    • Added scripts/backfill_isolated_artifact_links.py to scan isolated artifacts by usage_score, citation_count, and created_at, infer only high-confidence links from metadata, entity IDs, provenance/dependencies, and related rows, and stop after 50 artifacts gain links.
    • Dry run: python3 scripts/backfill_isolated_artifact_links.py --dry-run --limit 50 --scan-limit 1000 scanned 443 isolated artifacts and found 50 linkable figure artifacts, with 61 candidate links.
    • Executed: python3 scripts/backfill_isolated_artifact_links.py --limit 50 --scan-limit 1000.
    • Before isolated count: 17,088
    • After isolated count: 17,035
    • Reduction: 53 isolated artifacts, because 50 source figures gained links and three previously isolated target wiki artifacts also became connected by inbound links.
    • Rows inserted: 61 artifact_links rows: 50 derives_from links from figure metadata analysis_id to existing analysis artifacts, plus 11 mentions links from direct entity_ids to existing wiki artifacts.
    • Sample verification queries confirmed:
    - figure-7f7b14e2f8bc -> analysis-sda-2026-04-01-gap-008, derives_from, evidence metadata.analysis_id = sess_sda-2026-04-01-gap-008
    - figure-31940a5cb4cd -> analysis-sda-2026-04-01-gap-20260401231108, derives_from, and wiki-neurodegeneration, mentions
    - figure-2ef32bff5b51 -> analysis-sda-2026-04-01-gap-v2-68d9c9c1, wiki-TAU, and wiki-TFEB

    Acceptance Criteria Status — Task ebdcb998

    ☑ A concrete batch of isolated artifacts gains artifact_links edges — 50 artifacts gained 61 links.
    ☑ Each link is derived from entity_ids, parent_version_id, dependencies, provenance_chain, or related DB rows — this batch used metadata.analysis_id and direct entity_ids only.
    ☑ Low-confidence name-only guesses are not inserted — the script only inserts links when the target artifact ID exists exactly.
    ☑ Before/after isolated artifact counts are recorded — 17,088 before, 17,035 after.

    2026-04-21 20:49 UTC — Task ebdcb998 live run (slot 71)

    Script executed against live PostgreSQL. No infrastructure blockers this run.

    Execution results:

    BEFORE: 17035 isolated artifacts
    Scanned: 444 isolated artifacts (scan-limit=500)
    Artifacts that gained links: 50
    Total links inserted: 68
    AFTER: 16985 isolated artifacts
    Reduction: 50 (artifacts now connected to governance graph)

    Links by type:

    • derives_from (metadata.analysis_id, provenance_chain, parent_version_id): majority
    • mentions (entity_ids → wiki artifacts): 18 links (TREM2, TYROBP, microglia, neurodegeneration)
    Sample inserted links:
    • figure-2eeef7deaf70analysis-SDA-2026-04-01-gap-001 (derives_from, strength 1.0, metadata.analysis_id)
    • figure-62c5cb7b0edcwiki-TREM2, wiki-TYROBP (mentions, strength 0.8, entity_ids)
    Verification query:

    SELECT COUNT(*) FROM artifacts a
    WHERE NOT EXISTS (SELECT 1 FROM artifact_links l
      WHERE l.source_artifact_id = a.id OR l.target_artifact_id = a.id)
    -- Result: 16985 (was 17035 before this run)

    Acceptance criteria status:

    ☑ 50 isolated artifacts gain artifact_links edges — 50 gained links, 68 total edges
    ☑ Each link derived from entity_ids, parent_version_id, dependencies, provenance_chain, or related DB rows
    ☑ No low-confidence name-only guesses — all targets verified to exist before insert
    ☑ Before/after counts recorded — 17035 → 16985

    2026-04-21 14:50 UTC — Task fde80239 (slot 73)

    Bug fixed: _infer_from_paper_citations used LIKE on JSONB columns evidence_for and evidence_against, which fails silently with operator does not exist: jsonb ~~ unknown. Fixed by casting to text: evidence_for::text LIKE %s.

    Execution results:

    BEFORE: 17101 isolated artifacts
    Scanned: 561 isolated artifacts (scan-limit=2000)
    Artifacts that gained links: 50
    Total links inserted: 70
    AFTER: 17050 isolated artifacts
    Reduction: 51

    Note: 116 new isolated artifacts were added since ebdcb998 ran (which ended at 16985). This run's 50-artifact batch recovers most of that drift and advances the graph connectivity.

    Links by type:

    • derives_from (metadata.analysis_id): majority
    • mentions (entity_ids → wiki artifacts): TREM2, TYROBP, neurodegeneration, PI3K, TFEB, APOE
    Sample inserted links:
    • figure-c29e2fec5b3eanalysis-SDA-2026-04-01-gap-001 (derives_from, strength 1.0, metadata.analysis_id)
    • figure-c29e2fec5b3ewiki-neurodegeneration, wiki-TREM2 (mentions, strength 0.8, entity_ids)
    Acceptance criteria status:
    ☑ 50 isolated artifacts gain artifact_links edges — 50 gained links, 70 total edges
    ☑ Each link derived from metadata.analysis_id, entity_ids, provenance_chain, parent_version_id
    ☑ No low-confidence name-only guesses — all targets verified to exist before insert
    ☑ Before/after counts recorded — 17101 → 17050

    2026-04-22 14:15 UTC — Task e6e84211 (slot 73)

    Task: Link 40 isolated artifacts into provenance governance graph.

    Key findings:

    • Top isolated artifacts by created_at DESC are rigor_score_cards and paper_figures with UUID-style PMIDs — most have no linkable targets
    • rigor_score_cards: scored_entity_id (hypothesis ID) exists in hypotheses table, but hypothesis artifact often doesn't exist; can link via hypothesis→analysis chain
    • paper_figures: UUID PMIDs don't map to paper artifacts; only numeric PMID figures (6% of numeric-PMID figures) have paper artifact targets
    • Figures (12K+ isolated) are linkable via metadata.analysis_id but appear after 3K+ paper_figures in created_at ordering
    New script: scripts/backfill_task_e6e84211.py
    • Uses created_at DESC ordering (matching task query)
    • Adds new strategies:
    - rigor_score_card: scored_entity_id → hypothesis table → analysis_idderives_from to analysis artifact (strength 1.0)
    - paper_figure: pmid → paper artifact lookup (cites, strength 0.9)
    • Fixes case-sensitivity bug in _artifact_candidates for sda-/SDA- analysis IDs
    • Scans up to 10,000 isolated artifacts to find 40 linkable ones
    Execution results:

    BEFORE: 19538 isolated artifacts
    Scanned: 3005 isolated artifacts
    Artifacts that gained links: 40
    Total links inserted: 41
    AFTER: 19497 isolated artifacts
    Reduction: 41

    Links by type:

    • rigor_score_card → analysis: 6 links (via scored_entity→hypothesis→analysis chain)
    • paper_figure → paper: 28 links (via numeric PMID matching paper artifact)
    • figure → analysis/wiki: 7 links (via metadata.analysis_id and entity_ids)
    Verification:

    -- Isolated count after run: 19497 (was 19538)
    SELECT COUNT(*) FROM artifacts a
    WHERE NOT EXISTS (
        SELECT 1 FROM artifact_links l
        WHERE l.source_artifact_id = a.id OR l.target_artifact_id = a.id
    )
    -- Result: 19497

    Acceptance criteria status:

    ☑ 40 isolated artifacts gain artifact_links edges — 40 gained links, 41 total edges
    ☑ Each link derived from scored_entity_id, pmid, metadata.analysis_id, entity_ids
    ☑ No low-confidence name-only guesses — all targets verified to exist before insert
    ☑ Before/after counts recorded — 19538 → 19497

    2026-04-22 18:07 UTC — Task 3a5b980b (slot 70)

    Problem identified: The backfill script's ordering caused it to scan thousands of isolated paper_figures (top of usage_score DESC NULLS LAST ordering with all having 0.5 usage_score) before finding linkable figure artifacts. Paper_figures with UUID PMIDs can't be linked to paper artifacts.

    Fixes applied to scripts/backfill_isolated_artifact_links.py:

  • Replaced global ordering with per-type processing: process artifacts by type in priority order (figure, notebook, analysis, hypothesis first — then paper_figure/wiki_page last).
  • Added case-insensitivity fix in _artifact_candidates for sess_SDA- prefix (stripped sess_ leaves SDA- uppercase; lowercased to sda- and added upper variant).
  • Each type is scanned in its own query with usage_score DESC NULLS LAST, citation_count DESC NULLS LAST, created_at DESC NULLS LAST ordering, limiting to --scan-limit per type.
  • Execution results:

    BEFORE: 19631 isolated artifacts
    Scanned: 96 isolated figure artifacts
    Artifacts that gained links: 50
    Total links inserted: 67
    AFTER: 19581 isolated artifacts
    Reduction: 50

    Links by type:

    • figure → analysis: derives_from via metadata.analysis_id (strength 1.0 via general metadata handling + 0.95 via figure-specific handling)
    • figure → wiki: mentions via entity_ids (strength 0.8)
    • Sample: figure-d8b07236d415 → analysis-analysis-SEAAD-20260402
    • Sample: figure-27be44fcaf91 → analysis-SDA-2026-04-16-gap-pubmed-20260411-082446-2c1c9e2d + wiki-neurodegeneration + wiki-TREM2
    Verification:

    SELECT COUNT(*) FROM artifacts a
    WHERE NOT EXISTS (
        SELECT 1 FROM artifact_links l
        WHERE l.source_artifact_id = a.id OR l.target_artifact_id = a.id
    )
    -- Result: 19581 (was 19631 before this run)

    Acceptance criteria status:

    ☑ 50 isolated artifacts gain artifact_links edges — 50 gained links, 67 total edges
    ☑ Each link derived from metadata.analysis_id, entity_ids
    ☑ No low-confidence name-only guesses — all targets verified to exist before insert
    ☑ Before/after counts recorded — 19631 → 19581

    2026-04-22 15:52 UTC — Task 0fd31858 (slot 76)

    Task: Link 50 isolated artifacts into the governance graph.

    Problem identified: The backfill script's _infer_from_metadata only looked for paper_id in metadata, but paper_figure artifacts store the PMID under the key pmid (not paper_id). Additionally, figure artifacts (15K+ isolated) were not being linked despite having metadata.analysis_id that maps to existing analysis artifacts.

    Fixes applied to scripts/backfill_isolated_artifact_links.py:

  • Added pmid handling in _infer_from_metadata: when pmid is a numeric string (not a UUID), link paper_figure to paper-{pmid} artifact with cites link type (strength 0.9).
  • Added figure artifact type handling in _infer_from_metadata: link figure artifacts to analysis via metadata.analysis_id with derives_from link type (strength 0.95).
  • Increased default --scan-limit from 500 to 5000 because top-scored isolated artifacts are mostly unlinkable paper_figures with UUID PMIDs; need to scan deeper to find linkable figures and notebooks.
  • Execution results:

    BEFORE: 19513 isolated artifacts
    Scanned: 3072 isolated artifacts (scan-limit=5000)
    Artifacts that gained links: 50
    Total links inserted: 61
    AFTER: 19462 isolated artifacts
    Reduction: 51

    Links by type:

    • figure → analysis: derives_from via metadata.analysis_id (strength 0.95) — 50 figures linked
    • figure → wiki: mentions via entity_ids (strength 0.8) — some figures also gained wiki mentions
    Sample inserted links:
    • figure-311d9d1facc8analysis-sda-2026-04-01-gap-008 (derives_from, metadata.analysis_id)
    • figure-cbaac6950f55analysis-sda-2026-04-01-gap-008, wiki-neurodegeneration (derives_from + mentions)
    • figure-e86a28c571e5analysis-sda-2026-04-01-002, wiki-GBA (derives_from + mentions)
    Verification:

    SELECT COUNT(*) FROM artifacts a
    WHERE NOT EXISTS (
        SELECT 1 FROM artifact_links l
        WHERE l.source_artifact_id = a.id OR l.target_artifact_id = a.id
    )
    -- Result: 19462 (was 19513 before this run)

    Acceptance criteria status:

    ☑ 50 isolated artifacts gain artifact_links edges — 50 gained links, 61 total edges
    ☑ Each link derived from metadata.analysis_id, entity_ids, or related DB rows
    ☑ No low-confidence name-only guesses — all targets verified to exist before insert
    ☑ Before/after counts recorded — 19513 → 19462

    Tasks using this spec (7)
    [Senate] Link 50 isolated artifacts into the governance grap
    [Senate] Link 50 isolated artifacts into the governance grap
    Senate done P80
    [Senate] Link 50 isolated artifacts into the governance grap
    [Senate] Link 50 isolated artifacts into the governance grap
    Senate done P80
    [Atlas] Link 40 isolated artifacts into the provenance gover
    Atlas done P78
    [Senate] Link 50 isolated artifacts into the governance grap
    Senate done P80
    [Senate] Link 50 isolated artifacts into the governance grap
    Senate done P85
    File: quest_engine_artifact_link_backfill_spec.md
    Modified: 2026-04-25 17:55
    Size: 18.1 KB