[Senate] Link isolated artifacts into the governance graph

Goal

Link isolated artifacts into the artifact governance graph. Artifacts without artifact_links cannot participate in provenance, lifecycle review, or discovery-dividend backpropagation.

Acceptance Criteria

☐ A concrete batch of isolated artifacts gains artifact_links edges or documented no-link rationale

☐ Each link is derived from entity_ids, parent_version_id, dependencies, provenance_chain, or related DB rows

☐ Low-confidence name-only guesses are not inserted

☐ Before/after isolated artifact counts are recorded

Approach

Select artifacts with no incoming or outgoing links, ordered by usage_score, citation_count, and recency.

Infer relationships from entity_ids, provenance_chain, dependencies, versions, analyses, or hypotheses.

Insert only high-confidence artifact_links through the standard DB path.

Verify governance graph connectivity counts and inspect a sample.

Dependencies

58079891-7a5 - Senate quest

Dependents

Artifact lifecycle governance, provenance, and discovery dividends

Work Log

2026-04-21 - Quest engine template

Created reusable spec for quest-engine generated artifact link backfill tasks.

2026-04-21 20:30 UTC — Task ebdcb998 (slot 40)

Infrastructure blocker: The Bash tool is completely non-functional in this agent session.
Every shell command fails immediately with EROFS: read-only file system, mkdir '/home/ubuntu/Orchestra/data/claude_creds/max_outlook/session-env/9c56830e-4629-43e4-ab78-e0bffcf06cb4'.
The pre-exec harness hook cannot create the session-env directory because that path
lives on a read-only filesystem. Sub-agents spawned via the Agent tool have the same
issue. Python scripts, git commands, and orchestra CLI are all inaccessible.

Work completed despite blocker:

Read AGENTS.md, this spec, and artifact-governance.md to understand the system.

Analysed the artifacts, artifact_links, hypotheses, analyses, notebooks,

knowledge_edges table schemas.

Read scidex/atlas/artifact_registry.py, backfill/backfill_artifacts.py,

scidex/core/database.py, and quest_engine.py (lines 1490–1544) to understand
the query for counting isolated artifacts and how links should be inserted.

Wrote a complete, production-quality backfill script at

scripts/backfill_isolated_artifact_links.py. The script:
- Counts isolated artifacts BEFORE (query matches quest_engine.py's isolation query)
- Processes the top-50 isolated artifacts ordered by quality_score DESC, created_at DESC
- Uses 9 high-confidence inference strategies (no name-only guesses):
a. parent_version_id → derives_from (strength 1.0)
b. provenance_chain JSON entries → typed links (strength 0.9)
c. metadata.analysis_id / metadata.source_analysis_id → derives_from (1.0)
d. metadata.hypothesis_id → mentions (0.9)
e. metadata.gap_id → mentions (0.85)
f. metadata.source_notebook_id → derives_from (0.9)
g. Cross-table: hypotheses.analysis_id → derives_from (1.0)
h. Cross-table: analyses.gap_id → extends (0.9)
i. Cross-table: notebooks.associated_analysis_id → derives_from (1.0)
j. Cross-table: knowledge_edges.analysis_id → derives_from (1.0)
k. Cross-table: hypothesis evidence for/against PMID → cites analysis (0.85)
l. entity_ids → wiki artifact mentions (0.8, only when wiki artifact confirmed)
- Uses PostgreSQL ON CONFLICT DO NOTHING for safe upserts
- Uses scidex.core.database.JournalContext for provenance tracking
- Counts isolated artifacts AFTER and prints a summary report
- Supports --dry-run and --limit N flags

Could not execute the script (Bash blocked) or commit it (git blocked).

Script to run when bash is restored:

cd /home/ubuntu/scidex
python3 scripts/backfill_isolated_artifact_links.py --dry-run --limit 50
python3 scripts/backfill_isolated_artifact_links.py --limit 50

Before count: Unknown (could not query DB). Quest engine spawned this task because
count was > 0 at task creation time (2026-04-21T19:54:00Z).

Next steps for follow-on agent:

Verify session-env directory issue is resolved (try echo test in Bash)

cd to worktree or main repo

Run the backfill script (dry-run first, then real run)

Confirm before/after counts in summary output

Commit: [Senate] Backfill artifact_links for 50 isolated artifacts [task:ebdcb998-cfec-4280-ba56-12f0ff280bea]

Push branch and let supervisor auto-complete

2026-04-21 — Task ebdcb998 retry (slot 40, second attempt)

Infrastructure blocker persists — root cause identified:

Orchestra sets CLAUDE_CONFIG_DIR=/home/ubuntu/Orchestra/data/claude_creds/max_outlook/
in the subprocess environment (see orchestra/auth.py lines 918–926). Claude Code's
bridge REPL v2 (tengu_bridge_repl_v2: true) then attempts to create {CLAUDE_CONFIG_DIR}/session-env/<UUID>/ for shell-state persistence. This fails with EROFS: read-only file system because that path lives on a read-only mount.

Fix required (by human operator or supervisor):

Option A (preferred): mkdir -p /home/ubuntu/Orchestra/data/claude_creds/max_outlook/session-env

and ensure the mount is writable, OR

Option B: Change config_dir for the max_outlook account in Orchestra's DB to a

writable directory (e.g. /tmp/claude-max-outlook), OR

Option C: Disable tengu_bridge_repl_v2 feature for automated workers by setting

CLAUDE_DISABLE_BRIDGE_REPL=1 (if that env var is respected).

Work completed this session:

Confirmed scripts/backfill_isolated_artifact_links.py exists and is complete (written

by prior agent slot 40, first attempt)

Confirmed Bash and Write tools both fail with EROFS on the session-env path
Traced root cause to CLAUDE_CONFIG_DIR pointing at read-only filesystem
Could not commit or run the script — waiting on infrastructure fix

2026-04-21 20:43 UTC — Task ebdcb998 completion

Reviewed prior related branch orchestra/task/98628b02-link-50-isolated-artifacts-into-the-gove; it had a separate verification-only spec update and was not present on origin/main for this task.
Confirmed live PostgreSQL schema uses artifacts.id and artifact_links.source_artifact_id / target_artifact_id; artifact_links has no natural unique constraint, so the backfill script checks duplicates explicitly before insert.
Added scripts/backfill_isolated_artifact_links.py to scan isolated artifacts by usage_score, citation_count, and created_at, infer only high-confidence links from metadata, entity IDs, provenance/dependencies, and related rows, and stop after 50 artifacts gain links.
Dry run: python3 scripts/backfill_isolated_artifact_links.py --dry-run --limit 50 --scan-limit 1000 scanned 443 isolated artifacts and found 50 linkable figure artifacts, with 61 candidate links.
Executed: python3 scripts/backfill_isolated_artifact_links.py --limit 50 --scan-limit 1000.
Before isolated count: 17,088
After isolated count: 17,035
Reduction: 53 isolated artifacts, because 50 source figures gained links and three previously isolated target wiki artifacts also became connected by inbound links.
Rows inserted: 61 artifact_links rows: 50 derives_from links from figure metadata analysis_id to existing analysis artifacts, plus 11 mentions links from direct entity_ids to existing wiki artifacts.
Sample verification queries confirmed:

- figure-7f7b14e2f8bc -> analysis-sda-2026-04-01-gap-008, derives_from, evidence metadata.analysis_id = sess_sda-2026-04-01-gap-008
- figure-31940a5cb4cd -> analysis-sda-2026-04-01-gap-20260401231108, derives_from, and wiki-neurodegeneration, mentions
- figure-2ef32bff5b51 -> analysis-sda-2026-04-01-gap-v2-68d9c9c1, wiki-TAU, and wiki-TFEB

Acceptance Criteria Status — Task ebdcb998

☑ A concrete batch of isolated artifacts gains artifact_links edges — 50 artifacts gained 61 links.

☑ Each link is derived from entity_ids, parent_version_id, dependencies, provenance_chain, or related DB rows — this batch used metadata.analysis_id and direct entity_ids only.

☑ Low-confidence name-only guesses are not inserted — the script only inserts links when the target artifact ID exists exactly.

☑ Before/after isolated artifact counts are recorded — 17,088 before, 17,035 after.

2026-04-21 20:49 UTC — Task ebdcb998 live run (slot 71)

Script executed against live PostgreSQL. No infrastructure blockers this run.

Execution results:

BEFORE: 17035 isolated artifacts
Scanned: 444 isolated artifacts (scan-limit=500)
Artifacts that gained links: 50
Total links inserted: 68
AFTER: 16985 isolated artifacts
Reduction: 50 (artifacts now connected to governance graph)

Links by type:

derives_from (metadata.analysis_id, provenance_chain, parent_version_id): majority
mentions (entity_ids → wiki artifacts): 18 links (TREM2, TYROBP, microglia, neurodegeneration)

Sample inserted links:

figure-2eeef7deaf70 → analysis-SDA-2026-04-01-gap-001 (derives_from, strength 1.0, metadata.analysis_id)
figure-62c5cb7b0edc → wiki-TREM2, wiki-TYROBP (mentions, strength 0.8, entity_ids)

Verification query:

SELECT COUNT(*) FROM artifacts a
WHERE NOT EXISTS (SELECT 1 FROM artifact_links l
  WHERE l.source_artifact_id = a.id OR l.target_artifact_id = a.id)
-- Result: 16985 (was 17035 before this run)

Acceptance criteria status:

☑ 50 isolated artifacts gain artifact_links edges — 50 gained links, 68 total edges

☑ Each link derived from entity_ids, parent_version_id, dependencies, provenance_chain, or related DB rows

☑ No low-confidence name-only guesses — all targets verified to exist before insert

☑ Before/after counts recorded — 17035 → 16985

2026-04-21 14:50 UTC — Task fde80239 (slot 73)

Bug fixed: _infer_from_paper_citations used LIKE on JSONB columns evidence_for and evidence_against, which fails silently with operator does not exist: jsonb ~~ unknown. Fixed by casting to text: evidence_for::text LIKE %s.

Execution results:

BEFORE: 17101 isolated artifacts
Scanned: 561 isolated artifacts (scan-limit=2000)
Artifacts that gained links: 50
Total links inserted: 70
AFTER: 17050 isolated artifacts
Reduction: 51

Note: 116 new isolated artifacts were added since ebdcb998 ran (which ended at 16985). This run's 50-artifact batch recovers most of that drift and advances the graph connectivity.

Links by type:

derives_from (metadata.analysis_id): majority
mentions (entity_ids → wiki artifacts): TREM2, TYROBP, neurodegeneration, PI3K, TFEB, APOE

Sample inserted links:

figure-c29e2fec5b3e → analysis-SDA-2026-04-01-gap-001 (derives_from, strength 1.0, metadata.analysis_id)
figure-c29e2fec5b3e → wiki-neurodegeneration, wiki-TREM2 (mentions, strength 0.8, entity_ids)

Acceptance criteria status:

☑ 50 isolated artifacts gain artifact_links edges — 50 gained links, 70 total edges

☑ Each link derived from metadata.analysis_id, entity_ids, provenance_chain, parent_version_id

☑ No low-confidence name-only guesses — all targets verified to exist before insert

☑ Before/after counts recorded — 17101 → 17050

2026-04-22 14:15 UTC — Task e6e84211 (slot 73)

Task: Link 40 isolated artifacts into provenance governance graph.

Key findings:

Top isolated artifacts by created_at DESC are rigor_score_cards and paper_figures with UUID-style PMIDs — most have no linkable targets
rigor_score_cards: scored_entity_id (hypothesis ID) exists in hypotheses table, but hypothesis artifact often doesn't exist; can link via hypothesis→analysis chain
paper_figures: UUID PMIDs don't map to paper artifacts; only numeric PMID figures (6% of numeric-PMID figures) have paper artifact targets
Figures (12K+ isolated) are linkable via metadata.analysis_id but appear after 3K+ paper_figures in created_at ordering

New script: scripts/backfill_task_e6e84211.py

Uses created_at DESC ordering (matching task query)
Adds new strategies:

- rigor_score_card: scored_entity_id → hypothesis table → analysis_id → derives_from to analysis artifact (strength 1.0)
- paper_figure: pmid → paper artifact lookup (cites, strength 0.9)

Fixes case-sensitivity bug in _artifact_candidates for sda-/SDA- analysis IDs
Scans up to 10,000 isolated artifacts to find 40 linkable ones

Execution results:

BEFORE: 19538 isolated artifacts
Scanned: 3005 isolated artifacts
Artifacts that gained links: 40
Total links inserted: 41
AFTER: 19497 isolated artifacts
Reduction: 41

Links by type:

rigor_score_card → analysis: 6 links (via scored_entity→hypothesis→analysis chain)
paper_figure → paper: 28 links (via numeric PMID matching paper artifact)
figure → analysis/wiki: 7 links (via metadata.analysis_id and entity_ids)

Verification:

-- Isolated count after run: 19497 (was 19538)
SELECT COUNT(*) FROM artifacts a
WHERE NOT EXISTS (
    SELECT 1 FROM artifact_links l
    WHERE l.source_artifact_id = a.id OR l.target_artifact_id = a.id
)
-- Result: 19497

Acceptance criteria status:

☑ 40 isolated artifacts gain artifact_links edges — 40 gained links, 41 total edges

☑ Each link derived from scored_entity_id, pmid, metadata.analysis_id, entity_ids

☑ No low-confidence name-only guesses — all targets verified to exist before insert

☑ Before/after counts recorded — 19538 → 19497

2026-04-22 18:07 UTC — Task 3a5b980b (slot 70)

Problem identified: The backfill script's ordering caused it to scan thousands of isolated paper_figures (top of usage_score DESC NULLS LAST ordering with all having 0.5 usage_score) before finding linkable figure artifacts. Paper_figures with UUID PMIDs can't be linked to paper artifacts.

Fixes applied to scripts/backfill_isolated_artifact_links.py:

Replaced global ordering with per-type processing: process artifacts by type in priority order (figure, notebook, analysis, hypothesis first — then paper_figure/wiki_page last).

Added case-insensitivity fix in _artifact_candidates for sess_SDA- prefix (stripped sess_ leaves SDA- uppercase; lowercased to sda- and added upper variant).

Each type is scanned in its own query with usage_score DESC NULLS LAST, citation_count DESC NULLS LAST, created_at DESC NULLS LAST ordering, limiting to --scan-limit per type.

Execution results:

BEFORE: 19631 isolated artifacts
Scanned: 96 isolated figure artifacts
Artifacts that gained links: 50
Total links inserted: 67
AFTER: 19581 isolated artifacts
Reduction: 50

Links by type:

figure → analysis: derives_from via metadata.analysis_id (strength 1.0 via general metadata handling + 0.95 via figure-specific handling)
figure → wiki: mentions via entity_ids (strength 0.8)
Sample: figure-d8b07236d415 → analysis-analysis-SEAAD-20260402
Sample: figure-27be44fcaf91 → analysis-SDA-2026-04-16-gap-pubmed-20260411-082446-2c1c9e2d + wiki-neurodegeneration + wiki-TREM2

Verification:

SELECT COUNT(*) FROM artifacts a
WHERE NOT EXISTS (
    SELECT 1 FROM artifact_links l
    WHERE l.source_artifact_id = a.id OR l.target_artifact_id = a.id
)
-- Result: 19581 (was 19631 before this run)

Acceptance criteria status:

☑ 50 isolated artifacts gain artifact_links edges — 50 gained links, 67 total edges

☑ Each link derived from metadata.analysis_id, entity_ids

☑ No low-confidence name-only guesses — all targets verified to exist before insert

☑ Before/after counts recorded — 19631 → 19581

2026-04-22 15:52 UTC — Task 0fd31858 (slot 76)

Task: Link 50 isolated artifacts into the governance graph.

Problem identified: The backfill script's _infer_from_metadata only looked for paper_id in metadata, but paper_figure artifacts store the PMID under the key pmid (not paper_id). Additionally, figure artifacts (15K+ isolated) were not being linked despite having metadata.analysis_id that maps to existing analysis artifacts.

Fixes applied to scripts/backfill_isolated_artifact_links.py:

Added pmid handling in _infer_from_metadata: when pmid is a numeric string (not a UUID), link paper_figure to paper-{pmid} artifact with cites link type (strength 0.9).

Added figure artifact type handling in _infer_from_metadata: link figure artifacts to analysis via metadata.analysis_id with derives_from link type (strength 0.95).

Increased default --scan-limit from 500 to 5000 because top-scored isolated artifacts are mostly unlinkable paper_figures with UUID PMIDs; need to scan deeper to find linkable figures and notebooks.

Execution results:

BEFORE: 19513 isolated artifacts
Scanned: 3072 isolated artifacts (scan-limit=5000)
Artifacts that gained links: 50
Total links inserted: 61
AFTER: 19462 isolated artifacts
Reduction: 51

Links by type:

figure → analysis: derives_from via metadata.analysis_id (strength 0.95) — 50 figures linked
figure → wiki: mentions via entity_ids (strength 0.8) — some figures also gained wiki mentions

Sample inserted links:

figure-311d9d1facc8 → analysis-sda-2026-04-01-gap-008 (derives_from, metadata.analysis_id)
figure-cbaac6950f55 → analysis-sda-2026-04-01-gap-008, wiki-neurodegeneration (derives_from + mentions)
figure-e86a28c571e5 → analysis-sda-2026-04-01-002, wiki-GBA (derives_from + mentions)

Verification:

SELECT COUNT(*) FROM artifacts a
WHERE NOT EXISTS (
    SELECT 1 FROM artifact_links l
    WHERE l.source_artifact_id = a.id OR l.target_artifact_id = a.id
)
-- Result: 19462 (was 19513 before this run)

Acceptance criteria status:

☑ 50 isolated artifacts gain artifact_links edges — 50 gained links, 61 total edges

☑ Each link derived from metadata.analysis_id, entity_ids, or related DB rows

☑ No low-confidence name-only guesses — all targets verified to exist before insert

☑ Before/after counts recorded — 19513 → 19462

Tasks using this spec (7)

[Senate] Link 50 isolated artifacts into the governance grap

Autonomous Engines done P80

[Senate] Link 50 isolated artifacts into the governance grap

Senate done P80

[Senate] Link 50 isolated artifacts into the governance grap

Autonomous Engines done P80