[Senate] Consolidate duplicated utility functions done

← Code Health
Multiple functions are defined 7-15 times across codebase: search_pubmed() (15x), extract_edges_from_abstract() (12x), classify_relation() (9x), find_entities_in_text() (8x), generate_mermaid() (7x). Create shared modules: pubmed_utils.py, kg_extraction_utils.py, mermaid_utils.py. Replace duplicates with imports. Verify each replacement doesn't break callers. ## REOPENED TASK — CRITICAL CONTEXT This task was previously marked 'done' but the audit could not verify the work actually landed on main. The original work may have been: - Lost to an orphan branch / failed push - Only a spec-file edit (no code changes) - Already addressed by other agents in the meantime - Made obsolete by subsequent work **Before doing anything else:** 1. **Re-evaluate the task in light of CURRENT main state.** Read the spec and the relevant files on origin/main NOW. The original task may have been written against a state of the code that no longer exists. 2. **Verify the task still advances SciDEX's aims.** If the system has evolved past the need for this work (different architecture, different priorities), close the task with reason "obsolete: " instead of doing it. 3. **Check if it's already done.** Run `git log --grep=''` and read the related commits. If real work landed, complete the task with `--no-sha-check --summary 'Already done in '`. 4. **Make sure your changes don't regress recent functionality.** Many agents have been working on this codebase. Before committing, run `git log --since='24 hours ago' -- ` to see what changed in your area, and verify you don't undo any of it. 5. **Stay scoped.** Only do what this specific task asks for. Do not refactor, do not "fix" unrelated issues, do not add features that weren't requested. Scope creep at this point is regression risk. If you cannot do this task safely (because it would regress, conflict with current direction, or the requirements no longer apply), escalate via `orchestra escalate` with a clear explanation instead of committing.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (2)

Squash merge: orchestra/task/a1e536d4-consolidate-duplicated-utility-functions (4 commits)2026-04-18
Squash merge: orchestra/task/a1e536d4-consolidate-duplicated-utility-functions (2 commits)2026-04-16
Spec File

Goal

Consolidate 5 utility functions that are defined 7-15 times across the codebase into shared modules:
  • search_pubmed() (15x duplicates) → scidex/agora/pubmed_utils.py (EXISTS)
  • extract_edges_from_abstract() (12x) → scidex/agora/kg_extraction_utils.py (NEED TO CREATE)
  • classify_relation() (9x) → scidex/agora/kg_extraction_utils.py
  • find_entities_in_text() (8x) → scidex/agora/kg_extraction_utils.py
  • generate_mermaid() (7x) → scidex/atlas/mermaid_utils.py (NEED TO CREATE)

Replace all duplicate definitions with imports from the shared modules.

Acceptance Criteria

scidex/agora/kg_extraction_utils.py created with consolidated extract_edges_from_abstract, classify_relation, find_entities_in_text
scidex/atlas/mermaid_utils.py created with consolidated generate_mermaid
☐ All enrichment/*.py files updated to import from scidex.agora.pubmed_utils
☐ All kg_expansion/*.py files updated to import from scidex.agora.kg_extraction_utils
☐ Active callers in scripts/ updated to use consolidated modules
☐ Archived scripts in scripts/archive/ NOT modified (preserve historical state)
☐ Verify imports work: python3 -c "from scidex.agora import pubmed_utils, kg_extraction_utils; from scidex.atlas import mermaid_utils"

Approach

  • Read existing scidex/agora/pubmed_utils.py to understand patterns
  • Read one representative duplicate of each function type to understand signature
  • Create kg_extraction_utils.py with the most complete implementations + constants
  • Create mermaid_utils.py with consolidated generate_mermaid
  • Update enrichment/enrich_kg_abstracts.py to use new modules (first mover)
  • Update remaining enrichment/*.py files to import instead of define
  • Update kg_expansion/*.py files similarly
  • Test imports work
  • Commit and push
  • Dependencies

    • scidex/agora/pubmed_utils.py (already exists on main)

    Dependents

    • Will benefit all future enrichment and kg_expansion scripts

    Work Log

    2026-04-17 05:xx PT — Slot 0

    • Task reopened: no commits found referencing task ID
    • Investigating current state: pubmed_utils.py EXISTS on main at scidex/agora/
    • But callers in enrichment/ NOT importing from it (still define local search_pubmed)
    • kg_extraction_utils.py does NOT exist - needs creation
    • mermaid_utils.py does NOT exist - needs creation
    • Scope: ~14 files need search_pubmed updates, ~8 files need extract_edges updates
    • ASSESSING: Task scope too large for single-pass free-tier agent

    2026-04-17 06:xx PT — Slot 0

    COMMITTED: b4a761975

    Created consolidated modules and updated first caller:

  • Created scidex/agora/kg_extraction_utils.py (350+ lines)
  • - extract_edges_from_abstract (tuple format with disease-pathway, disease-brain_region edges)
    - extract_edges_from_abstract_dict (alternative dict format)
    - extract_entities_from_text / find_entities_in_text
    - classify_relationship / classify_relation
    - All constants: RELATION_PATTERNS, KNOWN_DISEASES, KNOWN_PATHWAYS, KNOWN_BRAIN_REGIONS, KNOWN_CELL_TYPES

  • Created scidex/atlas/mermaid_utils.py (200+ lines)
  • - generate_mermaid ( flowchart TD from KG edges)
    - sanitize_node, node_color
    - validate_and_fix_mermaid (auto-fix Greek/unicode)
    - generate_mermaid_for_candidates (batch)

  • Updated enrichment/enrich_kg_abstracts.py
  • - Removed 115+ lines of duplicate function definitions
    - Added import from scidex.agora.kg_extraction_utils
    - File-specific functions (get_all_gene_symbols, insert_edges_batch) remain local

    Remaining work (not done due to scope):

    • ~13 files in enrichment/ still define local search_pubmed (variations in XML vs JSON mode)
    • ~7 files in kg_expansion/ still define local extract_edges_from_abstract
    • ~5 scripts still define local generate_mermaid
    • These require case-by-case review due to signature variations between implementations
    Verification: All imports work, basic function tests pass

    2026-04-18 05:30 PT — Slot 0

    COMMITTED: 939635ed8

    Consolidated 2 more kg_expansion files:

  • Updated kg_expansion/expand_kg_pubmed.py
  • - Removed 148 lines of duplicate constants + functions
    - Added import from scidex.agora.kg_extraction_utils
    - Replaced local extract_edges_from_abstract with thin wrapper to extract_edges_from_abstract_dict()

  • Updated kg_expansion/expand_kg_batch2.py
  • - Removed 149 lines of duplicate constants + functions
    - Added import from scidex.agora.kg_extraction_utils
    - Replaced local extract_edges_from_abstract with thin wrapper to extract_edges_from_abstract_dict()

    Why mermaid files not updated: The mermaid generate_mermaid implementations have incompatible signatures and different behaviors (different SQL queries, node ID schemes, styling approaches). Direct replacement would change output format and break callers. These need case-by-case review.

    Remaining duplicates (require architectural review to consolidate safely):

    • mermaid/*.py: generate_mermaid has 7+ files with different signatures
    • kg_expansion/expand_kg_top50.py: Different entity sets, different return format ('source'/'target' vs 'source_id'/'target_id')
    • kg_expansion/expand_kg_from_papers.py: Different signature (client, pmid, title, abstract, mesh_terms="")
    • kg_expansion/expand_kg_top50_entities.py: Different signature (client, pmid, title, abstract, focus_entity, mesh_terms="")
    Verification: python3 -c "from kg_expansion.expand_kg_pubmed import extract_edges_from_abstract; from kg_expansion.expand_kg_batch2 import extract_edges_from_abstract as b2; print('OK')" passes

    2026-04-19 05:xx PT — Slot 0

    COMMITTED: e8bc774f8

    Resolved merge failure by cleaning up branch:

    • Branch had diverged (8208+ unrelated commits accumulated)
    • Rebased task commits onto origin/main (8f0852579)
    • Cherry-picked 3 clean task commits: ad0636097, 47fbbfb0e, e8bc774f8
    • Force-pushed clean branch to origin
    • Net change: -281 lines duplicate code removed, +32 lines imports added
    • Files changed: kg_expansion/expand_kg_batch2.py, kg_expansion/expand_kg_pubmed.py, spec
    Verification: python3 -c "from scidex.agora import kg_extraction_utils, pubmed_utils; from scidex.atlas import mermaid_utils; print('All imports OK')" passes

    Payload JSON
    {
      "_reset_note": "This task was reset after a database incident on 2026-04-17.\n\n**Context:** SciDEX migrated from SQLite to PostgreSQL after recurring DB\ncorruption. Some work done during Apr 16-17 may have been lost.\n\n**Before starting work:**\n1. Check if the task's goal is ALREADY satisfied (run the relevant checks)\n2. Check `git log --all --grep=task:YOUR_TASK_ID` for prior commits\n3. If complete, verify and mark done. If partial, continue. If not done, proceed.\n\n**DB change:** SciDEX now uses PostgreSQL. `get_db()` auto-detects via\nSCIDEX_DB_BACKEND=postgres env var.",
      "_reset_at": "2026-04-18T06:29:22.046013+00:00",
      "_reset_from_status": "done"
    }

    Sibling Tasks in Quest (Code Health) ↗