[Atlas] Expand knowledge graph — add edges from PubMed abstracts for top entities
Quest: Atlas
Priority: P90
Status: running
Goal
Expand knowledge graph — add edges from PubMed abstracts for top entities
Context
This task is part of the Atlas quest (Atlas layer). It contributes to the broader goal of building out SciDEX's atlas capabilities.
Acceptance Criteria
☐ Implementation complete and tested
☐ All affected pages load (200 status)
☐ Work visible on the website frontend
☐ No broken links introduced
☐ Code follows existing patterns
Approach
Identify top 50 real gene/protein entities by KG degree (filtering noise like "AND", "GENES")
Use NLP pattern matching to extract edges from 16K+ existing paper abstracts
Fetch fresh PubMed papers for top 30 entities via E-utilities
Deduplicate against 708K existing edges (case-insensitive)
Insert with evidence (PMID, excerpt) and edge_type pubmed_abstract_top50_v4Work Log
2026-04-25 — v6 extraction completed
- Identified top 50 real gene entities: APP (6,437), AKT (5,993), MTOR (5,936), TAU (5,813), TNF (5,675), BDNF (5,555), APOE (4,952), NLRP3 (4,898), AMPK (4,651), SQSTM1 (3,882), TREM2 (3,707), and 38 more
- Processed 16,802 papers with abstracts; filtered to those mentioning top entities
- Extracted 110,776 raw edges → 24,340 unique → 8,518 new (not in existing KG)
- 17 relation types: activates (1,048), expresses (828), co_discussed (802), regulates (762), enhances (760), associated_with (549), causes (546), inhibits (528), interacts_with (483), markers (472), mediates (425), targets (368)
- Target type distribution: gene (5,613), cell_type (1,216), pathway (1,118), disease (571)
- All edges carry PMID evidence in evidence_sources JSON; strength 0.6-0.9 based on co-occurrence count
- Edge_type:
pubmed_abstract_top50_v6
- KG grew from 717,199 to 725,717 edges (+8,518)
- Script:
extract_kg_top50_v6.py (461 lines)
- Commit: 9cfca1074