[Artifacts] External dataset references: track datasets like we track papers
Goal
Just as SciDEX tracks papers by PubMed ID without hosting the full text, we need to track external datasets by their repository IDs without hosting the data. This enables analyses to reference specific datasets, link them to hypotheses, and track when datasets are updated upstream.
Design Principles
- Reference, don't replicate: Store metadata and URLs, not the actual data
- Mirror the paper model: register_dataset() should feel like register_paper()
- Lightweight: Don't become HuggingFace — just enough metadata to know what the dataset contains and how to find it
- Linkable: Datasets connect to KG entities, hypotheses, and analyses via artifact_links
Supported Sources (initial)
| Source | ID Format | Metadata API | Example |
|---|
| GEO | GSE123456 | NCBI E-utils | Gene expression arrays |
| Allen Brain | allen:SEA-AD | brain-map.org API | Cell atlas data |
| Zenodo | 10.5281/zenodo.XXX | Zenodo API | General scientific data |
| Figshare | 10.6084/m9.figshare.XXX | Figshare API | Supplementary data |
| UniProt | UP000005640 | UniProt API | Proteome datasets |
Implementation
register_dataset(source, external_id, **kwargs) → artifact_id
def register_dataset(source, external_id, url=None, title=None,
description=None, license=None, schema_summary=None,
row_count=None, format=None):
artifact_id = f"dataset-{source}-{external_id}"
metadata = {
"source": source,
"external_id": external_id,
"url": url or _resolve_url(source, external_id),
"license": license,
"format": format,
"row_count": row_count,
"schema_summary": schema_summary,
"last_checked_at": datetime.utcnow().isoformat()
}
return register_artifact(
artifact_type="dataset",
title=title or f"{source}:{external_id}",
metadata=metadata,
created_by="import",
quality_score=0.8
)
_resolve_url(source, external_id) → str
Map source+ID to canonical URL:
- GEO:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc={id}
- Allen Brain:
https://portal.brain-map.org/atlases-and-data/...
- Zenodo:
https://zenodo.org/record/{id}
Freshness Check (optional, low priority)
Periodic task that re-checks external URLs for 404s or updated metadata. Update
last_checked_at field. Flag stale datasets (not checked in 30+ days).
REST Endpoints
GET /api/datasets — list all registered external datasets with filtering by source
GET /api/dataset/{id} — dataset detail with linked artifacts (hypotheses, analyses)
Acceptance Criteria
☐ register_dataset() creates artifact with type=dataset
☐ Artifact ID format: dataset-{source}-{external_id}
☐ URL auto-resolution for GEO, Allen Brain, Zenodo
☐ Dataset metadata includes source, external_id, url, schema_summary
☐ Duplicate registration (same source+external_id) returns existing artifact
☐ GET /api/datasets returns list with source filtering
☐ GET /api/dataset/{id} returns detail with linked artifacts
☐ Work log updated with timestamped entry
Dependencies
- a17-19-TYPE0001 (dataset artifact type must be registered)
Dependents
- a17-22-TABL0001 (tabular datasets reference external sources)
- d16-21-DSET0001 (demo: Allen Brain Cell Atlas)
Work Log
2026-04-24 11:25 PT — Codex
- Read the external-dataset spec and existing artifact registry / API routes before editing.
- Extended
scidex/atlas/artifact_registry.py so external datasets get canonical IDs (dataset-{source}-{external_id}), source-aware URL resolution, duplicate detection by source + external_id, and helper functions for listing/detail payloads.
- Added
GET /api/datasets with source filtering and GET /api/dataset/{id} for dataset detail plus linked artifacts in api.py.
- Added focused unit coverage in
scidex/atlas/test_external_dataset_registry.py for canonical registration, duplicate reuse, source filtering, and linked-artifact enrichment.
- Verified with
python -m py_compile api.py scidex/atlas/artifact_registry.py scidex/atlas/test_external_dataset_registry.py and pytest -q scidex/atlas/test_external_dataset_registry.py (4 passed).
2026-04-24 12:05 PT — Codex
- Added dataset freshness helpers in
scidex/atlas/artifact_registry.py so registered datasets expose last_checked_at / stale status and can be re-checked periodically without downloading the underlying data.
- Added coverage for freshness refresh behavior in
scidex/atlas/test_external_dataset_registry.py.
2026-04-24 19:50 PT — GLM-5 (slot 63)
- Verified all acceptance criteria met:
-
register_dataset() creates artifact with type=dataset ✓
- Artifact ID format
dataset-{source}-{external_id} ✓
- URL auto-resolution for GEO, Allen Brain, Zenodo, Figshare, UniProt, Dryad, Synapse ✓
- Metadata includes source, external_id, url, schema_summary ✓
- Duplicate registration returns existing artifact ✓
- GET /api/datasets with source filtering ✓
- GET /api/dataset/{id} with linked artifacts ✓
- All 5 tests pass (pytest: 5 passed in 1.05s)
- All Python files compile cleanly
- Import patterns match existing api.py conventions (34 existing
import artifact_registry references)
- Ready for merge and push.