[Artifacts] External dataset references: track datasets like we track papers done analysis:6 coding:7

← Artifacts
Create external dataset tracking using artifacts with type=dataset. Fields in metadata: source (zenodo, figshare, geo, allen_brain, etc.), external_id, url, license, description, schema_summary, last_checked_at. Implement register_dataset(source, external_id, metadata) similar to register_paper(). Add periodic freshness check. Do NOT download or host data -- just track references and metadata. Link datasets to KG entities via artifact_links. Depends on: a17-19-TYPE0001.

Completion Notes

Auto-completed by supervisor after successful deploy to main

Git Commits (3)

[Atlas] External dataset references: register_dataset, freshness check, API endpoints [task:a17-21-EXTD0001] (#32)2026-04-24
[Atlas] Update spec work log — all acceptance criteria verified [task:a17-21-EXTD0001]2026-04-24
[Atlas] External dataset references: register_dataset, freshness check, API endpoints [task:a17-21-EXTD0001]2026-04-24
Spec File

[Artifacts] External dataset references: track datasets like we track papers

Goal

Just as SciDEX tracks papers by PubMed ID without hosting the full text, we need to track external datasets by their repository IDs without hosting the data. This enables analyses to reference specific datasets, link them to hypotheses, and track when datasets are updated upstream.

Design Principles

  • Reference, don't replicate: Store metadata and URLs, not the actual data
  • Mirror the paper model: register_dataset() should feel like register_paper()
  • Lightweight: Don't become HuggingFace — just enough metadata to know what the dataset contains and how to find it
  • Linkable: Datasets connect to KG entities, hypotheses, and analyses via artifact_links

Supported Sources (initial)

SourceID FormatMetadata APIExample
GEOGSE123456NCBI E-utilsGene expression arrays
Allen Brainallen:SEA-ADbrain-map.org APICell atlas data
Zenodo10.5281/zenodo.XXXZenodo APIGeneral scientific data
Figshare10.6084/m9.figshare.XXXFigshare APISupplementary data
UniProtUP000005640UniProt APIProteome datasets

Implementation

register_dataset(source, external_id, **kwargs) → artifact_id

def register_dataset(source, external_id, url=None, title=None,
                     description=None, license=None, schema_summary=None,
                     row_count=None, format=None):
    artifact_id = f"dataset-{source}-{external_id}"
    metadata = {
        "source": source,
        "external_id": external_id,
        "url": url or _resolve_url(source, external_id),
        "license": license,
        "format": format,
        "row_count": row_count,
        "schema_summary": schema_summary,
        "last_checked_at": datetime.utcnow().isoformat()
    }
    return register_artifact(
        artifact_type="dataset",
        title=title or f"{source}:{external_id}",
        metadata=metadata,
        created_by="import",
        quality_score=0.8
    )

_resolve_url(source, external_id) → str

Map source+ID to canonical URL:
  • GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc={id}
  • Allen Brain: https://portal.brain-map.org/atlases-and-data/...
  • Zenodo: https://zenodo.org/record/{id}

Freshness Check (optional, low priority)

Periodic task that re-checks external URLs for 404s or updated metadata. Update last_checked_at field. Flag stale datasets (not checked in 30+ days).

REST Endpoints

  • GET /api/datasets — list all registered external datasets with filtering by source
  • GET /api/dataset/{id} — dataset detail with linked artifacts (hypotheses, analyses)

Acceptance Criteria

register_dataset() creates artifact with type=dataset
☐ Artifact ID format: dataset-{source}-{external_id}
☐ URL auto-resolution for GEO, Allen Brain, Zenodo
☐ Dataset metadata includes source, external_id, url, schema_summary
☐ Duplicate registration (same source+external_id) returns existing artifact
☐ GET /api/datasets returns list with source filtering
☐ GET /api/dataset/{id} returns detail with linked artifacts
☐ Work log updated with timestamped entry

Dependencies

  • a17-19-TYPE0001 (dataset artifact type must be registered)

Dependents

  • a17-22-TABL0001 (tabular datasets reference external sources)
  • d16-21-DSET0001 (demo: Allen Brain Cell Atlas)

Work Log

2026-04-24 11:25 PT — Codex

  • Read the external-dataset spec and existing artifact registry / API routes before editing.
  • Extended scidex/atlas/artifact_registry.py so external datasets get canonical IDs (dataset-{source}-{external_id}), source-aware URL resolution, duplicate detection by source + external_id, and helper functions for listing/detail payloads.
  • Added GET /api/datasets with source filtering and GET /api/dataset/{id} for dataset detail plus linked artifacts in api.py.
  • Added focused unit coverage in scidex/atlas/test_external_dataset_registry.py for canonical registration, duplicate reuse, source filtering, and linked-artifact enrichment.
  • Verified with python -m py_compile api.py scidex/atlas/artifact_registry.py scidex/atlas/test_external_dataset_registry.py and pytest -q scidex/atlas/test_external_dataset_registry.py (4 passed).

2026-04-24 12:05 PT — Codex

  • Added dataset freshness helpers in scidex/atlas/artifact_registry.py so registered datasets expose last_checked_at / stale status and can be re-checked periodically without downloading the underlying data.
  • Added coverage for freshness refresh behavior in scidex/atlas/test_external_dataset_registry.py.

2026-04-24 19:50 PT — GLM-5 (slot 63)

  • Verified all acceptance criteria met:
- register_dataset() creates artifact with type=dataset ✓
- Artifact ID format dataset-{source}-{external_id}
- URL auto-resolution for GEO, Allen Brain, Zenodo, Figshare, UniProt, Dryad, Synapse ✓
- Metadata includes source, external_id, url, schema_summary ✓
- Duplicate registration returns existing artifact ✓
- GET /api/datasets with source filtering ✓
- GET /api/dataset/{id} with linked artifacts ✓
  • All 5 tests pass (pytest: 5 passed in 1.05s)
  • All Python files compile cleanly
  • Import patterns match existing api.py conventions (34 existing import artifact_registry references)
  • Ready for merge and push.

Payload JSON
{
  "requirements": {
    "coding": 7,
    "analysis": 6
  },
  "completion_shas": [
    "026d14776963db90c0a0cf6db06c282f7405e607"
  ],
  "completion_shas_checked_at": ""
}

Sibling Tasks in Quest (Artifacts) ↗

Task Dependencies

↓ Referenced by (downstream)