[Atlas] Extract and catalog datasets cited in 20 high-impact analyses

← All Specs

Goal

Identify and catalog datasets referenced in the top 20 high-impact analyses (by kg_impact_score),
creating datasets rows for new datasets and dataset_citations linking them to analyses.
This enriches the Atlas layer's dataset registry and enables dataset-level provenance tracking,
cross-analysis discovery, and future discovery dividend attribution.

Acceptance Criteria

☑ ≥15 analyses with linked datasets via dataset_citations
☑ ≥10 new dataset_citations rows created
☑ New dataset entries include name, description, source_url (in description), version, license

Approach

  • Query top 20 analyses by kg_impact_score (used as proxy for "high-impact" since citation_count column does not exist in analyses table)
  • Parse debate content (debate_rounds.content) and analysis questions for dataset names/URLs
  • Identify known neurodegeneration datasets: Allen Brain Atlas, SEA-AD, PPMI, HCP, AMP-AD, ENCODE, etc.
  • For each new dataset found, create a datasets row with id, name, title, description, license, storage_backend=remote
  • For each dataset-analysis pair, create a dataset_citations row linking them
  • Verify acceptance criteria: ≥15 analyses linked, ≥10 new citations
  • Dependencies

    None — uses existing datasets, dataset_citations, and analyses tables.

    Dependents

    • Future dataset quality scoring tasks can build on these citations
    • Discovery dividend driver will back-propagate rewards to dataset authors when analyses validate

    Work Log

    2026-04-26 22:00 UTC — Slot 41 (claude-auto)

    • Read AGENTS.md and CLAUDE.md for context
    • Investigated DB schema: analyses has no citation_count column; used kg_impact_score as proxy
    • Queried top 20 analyses by kg_impact_score (range: 651 down to 211)
    • Found existing datasets (36): seaad-snRNAseq, seaad-spatial, abc-atlas-spatial, gmrepo-pd,
    adni-longitudinal, adni-biomarker, gwas-catalog-ad, opentargets-ad, clinicaltrials-gov-ad,
    rosmap-rnaseq, amp-ad-mayo, etc.
    • Identified 17 new datasets to create by parsing analysis questions and debate content:
    - allen-aging-mouse-brain (Allen Aging Mouse Brain Atlas)
    - allen-neural-dynamics (Allen Institute Neural Dynamics)
    - ppmi-biomarker (PPMI Longitudinal Cohort)
    - hcp-connectome (Human Connectome Project)
    - amp-ad-portal (AMP-AD Synapse Knowledge Portal)
    - encode-neuro-chipseq (ENCODE neuronal ChIP-seq/ATAC-seq)
    - niagads-ad-genomics (NIAGADS AD genomics)
    - gut-microbiome-pd-16s (PD 16S rRNA microbiome studies)
    - braak-staging-neuropath (Braak staging reference data)
    - sleep-eeg-neurodegeneration (polysomnography cohorts)
    - bbb-transcytosis-proteomics (BBB surface proteome data)
    - senescence-scrnaseq-atlas (Brain senescence scRNA-seq atlas)
    - digital-biomarker-nd (Wearable/digital biomarker studies)
    - crispr-neuro-screen-data (CRISPR neurodegeneration screens)
    - lipid-raft-lipidomics (Synaptic lipid raft lipidomics)
    - glymphatic-mri-ad (Glymphatic clearance MRI data)
    - tau-strain-cryo-em (Tau filament cryo-EM structures)
    • Wrote and ran scripts/catalog_analysis_datasets.py
    • Results:
    - 17 new datasets inserted (total: 53)
    - 47 new dataset_citations created (total: 53)
    - 25 analyses now have linked datasets (≥15 ✓)
    - 47 new citations (≥10 ✓)
    • Committed and pushed to task branch

    File: 6f78b335_atlas_extract_catalog_datasets_spec.md
    Modified: 2026-04-26 08:21
    Size: 3.4 KB