Goal
Identify and catalog datasets referenced in the top 20 high-impact analyses (by kg_impact_score),
creating datasets rows for new datasets and dataset_citations linking them to analyses.
This enriches the Atlas layer's dataset registry and enables dataset-level provenance tracking,
cross-analysis discovery, and future discovery dividend attribution.
Acceptance Criteria
☑ ≥15 analyses with linked datasets via dataset_citations
☑ ≥10 new dataset_citations rows created
☑ New dataset entries include name, description, source_url (in description), version, license
Approach
Query top 20 analyses by kg_impact_score (used as proxy for "high-impact" since citation_count column does not exist in analyses table)
Parse debate content (debate_rounds.content) and analysis questions for dataset names/URLs
Identify known neurodegeneration datasets: Allen Brain Atlas, SEA-AD, PPMI, HCP, AMP-AD, ENCODE, etc.
For each new dataset found, create a datasets row with id, name, title, description, license, storage_backend=remote
For each dataset-analysis pair, create a dataset_citations row linking them
Verify acceptance criteria: ≥15 analyses linked, ≥10 new citationsDependencies
None — uses existing datasets, dataset_citations, and analyses tables.
Dependents
- Future dataset quality scoring tasks can build on these citations
- Discovery dividend driver will back-propagate rewards to dataset authors when analyses validate
Work Log
2026-04-26 22:00 UTC — Slot 41 (claude-auto)
- Read AGENTS.md and CLAUDE.md for context
- Investigated DB schema:
analyses has no citation_count column; used kg_impact_score as proxy
- Queried top 20 analyses by kg_impact_score (range: 651 down to 211)
- Found existing datasets (36): seaad-snRNAseq, seaad-spatial, abc-atlas-spatial, gmrepo-pd,
adni-longitudinal, adni-biomarker, gwas-catalog-ad, opentargets-ad, clinicaltrials-gov-ad,
rosmap-rnaseq, amp-ad-mayo, etc.
- Identified 17 new datasets to create by parsing analysis questions and debate content:
- allen-aging-mouse-brain (Allen Aging Mouse Brain Atlas)
- allen-neural-dynamics (Allen Institute Neural Dynamics)
- ppmi-biomarker (PPMI Longitudinal Cohort)
- hcp-connectome (Human Connectome Project)
- amp-ad-portal (AMP-AD Synapse Knowledge Portal)
- encode-neuro-chipseq (ENCODE neuronal ChIP-seq/ATAC-seq)
- niagads-ad-genomics (NIAGADS AD genomics)
- gut-microbiome-pd-16s (PD 16S rRNA microbiome studies)
- braak-staging-neuropath (Braak staging reference data)
- sleep-eeg-neurodegeneration (polysomnography cohorts)
- bbb-transcytosis-proteomics (BBB surface proteome data)
- senescence-scrnaseq-atlas (Brain senescence scRNA-seq atlas)
- digital-biomarker-nd (Wearable/digital biomarker studies)
- crispr-neuro-screen-data (CRISPR neurodegeneration screens)
- lipid-raft-lipidomics (Synaptic lipid raft lipidomics)
- glymphatic-mri-ad (Glymphatic clearance MRI data)
- tau-strain-cryo-em (Tau filament cryo-EM structures)
- Wrote and ran
scripts/catalog_analysis_datasets.py
- Results:
- 17 new datasets inserted (total: 53)
- 47 new dataset_citations created (total: 53)
- 25 analyses now have linked datasets (≥15 ✓)
- 47 new citations (≥10 ✓)
- Committed and pushed to task branch