Populate quality scores for registered datasets so dataset reuse, citation rewards, and governance can prioritize well-documented scientific data. Scores should consider schema completeness, provenance, license clarity, citation coverage, and reuse readiness.
quality_score values between 0 and 1schema_json, canonical_path, license, and citation metadata.quest-engine-ci - Generates this task when queue depth is low and unscored datasets exist.quality_score IS NULL; the task's original batch size of 8 is still actionable as a bounded scoring batch, but no longer exhausts all unscored datasets.datasets table has quality_score but no quality_notes column; this batch will add a nullable quality_notes text column so the requested rationale can live with the score.wrap-biomarker, ad-trial-tracker, ukb-ad-gwas, allen-aging-mouse-brain, allen-neural-dynamics, amp-ad-portal, bbb-transcytosis-proteomics, braak-staging-neuropath) using a 4-part rubric: provenance completeness, schema conformance, spot-check accuracy, and domain completeness.dataset_versions audit row per scored dataset, and file specific improvement tasks for any dataset scored below 0.5.scripts/score_dataset_quality_batch_9df5913c.py; added nullable live DB column datasets.quality_notes because the task required notes but the table only had quality_score.quality_score IS NULL; 8 scored datasets now have non-null quality_score and quality_notes.ukb-ad-gwas: 0.36 — generic UKB URL only, no accession/phenotype/schema/row provenance.ad-trial-tracker: 0.40 — broad Alzheimer’s Association pages, no structured extract/schema/trial IDs.allen-neural-dynamics: 0.42 — broad institutional reference, no pinned release/schema/direct ND scope.allen-aging-mouse-brain: 0.46 — no schema/local rows and naming/source normalization needed.wrap-biomarker: 0.46 — real WRAP cohort, but no schema/data dictionary/row provenance.bbb-transcytosis-proteomics: 0.52 — plausible cited target set, but no row-level curation/schema.braak-staging-neuropath: 0.57 — accurate staging references, but no tabular schema/row evidence tiers.amp-ad-portal: 0.66 — specific Synapse portal and strong source provenance, but controlled access/no local schema.
dataset_versions audit rows with rubric components, source URLs, notes, and task ID in diff_stat.790cfae9-e501-4fb5-be66-34052fe06760 (WRAP), 0e79463f-4092-406c-b902-002ba3b1ae6b (AD trial tracker), c43a0413-2405-47e6-a25c-6b8c7a95d3b4 (UKB AD GWAS), 1d173c30-6e6f-47d0-bb95-84f29b3a4e8d (Allen aging mouse brain), 64f00534-b410-440b-945f-3dcd9b0fc813 (Allen neural dynamics).python3 -m py_compile scripts/score_dataset_quality_batch_9df5913c.py; SQL check confirmed 53 total datasets, 12 unscored, 8 scored-with-notes for this batch, and 8 task-specific dataset_versions rows.