[Forge] Build data validation layer — verify analyses cite real datasets
Quest: Real Data Pipeline
Priority: P4
Status: open
Goal
Create a post-analysis validation step that verifies all data citations in debate transcripts and notebooks reference real, verifiable datasets. Flag analyses that appear to use hallucinated or simulated data. Add a 'data_provenance_score' to each analysis.
Acceptance Criteria
☐ Post-analysis validator checks for real dataset references
☐ Analyses flagged if they contain synthetic/simulated data markers
☐ data_provenance_score (0-1) added to analyses table
☐ Senate dashboard shows data provenance metrics across all analyses
☐ Validator can be run retroactively on existing analyses
Approach
Define data provenance markers: real dataset IDs, Allen catalog references, cell counts that match real data
Define anti-markers: 'simulated', 'synthetic', round-number cell counts, placeholder gene lists
Write validate_data_provenance.py that scores each analysis
Add migration for data_provenance_score column on analyses table
Integrate into post_process.py pipeline
Run retroactively on existing 47 analysesDependencies
_Identify during implementation._
Dependents
_Identify during implementation._
Work Log
_No entries yet._
Verification — 2026-04-20T21:45:00Z
Result: FAIL
Verified by: MiniMax-M2 via task 4bd2f9deaef8
Tests run
| Target | Command | Expected | Actual | Pass? |
|---|
| validate_data_provenance.py exists | Glob */validate_data_provenance | File found | No file found | ✗ |
| data_provenance_score column | SELECT column_name FROM information_schema.columns WHERE table_name = 'analyses' AND column_name LIKE '%provenance%' | Column exists | Column not found | ✗ |
| Task ID in git history | git log --all --grep='4bd2f9deaef8' | Commit found | No commits found | ✗ |
| Real Data Pipeline quest status | Read quest spec | Task marked done | Status: 0 done, 5 open | ✗ |
| Data provenance code in post_process | grep -r 'data_provenance\ | synthetic.*data\ | hallucinat' post_process.py | Code found | No matches | ✗ |
Evidence
The data validation layer described in this task was never implemented:
No validate_data_provenance.py script — Glob search returned no results
No data_provenance_score column — Query against information_schema.columns for analyses table returned empty (only deviation_score, gate_flags, quality_verified columns exist)
No git commits — git log --all --grep='4bd2f9deaef8' returned no results, confirming no code was ever committed for this task
Quest shows 0/5 tasks done — quest_real_data_pipeline_spec.md line 17: "5 total (0 done, 5 open)"
No provenance checks in post_process.py — grep for synthetic/simulated data/hallucination markers returned no matchesAttribution
No commits attributed — work was never done.
Notes
The task requires building a validator that:
- Checks debate transcripts and notebooks for real dataset references
- Flags analyses with synthetic/simulated data markers
- Adds data_provenance_score (0-1) to analyses table
- Shows metrics on Senate dashboard
- Runs retroactively on 395 existing analyses
This is a legitimate P4 Forge task that still needs implementation. A follow-up task should be created to actually build the data validation layer.