[Forge] Build data validation layer — verify analyses cite real datasets

Quest: Real Data Pipeline Priority: P4 Status: open

Goal

Create a post-analysis validation step that verifies all data citations in debate transcripts and notebooks reference real, verifiable datasets. Flag analyses that appear to use hallucinated or simulated data. Add a 'data_provenance_score' to each analysis.

Acceptance Criteria

☐ Post-analysis validator checks for real dataset references

☐ Analyses flagged if they contain synthetic/simulated data markers

☐ data_provenance_score (0-1) added to analyses table

☐ Senate dashboard shows data provenance metrics across all analyses

☐ Validator can be run retroactively on existing analyses

Approach

Define data provenance markers: real dataset IDs, Allen catalog references, cell counts that match real data

Define anti-markers: 'simulated', 'synthetic', round-number cell counts, placeholder gene lists

Write validate_data_provenance.py that scores each analysis

Add migration for data_provenance_score column on analyses table

Integrate into post_process.py pipeline

Run retroactively on existing 47 analyses

Dependencies

_Identify during implementation._

Dependents

_Identify during implementation._

Work Log

_No entries yet._

Verification — 2026-04-20T21:45:00Z

Result: FAIL Verified by: MiniMax-M2 via task 4bd2f9deaef8

Tests run

Target	Command	Expected	Actual	Pass?
validate_data_provenance.py exists	`Glob */validate_data_provenance`	File found	No file found	✗
data_provenance_score column	`SELECT column_name FROM information_schema.columns WHERE table_name = 'analyses' AND column_name LIKE '%provenance%'`	Column exists	Column not found	✗
Task ID in git history	`git log --all --grep='4bd2f9deaef8'`	Commit found	No commits found	✗
Real Data Pipeline quest status	Read quest spec	Task marked done	Status: 0 done, 5 open	✗
Data provenance code in post_process	`grep -r 'data_provenance\`	synthetic.*data\	hallucinat' post_process.py	Code found	No matches	✗

Evidence

The data validation layer described in this task was never implemented:

No validate_data_provenance.py script — Glob search returned no results

No data_provenance_score column — Query against information_schema.columns for analyses table returned empty (only deviation_score, gate_flags, quality_verified columns exist)

No git commits — git log --all --grep='4bd2f9deaef8' returned no results, confirming no code was ever committed for this task

Quest shows 0/5 tasks done — quest_real_data_pipeline_spec.md line 17: "5 total (0 done, 5 open)"

No provenance checks in post_process.py — grep for synthetic/simulated data/hallucination markers returned no matches

Attribution

No commits attributed — work was never done.

Notes

The task requires building a validator that:

Checks debate transcripts and notebooks for real dataset references
Flags analyses with synthetic/simulated data markers
Adds data_provenance_score (0-1) to analyses table
Shows metrics on Senate dashboard
Runs retroactively on 395 existing analyses

This is a legitimate P4 Forge task that still needs implementation. A follow-up task should be created to actually build the data validation layer.

Tasks using this spec (1)

[Forge] Build data validation layer — verify analyses cite r

Real Data Pipeline done P4

File: 4bd2f9deaef8_forge_build_data_validation_layer_verif_spec.md

Modified: 2026-04-25 23:40

Size: 3.7 KB