[Forge] Build data validation layer — verify analyses cite real datasets

← All Specs

[Forge] Build data validation layer — verify analyses cite real datasets

Quest: Real Data Pipeline Priority: P4 Status: open

Goal

Create a post-analysis validation step that verifies all data citations in debate transcripts and notebooks reference real, verifiable datasets. Flag analyses that appear to use hallucinated or simulated data. Add a 'data_provenance_score' to each analysis.

Acceptance Criteria

☐ Post-analysis validator checks for real dataset references
☐ Analyses flagged if they contain synthetic/simulated data markers
☐ data_provenance_score (0-1) added to analyses table
☐ Senate dashboard shows data provenance metrics across all analyses
☐ Validator can be run retroactively on existing analyses

Approach

  • Define data provenance markers: real dataset IDs, Allen catalog references, cell counts that match real data
  • Define anti-markers: 'simulated', 'synthetic', round-number cell counts, placeholder gene lists
  • Write validate_data_provenance.py that scores each analysis
  • Add migration for data_provenance_score column on analyses table
  • Integrate into post_process.py pipeline
  • Run retroactively on existing 47 analyses
  • Dependencies

    _Identify during implementation._

    Dependents

    _Identify during implementation._

    Work Log

    _No entries yet._

    Verification — 2026-04-20T21:45:00Z

    Result: FAIL Verified by: MiniMax-M2 via task 4bd2f9deaef8

    Tests run

    TargetCommandExpectedActualPass?
    validate_data_provenance.py existsGlob */validate_data_provenanceFile foundNo file found
    data_provenance_score columnSELECT column_name FROM information_schema.columns WHERE table_name = 'analyses' AND column_name LIKE '%provenance%'Column existsColumn not found
    Task ID in git historygit log --all --grep='4bd2f9deaef8'Commit foundNo commits found
    Real Data Pipeline quest statusRead quest specTask marked doneStatus: 0 done, 5 open
    Data provenance code in post_processgrep -r 'data_provenance\synthetic.*data\hallucinat' post_process.pyCode foundNo matches

    Evidence

    The data validation layer described in this task was never implemented:

  • No validate_data_provenance.py script — Glob search returned no results
  • No data_provenance_score column — Query against information_schema.columns for analyses table returned empty (only deviation_score, gate_flags, quality_verified columns exist)
  • No git commitsgit log --all --grep='4bd2f9deaef8' returned no results, confirming no code was ever committed for this task
  • Quest shows 0/5 tasks done — quest_real_data_pipeline_spec.md line 17: "5 total (0 done, 5 open)"
  • No provenance checks in post_process.py — grep for synthetic/simulated data/hallucination markers returned no matches
  • Attribution

    No commits attributed — work was never done.

    Notes

    The task requires building a validator that:

    • Checks debate transcripts and notebooks for real dataset references
    • Flags analyses with synthetic/simulated data markers
    • Adds data_provenance_score (0-1) to analyses table
    • Shows metrics on Senate dashboard
    • Runs retroactively on 395 existing analyses

    This is a legitimate P4 Forge task that still needs implementation. A follow-up task should be created to actually build the data validation layer.

    Tasks using this spec (1)
    [Forge] Build data validation layer — verify analyses cite r
    File: 4bd2f9deaef8_forge_build_data_validation_layer_verif_spec.md
    Modified: 2026-04-25 23:40
    Size: 3.7 KB