SciDEX has 3,741 hypothesis_predictions: 3,719 pending, 9 confirmed, 5 falsified. Only 0.37% evaluated. Each prediction is a falsifiable claim tied to a hypothesis. Evaluating them against literature demonstrates predictive validity — the platform's scientific credibility.
Infrastructure exists (status field, evidence_pmids). Missing: the evaluation pipeline.
What to do:
1. Start with predictions from hypotheses with composite_score >= 0.8 (88 hypotheses, highest signal-to-noise)
2. Generate search terms per prediction, query PubMed via paper_cache.search_papers()
3. Use LLM to assess evidence relevance and direction (supporting vs. contradicting)
4. Update hypothesis_predictions.status (confirmed/falsified) + add evidence PMIDs
5. Feed confirmed predictions back into hypothesis evidence_validation_score
Confidence threshold: only update status if evidence strength >= 0.75 (require 2+ independent PMIDs for confirmed).
Success per iteration: >= 50 predictions evaluated. Total target: >= 500.
Read first: docs/planning/specs/quest_agora_prediction_evaluation_pipeline.md