[Senate] Rebuild theme S8: belief evolution & convergence metrics as a continuous process
Rebuild spec — follow docs/planning/specs/rebuild_theme_template_spec.md first.
Theme anchor
- Theme:
S8 — Belief evolution, convergence metrics, replication
tracking
- Full description:
docs/design/retired_scripts_patterns.md → S8
Why this matters now
Deleted scripts under this theme include convergence_monitor.py,
convergence_metrics.py, belief_tracker.py,
backfill_convergence_50.py. Open recurring tasks:
[Senate] Knowledge growth metrics snapshot (every-12h,
05b6876b-61a9-4a49-8881-17e8db81746c) and the hourly
98c6c423-6d32-4090-ae73-6d6ddae3d155. Without this theme, the
site's longitudinal "are we getting closer to scientific truth?"
dashboard breaks.
Template fills
{{THEME_ID}} = S8
{{THEME_NAME}} = belief evolution + convergence metrics
{{LAYER}} = Senate
{{LAYER_SLUG}} = senate
{{THEME_SLUG}} = convergence
{{CADENCE}} = hourly for snapshots, daily for aggregates, weekly
for trend reports.
{{CORE_JUDGMENT}} = "given the time-series of hypothesis states
and evidence accumulations, is the system converging on specific
mechanisms, diverging into contradictions, or stalling? Which
disease-areas show the strongest convergence signal?"
{{GAP_PREDICATE}} = "has a snapshot been taken for time-bucket T
yet?" → simple if-not-exists insert.
Where LLMs are load-bearing
The snapshot-taking is deterministic SQL aggregation, NOT an LLM job.
LLMs come in for the analysis layer:
Narrative generation: given snapshot deltas over the last
week, an LLM writes "what changed this week and why?" for the
dashboard.
Convergence-signal detection: given a cluster of hypotheses
about the same target, an LLM judges whether recent evidence has
converged them, diverged them, or left them stalled.
Replication cluster keys: rather than hash(target+relation+
direction), an LLM rubric decides "are these two hypotheses
testing the same underlying claim?" because surface forms
differ ("TREM2 activates microglia" vs "microglial activation is
driven by TREM2").
No hardcoded KPI lists
Scripts like backfill_convergence_50.py hardcoded "here are 50
metrics to compute". The rebuild defines KPIs in a convergence_kpi
table: (kpi_id, description, sql_expression, llm_rubric_id,
enabled). Operators add new KPIs via SQL; the snapshot worker picks
them up next cycle.
Each KPI has either a sql_expression (for deterministic metrics) or
a llm_rubric_id (for semantic metrics like "disease coverage"). No
code change to add a KPI.
Outcome feedback
- KPI informativeness: if a KPI's value is flat for > 30d, it's not
tracking anything useful — propose disable.
- Narrative quality: LLM narratives are self-scored for grounding
(did each claim cite a snapshot-delta?); low-grounded narratives
downgrade the narrative rubric.
- Replication-cluster confusions: if an LLM-generated cluster is
later split by an operator, the clustering rubric downgrades
confidence for similar shapes.
Acceptance
All template criteria, plus:
☐ KPI registry in PG, not hardcoded list.
☐ Snapshot idempotency: same time-bucket cannot produce two rows.
☐ Narrative generator grounds every claim in a specific
snapshot-delta, auditable.
☐ Replication clustering uses LLM judge, not string hash.
☐ Recurring tasks 05b6876b- and 98c6c423- reassigned.