SciDEX — Task: [Forge] Dedup scan every 6h

Run artifact_dedup_agent.run_full_scan() to generate new deduplication recommendations as content grows. Uses high thresholds to minimize false positives.

Completion Notes

Auto-release: recurring task had no work this cycle

Git Commits (6)

[Senate] Prioritization review run 72: 5 priority adjustments + 1 cancellation [task:dff08e77-5e49-4da5-a9ef-fbfe7de3dc17] (#1310)2026-04-28

[Forge] Fix artifact_dedup_agent PostgreSQL compatibility bugs; add pending_by_type to run_full_scan [task:7ffcac76-07ae-4f9b-a5ae-40a531d8da09]2026-04-20

[Forge] Add per-scan error handling to run_full_scan [task:7ffcac76-07ae-4f9b-a5ae-40a531d8da09]2026-04-17

[Forge] Optimize dedup scans: gene-grouped hypothesis + entity-grouped wiki [task:7ffcac76-07ae-4f9b-a5ae-40a531d8da09]2026-04-12

forge: dedup scan uses main DB + batch inserts + slug id fix [task:7ffcac76-07ae-4f9b-a5ae-40a531d8da09]2026-04-11

Spec File

Goal

Run artifact_dedup_agent.run_full_scan() on a recurring schedule to generate deduplication recommendations as content grows. Uses high thresholds to minimize false positives.

Approach

Run scan_hypothesis_duplicates (threshold=0.42, limit=200) — finds near-duplicate hypotheses across different analyses

Run scan_wiki_duplicates (threshold=0.60, limit=500) — finds similar wiki pages by title + entity overlap

Run scan_artifact_duplicates (threshold=0.85) — finds exact duplicates by content_hash

Log stats: created recommendations per type and total pending queue

Thresholds

Hypothesis similarity ≥ 0.42 (50% text + 30% target overlap + 20% concept overlap)
Wiki similarity ≥ 0.60 (50% title word overlap + 20% entity type match + 30% KG node match)
Artifact similarity = 1.0 (exact content_hash match only)

Acceptance Criteria

☐ All three scan functions complete without error

☐ Recommendations inserted into dedup_recommendations table

☐ Stats logged and returned (pending_by_type counts)

Dependencies

Requires postgresql://scidex (main database with full schema)

Work Log

2026-04-20 — Run (slot minimax:64)

Fixed three PostgreSQL compatibility bugs in artifact_dedup_agent.py:

- db.executemany() → cursor loop with individual cursor.execute() (psycopg has no executemany)
- GROUP_CONCAT → string_agg(id::text, ',') (PostgreSQL has no GROUP_CONCAT)
- json_set() → Python dict + json.dumps() (similarity_details is TEXT, not JSONB)

Fixed dedup_recommendations_id_seq sequence (was stuck at 1, table had rows up to id=2362)
All four scans now complete without error:

- hypotheses: 200 scanned (104 pairs checked), 0 new recommendations (all dupes already pending)
- wiki_pages: 209 candidate pairs examined, 0 new recommendations
- gaps: 500 gaps scanned, 0 new recommendations
- artifacts: 7 exact hash groups found, 0 new (all already pending)

Pending queue: 71 hypotheses + 202 wiki_pages + 12 gaps + 7 artifacts = 292 pending
Result: 0 new recommendations added this run (queue was already populated)

2026-04-11 — First run (slot 61)

Found DB_PATH was pointing to worktree-local stub PostgreSQL (empty)
Fixed DB_PATH to use postgresql://scidex
Benchmark: 336 hypothesis comparisons/sec, full scan ~59s
scan_artifact_duplicates: found 100 exact hash groups, created 0 recommendations (all already pending)
scan_wiki_duplicates: timed out at 30s — 17538 pages is too many for naive O(n²) scan
scan_hypothesis_duplicates: would take ~59s — acceptable
TODO: Optimize wiki dedup with indexed query or batch-by-prefix rather than O(n²) scan

Payload JSON

{
  "requirements": {
    "analysis": 3
  },
  "completion_shas": [
    "f6f2cd3b5f98bc3746ed0cf99bfd56862c7a282a"
  ],
  "completion_shas_checked_at": "2026-04-12T17:19:29.992468+00:00",
  "completion_shas_missing": [
    "30683bafd3f72119104eedcd61c83303cf6172c3",
    "af558f3aa94ddc4dec8fb60c3d6ffa6fc205445e"
  ],
  "_stall_skip_providers": [
    "glm"
  ]
}