[Forge] Dedup scan every 6h open analysis:3

← Forge
Run artifact_dedup_agent.run_full_scan() to generate new deduplication recommendations as content grows. Uses high thresholds to minimize false positives.

Completion Notes

Auto-release: recurring task had no work this cycle

Git Commits (6)

[Forge] Fix artifact_dedup_agent PostgreSQL compatibility bugs; add pending_by_type to run_full_scan [task:7ffcac76-07ae-4f9b-a5ae-40a531d8da09]2026-04-20
[Forge] Fix artifact_dedup_agent PostgreSQL compatibility bugs; add pending_by_type to run_full_scan [task:7ffcac76-07ae-4f9b-a5ae-40a531d8da09]2026-04-20
[Forge] Fix artifact_dedup_agent PostgreSQL compatibility bugs; add pending_by_type to run_full_scan [task:7ffcac76-07ae-4f9b-a5ae-40a531d8da09]2026-04-20
[Forge] Add per-scan error handling to run_full_scan [task:7ffcac76-07ae-4f9b-a5ae-40a531d8da09]2026-04-17
[Forge] Optimize dedup scans: gene-grouped hypothesis + entity-grouped wiki [task:7ffcac76-07ae-4f9b-a5ae-40a531d8da09]2026-04-12
forge: dedup scan uses main DB + batch inserts + slug id fix [task:7ffcac76-07ae-4f9b-a5ae-40a531d8da09]2026-04-11
Spec File

Goal

Run artifact_dedup_agent.run_full_scan() on a recurring schedule to generate deduplication recommendations as content grows. Uses high thresholds to minimize false positives.

Approach

  • Run scan_hypothesis_duplicates (threshold=0.42, limit=200) — finds near-duplicate hypotheses across different analyses
  • Run scan_wiki_duplicates (threshold=0.60, limit=500) — finds similar wiki pages by title + entity overlap
  • Run scan_artifact_duplicates (threshold=0.85) — finds exact duplicates by content_hash
  • Log stats: created recommendations per type and total pending queue
  • Thresholds

    • Hypothesis similarity ≥ 0.42 (50% text + 30% target overlap + 20% concept overlap)
    • Wiki similarity ≥ 0.60 (50% title word overlap + 20% entity type match + 30% KG node match)
    • Artifact similarity = 1.0 (exact content_hash match only)

    Acceptance Criteria

    ☐ All three scan functions complete without error
    ☐ Recommendations inserted into dedup_recommendations table
    ☐ Stats logged and returned (pending_by_type counts)

    Dependencies

    Requires postgresql://scidex (main database with full schema)

    Work Log

    2026-04-20 — Run (slot minimax:64)

    • Fixed three PostgreSQL compatibility bugs in artifact_dedup_agent.py:
    - db.executemany() → cursor loop with individual cursor.execute() (psycopg has no executemany)
    - GROUP_CONCATstring_agg(id::text, ',') (PostgreSQL has no GROUP_CONCAT)
    - json_set() → Python dict + json.dumps() (similarity_details is TEXT, not JSONB)
    • Fixed dedup_recommendations_id_seq sequence (was stuck at 1, table had rows up to id=2362)
    • All four scans now complete without error:
    - hypotheses: 200 scanned (104 pairs checked), 0 new recommendations (all dupes already pending)
    - wiki_pages: 209 candidate pairs examined, 0 new recommendations
    - gaps: 500 gaps scanned, 0 new recommendations
    - artifacts: 7 exact hash groups found, 0 new (all already pending)
    • Pending queue: 71 hypotheses + 202 wiki_pages + 12 gaps + 7 artifacts = 292 pending
    • Result: 0 new recommendations added this run (queue was already populated)

    2026-04-11 — First run (slot 61)

    • Found DB_PATH was pointing to worktree-local stub PostgreSQL (empty)
    • Fixed DB_PATH to use postgresql://scidex
    • Benchmark: 336 hypothesis comparisons/sec, full scan ~59s
    • scan_artifact_duplicates: found 100 exact hash groups, created 0 recommendations (all already pending)
    • scan_wiki_duplicates: timed out at 30s — 17538 pages is too many for naive O(n²) scan
    • scan_hypothesis_duplicates: would take ~59s — acceptable
    • TODO: Optimize wiki dedup with indexed query or batch-by-prefix rather than O(n²) scan

    Payload JSON
    {
      "requirements": {
        "analysis": 3
      },
      "completion_shas": [
        "f6f2cd3b5f98bc3746ed0cf99bfd56862c7a282a"
      ],
      "completion_shas_checked_at": "2026-04-12T17:19:29.992468+00:00",
      "completion_shas_missing": [
        "30683bafd3f72119104eedcd61c83303cf6172c3",
        "af558f3aa94ddc4dec8fb60c3d6ffa6fc205445e"
      ],
      "_stall_skip_providers": [
        "glm"
      ]
    }

    Sibling Tasks in Quest (Forge) ↗