[Forge] Benchmark latency and output quality for 10 high-frequency scientific API tools open

← Forge
Identify the 10 skills most frequently called in the last 30 days from tool_calls. For each: (1) run a representative test call with a standard neurodegeneration query; (2) measure response latency in ms; (3) score output completeness and accuracy (0–1) against a known reference; (4) compare against prior benchmark if one exists in tool_health_log; (5) INSERT a new row into tool_health_log with results. Flag tools with latency > 5s or quality < 0.7 for optimization. This establishes a baseline performance register.

Sibling Tasks in Quest (Forge) ↗