## Root Cause
`orchestra/abandonment_watchdog.py` lines 535-541 add a provider to
`_stall_skip_providers` WITHOUT adding a `_stall_skip_at` timestamp.
`prune_expired_stall_skips()` (called with `treat_legacy_as_expired=True`)
removes any entry that has no timestamp, so the GLM ban is pruned
immediately and GLM reclaims the task on the next tick.
This caused task 80ffb77b-8391-493c-8644-37086c8e2e3c (quest engine CI)
to be abandoned 29 times with rate_limit_retries_exhausted:glm.
## Fix
In `/home/ubuntu/Orchestra/orchestra/abandonment_watchdog.py`, change the
`rebalance_stuck_tasks` function to also update `_stall_skip_at`:
```python
# BEFORE (lines 535-541):
skip_list = list(payload.get("_stall_skip_providers") or [])
added_providers = []
for p in providers:
if p not in skip_list:
skip_list.append(p)
added_providers.append(p)
payload["_stall_skip_providers"] = skip_list
# AFTER:
skip_list = list(payload.get("_stall_skip_providers") or [])
skip_at = payload.get("_stall_skip_at") or {}
if not isinstance(skip_at, dict):
skip_at = {}
added_providers = []
for p in providers:
if p not in skip_list:
skip_list.append(p)
added_providers.append(p)
# Refresh timestamp so prune_expired_stall_skips keeps this entry alive.
skip_at[p] = now_iso
payload["_stall_skip_providers"] = skip_list
payload["_stall_skip_at"] = skip_at
```
Also update the `changed` check to include timestamp refreshes:
```python
changed = bool(added_providers) or bool(skip_at) or bool(row["slot_affinity"]) or bool(row["assigned_slot"])
```
## After the code fix
1. Commit to the Orchestra repo with a clear message
2. Restart the Orchestra supervisor so the fix takes effect
3. Reset the quest engine task: `orchestra reset 80ffb77b-8391-493c-8644-37086c8e2e3c --project SciDEX`
4. Update the quest engine task provider from `any` to exclude GLM: use Python Orchestra services to set `provider = 'minimax'` on task 80ffb77b, so GLM never routes there again
## Verification
After the fix:
- The next time the abandonment watchdog runs for any rate-limited task, check that `_stall_skip_at` is populated in the payload
- Confirm the quest engine task (80ffb77b) is picked up by a non-GLM provider and completes successfully
Completion Notes
Fix pushed to branch orchestra/task/fbe73f92-abandonment-watchdog-missing-stall-skip (ec579d39b). origin/main still needs the fix merged — branch is ahead. Post-fix operational steps (supervisor restart, quest engine task reset, provider update to 'minimax') could not be executed due to lack of sudo/DB access; those are operator responsibilities. Code fix verified correct: skip_at dict is now populated with now_iso for every provider in the loop, and the changed check includes skip_at refreshes.
Task Dependencies
↓ Referenced by (downstream)