Usage-Refresh Deadlock & Auto-Swap Fixes

Status:complete
Date:2026-06-10
Branch:master
Type:plan (implementation record)

Problem

Stuck server. A usage refresh could wedge the entire server until restart: account rows froze at "checking", the single-account refresh endpoints never returned, and the background token-refresh and heal sweeps silently died. The dashboard's refresh buttons locked out forever because the frontend fetch had no timeout either.

Auto-swap shortfalls. Independently, the swap algorithm was stranding capacity (an active T0 with a long runway blocked an earlier-expiring T0; the fixed T1 target of 90 stranded the buffer when expiry landed outside working hours; the burn-floor check stranded the final slice of every account) and churning (id-keyed emergence streaks reset each other; asymmetric hysteresis fired premature drains; failed swaps retried every tick; manual switches were silently reverted within minutes). The decision log recorded every blocked swap as a generic tick, making any of this invisible.

Root Cause & Evidence Chain

Verified against the code, reproduced in tests/unit/test_refresh_deadlock.py:

  1. _refresh_token_flow(PRIMARY) in jacked/web/auth.py ran the post-refresh fetch_profile while still holding the per-account primary refresh lock.
  2. fetch_profile's 401 recovery path re-enters _refresh_token_flow on the same non-reentrant asyncio.Lock — the coroutine awaited its own lock: self-deadlock, lock never released.
  3. Compounding: refresh_all_expiring_tokens and heal_invalid_accounts wrapped refresh_account_token in async with lock: — the flow acquires that same lock internally, so those paths self-deadlocked even without the profile-fetch trigger.
  4. The lock acquire was unbounded, so every later refresh / 401-recovery / validate path for that account suspended forever behind the wedged holder.
  5. main.py's _token_refresh_loop and _heal_sweep_loop awaited the wedged pass with no timeout — both loops dead until server restart.
  6. HTTP routes (/refresh-usage, /refresh-token, /validate) awaited the same wedged coroutines unbounded — endless "checking" spinner; the frontend fetch() had no timeout, so its in-flight flags (_singleRefreshInFlight, _usageRefreshInProgress) stuck until page reload.
flowchart TD
    A[refresh_account_token] --> B[_refresh_token_flow PRIMARY
acquires per-account lock] B --> C[token exchange OK] C --> D[fetch_profile
WHILE HOLDING LOCK] D -->|401| E[recovery: re-enter
_refresh_token_flow] E -->|acquire same
non-reentrant lock| B E -. wedged forever .-> F[every later refresh /
heal / validate for this
account blocks unbounded] F --> G[main.py sweep loops +
HTTP routes await forever]
flowchart TD
    A[refresh_account_token] --> B[_refresh_token_flow PRIMARY
acquires per-account lock] B --> C[token exchange OK] C --> D[fetch_profile
WHILE HOLDING LOCK] D -->|401| E[recovery: re-enter
_refresh_token_flow] E -->|acquire same
non-reentrant lock| B E -. wedged forever .-> F[every later refresh /
heal / validate for this
account blocks unbounded] F --> G[main.py sweep loops +
HTTP routes await forever]

Fix Inventory

Bundle A — Deadlock / stuck server

FixWhere
Lock ownership consolidated into _refresh_token_flow; bounded acquire wait_for(lock.acquire(), 60)lock_timeout error instead of eternal suspension; explicit try/finally lock.release().jacked/web/auth.py
Post-refresh fetch_profile moved OUTSIDE the lock (step 4l) — its 401 path re-enters the flow on the same lock (the root cause).jacked/web/auth.py
refresh_all_expiring_tokens: lock.locked() peek for fast-skip only; per-account wait_for(60); exceptions counted as failed, never propagate.jacked/web/auth.py
heal_invalid_accounts: calls _refresh_token_flow(PRIMARY) directly (flow owns locking; also fixes the phantom heal where should_refresh reported success without exchanging); heal validation bounded 60s.jacked/web/auth.py
Sweep loops self-heal: asyncio.wait_for(..., SWEEP_PASS_TIMEOUT=600) around each refresh/heal pass — cancellation releases any held lock; loop continues next interval.jacked/api/main.py
Route-level bounds → deterministic 504 (REFRESH_TIMEOUT/VALIDATE_TIMEOUT): refresh-token 45s, refresh-usage 60s (also resets validation_status to "unknown" so the row leaves "checking"), validate 60s. Bulk-refresh 429 envelope standardized; _bulk_refresh_acquired_at reset in finally.jacked/api/routes/auth.py
Poll-loop fetch bounds: active fetch_usage 50s (tick continues on cached data), candidate fetches 30s + catch-all, prime fetches 30s; single-account installs no longer re-prime forever (attempted == 0 counts as done).jacked/api/usage_monitor.py
Frontend: AbortController default 60s API timeout (300s for bulk refresh); _singleRefreshInFlight is now a Map with a 240s stale backstop so an orphaned refresh can't lock out the button forever.jacked/data/web/js/app.js, account-actions.js

Bundle B — Churn

FixDetail
Min-residency 900sProactive departures (emerged / intra-tier / burn-rate) blocked until the active account has held the slot 900s since the last committed swap or manual switch. Drained + 5h-critical are forced departures — exempt. Trigger: residency_blocked.
Manual-switch protectionuse_account calls note_external_swap() (arms cooldown + residency, clears emergence streak) and pauses the sweep 15 min — auto-swap can no longer silently revert the user's choice.
RESET_SUPPRESS_MINUTES = 30Shared with selection's _FIVE_H_HEADROOM_RESET_MIN. Invariant: lookahead <= suppression, else an account admitted on an imminent 5h reset is immediately ejected by 5h-critical (deterministic ping-pong).
Symmetric tier hysteresisSingle-step less-urgent flips (T0→T1) now damped within 5 min of the boundary — refetch noise fired a premature drained (target 100→90) plus pull-back swap. Multi-step jumps flip immediately (genuine window reset).
Tier-keyed emergence persistence_emerged_tier_streak keyed by TIER, not account id — two near-tied same-tier candidates alternating as best reset each other's id-keyed streak forever. Fast path: a >=2-tier gap (T0 best vs T2+ active) skips persistence.
Swap-failure backoffFailed credential writes retry on exponential backoff (60s × 2^(n−1), cap 600s) instead of every tick; burn-rate state preserved on failure; committed swap resets the count.
Burn-rate decay ungatedDecays after 5+ unchanged ticks at ANY usage level — the below-warning gate froze the rate exactly when the user went idle at >=80%, firing spurious burn-rate swaps.

Bundle C — Utilization

FixDetail
Quiet-hours executor guard REMOVEDSwaps execute 24/7. A credential swap costs nothing while idle and pre-positions the fleet for morning — 5h windows open on the first API call, not on credential placement. Dead active_start/active_end params dropped from _execute_swap.
Intra-T0 preemption (rule 1b)An earlier-expiring T0 candidate with deficit >= 5% and loss-rate >= 1.5× the active T0's preempts it. The largest stranding mechanism observed; margin gates prevent ping-pong. Trigger: intra_tier_preempted.
Deadline-aware T1 targetmax(90, 100 − min(10, achievable_final_day_burn)) — a fixed 90 strands the buffer when expiry lands outside working hours (e.g. 03:00 local). Floor 90.
T0 headroom-floor bypassT0 candidates with deficit >= 1 skip has_viable_headroom — drain-to-100 must not strand the final slice below the ~4.2% burn-per-window floor.
Stale-saturation recovery_has_5h_headroom treats a PAST 5h reset as headroom; _candidate_staleness_override forces a refetch on past-reset-with-saturated-cache or a 7d window reset — removes the 5-10 min blind spot right after a reset.
T0 poll clampInterval <=90s whenever an eligible non-active T0 with deficit exists; <=60s when a saturated T0's 5h reset lands within the next interval (be on time for the reset).
Damped-tier deficit + numeric sortSelection computes deficit against the hysteresis-damped tier via target_for_tier (a held-at-T1 account is admitted with T1's target, never T0's); _SortKey uses UTC-epoch reset times (lex string sort diverges chronologically for mixed ISO formats).

Bundle D — Observability

FixDetail
Trigger taxonomy extendedStay causes are now first-class: tick, emergence_pending, residency_blocked, cooldown_blocked, no_target, swap_aborted; swaps add intra_tier_preempted. Trigger set explicitly by the branch taken (_decision_trigger), not re-derived by parsing the reason string.
Swap audit trailswap_log gains status (pendingcommitted/failed; written pending BEFORE the credential write) and residency_seconds; auto-migrated. /swap-log returns {"swaps": [...], "swaps_last_24h": N} (committed-only count).
Drain advisorexpiring_with_stranded_capacity WS event for T0 accounts projected to strand >2% (stranding_estimate = max(0, deficit − achievable_burn)); per-account 30-min cooldown. Advisor only — explicitly NO auto-burn worker: routing real work toward the expiring account beats burning quota on no-ops.
Stall pattern (d) demotedSame-tier-stay-with-deficit is intended behavior per spec — paging it at ERROR trains operators to ignore the watchdog. Now same_tier_deficit_advisory (INFO + WS, 30-min cooldown, T0/T1 candidates >=15% behind target only).
Candidate summariesDecision-log candidates include stranding and compute target/deficit against the damped tier — the log matches what selection actually evaluated.

Test Strategy

New test files:

Extended: test_auto_swap.py (target_for_tier, deadline-aware T1, damped-deficit admission, chronological sort, intra-T0 preemption, reset-suppress invariant, achievable_burn/stranding_estimate), test_usage_monitor.py (residency gate, failure backoff, taxonomy branches, fast-path emergence, tier-keyed persistence, same-tier advisory, execute-swap atomicity/residency, priming, active fetch timeout), test_use_account.py (manual switch arms pause + residency), test_token_refresh.py.

uv run python -m pytest tests/unit/test_refresh_deadlock.py tests/unit/test_swap_log.py \
  tests/unit/api/test_routes_auth_timeouts.py tests/unit/test_auto_swap.py \
  tests/unit/test_usage_monitor.py tests/unit/test_token_refresh.py \
  tests/unit/test_use_account.py -q

Rollout Notes


Generated with the jacked HTML artifact template. Edit freely.