Stuck server. A usage refresh could wedge the entire server until restart: account rows froze at "checking", the single-account refresh endpoints never returned, and the background token-refresh and heal sweeps silently died. The dashboard's refresh buttons locked out forever because the frontend fetch had no timeout either.
Auto-swap shortfalls. Independently, the swap algorithm was
stranding capacity (an active T0 with a long runway blocked an earlier-expiring
T0; the fixed T1 target of 90 stranded the buffer when expiry landed outside
working hours; the burn-floor check stranded the final slice of every account)
and churning (id-keyed emergence streaks reset each other; asymmetric hysteresis
fired premature drains; failed swaps retried every tick; manual switches were
silently reverted within minutes). The decision log recorded every blocked swap
as a generic tick, making any of this invisible.
Verified against the code, reproduced in tests/unit/test_refresh_deadlock.py:
_refresh_token_flow(PRIMARY) in jacked/web/auth.py ran the
post-refresh fetch_profile while still holding the per-account
primary refresh lock.fetch_profile's 401 recovery path re-enters
_refresh_token_flow on the same non-reentrant
asyncio.Lock — the coroutine awaited its own lock: self-deadlock,
lock never released.refresh_all_expiring_tokens and
heal_invalid_accounts wrapped refresh_account_token in
async with lock: — the flow acquires that same lock internally, so
those paths self-deadlocked even without the profile-fetch trigger.main.py's _token_refresh_loop and
_heal_sweep_loop awaited the wedged pass with no timeout — both
loops dead until server restart./refresh-usage, /refresh-token,
/validate) awaited the same wedged coroutines unbounded — endless
"checking" spinner; the frontend fetch() had no timeout, so its
in-flight flags (_singleRefreshInFlight,
_usageRefreshInProgress) stuck until page reload.
flowchart TD
A[refresh_account_token] --> B[_refresh_token_flow PRIMARY
acquires per-account lock]
B --> C[token exchange OK]
C --> D[fetch_profile
WHILE HOLDING LOCK]
D -->|401| E[recovery: re-enter
_refresh_token_flow]
E -->|acquire same
non-reentrant lock| B
E -. wedged forever .-> F[every later refresh /
heal / validate for this
account blocks unbounded]
F --> G[main.py sweep loops +
HTTP routes await forever]
flowchart TD
A[refresh_account_token] --> B[_refresh_token_flow PRIMARY
acquires per-account lock]
B --> C[token exchange OK]
C --> D[fetch_profile
WHILE HOLDING LOCK]
D -->|401| E[recovery: re-enter
_refresh_token_flow]
E -->|acquire same
non-reentrant lock| B
E -. wedged forever .-> F[every later refresh /
heal / validate for this
account blocks unbounded]
F --> G[main.py sweep loops +
HTTP routes await forever]
| Fix | Where |
|---|---|
Lock ownership consolidated into _refresh_token_flow; bounded acquire wait_for(lock.acquire(), 60) → lock_timeout error instead of eternal suspension; explicit try/finally lock.release(). | jacked/web/auth.py |
Post-refresh fetch_profile moved OUTSIDE the lock (step 4l) — its 401 path re-enters the flow on the same lock (the root cause). | jacked/web/auth.py |
refresh_all_expiring_tokens: lock.locked() peek for fast-skip only; per-account wait_for(60); exceptions counted as failed, never propagate. | jacked/web/auth.py |
heal_invalid_accounts: calls _refresh_token_flow(PRIMARY) directly (flow owns locking; also fixes the phantom heal where should_refresh reported success without exchanging); heal validation bounded 60s. | jacked/web/auth.py |
Sweep loops self-heal: asyncio.wait_for(..., SWEEP_PASS_TIMEOUT=600) around each refresh/heal pass — cancellation releases any held lock; loop continues next interval. | jacked/api/main.py |
Route-level bounds → deterministic 504 (REFRESH_TIMEOUT/VALIDATE_TIMEOUT): refresh-token 45s, refresh-usage 60s (also resets validation_status to "unknown" so the row leaves "checking"), validate 60s. Bulk-refresh 429 envelope standardized; _bulk_refresh_acquired_at reset in finally. | jacked/api/routes/auth.py |
Poll-loop fetch bounds: active fetch_usage 50s (tick continues on cached data), candidate fetches 30s + catch-all, prime fetches 30s; single-account installs no longer re-prime forever (attempted == 0 counts as done). | jacked/api/usage_monitor.py |
Frontend: AbortController default 60s API timeout (300s for bulk refresh); _singleRefreshInFlight is now a Map with a 240s stale backstop so an orphaned refresh can't lock out the button forever. | jacked/data/web/js/app.js, account-actions.js |
| Fix | Detail |
|---|---|
| Min-residency 900s | Proactive departures (emerged / intra-tier / burn-rate) blocked until the active account has held the slot 900s since the last committed swap or manual switch. Drained + 5h-critical are forced departures — exempt. Trigger: residency_blocked. |
| Manual-switch protection | use_account calls note_external_swap() (arms cooldown + residency, clears emergence streak) and pauses the sweep 15 min — auto-swap can no longer silently revert the user's choice. |
RESET_SUPPRESS_MINUTES = 30 | Shared with selection's _FIVE_H_HEADROOM_RESET_MIN. Invariant: lookahead <= suppression, else an account admitted on an imminent 5h reset is immediately ejected by 5h-critical (deterministic ping-pong). |
| Symmetric tier hysteresis | Single-step less-urgent flips (T0→T1) now damped within 5 min of the boundary — refetch noise fired a premature drained (target 100→90) plus pull-back swap. Multi-step jumps flip immediately (genuine window reset). |
| Tier-keyed emergence persistence | _emerged_tier_streak keyed by TIER, not account id — two near-tied same-tier candidates alternating as best reset each other's id-keyed streak forever. Fast path: a >=2-tier gap (T0 best vs T2+ active) skips persistence. |
| Swap-failure backoff | Failed credential writes retry on exponential backoff (60s × 2^(n−1), cap 600s) instead of every tick; burn-rate state preserved on failure; committed swap resets the count. |
| Burn-rate decay ungated | Decays after 5+ unchanged ticks at ANY usage level — the below-warning gate froze the rate exactly when the user went idle at >=80%, firing spurious burn-rate swaps. |
| Fix | Detail |
|---|---|
| Quiet-hours executor guard REMOVED | Swaps execute 24/7. A credential swap costs nothing while idle and pre-positions the fleet for morning — 5h windows open on the first API call, not on credential placement. Dead active_start/active_end params dropped from _execute_swap. |
| Intra-T0 preemption (rule 1b) | An earlier-expiring T0 candidate with deficit >= 5% and loss-rate >= 1.5× the active T0's preempts it. The largest stranding mechanism observed; margin gates prevent ping-pong. Trigger: intra_tier_preempted. |
| Deadline-aware T1 target | max(90, 100 − min(10, achievable_final_day_burn)) — a fixed 90 strands the buffer when expiry lands outside working hours (e.g. 03:00 local). Floor 90. |
| T0 headroom-floor bypass | T0 candidates with deficit >= 1 skip has_viable_headroom — drain-to-100 must not strand the final slice below the ~4.2% burn-per-window floor. |
| Stale-saturation recovery | _has_5h_headroom treats a PAST 5h reset as headroom; _candidate_staleness_override forces a refetch on past-reset-with-saturated-cache or a 7d window reset — removes the 5-10 min blind spot right after a reset. |
| T0 poll clamp | Interval <=90s whenever an eligible non-active T0 with deficit exists; <=60s when a saturated T0's 5h reset lands within the next interval (be on time for the reset). |
| Damped-tier deficit + numeric sort | Selection computes deficit against the hysteresis-damped tier via target_for_tier (a held-at-T1 account is admitted with T1's target, never T0's); _SortKey uses UTC-epoch reset times (lex string sort diverges chronologically for mixed ISO formats). |
| Fix | Detail |
|---|---|
| Trigger taxonomy extended | Stay causes are now first-class: tick, emergence_pending, residency_blocked, cooldown_blocked, no_target, swap_aborted; swaps add intra_tier_preempted. Trigger set explicitly by the branch taken (_decision_trigger), not re-derived by parsing the reason string. |
| Swap audit trail | swap_log gains status (pending → committed/failed; written pending BEFORE the credential write) and residency_seconds; auto-migrated. /swap-log returns {"swaps": [...], "swaps_last_24h": N} (committed-only count). |
| Drain advisor | expiring_with_stranded_capacity WS event for T0 accounts projected to strand >2% (stranding_estimate = max(0, deficit − achievable_burn)); per-account 30-min cooldown. Advisor only — explicitly NO auto-burn worker: routing real work toward the expiring account beats burning quota on no-ops. |
| Stall pattern (d) demoted | Same-tier-stay-with-deficit is intended behavior per spec — paging it at ERROR trains operators to ignore the watchdog. Now same_tier_deficit_advisory (INFO + WS, 30-min cooldown, T0/T1 candidates >=15% behind target only). |
| Candidate summaries | Decision-log candidates include stranding and compute target/deficit against the damped tier — the log matches what selection actually evaluated. |
New test files:
tests/unit/test_refresh_deadlock.py — sweep and heal complete without deadlock (timeout-bounded), phantom-heal regression, bounded lock acquire returns lock_timeout, profile-fetch 401 re-entry does not deadlock, lock released after flow.tests/unit/test_swap_log.py — record/update status, residency, legacy-table migration, trailing-24h committed count, endpoint payload shape.tests/unit/api/test_routes_auth_timeouts.py — 504s with codes, status reset on usage timeout, 429 envelope, acquired_at reset.Extended: test_auto_swap.py (target_for_tier, deadline-aware T1, damped-deficit admission, chronological sort, intra-T0 preemption, reset-suppress invariant, achievable_burn/stranding_estimate), test_usage_monitor.py (residency gate, failure backoff, taxonomy branches, fast-path emergence, tier-keyed persistence, same-tier advisory, execute-swap atomicity/residency, priming, active fetch timeout), test_use_account.py (manual switch arms pause + residency), test_token_refresh.py.
uv run python -m pytest tests/unit/test_refresh_deadlock.py tests/unit/test_swap_log.py \
tests/unit/api/test_routes_auth_timeouts.py tests/unit/test_auto_swap.py \
tests/unit/test_usage_monitor.py tests/unit/test_token_refresh.py \
tests/unit/test_use_account.py -q
swap_log column migration runs automatically at
startup.swaps_last_24h counter (in
GET /api/settings/swap-log). Expect a modest increase from 24/7
execution + intra-T0 preemption; min-residency, cooldown, and failure backoff
should keep it in the single digits. A sustained high count means ping-pong —
read the decision log triggers."could not acquire ... refresh lock" or
"pass exceeded ... cancelled" — those lines now mean a new wedge
is being contained rather than freezing the server.Generated with the jacked HTML artifact template. Edit freely.