fix(vault): sticky pool failover (no main account, stops failover flap and notice spam)#48
Merged
Conversation
…urrentMembers prune
…e lock before degrade logging
…oncurrent degrade no-double-unlock test
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Live on the server the bot sent 20+ identical
pool openai_pool failed over openai_oauth -> openai_oauth_2 (429)notices over ~1.5h while hermes kept working. Root cause is a failover flap.openai_oauthis pool position 0 (the de-facto "main"). Its OpenAI quota is exhausted so it 429s on every request. On a 429 sluice fails over toopenai_oauth_2and parksopenai_oauthfor onlyvault.RateLimitCooldown(60s).PoolResolver.ResolveActivereturned the first member in position order that was healthy or whose cooldown had expired, so 60s lateropenai_oauthwas re-selected even thoughopenai_oauth_2was serving fine. It 429s again, fails over again, emits another identical notice. The real OpenAI Codex quota window is hours, so re-probing and snapping back to the exhausted account every 60s is wrong.Change
Sticky failover. There is no "main" account.
ResolveActivenow stays on whichever member is currently active until that member itself cools, then advances forward to the next eligible member (wrapping), and never snaps back to a lower-position member just because its cooldown lapsed. A new failover, and therefore exactly onecred_failoveraudit event plus one Telegram notice, happens only on a genuine exhaustion transition.ResolveActiveis the single source of truth (the failover path already calls it to compute the new active member), so the fix lands in one place and the flap, the failover events, and the notice spam all stop together.The per-pool current-active selection lives on the swap-surviving shared
PoolHealthstruct under the same mutex as the cooldown map, so it survives resolver regeneration and atomic pointer swaps and a stale generation cannot clobber it (mirrors the existing CRITICAL-1 cooldown handling, pruned inMergeLiveCooldowns). The all-cooling degrade path (operator-parked-but-healthy first, else soonest-recovering) and the audit Reason /pool_exhaustedformats are unchanged.sluice pool rotatestill parks the active member and now advances-and-stays rather than snapping back.A selectable position-vs-sticky strategy mode is noted as a possible follow-up and is out of scope here.
Testing
vault unit tests: sticky hold, flap regression (cool A returns B, expire A still returns B, cool B advances with wrap), all-cooling degrade unchanged, operator-parked degrade target preserved, sticky pointer survives
NewPoolResolverSharedregeneration plus atomic swap and a stale generation cannot clobber it. proxy: exactly onecred_failoverplus one notice per real transition and zero events when a non-active member cooldown merely lapses. The flap and spam tests were confirmed to fail against the old position-priority logic and pass after.Full
go test ./...green,-raceoninternal/vaultandinternal/proxyclean,go vet ./...andgo vet -tags=e2e ./e2e/clean, gofumpt clean, golangci-lint 0 issues.Plan:
docs/plans/20260518-sticky-failover.md.