Skip to content

fix(vault): sticky pool failover (no main account, stops failover flap and notice spam)#48

Merged
nnemirovsky merged 8 commits into
mainfrom
sticky-failover
May 18, 2026
Merged

fix(vault): sticky pool failover (no main account, stops failover flap and notice spam)#48
nnemirovsky merged 8 commits into
mainfrom
sticky-failover

Conversation

@nnemirovsky
Copy link
Copy Markdown
Owner

What

Live on the server the bot sent 20+ identical pool openai_pool failed over openai_oauth -> openai_oauth_2 (429) notices over ~1.5h while hermes kept working. Root cause is a failover flap.

openai_oauth is pool position 0 (the de-facto "main"). Its OpenAI quota is exhausted so it 429s on every request. On a 429 sluice fails over to openai_oauth_2 and parks openai_oauth for only vault.RateLimitCooldown (60s). PoolResolver.ResolveActive returned the first member in position order that was healthy or whose cooldown had expired, so 60s later openai_oauth was re-selected even though openai_oauth_2 was serving fine. It 429s again, fails over again, emits another identical notice. The real OpenAI Codex quota window is hours, so re-probing and snapping back to the exhausted account every 60s is wrong.

Change

Sticky failover. There is no "main" account. ResolveActive now stays on whichever member is currently active until that member itself cools, then advances forward to the next eligible member (wrapping), and never snaps back to a lower-position member just because its cooldown lapsed. A new failover, and therefore exactly one cred_failover audit event plus one Telegram notice, happens only on a genuine exhaustion transition.

ResolveActive is the single source of truth (the failover path already calls it to compute the new active member), so the fix lands in one place and the flap, the failover events, and the notice spam all stop together.

The per-pool current-active selection lives on the swap-surviving shared PoolHealth struct under the same mutex as the cooldown map, so it survives resolver regeneration and atomic pointer swaps and a stale generation cannot clobber it (mirrors the existing CRITICAL-1 cooldown handling, pruned in MergeLiveCooldowns). The all-cooling degrade path (operator-parked-but-healthy first, else soonest-recovering) and the audit Reason / pool_exhausted formats are unchanged. sluice pool rotate still parks the active member and now advances-and-stays rather than snapping back.

A selectable position-vs-sticky strategy mode is noted as a possible follow-up and is out of scope here.

Testing

vault unit tests: sticky hold, flap regression (cool A returns B, expire A still returns B, cool B advances with wrap), all-cooling degrade unchanged, operator-parked degrade target preserved, sticky pointer survives NewPoolResolverShared regeneration plus atomic swap and a stale generation cannot clobber it. proxy: exactly one cred_failover plus one notice per real transition and zero events when a non-active member cooldown merely lapses. The flap and spam tests were confirmed to fail against the old position-priority logic and pass after.

Full go test ./... green, -race on internal/vault and internal/proxy clean, go vet ./... and go vet -tags=e2e ./e2e/ clean, gofumpt clean, golangci-lint 0 issues.

Plan: docs/plans/20260518-sticky-failover.md.

This comment was marked as outdated.

This comment was marked as outdated.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@nnemirovsky nnemirovsky merged commit e5223ce into main May 18, 2026
10 checks passed
@nnemirovsky nnemirovsky deleted the sticky-failover branch May 18, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants