fix(vault): sticky pool failover (no main account, stops failover flap and notice spam) by nnemirovsky · Pull Request #48 · nnemirovsky/sluice

nnemirovsky · 2026-05-18T12:57:39Z

What

Live on the server the bot sent 20+ identical pool openai_pool failed over openai_oauth -> openai_oauth_2 (429) notices over ~1.5h while hermes kept working. Root cause is a failover flap.

openai_oauth is pool position 0 (the de-facto "main"). Its OpenAI quota is exhausted so it 429s on every request. On a 429 sluice fails over to openai_oauth_2 and parks openai_oauth for only vault.RateLimitCooldown (60s). PoolResolver.ResolveActive returned the first member in position order that was healthy or whose cooldown had expired, so 60s later openai_oauth was re-selected even though openai_oauth_2 was serving fine. It 429s again, fails over again, emits another identical notice. The real OpenAI Codex quota window is hours, so re-probing and snapping back to the exhausted account every 60s is wrong.

Change

Sticky failover. There is no "main" account. ResolveActive now stays on whichever member is currently active until that member itself cools, then advances forward to the next eligible member (wrapping), and never snaps back to a lower-position member just because its cooldown lapsed. A new failover, and therefore exactly one cred_failover audit event plus one Telegram notice, happens only on a genuine exhaustion transition.

ResolveActive is the single source of truth (the failover path already calls it to compute the new active member), so the fix lands in one place and the flap, the failover events, and the notice spam all stop together.

The per-pool current-active selection lives on the swap-surviving shared PoolHealth struct under the same mutex as the cooldown map, so it survives resolver regeneration and atomic pointer swaps and a stale generation cannot clobber it (mirrors the existing CRITICAL-1 cooldown handling, pruned in MergeLiveCooldowns). The all-cooling degrade path (operator-parked-but-healthy first, else soonest-recovering) and the audit Reason / pool_exhausted formats are unchanged. sluice pool rotate still parks the active member and now advances-and-stays rather than snapping back.

A selectable position-vs-sticky strategy mode is noted as a possible follow-up and is out of scope here.

Testing

vault unit tests: sticky hold, flap regression (cool A returns B, expire A still returns B, cool B advances with wrap), all-cooling degrade unchanged, operator-parked degrade target preserved, sticky pointer survives NewPoolResolverShared regeneration plus atomic swap and a stale generation cannot clobber it. proxy: exactly one cred_failover plus one notice per real transition and zero events when a non-active member cooldown merely lapses. The flap and spam tests were confirmed to fail against the old position-priority logic and pass after.

Full go test ./... green, -race on internal/vault and internal/proxy clean, go vet ./... and go vet -tags=e2e ./e2e/ clean, gofumpt clean, golangci-lint 0 issues.

Plan: docs/plans/20260518-sticky-failover.md.

…egressions

… ResolveActive

…urrentMembers prune

…e lock before degrade logging

…oncurrent degrade no-double-unlock test

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

nnemirovsky added 4 commits May 18, 2026 20:44

docs(plans): sticky pool failover plan

40d069e

fix(vault): sticky pool failover selection (no main account)

8f43994

test(pool): sticky-hold, flap, advance-wrap, swap-survival and spam r…

e1b7ad1

…egressions

docs(pools): describe sticky active-member selection

22243f4

nnemirovsky requested a review from Copilot May 18, 2026 12:58

Copilot started reviewing on behalf of nnemirovsky May 18, 2026 12:58 View session

This comment was marked as outdated.

Sign in to view

nnemirovsky added 2 commits May 18, 2026 21:09

perf(vault): read-mostly fast path and epoch-scoped sticky pointer in…

eb07a8a

… ResolveActive

test(vault): cover sticky fast/advance paths, epoch clobber, and SetC…

a34000e

…urrentMembers prune

nnemirovsky requested a review from Copilot May 18, 2026 13:12

Copilot started reviewing on behalf of nnemirovsky May 18, 2026 13:13 View session

This comment was marked as outdated.

Sign in to view

nnemirovsky added 2 commits May 18, 2026 21:21

perf(vault): single time.Now snapshot per ResolveActive and drop writ…

ee609ab

…e lock before degrade logging

test(vault): use realistic epoch>=1 in sticky-pointer tests and add c…

57e72e6

…oncurrent degrade no-double-unlock test

nnemirovsky requested a review from Copilot May 18, 2026 13:24

Copilot started reviewing on behalf of nnemirovsky May 18, 2026 13:25 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

nnemirovsky merged commit e5223ce into main May 18, 2026
10 checks passed

nnemirovsky deleted the sticky-failover branch May 18, 2026 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vault): sticky pool failover (no main account, stops failover flap and notice spam)#48

fix(vault): sticky pool failover (no main account, stops failover flap and notice spam)#48
nnemirovsky merged 8 commits into
mainfrom
sticky-failover

nnemirovsky commented May 18, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nnemirovsky commented May 18, 2026

What

Change

Testing

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants