Skip to content
Merged
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ Auto-failover on 429/401 is primary; `pool rotate` is an override.

- **Single chokepoint (I2):** every `binding.Credential` / `OAuthIndex.Has` / `extractInjectableSecret` / persist consumer on the HTTP/HTTPS OAuth path routes through `PoolResolver.ResolveActive` (`resolveInjectionTarget` for pass-1 header + pass-2 swap; `resolveOAuthResponseAttribution` for response/persist). `idx.Has` is always called with the resolved member, never the pool. Plain credentials pass through unchanged; SSH/mail are non-OAuth, out of scope.
- **QUIC scope:** the HTTP/1.x/HTTP/2 MITM addon implements the full feature set (R1, R3, Phase 2). The HTTP/3/QUIC path (`QUICProxy.buildPhantomPairs`, binding-header injection in `quic.go`) is a request-side buffered swap with **no response-side OAuth interception**, but IS pool-aware on the request side: `QUICProxy.resolvePoolTarget` (via `NewQUICProxy`'s `poolResolver`) selects the active member's real secret and routes through `buildPooledOAuthPhantomPairs` so the access phantom is the same pool-stable JWT (R3 holds over QUIC). QUIC does **not** do R1 attribution or Phase 2 failover — the injected member is whatever the HTTP path / `pool rotate` last made active, and a QUIC-only 429/401 or refresh is not acted on. Deployments needing R1/auto-failover must route the pooled upstream over HTTP/HTTPS.
- **Active-member selection:** healthy or expired-cooldown members first, by configured position; if all are in cooldown, the soonest-recovering is returned with a WARNING (degrade, never hard-fail). Recovery is lazy (evaluated in `ResolveActive`, no scheduler).
- **Active-member selection (sticky, no "main" account):** `ResolveActive` is sticky. There is no position-0 "main": once selection settles on a member sluice keeps returning it while it is healthy, even if a lower-position member recovers from cooldown. A new active member is chosen only when the current one cools. The chosen member is the next eligible one by position **starting after the current member and wrapping** (advance forward, never snap back to position 0). The sticky current-active pointer lives on the shared swap-surviving `PoolHealth` under the same mutex as the cooldown map, so it survives `NewPoolResolverShared` regeneration and atomic pointer swaps (CRITICAL-1) and a stale resolver generation cannot clobber it, exactly like cooldowns. This kills the live flap where a 60s `RateLimitCooldown` lapse re-selected an upstream-exhausted account and respammed `cred_failover` plus a Telegram notice every cooldown window. If every member is cooling, the existing degrade applies (operator-parked-but-healthy `ManualRotateReason` first, else soonest-recovering, with a WARNING, never hard-fail) and the sticky pointer is left untouched so a recovery advances forward rather than snapping to position 0. `sluice pool rotate` still works: it parks the active member, so the next `ResolveActive` advances to the next member and then stays there (no snap-back). Recovery is lazy (evaluated in `ResolveActive`, no scheduler). A selectable position-priority-vs-sticky strategy mode is a possible follow-up and is out of scope here.
- **R1 refresh-token attribution / fail-closed:** when pass-2 swaps `SLUICE_PHANTOM:<pool>.refresh`, sluice records `realRefreshToken -> member` in a short-TTL map; on the token-endpoint response it recovers the member by that real refresh token and persists to it (`persistAddonOAuthTokens(member, ...)`, singleflight `"persist:"+member`). The join key is the real **refresh** token — never the access token, connection, or `OAuthIndex.Match` (two pooled members share `auth.openai.com`'s token URL and collide). Unrecoverable -> WARNING + skip the write (rotating refresh tokens are single-use; a mis-attributed write bricks both accounts). **Plain-credential disambiguation:** a plain OAuth credential sharing a pool's token URL also tags its injected refresh token `realRefreshToken -> <plain name>` (plain path in `buildPhantomPairs`/`buildOAuthPhantomPairs`'s `onRefreshInject`, incl. split-host expansion); on response a recovered non-member (`PoolForMember == ""`) is attributed 1:1 to that plain credential, NOT fail-closed as pooled. The pooled fail-closed path applies only when recovery fails or resolves to an actual member; `poolForResponse` gates the same on an independent `flowInjected` tag (set post-swap only if a pool phantom was present) before cooling a member.
- **R3 pool-stable phantom JWT:** Codex access tokens are JWTs; the per-real-token `resignJWT` would emit a different phantom after each cross-member refresh, breaking "agent never notices". `poolStablePhantomAccess` (`internal/proxy/oauth_response.go`) builds the phantom JWT from a deterministic synthetic payload keyed on the **pool name** (`sub: sluice-pool:<pool>`, `iss: sluice-phantom`, fixed far-future `exp`, no `iat`), HMAC-SHA256 with the fixed key — byte-identical across switches, structurally valid. Pool name is JSON-marshaled (never concatenated) so quotes/control chars can't inject claims. Static-form fallback (`SLUICE_PHANTOM:<pool>.access`) only on the unreachable `json.Marshal` failure. Refresh phantom stays static `SLUICE_PHANTOM:<pool>.refresh`.

Expand Down
103 changes: 103 additions & 0 deletions docs/plans/20260518-sticky-failover.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Sticky Pool Failover (no "main" account)

## Problem

Live on knuth: `pool openai_pool failed over openai_oauth -> openai_oauth_2 (429)` repeats
every few minutes (20+ messages). Hermes keeps working. Root cause is a flap:

- `openai_oauth` is position 0 (the de-facto "main"). Its OpenAI quota is exhausted, so it
429s on every request.
- On 429 sluice fails over to `openai_oauth_2` and parks `openai_oauth` for only
`vault.RateLimitCooldown` = 60s.
- `PoolResolver.ResolveActive` picks the **first member in position order** that is healthy
or whose cooldown expired. 60s later `openai_oauth`'s cooldown lapses, so it is re-selected
even though `openai_oauth_2` was serving fine.
- `openai_oauth` is still quota-exhausted upstream -> 429 -> failover again -> another
identical Telegram notice. Repeats on every request gap > 60s.

The real OpenAI Codex quota window is hours/weekly, so re-probing the exhausted account
every 60s and snapping back to it is wrong.

## Desired behavior (user decision)

Sticky failover: there is no "main" account. Stay on whichever member is currently active
until **that** member exhausts, then advance to the next member. A lower-position member
recovering from cooldown must NOT cause a switch back to it. A new failover (and its single
Telegram notice) happens only on a genuine exhaustion transition, not on cooldown lapse.

Mode toggle (position-priority vs sticky as selectable strategies) is explicitly out of
scope for this change; sticky becomes the behavior. Note it as a possible follow-up.

## Design

`ResolveActive` is the single source of truth: `internal/proxy/pool_failover.go` calls
`pr.ResolveActive(pool)` to compute the new active ("to") after cooling the old member, so
making selection sticky fixes the flap, the failover events, and the notification spam in
one place.

1. Add a per-pool current-active pointer to the swap-surviving shared `PoolHealth`
(`internal/vault/pool.go`), guarded by the same mutex as the cooldown map, so it
survives `NewPoolResolverShared` regeneration and atomic swaps (same lifecycle as
cooldowns; carry it the way cooldowns are carried).
2. Sticky `ResolveActive(pool)`:
- Let `cur` = shared current-active for the pool, if set and still a member of this
generation.
- If `cur` is set and healthy (no active cooldown) -> return `cur` (sticky hold).
- Else pick the next eligible member by position **starting after `cur`'s position and
wrapping** (so we advance forward, never snap back to position 0); skip members in
active cooldown; treat `ManualRotateReason` parks with the existing
better-degrade-target semantics. Set shared current-active to the picked member and
return it.
- If every member is cooling: keep the existing degrade behavior (operator-parked-but-
healthy first, else soonest-recovering) and do NOT move the sticky pointer, so when a
member recovers the next call advances to a healthy one rather than the absolute
position-0 member.
3. Operator `sluice pool rotate` semantics unchanged at the surface: it still parks the
current active so the next `ResolveActive` advances; with sticky that advance lands on
the next member and stays there (no snap-back), which is the intended rotate behavior.
4. Keep all existing concurrency invariants: sticky pointer reads/writes under the same
`PoolHealth.mu`; never lost across atomic pointer swap; a stale resolver generation must
not clobber it (mirror the cooldown CRITICAL-1 handling).

## Out of scope

- Strategy/mode toggle across CLI/REST/Telegram + schema (`credential_pools.strategy`) —
follow-up only; note the synergy, do not build.
- Honoring upstream `Retry-After` / changing `RateLimitCooldown` constant — sticky makes
the short cooldown harmless (we no longer re-probe the cooled member until forced), so a
cooldown-duration change is unnecessary for this fix.

## Testing strategy

- Unit (vault): sticky hold — active member stays selected across many `ResolveActive`
calls while a lower-position member is healthy.
- Unit (vault): flap regression — fail member A (cooldown), `ResolveActive` returns B; A's
cooldown expires; `ResolveActive` still returns B (no snap-back). B then cools ->
advance to next, wrapping.
- Unit (vault): all-cooling degrade unchanged; operator-parked degrade-target preserved.
- Unit (vault): sticky pointer survives `NewPoolResolverShared` regeneration + swap, and a
stale generation cannot clobber it (extend the existing CRITICAL-1 style test).
- Unit (proxy): the failover path emits exactly ONE `cred_failover` + one notice per real
exhaustion transition, and emits NOTHING when a non-active member's cooldown merely
lapses (the spam regression, fail-before/pass-after).
- Full `go test ./...`, `-race` on `internal/vault` and `internal/proxy`, gofumpt,
golangci-lint, `go vet ./...` and `-tags=e2e ./e2e/`.

## Steps

### Task 1: Sticky selection in vault.PoolResolver
- [x] Add swap-surviving per-pool current-active to shared `PoolHealth` (same mutex)
- [x] Rewrite `ResolveActive` to the sticky algorithm above; preserve degrade + parked semantics
- [x] Preserve CRITICAL-1 invariants (no loss/clobber across swap; stale generation safe)
- [x] Unit tests: sticky hold, flap regression (no snap-back), advance+wrap, degrade unchanged, swap-survival
- [x] `go test ./internal/vault/ -race`, gofumpt, vet

### Task 2: Failover path + notification spam regression
- [x] Confirm `pool_failover.go` from->to now changes only on real exhaustion (sticky source of truth); adjust only if it bypasses `ResolveActive`
- [x] Test: one `cred_failover`+notice per real transition; zero events when a non-active member's cooldown lapses (fail-before/pass-after)
- [x] `go test ./internal/proxy/ -race`, gofumpt, vet

### Task 3: Docs + final validation
- [x] Update CLAUDE.md credential-pools section to describe sticky selection (replace the position-priority wording) and note the mode-toggle follow-up
- [x] `gofumpt -l` clean; `golangci-lint run ./...` 0 issues; full `go test ./...`; `go vet ./...`; `go vet -tags=e2e ./e2e/`
- [x] Independently verify committed HEAD builds and tests pass (do not trust subagent green)
107 changes: 104 additions & 3 deletions internal/proxy/pool_failover_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -193,11 +193,23 @@ func TestFailoverCooldownTTLAndLazyRecovery(t *testing.T) {
t.Fatalf("auth-fail cooldown TTL = %v, want ~%v", gotTTL, vault.AuthFailCooldown)
}

// Lazy recovery: force the cooldown to the past; ResolveActive must
// treat memA as eligible again with no background scheduler involved.
// Lazy recovery: force the cooldown to the past. memA becomes ELIGIBLE
// again with no background scheduler involved (CooldownUntil reports it
// is no longer cooling). Selection is STICKY, so memB - which we failed
// over to and which is healthy - remains active: a recovered lower-
// position member must NOT snap back (the flap fix). memA's eligibility
// only matters as a future advance target if memB later exhausts.
pr.MarkCooldown("memA", time.Now().Add(-time.Second), "expired")
if _, cooling := pr.CooldownUntil("memA"); cooling {
t.Fatal("after expiry memA must no longer be cooling (lazy recovery, no scheduler)")
}
if active, _ := pr.ResolveActive("codex_pool"); active != "memB" {
t.Fatalf("after expiry active = %q, want memB (sticky: recovered memA must NOT snap back)", active)
}
// Confirm memA IS the advance target once memB exhausts (wrap forward).
pr.MarkCooldown("memB", time.Now().Add(vault.RateLimitCooldown), "429")
if active, _ := pr.ResolveActive("codex_pool"); active != "memA" {
t.Fatalf("after expiry active = %q, want memA (lazy recovery)", active)
t.Fatalf("after memB exhausts active = %q, want memA (advance forward to recovered member)", active)
}
}

Expand Down Expand Up @@ -346,6 +358,95 @@ func TestFailoverAuditEvent(t *testing.T) {
}
}

// TestStickyFailoverNoSpamOnNonActiveCooldownLapse is the notification-spam
// regression for the live knuth flap. memA (position 0) is upstream-
// exhausted. Sequence:
//
// 1. memA 429 -> exactly ONE cred_failover (memA->memB) + one notice.
// 2. memA's short cooldown lapses. Under the OLD position-priority
// ResolveActive, the next request's injection would re-select memA
// (lower position, no longer cooling), hit the still-exhausted upstream,
// 429 again, and emit ANOTHER identical failover + Telegram notice -
// repeating every cooldown window (the 20+ message spam).
// 3. With sticky selection memB stays active after memA recovers, so the
// subsequent request resolves to memB, succeeds (200), and produces NO
// additional failover event or notice.
//
// Fail-before: step 3 would record a 2nd cred_failover (the flap).
// Pass-after: exactly one cred_failover total, one notice total.
func TestStickyFailoverNoSpamOnNonActiveCooldownLapse(t *testing.T) {
dir := t.TempDir()
logPath := filepath.Join(dir, "audit.log")
logger, err := audit.NewFileLogger(logPath)
if err != nil {
t.Fatalf("NewFileLogger: %v", err)
}
t.Cleanup(func() { _ = logger.Close() })

addon, _, prPtr := setupPoolAddon(t, "memA", "memB")
addon.auditLog = logger
client := setupAddonConn(addon, "auth.example.com:443")
pr := prPtr.Load()

var notices int
addon.SetOnFailover(func(FailoverEvent) { notices++ })

if got, _ := pr.ResolveActive("codex_pool"); got != "memA" {
t.Fatalf("pre = %q, want memA", got)
}

// 1. memA exhausts -> one real failover to memB.
f1 := newPoolRespFlow(client, 429, []byte(`{"error":"rate_limited"}`))
addon.flowInjected.Tag(f1.Id, "memA")
addon.Response(f1)
if got, _ := pr.ResolveActive("codex_pool"); got != "memB" {
t.Fatalf("after 429 active = %q, want memB", got)
}

// 2. memA's short cooldown lapses (still upstream-exhausted in reality).
pr.MarkCooldown("memA", time.Now().Add(-time.Second), "expired")

// The next request's injection resolves the pool. STICKY: it must stay
// on memB, NOT snap back to the recovered-but-still-exhausted memA. This
// is the single source of truth that kills the flap.
if got, _ := pr.ResolveActive("codex_pool"); got != "memB" {
t.Fatalf("FLAP: after memA cooldown lapse, injection resolves %q, want memB", got)
}

// 3. The subsequent legitimate request therefore hits memB and succeeds;
// a 200 is a documented no-op (no failover, no notice).
f2 := newPoolRespFlow(client, 200, []byte(`{"ok":true}`))
addon.flowInjected.Tag(f2.Id, "memB")
addon.Response(f2)

if err := logger.Close(); err != nil {
t.Fatalf("logger close: %v", err)
}
data, err := os.ReadFile(logPath)
if err != nil {
t.Fatalf("read audit log: %v", err)
}
failovers := 0
for _, line := range strings.Split(strings.TrimSpace(string(data)), "\n") {
if line == "" {
continue
}
var evt audit.Event
if uerr := json.Unmarshal([]byte(line), &evt); uerr != nil {
t.Fatalf("unmarshal %q: %v", line, uerr)
}
if evt.Action == "cred_failover" || evt.Action == "pool_exhausted" {
failovers++
}
}
if failovers != 1 {
t.Fatalf("got %d failover audit events, want exactly 1 (sticky must not flap on non-active cooldown lapse)\n%s", failovers, data)
}
if notices != 1 {
t.Fatalf("got %d failover notices, want exactly 1 (no Telegram spam on cooldown lapse)", notices)
}
}

// TestPoolForResponseResolvesActiveMember sanity-checks the destination ->
// pool reverse mapping used by handlePoolFailover.
func TestPoolForResponseResolvesActiveMember(t *testing.T) {
Expand Down
Loading
Loading