fix(auth): don't eject users to invite screen on transient access-check failure by pauldambra · Pull Request #2862 · PostHog/code

pauldambra · 2026-06-23T14:34:09Z

Problem

People actively using PostHog Code — already past the paywall, mid-session — were suddenly bounced to the invite-code screen during brief API outages, then recovered on their own. A quit/reopen also fixed it, which pointed at a transient failure rather than real loss of access.

Root cause: code access (hasCodeAccess) is never persisted. It's re-fetched from GET /api/code/invites/check-access/ on every token refresh (≈hourly, piggybacking on the next API call after the token enters its 60s expiry skew) via updateCodeAccessFromSession. That function treated any non-success — a 403, a 5xx, a timeout, a network error — as a definitive "no access" and set hasCodeAccess = false, which is exactly the condition that renders the invite screen. A single bad response during an outage ejected authenticated users. Confirmed against the backend (products/tasks/backend/.../api.py CodeInviteViewSet.check_access): genuine denial is always 200 {has_access: false}; non-2xx is only auth failure / throttle / outage, never a "no access" signal.

Changes

updateCodeAccessFromSession now treats the check as resilient with a bounded fail-closed:

Healthy path stays synchronous — one awaited attempt resolves the gate before the app renders.
Transient failures don't eject the user — on a non-2xx / network error / timeout we keep the current value and retry in the background with prime-number backoff (2, 3, 5, 7, 11, 13, 17, 19, 23s) up to a ~90s budget. Primes keep many clients' retries from lining up after an outage. The background loop is decoupled from the token-refresh path so it can't stall other API calls.
Fail closed after the budget — if we still can't get a definitive 200 within ~90s, hasCodeAccess is set to false. A sustained inability to call home revokes access, so the gate can't be bypassed by keeping the client offline.
Per-attempt fetch timeout for this check tightened from 30s → 10s.
Genuine "no access" (200 {has_access:false}) still gates immediately on the first attempt.

Net effect: a blip no longer kicks real users, but you can't game the invite by going offline.

How did you test this?

Unit tests in auth.test.ts (39 passing, pnpm --filter @posthog/core exec vitest run src/auth/auth.test.ts):

transient failures then 200 → recovers to access in the background, never flips to false
persistent failure → fails closed (false) after exhausting the prime-backoff budget (1 awaited + 8 retries = 9 attempts)
200 {has_access:false} → still gated to false

Biome lint clean on the changed files; no new type errors in auth.ts.

Automatic notifications

Publish to changelog?
Alert Sales and Marketing teams?

Created with PostHog Code from a Slack thread

github-actions · 2026-06-23T14:34:40Z

React Doctor found no issues in the changed files. 🎉

_{Reviewed by React Doctor for commit 5ee6da1.}

greptile-apps · 2026-06-23T14:48:40Z

_{Reviews (1): Last reviewed commit: "fix(auth): don't eject users to invite s..." | Re-trigger Greptile}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 171b133c82

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

greptile-apps · 2026-06-23T15:29:05Z

_{Reviews (2): Last reviewed commit: "fix(auth): retry code-access check with ..." | Re-trigger Greptile}

pauldambra

Note

🤖 Automated comment by QA Swarm — not written by a human

Multi-perspective review: paul-reviewer, xp-reviewer, security-audit (qa-team skipped — not available in this environment). See inline comments.

pauldambra · 2026-06-23T15:31:05Z

Note

🤖 Automated comment by QA Swarm — not written by a human

Multi-perspective review: paul-reviewer, xp-reviewer, security-audit (qa-team skipped — not available in this environment)

Verdict: ⚠️ REQUEST CHANGES

One real correctness bug (a stale background retry loop can clobber a freshly-granted true back to false and eject an active user — the exact failure this PR exists to prevent), plus a cluster of expressiveness/test nits around the budget+prime loop. Security found no gate-bypass.

Key findings

🟠 HIGH — runCodeAccessRetry fail-closed race: the single-flight guard stops a new loop, not an in-flight one. If a token-refresh checkCodeAccessOnce resolves true while an older loop is still spinning, the old loop's budget-exhaustion still writes hasCodeAccess: false. Re-check or cancel-on-resolve before failing closed.
🟡 MEDIUM — Redundant stop conditions: the 90s budget + the prime array overlap, making the Math.min(attempt, len-1) clamp dead code (loop always stops at the array end). Pick one policy.
🟡 MEDIUM — sleepWithBackoff(0, {initialDelayMs}) misused as a plain sleep; reads as backoff but the growth lives in the prime array.
🟡 MEDIUM (convergent: paul + xp) — toBe(9) is a magic number coupled to the prime list, and the mocked sleepWithBackoff means the budget/timing is never actually exercised.
🟢 LOW — "~90s" is really ~77s of backoff; 401/403 ride the same retry path as outages, with no metric on fail-closed denials.

Convergence

paul + xp independently flagged the test's toBe(9) magic number / untested timing — highest-confidence nit.
paul + xp both circled the budget-vs-prime-array design (paul on the comment accuracy, xp on the redundant guard / dead clamp).

Reviewer summaries

Reviewer	Assessment
👤 paul	right shape (sync happy-path + background retry + fail-closed), but close the stale-loop race before merge; rest are nits.
📐 xp	correctness holds; main weakness is expressiveness — collapse the two stop conditions to one and give the timing a real test.
🛡 security-audit	no bypass: every success reflects the live server answer, fail-open is a bounded ~90s retention of a genuine prior grant, and it deterministically fails closed.

Automated by QA Swarm — not a human review

pauldambra · 2026-06-23T15:36:52Z

Note

🤖 Automated comment by PR Shepherd — not written by a human

Triaged the QA Swarm review and pushed fixes in 1c984415:

🟠 HIGH race — fixed. Each authoritative check bumps a sequence; the background retry loop bails if superseded (or the session ended) before it can write state, so a stale loop can no longer clobber a fresh true → false.
🟡 Redundant budget + dead clamp — fixed. Dropped the 90s budget; the loop now just walks the prime schedule (~77s), removing the unreachable Math.min clamp.
🟡 sleepWithBackoff(0,…) misuse — fixed. Added a real sleep() to @posthog/shared and use it.
🟡 Magic 9 / untested timing — fixed. The count is now derived/commented, and a new assertion checks the actual prime-backoff schedule is exercised; added a regression test for the race.
🟢 "~90s" comment — fixed (now states ~77s).
🟢 401/403 share the outage path + no fail-closed metric — deferred (judgement calls): short-circuiting auth-class statuses and adding a denial counter are worth doing but belong in a follow-up, not this fix.

security-audit found no gate-bypass, so nothing to action there.

stamphog

This PR touches core auth logic and was denied by automated gates for matching the auth deny-list. While the sequence-guard approach to prevent stale retries looks reasonable and tests were added, auth changes require a human reviewer to validate the retry/fail-closed behavior and the unresolved concern about 401/403 being treated identically to transient outages.

…ck failure updateCodeAccessFromSession ran on every token refresh and treated any non-success response (403, 5xx, timeout, network error) as a definitive "no access", flipping hasCodeAccess to false and bouncing authenticated users to the invite screen mid-session — e.g. during a brief API outage. Now only a definitive 200 response changes the gate. Transient failures are retried with backoff and, if they persist, the last known value is preserved (a granted user stays in the app; a never-determined user stays at null, showing the "Checking access" spinner) instead of collapsing to false. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Generated-By: PostHog Code Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542

…ter ~90s Extends the access-check resilience: one awaited attempt keeps the healthy path synchronous, and on a transient failure we retry in the background with prime-number backoff (2,3,5,7,11,13,17,19,23s) up to a ~90s budget instead of ejecting the user. The current value is preserved across the retry window so a brief outage no longer bounces active users to the invite screen. Once the budget is exhausted without a definitive 200, we now fail closed (hasCodeAccess=false) rather than preserving access indefinitely — so the gate can't be bypassed by keeping the client offline. The per-attempt fetch timeout for this check is tightened to 10s, and the background loop is decoupled from the token-refresh path so it can't stall other API calls. Generated-By: PostHog Code Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542

Addresses qa-swarm review on the PR: - Race (HIGH): a background retry loop from an earlier check could fail closed and clobber a fresher `hasCodeAccess: true` set by a later token-refresh check, ejecting an active user. Each authoritative check now bumps a sequence and the retry loop bails if it's been superseded (or the session ended) before mutating state. - Collapse the redundant 90s budget + prime-array into a single stop condition: the loop now just walks the prime schedule (~77s total), which removes the unreachable Math.min clamp and the ~90s-vs-~77s comment drift. - Replace the sleepWithBackoff(0, …) plain-sleep hack with a real sleep() helper added to @posthog/shared. - Tests: de-magic the call-count assertion, assert the actual prime-backoff schedule is exercised, and add a regression test for the stale-loop race. Generated-By: PostHog Code Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542

Addresses a greptile review note: the existing resilience tests drove the non-happy path with non-2xx responses only, leaving the checkCodeAccessOnce catch branch (thrown network errors — timeouts, DNS failures, fetch rejections) unexercised. Adds a test where the check throws, asserting it's treated as a transient failure (retried, then failed closed). Generated-By: PostHog Code Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542

Per review follow-up: a 401 from the code-access check means the session token is rejected, so retrying with it is pointless — hard-stop and log the user out so they re-authenticate, rather than retrying for ~77s and dropping them on the fail-closed invite screen. A 403 stays in the transient/retry path: during an outage it isn't a reliable "your access is gone" signal. Generated-By: PostHog Code Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542

pauldambra · 2026-06-25T18:59:10Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

pauldambra

QA Swarm review complete. See inline comments.

pauldambra · 2026-06-25T19:07:56Z

Note

🤖 Automated comment by QA Swarm — not written by a human

Multi-perspective review: qa-team (specialists + generalists), paul-reviewer, xp-reviewer, security-audit

Verdict: 💬 APPROVE WITH NITS

A careful, well-tested resilience fix to a hot-path auth gate. The synchronous-healthy / background-retry split is the right shape, the monotonic codeAccessCheckSeq is a clean lock-free supersession primitive, the security posture is correct, and the 7 new tests assert the why (no clobber, backoff actually exercised) not just the what. No CRITICAL or HIGH findings. The follow-ups are observability- and polish-shaped, none blocking.

Key findings

🟡 MEDIUM — Fail-closed eviction only logs at warn, with no recovery signal — under a sustained partial outage this degrades quietly into the very user-ejection it prevents, ~100s later. Wants an alertable log/event (and ideally a flag/% rollout + a "reverifying" UI state so the flip isn't silent). auth.ts:998
🟡 MEDIUM — await this.logout() on a 401 fires inside the shared refreshPromise; a concurrent getValidAccessToken() caller can get a token for a session that was just torn down. auth.ts:960
🟡 MEDIUM — Duplicated staleness guard worth extracting, and it hides a final-iteration ordering window: a fresh grant landing during the last checkCodeAccessOnce isn't covered by the existing supersession test. auth.ts:991
🟢 LOW — Backoff comments say "~77s" but the primes sum to 100s (commit says "~90s") — three numbers for one budget. auth.ts:984
🟢 LOW — 403-is-transient rests entirely on the backend contract "revocation is always a 200, never a 403" — worth confirming/pinning. auth.ts:957
🟢 LOW — boolean return wants a named type ("settled" | "retry"), which deletes the paraphrasing comment. auth.ts:937
🟢 LOW — A malformed/empty 200 body fails the user closed for that cycle instead of being treated as transient. auth.ts:971
⚪ NIT — sleep duplicates sleepWithBackoff's setTimeout body; let sleepWithBackoff delegate. backoff.ts:36

Convergence

Fail-closed observability / silent eviction — flagged independently by paul and qa-team/data-integrity.
403-is-transient backend-contract coupling — paul and qa-team/security.
Budget number wrong (~77s vs 100s) — paul and qa-team/performance (arithmetic verified: 2+3+5+7+11+13+17+19+23 = 100).
await logout() mid-refresh — qa-team/reliability and xp.
sleep duplicates sleepWithBackoff — xp, qa-team, and paul.

Reviewer summaries

Reviewer	Assessment
🔍 qa-team	Careful, well-tested fix; no CRITICAL/HIGH; main gaps are observability on the fail-closed path and a mid-refresh `logout()` race.
👤 paul	Really nice change and strong tests; ship it once the silent invite-screen flip has some UI feedback + observability, and the 403 contract is confirmed.
📐 xp	Passes all four rules with only minor once-and-only-once / primitive-obsession nicks; most of the added comments are genuine why and should stay.
🛡 security-audit	0 findings — the gate can only open on a genuine backend allow; no failure path sets it open, the 401-logout invalidates the session, and seq/`!session` guards close the race.

Automated by QA Swarm — not a human review

Correct retry-budget comment from ~77s to ~100s (2+3+5+7+11+13+17+19+23=100) in two locations in auth.ts. Make sleepWithBackoff delegate to sleep so the Promise/setTimeout body is expressed OnceAndOnlyOnce. Generated-By: PostHog Code Task-Id: 47d056fa-5014-40c5-b330-d5c34d9d49bc

stamphog

Gates denied this PR because it touches auth logic, which requires human review. Several unresolved inline comments on the current head raise substantive concerns: a potential race where a 401 mid-refresh calls logout() inside the shared refreshPromise (which could leave auth state inconsistent), the assumption that 403 is always transient rather than definitive access revocation, and a staleness-check ordering issue inside the retry loop. These need a human auth reviewer to validate before merge.

Resolves the deferred qa-swarm findings on the code-access check: - guard the shared refresh against a 401 logout landing mid-sync, so an awaited refresh rejects instead of handing back a torn-down session's token - model the check result as a named CodeAccessCheckOutcome instead of a bare boolean, removing the paraphrasing comment - treat an unreadable 200 body as a transient failure (retry) rather than gating the user on garbage - extract codeAccessCheckIsStale for the duplicated supersession guard Adds tests for the mid-sync-logout rejection, the unreadable-200 path, and honouring a success on the final retry. Generated-By: PostHog Code Task-Id: 47d056fa-5014-40c5-b330-d5c34d9d49bc

stamphog

This PR modifies core authentication retry logic and session management, which triggered the auth deny-list gate. While all inline review comments appear resolved and the author iterated thoughtfully through multiple rounds of feedback, auth changes require human review before auto-approval — the deny-list exists precisely for this category of change.

…t a counter Replace the monotonic codeAccessCheckSeq counter (and the codeAccessCheckIsStale helper) with a plain reference check against the session the retry loop was started for. A token refresh or logout already swaps this.session for a new object (or null), so `this.session !== session` is a complete supersession signal -- the same expression the mid-sync-logout guard already uses. No extra mutable bookkeeping state. Generated-By: PostHog Code Task-Id: 47d056fa-5014-40c5-b330-d5c34d9d49bc

… branches) - Replace the bespoke 9-prime delay array with the existing exponential sleepWithBackoff helper + an attempt count, matching the REFRESH_BACKOFF pattern already in this class. The prime "anti-thundering-herd" rationale didn't hold (identical client schedules still align; only jitter de-syncs), and desktop clients already refresh on staggered wall-clocks. - Drop the now-unused fixed-delay `sleep` primitive this PR had added to @posthog/shared; sleepWithBackoff goes back to its inline timer. - Parse the check-access response with a zod schema (schemas.ts) instead of an unchecked cast, per the runtime-boundary convention. - Drop the speculative unreadable-200 retry branch (YAGNI). - Trim comments that restated the code; keep the why. Generated-By: PostHog Code Task-Id: 47d056fa-5014-40c5-b330-d5c34d9d49bc

greptile-apps Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread packages/core/src/auth/auth.test.ts Outdated

Comment thread packages/core/src/auth/auth.test.ts

pauldambra marked this pull request as ready for review June 23, 2026 15:21

chatgpt-codex-connector Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread packages/core/src/auth/auth.ts Outdated

pauldambra commented Jun 23, 2026

View reviewed changes

pauldambra added the Stamphog This will request an autostamp by stamphog on small changes label Jun 23, 2026 — with PostHog

stamphog Bot reviewed Jun 23, 2026

View reviewed changes

stamphog Bot removed the Stamphog This will request an autostamp by stamphog on small changes label Jun 23, 2026

pauldambra requested a review from a team June 23, 2026 15:58

pauldambra added 5 commits June 25, 2026 19:58

pauldambra force-pushed the posthog-code/fix-invite-screen-transient-access-check branch from fe45fb0 to c0fa360 Compare June 25, 2026 18:58

pauldambra commented Jun 25, 2026

View reviewed changes

pauldambra mentioned this pull request Jun 25, 2026

fix: address qa-swarm backoff comment accuracy and sleep delegation #2944

Merged

2 tasks

pauldambra added the Stamphog This will request an autostamp by stamphog on small changes label Jun 25, 2026

stamphog Bot reviewed Jun 25, 2026

View reviewed changes

stamphog Bot removed the Stamphog This will request an autostamp by stamphog on small changes label Jun 25, 2026

pauldambra added the Stamphog This will request an autostamp by stamphog on small changes label Jun 25, 2026

stamphog Bot reviewed Jun 25, 2026

View reviewed changes

stamphog Bot removed the Stamphog This will request an autostamp by stamphog on small changes label Jun 25, 2026

pauldambra added 2 commits June 25, 2026 22:59

Uh oh!

Conversation

pauldambra commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

How did you test this?

Automatic notifications

Uh oh!

github-actions Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 23, 2026

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 23, 2026

Uh oh!

pauldambra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pauldambra commented Jun 23, 2026

Verdict: ⚠️ REQUEST CHANGES

Key findings

Convergence

Reviewer summaries

Uh oh!

pauldambra commented Jun 23, 2026

Uh oh!

stamphog Bot left a comment

Choose a reason for hiding this comment

Uh oh!

pauldambra commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pauldambra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pauldambra commented Jun 25, 2026

Verdict: 💬 APPROVE WITH NITS

Key findings

Convergence

Reviewer summaries

Uh oh!

stamphog Bot left a comment

Choose a reason for hiding this comment

Uh oh!

stamphog Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pauldambra commented Jun 23, 2026 •

edited

Loading

github-actions Bot commented Jun 23, 2026 •

edited

Loading

pauldambra commented Jun 25, 2026 •

edited

Loading