Skip to content

fix(auth): don't eject users to invite screen on transient access-check failure#2862

Open
pauldambra wants to merge 9 commits into
mainfrom
posthog-code/fix-invite-screen-transient-access-check
Open

fix(auth): don't eject users to invite screen on transient access-check failure#2862
pauldambra wants to merge 9 commits into
mainfrom
posthog-code/fix-invite-screen-transient-access-check

Conversation

@pauldambra

@pauldambra pauldambra commented Jun 23, 2026

Copy link
Copy Markdown
Member

Problem

People actively using PostHog Code — already past the paywall, mid-session — were suddenly bounced to the invite-code screen during brief API outages, then recovered on their own. A quit/reopen also fixed it, which pointed at a transient failure rather than real loss of access.

Root cause: code access (hasCodeAccess) is never persisted. It's re-fetched from GET /api/code/invites/check-access/ on every token refresh (≈hourly, piggybacking on the next API call after the token enters its 60s expiry skew) via updateCodeAccessFromSession. That function treated any non-success — a 403, a 5xx, a timeout, a network error — as a definitive "no access" and set hasCodeAccess = false, which is exactly the condition that renders the invite screen. A single bad response during an outage ejected authenticated users. Confirmed against the backend (products/tasks/backend/.../api.py CodeInviteViewSet.check_access): genuine denial is always 200 {has_access: false}; non-2xx is only auth failure / throttle / outage, never a "no access" signal.

Changes

updateCodeAccessFromSession now treats the check as resilient with a bounded fail-closed:

  • Healthy path stays synchronous — one awaited attempt resolves the gate before the app renders.
  • Transient failures don't eject the user — on a non-2xx / network error / timeout we keep the current value and retry in the background with prime-number backoff (2, 3, 5, 7, 11, 13, 17, 19, 23s) up to a ~90s budget. Primes keep many clients' retries from lining up after an outage. The background loop is decoupled from the token-refresh path so it can't stall other API calls.
  • Fail closed after the budget — if we still can't get a definitive 200 within ~90s, hasCodeAccess is set to false. A sustained inability to call home revokes access, so the gate can't be bypassed by keeping the client offline.
  • Per-attempt fetch timeout for this check tightened from 30s → 10s.
  • Genuine "no access" (200 {has_access:false}) still gates immediately on the first attempt.

Net effect: a blip no longer kicks real users, but you can't game the invite by going offline.

How did you test this?

Unit tests in auth.test.ts (39 passing, pnpm --filter @posthog/core exec vitest run src/auth/auth.test.ts):

  • transient failures then 200 → recovers to access in the background, never flips to false
  • persistent failure → fails closed (false) after exhausting the prime-backoff budget (1 awaited + 8 retries = 9 attempts)
  • 200 {has_access:false} → still gated to false

Biome lint clean on the changed files; no new type errors in auth.ts.

Automatic notifications

  • Publish to changelog?
  • Alert Sales and Marketing teams?

Created with PostHog Code from a Slack thread

@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

React Doctor found no issues in the changed files. 🎉

Reviewed by React Doctor for commit 5ee6da1.

@greptile-apps

greptile-apps Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Reviews (1): Last reviewed commit: "fix(auth): don't eject users to invite s..." | Re-trigger Greptile

Comment thread packages/core/src/auth/auth.test.ts Outdated
Comment thread packages/core/src/auth/auth.test.ts
@pauldambra pauldambra marked this pull request as ready for review June 23, 2026 15:21

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 171b133c82

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/core/src/auth/auth.ts Outdated
@greptile-apps

greptile-apps Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Reviews (2): Last reviewed commit: "fix(auth): retry code-access check with ..." | Re-trigger Greptile

@pauldambra pauldambra left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

🤖 Automated comment by QA Swarm — not written by a human

Multi-perspective review: paul-reviewer, xp-reviewer, security-audit (qa-team skipped — not available in this environment). See inline comments.

Comment thread packages/core/src/auth/auth.ts Outdated
Comment thread packages/core/src/auth/auth.ts Outdated
Comment thread packages/core/src/auth/auth.ts Outdated
Comment thread packages/core/src/auth/auth.test.ts Outdated
Comment thread packages/core/src/auth/auth.ts Outdated
Comment thread packages/core/src/auth/auth.ts

Copy link
Copy Markdown
Member Author

Note

🤖 Automated comment by QA Swarm — not written by a human

Multi-perspective review: paul-reviewer, xp-reviewer, security-audit (qa-team skipped — not available in this environment)

Verdict: ⚠️ REQUEST CHANGES

One real correctness bug (a stale background retry loop can clobber a freshly-granted true back to false and eject an active user — the exact failure this PR exists to prevent), plus a cluster of expressiveness/test nits around the budget+prime loop. Security found no gate-bypass.

Key findings

  • 🟠 HIGHrunCodeAccessRetry fail-closed race: the single-flight guard stops a new loop, not an in-flight one. If a token-refresh checkCodeAccessOnce resolves true while an older loop is still spinning, the old loop's budget-exhaustion still writes hasCodeAccess: false. Re-check or cancel-on-resolve before failing closed.
  • 🟡 MEDIUM — Redundant stop conditions: the 90s budget + the prime array overlap, making the Math.min(attempt, len-1) clamp dead code (loop always stops at the array end). Pick one policy.
  • 🟡 MEDIUMsleepWithBackoff(0, {initialDelayMs}) misused as a plain sleep; reads as backoff but the growth lives in the prime array.
  • 🟡 MEDIUM (convergent: paul + xp)toBe(9) is a magic number coupled to the prime list, and the mocked sleepWithBackoff means the budget/timing is never actually exercised.
  • 🟢 LOW — "~90s" is really ~77s of backoff; 401/403 ride the same retry path as outages, with no metric on fail-closed denials.

Convergence

  • paul + xp independently flagged the test's toBe(9) magic number / untested timing — highest-confidence nit.
  • paul + xp both circled the budget-vs-prime-array design (paul on the comment accuracy, xp on the redundant guard / dead clamp).

Reviewer summaries

Reviewer Assessment
👤 paul right shape (sync happy-path + background retry + fail-closed), but close the stale-loop race before merge; rest are nits.
📐 xp correctness holds; main weakness is expressiveness — collapse the two stop conditions to one and give the timing a real test.
🛡 security-audit no bypass: every success reflects the live server answer, fail-open is a bounded ~90s retention of a genuine prior grant, and it deterministically fails closed.

Automated by QA Swarm — not a human review

Copy link
Copy Markdown
Member Author

Note

🤖 Automated comment by PR Shepherd — not written by a human

Triaged the QA Swarm review and pushed fixes in 1c984415:

  • 🟠 HIGH racefixed. Each authoritative check bumps a sequence; the background retry loop bails if superseded (or the session ended) before it can write state, so a stale loop can no longer clobber a fresh truefalse.
  • 🟡 Redundant budget + dead clampfixed. Dropped the 90s budget; the loop now just walks the prime schedule (~77s), removing the unreachable Math.min clamp.
  • 🟡 sleepWithBackoff(0,…) misusefixed. Added a real sleep() to @posthog/shared and use it.
  • 🟡 Magic 9 / untested timingfixed. The count is now derived/commented, and a new assertion checks the actual prime-backoff schedule is exercised; added a regression test for the race.
  • 🟢 "~90s" commentfixed (now states ~77s).
  • 🟢 401/403 share the outage path + no fail-closed metricdeferred (judgement calls): short-circuiting auth-class statuses and adding a denial counter are worth doing but belong in a follow-up, not this fix.

security-audit found no gate-bypass, so nothing to action there.

@pauldambra pauldambra added the Stamphog This will request an autostamp by stamphog on small changes label Jun 23, 2026 — with PostHog

@stamphog stamphog Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR touches core auth logic and was denied by automated gates for matching the auth deny-list. While the sequence-guard approach to prevent stale retries looks reasonable and tests were added, auth changes require a human reviewer to validate the retry/fail-closed behavior and the unresolved concern about 401/403 being treated identically to transient outages.

@stamphog stamphog Bot removed the Stamphog This will request an autostamp by stamphog on small changes label Jun 23, 2026
@pauldambra pauldambra requested a review from a team June 23, 2026 15:58
…ck failure

updateCodeAccessFromSession ran on every token refresh and treated any
non-success response (403, 5xx, timeout, network error) as a definitive
"no access", flipping hasCodeAccess to false and bouncing authenticated
users to the invite screen mid-session — e.g. during a brief API outage.

Now only a definitive 200 response changes the gate. Transient failures
are retried with backoff and, if they persist, the last known value is
preserved (a granted user stays in the app; a never-determined user
stays at null, showing the "Checking access" spinner) instead of
collapsing to false.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Generated-By: PostHog Code
Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542
…ter ~90s

Extends the access-check resilience: one awaited attempt keeps the healthy
path synchronous, and on a transient failure we retry in the background with
prime-number backoff (2,3,5,7,11,13,17,19,23s) up to a ~90s budget instead of
ejecting the user. The current value is preserved across the retry window so a
brief outage no longer bounces active users to the invite screen.

Once the budget is exhausted without a definitive 200, we now fail closed
(hasCodeAccess=false) rather than preserving access indefinitely — so the gate
can't be bypassed by keeping the client offline. The per-attempt fetch timeout
for this check is tightened to 10s, and the background loop is decoupled from
the token-refresh path so it can't stall other API calls.

Generated-By: PostHog Code
Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542
Addresses qa-swarm review on the PR:

- Race (HIGH): a background retry loop from an earlier check could fail closed
  and clobber a fresher `hasCodeAccess: true` set by a later token-refresh
  check, ejecting an active user. Each authoritative check now bumps a
  sequence and the retry loop bails if it's been superseded (or the session
  ended) before mutating state.
- Collapse the redundant 90s budget + prime-array into a single stop
  condition: the loop now just walks the prime schedule (~77s total), which
  removes the unreachable Math.min clamp and the ~90s-vs-~77s comment drift.
- Replace the sleepWithBackoff(0, …) plain-sleep hack with a real sleep()
  helper added to @posthog/shared.
- Tests: de-magic the call-count assertion, assert the actual prime-backoff
  schedule is exercised, and add a regression test for the stale-loop race.

Generated-By: PostHog Code
Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542
Addresses a greptile review note: the existing resilience tests drove the
non-happy path with non-2xx responses only, leaving the checkCodeAccessOnce
catch branch (thrown network errors — timeouts, DNS failures, fetch
rejections) unexercised. Adds a test where the check throws, asserting it's
treated as a transient failure (retried, then failed closed).

Generated-By: PostHog Code
Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542
Per review follow-up: a 401 from the code-access check means the session
token is rejected, so retrying with it is pointless — hard-stop and log the
user out so they re-authenticate, rather than retrying for ~77s and dropping
them on the fail-closed invite screen. A 403 stays in the transient/retry
path: during an outage it isn't a reliable "your access is gone" signal.

Generated-By: PostHog Code
Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542
@pauldambra pauldambra force-pushed the posthog-code/fix-invite-screen-transient-access-check branch from fe45fb0 to c0fa360 Compare June 25, 2026 18:58

pauldambra commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

@pauldambra pauldambra left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Swarm review complete. See inline comments.

Comment thread packages/core/src/auth/auth.ts Outdated
Comment thread packages/core/src/auth/auth.ts Outdated
Comment thread packages/core/src/auth/auth.ts
Comment thread packages/core/src/auth/auth.ts Outdated
Comment thread packages/core/src/auth/auth.ts Outdated
Comment thread packages/core/src/auth/auth.ts Outdated
Comment thread packages/core/src/auth/auth.ts Outdated
Comment thread packages/shared/src/backoff.ts Outdated
@pauldambra

Copy link
Copy Markdown
Member Author

Note

🤖 Automated comment by QA Swarm — not written by a human

Multi-perspective review: qa-team (specialists + generalists), paul-reviewer, xp-reviewer, security-audit

Verdict: 💬 APPROVE WITH NITS

A careful, well-tested resilience fix to a hot-path auth gate. The synchronous-healthy / background-retry split is the right shape, the monotonic codeAccessCheckSeq is a clean lock-free supersession primitive, the security posture is correct, and the 7 new tests assert the why (no clobber, backoff actually exercised) not just the what. No CRITICAL or HIGH findings. The follow-ups are observability- and polish-shaped, none blocking.

Key findings

  • 🟡 MEDIUM — Fail-closed eviction only logs at warn, with no recovery signal — under a sustained partial outage this degrades quietly into the very user-ejection it prevents, ~100s later. Wants an alertable log/event (and ideally a flag/% rollout + a "reverifying" UI state so the flip isn't silent). auth.ts:998
  • 🟡 MEDIUMawait this.logout() on a 401 fires inside the shared refreshPromise; a concurrent getValidAccessToken() caller can get a token for a session that was just torn down. auth.ts:960
  • 🟡 MEDIUM — Duplicated staleness guard worth extracting, and it hides a final-iteration ordering window: a fresh grant landing during the last checkCodeAccessOnce isn't covered by the existing supersession test. auth.ts:991
  • 🟢 LOW — Backoff comments say "~77s" but the primes sum to 100s (commit says "~90s") — three numbers for one budget. auth.ts:984
  • 🟢 LOW — 403-is-transient rests entirely on the backend contract "revocation is always a 200, never a 403" — worth confirming/pinning. auth.ts:957
  • 🟢 LOWboolean return wants a named type ("settled" | "retry"), which deletes the paraphrasing comment. auth.ts:937
  • 🟢 LOW — A malformed/empty 200 body fails the user closed for that cycle instead of being treated as transient. auth.ts:971
  • NITsleep duplicates sleepWithBackoff's setTimeout body; let sleepWithBackoff delegate. backoff.ts:36

Convergence

  • Fail-closed observability / silent eviction — flagged independently by paul and qa-team/data-integrity.
  • 403-is-transient backend-contract coupling — paul and qa-team/security.
  • Budget number wrong (~77s vs 100s) — paul and qa-team/performance (arithmetic verified: 2+3+5+7+11+13+17+19+23 = 100).
  • await logout() mid-refresh — qa-team/reliability and xp.
  • sleep duplicates sleepWithBackoff — xp, qa-team, and paul.

Reviewer summaries

Reviewer Assessment
🔍 qa-team Careful, well-tested fix; no CRITICAL/HIGH; main gaps are observability on the fail-closed path and a mid-refresh logout() race.
👤 paul Really nice change and strong tests; ship it once the silent invite-screen flip has some UI feedback + observability, and the 403 contract is confirmed.
📐 xp Passes all four rules with only minor once-and-only-once / primitive-obsession nicks; most of the added comments are genuine why and should stay.
🛡 security-audit 0 findings — the gate can only open on a genuine backend allow; no failure path sets it open, the 401-logout invalidates the session, and seq/!session guards close the race.

Automated by QA Swarm — not a human review

Correct retry-budget comment from ~77s to ~100s (2+3+5+7+11+13+17+19+23=100)
in two locations in auth.ts. Make sleepWithBackoff delegate to sleep so the
Promise/setTimeout body is expressed OnceAndOnlyOnce.

Generated-By: PostHog Code
Task-Id: 47d056fa-5014-40c5-b330-d5c34d9d49bc
@pauldambra pauldambra added the Stamphog This will request an autostamp by stamphog on small changes label Jun 25, 2026

@stamphog stamphog Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gates denied this PR because it touches auth logic, which requires human review. Several unresolved inline comments on the current head raise substantive concerns: a potential race where a 401 mid-refresh calls logout() inside the shared refreshPromise (which could leave auth state inconsistent), the assumption that 403 is always transient rather than definitive access revocation, and a staleness-check ordering issue inside the retry loop. These need a human auth reviewer to validate before merge.

@stamphog stamphog Bot removed the Stamphog This will request an autostamp by stamphog on small changes label Jun 25, 2026
Resolves the deferred qa-swarm findings on the code-access check:
- guard the shared refresh against a 401 logout landing mid-sync, so an
  awaited refresh rejects instead of handing back a torn-down session's token
- model the check result as a named CodeAccessCheckOutcome instead of a bare
  boolean, removing the paraphrasing comment
- treat an unreadable 200 body as a transient failure (retry) rather than
  gating the user on garbage
- extract codeAccessCheckIsStale for the duplicated supersession guard

Adds tests for the mid-sync-logout rejection, the unreadable-200 path, and
honouring a success on the final retry.

Generated-By: PostHog Code
Task-Id: 47d056fa-5014-40c5-b330-d5c34d9d49bc
@pauldambra pauldambra added the Stamphog This will request an autostamp by stamphog on small changes label Jun 25, 2026

@stamphog stamphog Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR modifies core authentication retry logic and session management, which triggered the auth deny-list gate. While all inline review comments appear resolved and the author iterated thoughtfully through multiple rounds of feedback, auth changes require human review before auto-approval — the deny-list exists precisely for this category of change.

@stamphog stamphog Bot removed the Stamphog This will request an autostamp by stamphog on small changes label Jun 25, 2026
…t a counter

Replace the monotonic codeAccessCheckSeq counter (and the codeAccessCheckIsStale
helper) with a plain reference check against the session the retry loop was
started for. A token refresh or logout already swaps this.session for a new
object (or null), so `this.session !== session` is a complete supersession
signal -- the same expression the mid-sync-logout guard already uses. No extra
mutable bookkeeping state.

Generated-By: PostHog Code
Task-Id: 47d056fa-5014-40c5-b330-d5c34d9d49bc
… branches)

- Replace the bespoke 9-prime delay array with the existing exponential
  sleepWithBackoff helper + an attempt count, matching the REFRESH_BACKOFF
  pattern already in this class. The prime "anti-thundering-herd" rationale
  didn't hold (identical client schedules still align; only jitter de-syncs),
  and desktop clients already refresh on staggered wall-clocks.
- Drop the now-unused fixed-delay `sleep` primitive this PR had added to
  @posthog/shared; sleepWithBackoff goes back to its inline timer.
- Parse the check-access response with a zod schema (schemas.ts) instead of an
  unchecked cast, per the runtime-boundary convention.
- Drop the speculative unreadable-200 retry branch (YAGNI).
- Trim comments that restated the code; keep the why.

Generated-By: PostHog Code
Task-Id: 47d056fa-5014-40c5-b330-d5c34d9d49bc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant