fix(auth): don't eject users to invite screen on transient access-check failure#2862
fix(auth): don't eject users to invite screen on transient access-check failure#2862pauldambra wants to merge 9 commits into
Conversation
|
React Doctor found no issues in the changed files. 🎉 Reviewed by React Doctor for commit |
|
Reviews (1): Last reviewed commit: "fix(auth): don't eject users to invite s..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 171b133c82
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Reviews (2): Last reviewed commit: "fix(auth): retry code-access check with ..." | Re-trigger Greptile |
pauldambra
left a comment
There was a problem hiding this comment.
Note
🤖 Automated comment by QA Swarm — not written by a human
Multi-perspective review: paul-reviewer, xp-reviewer, security-audit (qa-team skipped — not available in this environment). See inline comments.
|
Note 🤖 Automated comment by QA Swarm — not written by a human Multi-perspective review: paul-reviewer, xp-reviewer, security-audit (qa-team skipped — not available in this environment) Verdict:
|
| Reviewer | Assessment |
|---|---|
| 👤 paul | right shape (sync happy-path + background retry + fail-closed), but close the stale-loop race before merge; rest are nits. |
| 📐 xp | correctness holds; main weakness is expressiveness — collapse the two stop conditions to one and give the timing a real test. |
| 🛡 security-audit | no bypass: every success reflects the live server answer, fail-open is a bounded ~90s retention of a genuine prior grant, and it deterministically fails closed. |
Automated by QA Swarm — not a human review
|
Note 🤖 Automated comment by PR Shepherd — not written by a human Triaged the QA Swarm review and pushed fixes in
security-audit found no gate-bypass, so nothing to action there. |
There was a problem hiding this comment.
This PR touches core auth logic and was denied by automated gates for matching the auth deny-list. While the sequence-guard approach to prevent stale retries looks reasonable and tests were added, auth changes require a human reviewer to validate the retry/fail-closed behavior and the unresolved concern about 401/403 being treated identically to transient outages.
…ck failure updateCodeAccessFromSession ran on every token refresh and treated any non-success response (403, 5xx, timeout, network error) as a definitive "no access", flipping hasCodeAccess to false and bouncing authenticated users to the invite screen mid-session — e.g. during a brief API outage. Now only a definitive 200 response changes the gate. Transient failures are retried with backoff and, if they persist, the last known value is preserved (a granted user stays in the app; a never-determined user stays at null, showing the "Checking access" spinner) instead of collapsing to false. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Generated-By: PostHog Code Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542
…ter ~90s Extends the access-check resilience: one awaited attempt keeps the healthy path synchronous, and on a transient failure we retry in the background with prime-number backoff (2,3,5,7,11,13,17,19,23s) up to a ~90s budget instead of ejecting the user. The current value is preserved across the retry window so a brief outage no longer bounces active users to the invite screen. Once the budget is exhausted without a definitive 200, we now fail closed (hasCodeAccess=false) rather than preserving access indefinitely — so the gate can't be bypassed by keeping the client offline. The per-attempt fetch timeout for this check is tightened to 10s, and the background loop is decoupled from the token-refresh path so it can't stall other API calls. Generated-By: PostHog Code Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542
Addresses qa-swarm review on the PR: - Race (HIGH): a background retry loop from an earlier check could fail closed and clobber a fresher `hasCodeAccess: true` set by a later token-refresh check, ejecting an active user. Each authoritative check now bumps a sequence and the retry loop bails if it's been superseded (or the session ended) before mutating state. - Collapse the redundant 90s budget + prime-array into a single stop condition: the loop now just walks the prime schedule (~77s total), which removes the unreachable Math.min clamp and the ~90s-vs-~77s comment drift. - Replace the sleepWithBackoff(0, …) plain-sleep hack with a real sleep() helper added to @posthog/shared. - Tests: de-magic the call-count assertion, assert the actual prime-backoff schedule is exercised, and add a regression test for the stale-loop race. Generated-By: PostHog Code Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542
Addresses a greptile review note: the existing resilience tests drove the non-happy path with non-2xx responses only, leaving the checkCodeAccessOnce catch branch (thrown network errors — timeouts, DNS failures, fetch rejections) unexercised. Adds a test where the check throws, asserting it's treated as a transient failure (retried, then failed closed). Generated-By: PostHog Code Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542
Per review follow-up: a 401 from the code-access check means the session token is rejected, so retrying with it is pointless — hard-stop and log the user out so they re-authenticate, rather than retrying for ~77s and dropping them on the fail-closed invite screen. A 403 stays in the transient/retry path: during an outage it isn't a reliable "your access is gone" signal. Generated-By: PostHog Code Task-Id: 8d2fbac8-0fbd-493a-b267-eeb5ba2ba542
fe45fb0 to
c0fa360
Compare
This stack of pull requests is managed by Graphite. Learn more about stacking. |
pauldambra
left a comment
There was a problem hiding this comment.
QA Swarm review complete. See inline comments.
|
Note 🤖 Automated comment by QA Swarm — not written by a human Multi-perspective review: qa-team (specialists + generalists), paul-reviewer, xp-reviewer, security-audit Verdict: 💬 APPROVE WITH NITSA careful, well-tested resilience fix to a hot-path auth gate. The synchronous-healthy / background-retry split is the right shape, the monotonic Key findings
Convergence
Reviewer summaries
Automated by QA Swarm — not a human review |
Correct retry-budget comment from ~77s to ~100s (2+3+5+7+11+13+17+19+23=100) in two locations in auth.ts. Make sleepWithBackoff delegate to sleep so the Promise/setTimeout body is expressed OnceAndOnlyOnce. Generated-By: PostHog Code Task-Id: 47d056fa-5014-40c5-b330-d5c34d9d49bc
There was a problem hiding this comment.
Gates denied this PR because it touches auth logic, which requires human review. Several unresolved inline comments on the current head raise substantive concerns: a potential race where a 401 mid-refresh calls logout() inside the shared refreshPromise (which could leave auth state inconsistent), the assumption that 403 is always transient rather than definitive access revocation, and a staleness-check ordering issue inside the retry loop. These need a human auth reviewer to validate before merge.
Resolves the deferred qa-swarm findings on the code-access check: - guard the shared refresh against a 401 logout landing mid-sync, so an awaited refresh rejects instead of handing back a torn-down session's token - model the check result as a named CodeAccessCheckOutcome instead of a bare boolean, removing the paraphrasing comment - treat an unreadable 200 body as a transient failure (retry) rather than gating the user on garbage - extract codeAccessCheckIsStale for the duplicated supersession guard Adds tests for the mid-sync-logout rejection, the unreadable-200 path, and honouring a success on the final retry. Generated-By: PostHog Code Task-Id: 47d056fa-5014-40c5-b330-d5c34d9d49bc
There was a problem hiding this comment.
This PR modifies core authentication retry logic and session management, which triggered the auth deny-list gate. While all inline review comments appear resolved and the author iterated thoughtfully through multiple rounds of feedback, auth changes require human review before auto-approval — the deny-list exists precisely for this category of change.
…t a counter Replace the monotonic codeAccessCheckSeq counter (and the codeAccessCheckIsStale helper) with a plain reference check against the session the retry loop was started for. A token refresh or logout already swaps this.session for a new object (or null), so `this.session !== session` is a complete supersession signal -- the same expression the mid-sync-logout guard already uses. No extra mutable bookkeeping state. Generated-By: PostHog Code Task-Id: 47d056fa-5014-40c5-b330-d5c34d9d49bc
… branches) - Replace the bespoke 9-prime delay array with the existing exponential sleepWithBackoff helper + an attempt count, matching the REFRESH_BACKOFF pattern already in this class. The prime "anti-thundering-herd" rationale didn't hold (identical client schedules still align; only jitter de-syncs), and desktop clients already refresh on staggered wall-clocks. - Drop the now-unused fixed-delay `sleep` primitive this PR had added to @posthog/shared; sleepWithBackoff goes back to its inline timer. - Parse the check-access response with a zod schema (schemas.ts) instead of an unchecked cast, per the runtime-boundary convention. - Drop the speculative unreadable-200 retry branch (YAGNI). - Trim comments that restated the code; keep the why. Generated-By: PostHog Code Task-Id: 47d056fa-5014-40c5-b330-d5c34d9d49bc

Problem
People actively using PostHog Code — already past the paywall, mid-session — were suddenly bounced to the invite-code screen during brief API outages, then recovered on their own. A quit/reopen also fixed it, which pointed at a transient failure rather than real loss of access.
Root cause: code access (
hasCodeAccess) is never persisted. It's re-fetched fromGET /api/code/invites/check-access/on every token refresh (≈hourly, piggybacking on the next API call after the token enters its 60s expiry skew) viaupdateCodeAccessFromSession. That function treated any non-success — a 403, a 5xx, a timeout, a network error — as a definitive "no access" and sethasCodeAccess = false, which is exactly the condition that renders the invite screen. A single bad response during an outage ejected authenticated users. Confirmed against the backend (products/tasks/backend/.../api.pyCodeInviteViewSet.check_access): genuine denial is always200 {has_access: false}; non-2xx is only auth failure / throttle / outage, never a "no access" signal.Changes
updateCodeAccessFromSessionnow treats the check as resilient with a bounded fail-closed:200within ~90s,hasCodeAccessis set tofalse. A sustained inability to call home revokes access, so the gate can't be bypassed by keeping the client offline.200 {has_access:false}) still gates immediately on the first attempt.Net effect: a blip no longer kicks real users, but you can't game the invite by going offline.
How did you test this?
Unit tests in
auth.test.ts(39 passing,pnpm --filter @posthog/core exec vitest run src/auth/auth.test.ts):falsefalse) after exhausting the prime-backoff budget (1 awaited + 8 retries = 9 attempts)200 {has_access:false}→ still gated tofalseBiome lint clean on the changed files; no new type errors in
auth.ts.Automatic notifications
Created with PostHog Code from a Slack thread