Skip to content

Fix/2462 fetch proxy deadlock#2489

Closed
krichprollsch wants to merge 2 commits into
mainfrom
fix/2462-fetch-proxy-deadlock
Closed

Fix/2462 fetch proxy deadlock#2489
krichprollsch wants to merge 2 commits into
mainfrom
fix/2462-fetch-proxy-deadlock

Conversation

@krichprollsch
Copy link
Copy Markdown
Member

rebased version of #2468

staylor and others added 2 commits May 18, 2026 11:30
WsConnection.send used to switch the socket to blocking mode on
EWOULDBLOCK, with a comment claiming this 'should virtually never
happen'. Under CDP Fetch.enable + --http-proxy it is routine: the
CONNECT+TLS proxy round-trip means many subresources are in flight
simultaneously, lightpanda emits a flood of Fetch.requestPaused events,
the kernel send buffer fills, lightpanda blocks on write, and
puppeteer's matching Fetch.continueRequest replies pile up in
lightpanda's TCP recv buffer (which lightpanda can't drain because it
is blocked on the write). Both peers wedge until the client times out.

Other contributing problems all collapse paused-intercept sites where
an outer loop polls without ever draining the CDP socket, OR where
disconnect-time cleanup re-enters JS through paths the runtime can no
longer satisfy:

* HttpClient.perform skipped the CDP socket poll entirely whenever
  processMessages() returned non-empty, so a steady stream of HTTP
  completions could starve CDP reads.
* ScriptManagerBase.waitForImport spun on `client.tick(200)` and
  discarded the .cdp_socket return, so a script `import()` whose
  request was paused at the InterceptionLayer hung forever.
* BrowserContext.deinit aborted pending intercepts via `transfer.abort`,
  which fired XHR/script error_callback chains into a half-torn-down
  V8 context (the inspector had already been stopped two lines above).
* Headers.deinit was non-idempotent, so a value-copied Headers (the
  hazard RobotsLayer documents) double-freed its curl_slist on the
  second deinit; the symptom was an "incorrect alignment" panic
  inside ZigToCurlAllocator.free.
* Transfer.deinit was non-idempotent, so a cascade out of
  error_callback (e.g. Script.errorCallback -> manager.evaluate() ->
  JS execution -> Frame.deinit -> abortOwner -> Transfer.kill ->
  Transfer.deinit) reached `arena_pool.release` twice on the same
  arena.

Coordinated changes:

* src/network/WsConnection.zig: On EWOULDBLOCK, instead of switching
  to a blocking write, poll for both POLLOUT and POLLIN. While waiting
  for write space, drain any incoming bytes into the reader buffer
  (without dispatching - that would re-enter send and recurse). Adds
  tryRead/bufferedBytes accessors.

* src/browser/HttpClient.zig:
  - Add has_buffered_input to CDPClient. In perform(), return
    .cdp_socket when buffered input exists, and always do at least a
    non-blocking poll on the CDP socket so HTTP completions can no
    longer starve CDP reads.
  - Make Transfer.deinit idempotent by claiming ownership through
    `client.transfers.remove(self.id)`. Second deinits (cascades out
    of error_callback) early-return.
  - Make `Transfer.kill` public (was `fn`) so BrowserContext.deinit
    can use it.
  - Tighten RequestParams.deinit / Request.deinit to take `*` instead
    of `*const` so they can call into `Headers.deinit` (now mutating).

* src/network/http.zig: Headers.deinit now nulls out `self.headers`
  after `curl_slist_free_all`, so a second deinit is a no-op. Without
  this guard a value-copied Headers double-frees the curl_slist (the
  hazard RobotsLayer's call site already documents).

* src/browser/ScriptManagerBase.zig: waitForImport now drains pending
  CDP messages on every iteration (matching the syncRequest pattern)
  and re-fetches the imported_modules entry per iteration. The cached
  entry was a use-after-free risk because the CDP-drain step above
  re-enters JS, and a transitively-imported module's preloadImport()
  -> getOrPut() can rehash the map and invalidate the prior entry
  pointer.

* src/cdp/CDP.zig:
  - Wire hasBufferedInput.
  - Replace read() with tryRead() in readSocket and tolerate the
    no_new_data case so we still process messages drained during a
    backpressured send.
  - BrowserContext.deinit aborts pending intercepts via
    `transfer.kill` instead of `transfer.abort`. `kill` fires
    shutdown_callback (or no-op for transfers without one), avoiding
    error_callback's re-entry into JS through XHR/script error
    handlers - those crash because the V8 context and inspector this
    BC owns have either been torn down already or are about to be.

Verified end-to-end against a puppeteer + http-proxy reproducer:

| URL                        | before               | after            |
|----------------------------|----------------------|------------------|
| example.com                | OK 124ms             | OK 117ms         |
| github.com                 | HANG 20s (12/82)     | OK 1410ms (82/82)|
| shopify.com                | HANG 20s (1/4)       | OK 1973ms (66/66)|
| allbirds.com               | HANG 20s (12/53)     | OK 3944ms (372)  |
| allbirds wool-runners PDP  | HANG 20s (12/53)     | OK 6286ms (459)  |

(nike.com still doesn't reach `load` event but all 68 continueRequests
process cleanly - the remaining stall is third-party widgets keeping
the page in `loading` state, not the CDP/HTTP deadlock this PR fixes.)

Lightpanda survives 18 back-to-back navigation runs across the matrix
above (3 per URL) without crashing.

Fixes #2462.
@krichprollsch
Copy link
Copy Markdown
Member Author

Integration test: 176

@krichprollsch
Copy link
Copy Markdown
Member Author

closed in favor of #2495

@github-actions github-actions Bot locked and limited conversation to collaborators May 20, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants