Fix/2462 fetch proxy deadlock#2489
Closed
krichprollsch wants to merge 2 commits into
Closed
Conversation
WsConnection.send used to switch the socket to blocking mode on
EWOULDBLOCK, with a comment claiming this 'should virtually never
happen'. Under CDP Fetch.enable + --http-proxy it is routine: the
CONNECT+TLS proxy round-trip means many subresources are in flight
simultaneously, lightpanda emits a flood of Fetch.requestPaused events,
the kernel send buffer fills, lightpanda blocks on write, and
puppeteer's matching Fetch.continueRequest replies pile up in
lightpanda's TCP recv buffer (which lightpanda can't drain because it
is blocked on the write). Both peers wedge until the client times out.
Other contributing problems all collapse paused-intercept sites where
an outer loop polls without ever draining the CDP socket, OR where
disconnect-time cleanup re-enters JS through paths the runtime can no
longer satisfy:
* HttpClient.perform skipped the CDP socket poll entirely whenever
processMessages() returned non-empty, so a steady stream of HTTP
completions could starve CDP reads.
* ScriptManagerBase.waitForImport spun on `client.tick(200)` and
discarded the .cdp_socket return, so a script `import()` whose
request was paused at the InterceptionLayer hung forever.
* BrowserContext.deinit aborted pending intercepts via `transfer.abort`,
which fired XHR/script error_callback chains into a half-torn-down
V8 context (the inspector had already been stopped two lines above).
* Headers.deinit was non-idempotent, so a value-copied Headers (the
hazard RobotsLayer documents) double-freed its curl_slist on the
second deinit; the symptom was an "incorrect alignment" panic
inside ZigToCurlAllocator.free.
* Transfer.deinit was non-idempotent, so a cascade out of
error_callback (e.g. Script.errorCallback -> manager.evaluate() ->
JS execution -> Frame.deinit -> abortOwner -> Transfer.kill ->
Transfer.deinit) reached `arena_pool.release` twice on the same
arena.
Coordinated changes:
* src/network/WsConnection.zig: On EWOULDBLOCK, instead of switching
to a blocking write, poll for both POLLOUT and POLLIN. While waiting
for write space, drain any incoming bytes into the reader buffer
(without dispatching - that would re-enter send and recurse). Adds
tryRead/bufferedBytes accessors.
* src/browser/HttpClient.zig:
- Add has_buffered_input to CDPClient. In perform(), return
.cdp_socket when buffered input exists, and always do at least a
non-blocking poll on the CDP socket so HTTP completions can no
longer starve CDP reads.
- Make Transfer.deinit idempotent by claiming ownership through
`client.transfers.remove(self.id)`. Second deinits (cascades out
of error_callback) early-return.
- Make `Transfer.kill` public (was `fn`) so BrowserContext.deinit
can use it.
- Tighten RequestParams.deinit / Request.deinit to take `*` instead
of `*const` so they can call into `Headers.deinit` (now mutating).
* src/network/http.zig: Headers.deinit now nulls out `self.headers`
after `curl_slist_free_all`, so a second deinit is a no-op. Without
this guard a value-copied Headers double-frees the curl_slist (the
hazard RobotsLayer's call site already documents).
* src/browser/ScriptManagerBase.zig: waitForImport now drains pending
CDP messages on every iteration (matching the syncRequest pattern)
and re-fetches the imported_modules entry per iteration. The cached
entry was a use-after-free risk because the CDP-drain step above
re-enters JS, and a transitively-imported module's preloadImport()
-> getOrPut() can rehash the map and invalidate the prior entry
pointer.
* src/cdp/CDP.zig:
- Wire hasBufferedInput.
- Replace read() with tryRead() in readSocket and tolerate the
no_new_data case so we still process messages drained during a
backpressured send.
- BrowserContext.deinit aborts pending intercepts via
`transfer.kill` instead of `transfer.abort`. `kill` fires
shutdown_callback (or no-op for transfers without one), avoiding
error_callback's re-entry into JS through XHR/script error
handlers - those crash because the V8 context and inspector this
BC owns have either been torn down already or are about to be.
Verified end-to-end against a puppeteer + http-proxy reproducer:
| URL | before | after |
|----------------------------|----------------------|------------------|
| example.com | OK 124ms | OK 117ms |
| github.com | HANG 20s (12/82) | OK 1410ms (82/82)|
| shopify.com | HANG 20s (1/4) | OK 1973ms (66/66)|
| allbirds.com | HANG 20s (12/53) | OK 3944ms (372) |
| allbirds wool-runners PDP | HANG 20s (12/53) | OK 6286ms (459) |
(nike.com still doesn't reach `load` event but all 68 continueRequests
process cleanly - the remaining stall is third-party widgets keeping
the page in `loading` state, not the CDP/HTTP deadlock this PR fixes.)
Lightpanda survives 18 back-to-back navigation runs across the matrix
above (3 per URL) without crashing.
Fixes #2462.
Member
Author
|
Integration test: 176 |
Member
Author
|
closed in favor of #2495 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
rebased version of #2468