Skip to content

[py][bidi] make the WebSocket transport thread-safe and event-driven#17682

Open
AutomatedTester wants to merge 2 commits into
trunkfrom
worktree-py-bidi-transport-p0
Open

[py][bidi] make the WebSocket transport thread-safe and event-driven#17682
AutomatedTester wants to merge 2 commits into
trunkfrom
worktree-py-bidi-transport-p0

Conversation

@AutomatedTester

Copy link
Copy Markdown
Member

Description

Foundation for safe parallel use of the Python BiDi API (driving multiple browsing contexts/tabs concurrently over one connection). This is the P0 "transport correctness & efficiency" step: make the single WebSocket safe and efficient under concurrent use without changing wire behaviour.

What changed in websocket_connection.py

  1. Busy-wait eliminated. execute() registers a per-request threading.Event before sending and blocks on event.wait(timeout); the receive thread sets it when the matching response arrives. Removes the CPU burn and the interval-sized latency floor on every command. The connection-open handshake uses the same pattern instead of polling.
  2. Shared state locked. _messages + per-request waiters under one lock; callbacks under another. No longer relying on the GIL for correctness.
  3. Single dispatcher thread. Events are queued and drained by one long-lived dispatcher thread instead of spawning a thread per event. Bounds thread usage to one regardless of event volume (no thread-per-event exhaustion), preserves event ordering, and keeps dispatch off the receive thread so a slow handler can't stall response routing.
  4. Callback exceptions surfaced. The dispatcher catches per callback, logs at ERROR with traceback, and keeps delivering. Command failures still raise synchronously in execute() (fail-fast preserved).
  5. Clean teardown. close() stops the dispatcher, clears handlers, and wakes any caller blocked on a response so it fails fast instead of waiting out the full timeout.

The (url, timeout, interval) constructor signature is unchanged (ABI); interval is now a no-op but retained.

Tests

New test/unit/selenium/webdriver/common/websocket_connection_tests.py (7 tests, small): single-response routing, 25-way concurrent execute() with responses fed back in reverse order (proves per-caller routing under contention), timeout, single-thread dispatch, callback-exception isolation + logging, handler clearing on close, and close waking blocked callers. The tests override only the network boundary (_start_ws); all transport logic runs for real.

Motivation and Context

Driving N tabs concurrently today means N threads making blocking busy-wait calls against unlocked dicts, with an unbounded thread spawned per event. This PR fixes the mechanical/transport side so in-process threading over one driver's contexts is safe and efficient. It's the enabling work for the per-context handle object and event-scoping work that follow.

Types of changes

  • Cleanup (formatting, renaming)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

⚠️ This touches the BiDi wire transport (high-risk per AGENTS.md). Wire behaviour is unchanged — only correctness/efficiency — and is now covered by concurrency tests. Other bindings have analogous transports; parity is suggested as follow-up.

Replace the busy-wait response loop with per-request threading.Event waits,
lock the shared message/callback state, and dispatch events on a single
long-lived thread instead of spawning one thread per event.

This removes the CPU burn and latency floor on every command, bounds thread
usage so high event volume can no longer exhaust threads, preserves event
ordering, and surfaces callback exceptions instead of losing them on an
orphaned thread. close() now stops the dispatcher, clears handlers, and wakes
any caller still blocked on a response so it fails fast.

Wire behaviour is unchanged. Adds unit tests covering concurrent execute()
response routing, single-threaded dispatch, callback error isolation, and
teardown.
@selenium-ci selenium-ci added the C-py Python Bindings label Jun 15, 2026
The thread-safe transport no longer busy-waits, so command responses return
the instant they arrive instead of up to one poll interval later, and events
are delivered through a single dispatcher thread. Events and rendering effects
are therefore no longer guaranteed to have landed by the time the triggering
command returns -- which is the correct, documented contract.

Update the event and wheel-scroll tests to await their outcome instead of
asserting immediately:

- browsing_context event tests wait for the event to arrive; context_created
  cases assert on the specific context rather than event counts, since that
  event may report more than one context per creation.
- removal tests assert the post-removal context is never observed instead of
  comparing counts.
- wheel tests wait for the scroll position to settle before asserting.
@AutomatedTester

Copy link
Copy Markdown
Member Author

Follow-up (e92f13c): fix the BiDi tests that the faster transport exposed as racy.

Why they failed: the old transport busy-waited (so every command returned up to one poll interval after its response actually arrived) and dispatched each event on its own thread immediately. Tests relied on that incidental slack to assert event/DOM state on the line right after the triggering command. With the new transport, responses return instantly and events are delivered via the dispatcher thread, so they are no longer guaranteed to have landed by the time the command returns.

Contract going forward (documented here intentionally): BiDi events are delivered asynchronously — do not assume a handler has run by the time the triggering command returns; await the event. This matches the direction of the expect_* / wait_for_* helpers in ADRs 0001/0006.

Test changes (no production behavior change):

  • Event tests now WebDriverWait for the event to arrive. context_created cases assert on the specific context rather than event counts, since that event can report more than one context per creation (the prior positional [0]/[1] indexing was the actual source of the IndexError).
  • Handler-removal tests assert the post-removal context is never observed, instead of comparing counts.
  • Wheel-scroll tests wait for the scroll position to settle before asserting (scroll is applied asynchronously by the browser; no transport ever guaranteed it was rendered when performActions returned).

Verified locally on chrome-bidi at --runs_per_test=10 (previously flaked within a couple of runs; now green). bidi_network (blocking-interceptor pattern) and bidi_emulation (synchronous override reads) don't use the immediate-assert-after-event pattern and were left unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-py Python Bindings

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants