Fix intermittent listener test hangs (LISTEN/UNLISTEN deadlock, panic, flaky tests)#174
Merged
Conversation
…lock, end the message-forwarding task cleanly instead of panicking, and bound the listener tests with deterministic waits plus a pytest-timeout backstop.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
python/tests/test_listener.pyintermittently hung the GitHub runners (sometimes, not always) — never reproducible locally.Root causes
update_listen_queryheldchannel_callbacks.read+listen_query.writewhile takingis_listened.write, whileexecute_listenheldis_listened.writewhile takinglisten_query.read. Aclear_*/add_callbackrunning concurrently with the listen loop deadlocked both tasks — exactly the pattern thecleartests exercise.test_listener_asynciteratorawaited anasync forfed by a fire-and-forgetasyncio.create_task(notify(...))whose exception was swallowed — any failure left it waiting forever.listen()tests sentNOTIFYimmediately afterlisten(), racing the background task that issuesLISTENlazily.Changes
Rust (
src/driver/listener/core.rs)listen_querystring withapplied_channels: HashSet<String>.execute_listennow reconciles desired vs applied channels and issues realUNLISTEN+LISTEN(previously cleared channels were never unlistened on the backend).mark_subscriptions_dirtytakes onlyis_listened, so the lock order is nowclient → is_listened → channel_callbacks → applied_channelswith no cycle.panic!s on a broken connection — itbreaks out of a manualpoll_messageloop (also stops cleanly when the receiver is dropped on shutdown).Tests (
python/tests/test_listener.py)wait_until_listening()pollspg_listening_channels()on the listener's own connection to confirm the subscription before notifying (removes the race deterministically).anyio.fail_after; fixedasyncio.sleep(0.5)+ assert replaced with poll-until-condition helpers; the iterator's notify runs in ananyiotask group so failures propagate instead of being swallowed.CI backstop (
tox.ini,pyproject.toml)pytest-timeoutwithtimeout = 120, so any future hang fails fast with a traceback instead of blocking the runner.Verification
cargo clippy --all-features -- -W clippy::pedantic -D warningsandcargo fmt --checkclean.TOKIO_WORKER_THREADS=1+ CPU load with no hangs.test_ssl_mode, which needs CI certs): 261 passed.UNLISTENremoves channels frompg_listening_channels(), and that an abnormal backend termination no longer panics the worker thread.