Skip to content

Commit 72ccbe7

Browse files
aryanputtaclaude
andcommitted
tests: kill zombie IPC child processes after join timeout
If a spawned process gets stuck during CUDA context teardown (the Python 3.12 + CUDA 12.9.1 hang pattern from issue #2004), the existing join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive. That zombie holds open IPC handles, causing the ipc_memory_resource fixture's mr.close() to block indefinitely and tie up the runner for hours. Kill any process that is still alive after its join timeout so the IPC handle is released before fixture teardown runs. The test still fails (exit code != 0 or completed == False), just quickly instead of hanging. Fixes #2004 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 326d522 commit 72ccbe7

1 file changed

Lines changed: 18 additions & 1 deletion

File tree

cuda_core/tests/memory_ipc/test_send_buffers.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,13 @@ def test_main(self, ipc_device, nmrs):
3939

4040
# Wait for the child process.
4141
process.join(timeout=CHILD_TIMEOUT_SEC)
42+
if process.is_alive():
43+
# Child is stuck (CUDA context teardown can hang under certain
44+
# driver/Python combos — see issue #2004). Kill it so the
45+
# IPC handle is released and the fixture teardown doesn't
46+
# block the runner for hours.
47+
process.kill()
48+
process.join()
4249
assert process.exitcode == 0
4350

4451
# Verify that the buffers were modified.
@@ -96,10 +103,20 @@ def test_main(self, ipc_device, ipc_memory_resource):
96103
proc_c.start()
97104

98105
# Wait for C to signal completion then clean up.
99-
event_c.wait(timeout=CHILD_TIMEOUT_SEC)
106+
completed = event_c.wait(timeout=CHILD_TIMEOUT_SEC)
100107
event_b.set() # b can finish now
101108
proc_b.join(timeout=CHILD_TIMEOUT_SEC)
102109
proc_c.join(timeout=CHILD_TIMEOUT_SEC)
110+
111+
# Kill any processes that are still alive. Without this, a child stuck
112+
# in CUDA context teardown (issue #2004: Python 3.12 + CUDA 12.9.1)
113+
# holds IPC handles and blocks fixture teardown indefinitely.
114+
for p in (proc_b, proc_c):
115+
if p.is_alive():
116+
p.kill()
117+
p.join()
118+
119+
assert completed, "process C did not complete within timeout"
103120
assert proc_b.exitcode == 0
104121
assert proc_c.exitcode == 0
105122

0 commit comments

Comments
 (0)