Skip to content

Commit df61e49

Browse files
committed
tests: kill zombie IPC child processes after join timeout
When Python 3.12 CI runs, env-vars enables compute-sanitizer with --target-processes=all, which attaches to every mp.Process child the tests spawn. On CUDA 12.9.1 the sanitizer analysis of IPC buffer teardown gets stuck, so child processes never exit. The existing join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive. That zombie keeps its IPC handle open. When pytest teardown runs ipc_memory_resource's mr.close(), it blocks waiting for the handle to be released — tying up the runner for hours until GitHub Actions force-cancels the job. This is the exact pattern in issue #2004 (always Python 3.12 + CUDA 12.9.1 local). Fix: after join(timeout=...), kill any process still alive so the IPC handle is released before fixture teardown. Tests still fail (exit code is non-zero or completed is False), just in seconds rather than hours. Fixes #2004
1 parent 326d522 commit df61e49

1 file changed

Lines changed: 18 additions & 1 deletion

File tree

cuda_core/tests/memory_ipc/test_send_buffers.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,13 @@ def test_main(self, ipc_device, nmrs):
3939

4040
# Wait for the child process.
4141
process.join(timeout=CHILD_TIMEOUT_SEC)
42+
if process.is_alive():
43+
# Child is stuck (CUDA context teardown can hang under certain
44+
# driver/Python combos — see issue #2004). Kill it so the
45+
# IPC handle is released and the fixture teardown doesn't
46+
# block the runner for hours.
47+
process.kill()
48+
process.join()
4249
assert process.exitcode == 0
4350

4451
# Verify that the buffers were modified.
@@ -96,10 +103,20 @@ def test_main(self, ipc_device, ipc_memory_resource):
96103
proc_c.start()
97104

98105
# Wait for C to signal completion then clean up.
99-
event_c.wait(timeout=CHILD_TIMEOUT_SEC)
106+
completed = event_c.wait(timeout=CHILD_TIMEOUT_SEC)
100107
event_b.set() # b can finish now
101108
proc_b.join(timeout=CHILD_TIMEOUT_SEC)
102109
proc_c.join(timeout=CHILD_TIMEOUT_SEC)
110+
111+
# Kill any processes that are still alive. Without this, a child stuck
112+
# in CUDA context teardown (issue #2004: Python 3.12 + CUDA 12.9.1)
113+
# holds IPC handles and blocks fixture teardown indefinitely.
114+
for p in (proc_b, proc_c):
115+
if p.is_alive():
116+
p.kill()
117+
p.join()
118+
119+
assert completed, "process C did not complete within timeout"
103120
assert proc_b.exitcode == 0
104121
assert proc_c.exitcode == 0
105122

0 commit comments

Comments
 (0)