Skip to content

Commit 5b802a0

Browse files
committed
tests: kill zombie IPC child processes after join timeout
When Python 3.12 CI runs, env-vars enables compute-sanitizer with --target-processes=all, which attaches to every mp.Process child the tests spawn. On CUDA 12.9.1 the sanitizer analysis of IPC buffer teardown gets stuck, so child processes never exit. The existing join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive. That zombie keeps its IPC handle open. When pytest teardown runs ipc_memory_resource's mr.close(), it blocks waiting for the handle to be released -- tying up the runner for hours until GitHub Actions force-cancels the job. This is the exact pattern in issue #2004 (always Python 3.12 + CUDA 12.9.1 local). Fix: after join(timeout=...), kill any process still alive so the IPC handle is released before fixture teardown. A stderr warning is printed when kill() fires so the failure is clearly attributable to the sanitizer/IPC deadlock rather than appearing as a generic exitcode != 0. Tests still fail (exit code is non-zero or completed is False), just in seconds rather than hours. Fixes #2004
1 parent df61e49 commit 5b802a0

1 file changed

Lines changed: 25 additions & 7 deletions

File tree

cuda_core/tests/memory_ipc/test_send_buffers.py

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
# SPDX-License-Identifier: Apache-2.0
33

44
import multiprocessing as mp
5+
import sys
56
from itertools import cycle
67

78
import pytest
@@ -40,10 +41,19 @@ def test_main(self, ipc_device, nmrs):
4041
# Wait for the child process.
4142
process.join(timeout=CHILD_TIMEOUT_SEC)
4243
if process.is_alive():
43-
# Child is stuck (CUDA context teardown can hang under certain
44-
# driver/Python combos — see issue #2004). Kill it so the
45-
# IPC handle is released and the fixture teardown doesn't
46-
# block the runner for hours.
44+
# Child did not exit within the timeout. Under compute-sanitizer
45+
# with --target-processes=all (active for Python 3.12 + local
46+
# CTK on Linux), IPC memory teardown inside the sanitizer can
47+
# deadlock on CUDA 12.9.1, leaving the child alive indefinitely.
48+
# SIGKILL forces the kernel to reclaim all IPC handles so that
49+
# fixture teardown (mr.close()) does not block the runner for
50+
# hours. See issue #2004.
51+
print(
52+
f"[WARN] child process {process.pid} still alive after "
53+
f"{CHILD_TIMEOUT_SEC}s — sending SIGKILL "
54+
f"(likely compute-sanitizer IPC deadlock, see issue #2004)",
55+
file=sys.stderr,
56+
)
4757
process.kill()
4858
process.join()
4959
assert process.exitcode == 0
@@ -108,11 +118,19 @@ def test_main(self, ipc_device, ipc_memory_resource):
108118
proc_b.join(timeout=CHILD_TIMEOUT_SEC)
109119
proc_c.join(timeout=CHILD_TIMEOUT_SEC)
110120

111-
# Kill any processes that are still alive. Without this, a child stuck
112-
# in CUDA context teardown (issue #2004: Python 3.12 + CUDA 12.9.1)
113-
# holds IPC handles and blocks fixture teardown indefinitely.
121+
# Kill any processes that are still alive. Under compute-sanitizer with
122+
# --target-processes=all, IPC teardown deadlocks on CUDA 12.9.1 so
123+
# children never exit. SIGKILL forces kernel IPC handle release so
124+
# fixture teardown (mr.close()) does not block the runner for hours.
125+
# See issue #2004.
114126
for p in (proc_b, proc_c):
115127
if p.is_alive():
128+
print(
129+
f"[WARN] child process {p.pid} still alive after "
130+
f"{CHILD_TIMEOUT_SEC}s — sending SIGKILL "
131+
f"(likely compute-sanitizer IPC deadlock, see issue #2004)",
132+
file=sys.stderr,
133+
)
116134
p.kill()
117135
p.join()
118136

0 commit comments

Comments
 (0)