Commit 5b802a0
committed
tests: kill zombie IPC child processes after join timeout
When Python 3.12 CI runs, env-vars enables compute-sanitizer with
--target-processes=all, which attaches to every mp.Process child the
tests spawn. On CUDA 12.9.1 the sanitizer analysis of IPC buffer teardown
gets stuck, so child processes never exit. The existing
join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive.
That zombie keeps its IPC handle open. When pytest teardown runs
ipc_memory_resource's mr.close(), it blocks waiting for the handle to
be released -- tying up the runner for hours until GitHub Actions
force-cancels the job. This is the exact pattern in issue #2004
(always Python 3.12 + CUDA 12.9.1 local).
Fix: after join(timeout=...), kill any process still alive so the IPC
handle is released before fixture teardown. A stderr warning is printed
when kill() fires so the failure is clearly attributable to the
sanitizer/IPC deadlock rather than appearing as a generic exitcode != 0.
Tests still fail (exit code is non-zero or completed is False), just in
seconds rather than hours.
Fixes #20041 parent df61e49 commit 5b802a0
1 file changed
Lines changed: 25 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
| |||
40 | 41 | | |
41 | 42 | | |
42 | 43 | | |
43 | | - | |
44 | | - | |
45 | | - | |
46 | | - | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
47 | 57 | | |
48 | 58 | | |
49 | 59 | | |
| |||
108 | 118 | | |
109 | 119 | | |
110 | 120 | | |
111 | | - | |
112 | | - | |
113 | | - | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
114 | 126 | | |
115 | 127 | | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
116 | 134 | | |
117 | 135 | | |
118 | 136 | | |
| |||
0 commit comments