Skip to content

Commit d65186e

Browse files
committed
tests: reframe IPC hang docs as test-side bug rather than CS bug
Reword the inline comment in ci/tools/setup-sanitizer and the docstrings in helpers/child_processes.py and memory_ipc/conftest.py to make clear that the underlying problem is insufficient guards in the IPC tests when child processes spawn slowly (>30s under compute-sanitizer). The sanitizer change and the new helpers are mitigations / safety nets; the durable fix is making the tests handle slow children correctly.
1 parent a026a05 commit d65186e

3 files changed

Lines changed: 12 additions & 12 deletions

File tree

ci/tools/setup-sanitizer

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,12 @@ if [[ "${SETUP_SANITIZER}" == 1 ]]; then
1414
COMPUTE_SANITIZER_VERSION=$(${COMPUTE_SANITIZER} --version | grep -Eo "[0-9]{4}\.[0-9]\.[0-9]" | sed -e 's/\.//g')
1515
# --target-processes=application-only: attach the sanitizer to the parent
1616
# pytest process only. Spawned multiprocessing.Process children run without
17-
# the sanitizer. This avoids a class of CI hangs where compute-sanitizer's
18-
# IPC teardown analysis wedges a child on certain CUDA driver / toolkit
19-
# combinations (see issue #2004). The parent process is still fully
20-
# sanitized, which is where most of the interesting host-side IPC plumbing
21-
# runs anyway.
17+
# the sanitizer. This aims to mitigate a class of CI hangs where child
18+
# processes take an extreme amount of time to spawn (>30 seconds). Test bugs
19+
# triggered by that specific condition are typically uncovered only in CI,
20+
# where they become emergencies and are difficult to debug. The parent
21+
# process is still fully sanitized, which is where most of the interesting
22+
# host-side IPC plumbing runs anyway.
2223
SANITIZER_CMD="${COMPUTE_SANITIZER} --target-processes=application-only --launch-timeout=0 --tool=memcheck --error-exitcode=1 --report-api-errors=no"
2324
if [[ "$COMPUTE_SANITIZER_VERSION" -ge 202111 ]]; then
2425
SANITIZER_CMD="${SANITIZER_CMD} --padding=32"

cuda_core/tests/helpers/child_processes.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,10 @@
44
"""Helpers for tests that spawn ``multiprocessing.Process`` children.
55
66
These exist primarily to defend IPC tests against a class of CI hang where a
7-
child process gets stuck during teardown (e.g., compute-sanitizer's IPC
8-
teardown analysis on certain CUDA driver / toolkit combinations -- see issue
9-
#2004). Without intervention, a zombie child holds an IPC memory handle and
10-
blocks the parent's ``mr.close()`` in fixture teardown, wedging the GHA runner
11-
for hours.
7+
child process spawns too slowly and the parent does not implement proper guards
8+
for that (see issue #2004). Without intervention, a zombie child holds an IPC
9+
memory handle and blocks the parent's ``mr.close()`` in fixture teardown,
10+
leading to deadlock and wedging the test runner for hours.
1211
"""
1312

1413
import contextlib

cuda_core/tests/memory_ipc/conftest.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
Applies an outer-guard ``pytest.mark.timeout`` to every test in this directory.
77
Individual tests still drive their own per-process waits using
88
``child_timeout_sec()`` from ``helpers.child_processes``; this marker is the
9-
final fallback so that no IPC test can wedge the CI runner for hours if some
10-
new driver / sanitizer / IPC interaction defeats every other layer.
9+
final fallback so that no IPC test can wedge the CI runner for hours if
10+
deadlock occurs.
1111
"""
1212

1313
import pathlib

0 commit comments

Comments
 (0)