File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change @@ -14,11 +14,12 @@ if [[ "${SETUP_SANITIZER}" == 1 ]]; then
1414 COMPUTE_SANITIZER_VERSION=$( ${COMPUTE_SANITIZER} --version | grep -Eo " [0-9]{4}\.[0-9]\.[0-9]" | sed -e ' s/\.//g' )
1515 # --target-processes=application-only: attach the sanitizer to the parent
1616 # pytest process only. Spawned multiprocessing.Process children run without
17- # the sanitizer. This avoids a class of CI hangs where compute-sanitizer's
18- # IPC teardown analysis wedges a child on certain CUDA driver / toolkit
19- # combinations (see issue #2004). The parent process is still fully
20- # sanitized, which is where most of the interesting host-side IPC plumbing
21- # runs anyway.
17+ # the sanitizer. This aims to mitigate a class of CI hangs where child
18+ # processes take an extreme amount of time to spawn (>30 seconds). Test bugs
19+ # triggered by that specific condition are typically uncovered only in CI,
20+ # where they become emergencies and are difficult to debug. The parent
21+ # process is still fully sanitized, which is where most of the interesting
22+ # host-side IPC plumbing runs anyway.
2223 SANITIZER_CMD=" ${COMPUTE_SANITIZER} --target-processes=application-only --launch-timeout=0 --tool=memcheck --error-exitcode=1 --report-api-errors=no"
2324 if [[ " $COMPUTE_SANITIZER_VERSION " -ge 202111 ]]; then
2425 SANITIZER_CMD=" ${SANITIZER_CMD} --padding=32"
Original file line number Diff line number Diff line change 44"""Helpers for tests that spawn ``multiprocessing.Process`` children.
55
66These exist primarily to defend IPC tests against a class of CI hang where a
7- child process gets stuck during teardown (e.g., compute-sanitizer's IPC
8- teardown analysis on certain CUDA driver / toolkit combinations -- see issue
9- #2004). Without intervention, a zombie child holds an IPC memory handle and
10- blocks the parent's ``mr.close()`` in fixture teardown, wedging the GHA runner
11- for hours.
7+ child process spawns too slowly and the parent does not implement proper guards
8+ for that (see issue #2004). Without intervention, a zombie child holds an IPC
9+ memory handle and blocks the parent's ``mr.close()`` in fixture teardown,
10+ leading to deadlock and wedging the test runner for hours.
1211"""
1312
1413import contextlib
Original file line number Diff line number Diff line change 66Applies an outer-guard ``pytest.mark.timeout`` to every test in this directory.
77Individual tests still drive their own per-process waits using
88``child_timeout_sec()`` from ``helpers.child_processes``; this marker is the
9- final fallback so that no IPC test can wedge the CI runner for hours if some
10- new driver / sanitizer / IPC interaction defeats every other layer .
9+ final fallback so that no IPC test can wedge the CI runner for hours if
10+ deadlock occurs .
1111"""
1212
1313import pathlib
You can’t perform that action at this time.
0 commit comments