cuda.core: keep kernel-argument objects alive in graph kernel nodes#2041
Conversation
`GraphDefinition.launch()` did not extend the lifetime of the Python kernel-argument objects to the lifetime of the graph. The `ParamHolder` built in `GN_launch` held the only references to those objects and was destroyed when `GN_launch` returned. The driver only stores the raw pointer values in the kernel node, so a `Buffer` reachable only through the call could be GC'd before the graph ran, leaving the graph with a stale device pointer. Attach the `kernel_args` tuple to the graph as a CUDA user object, mirroring the existing handling of `KernelHandle` and `EventHandle`. This reuses the `_py_host_destructor` path already used by the host callback machinery. Closes NVIDIA#2039 Co-authored-by: Cursor <cursoragent@cursor.com>
e50c99e to
6654645
Compare
This comment has been minimized.
This comment has been minimized.
|
|
||
| del buf | ||
| gc.collect() | ||
| assert buf_weak() is not None # graph kept the Buffer alive |
There was a problem hiding this comment.
Test prove the buffer is kept alive, but it doesn't validate that its cleaned up after the graph is released.
There was a problem hiding this comment.
I added a test for this. If it is flakey, we might need to adjust the CU_USER_OBJECT_NO_DESTRUCTOR_SYNC flag so that graph destructors cannot be invoked asynchronously.
Update: I confirmed this is not a concern for source graphs. Asynchronous destruction only comes into play for exec graphs.
There was a problem hiding this comment.
This test creates an exec graph, so there is a race. CI for free-threaded Python seems more likely to trigger it. 9f2c8f2 adds polling, but removing the test would also be defensible.
rparolin
left a comment
There was a problem hiding this comment.
The tests should validate that buffer is eventually freed once the graph is refcount is decremented.
Addresses review feedback (PR NVIDIA#2041): the existing test only proved the graph kept the Buffer alive, not that the user-object machinery actually releases it once the graph is destroyed. Without the symmetric check, a working attachment is indistinguishable from a permanent leak. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Below are the Cursor GPT-5.4 Extra High Fast findings. It was thinking far longer than I'd have expected for a PR this size. I'm not sure which of these are actually actionable: Re 1. Do we care about stream-captured graphs?
|
|
Thanks @rwgk
I'll look into this. It might need to be deferred because AFAIK the stream capture path does not create any user objects.
We have a huge class of possible errors of this type, unfortunately. A better approach than storing the
I have to rework the whole user object design for #1330 (step 4) and I plan to address this. |
This is a good catch, and it is not fixed with this PR. (Which is why kernel arg update is so messy, as noted during the team sync today). I am fine with this PR only fixing the explicit graph construction path. |
The freeing assertion at the end of test_kernel_args_buffer_lifetime failed on free-threaded Python (py3.14t) because cuGraphExecDestroy releases its user-object references via an asynchronous DPC, and free- threaded CPython's deferred ref counting can need an extra GC pass to settle. Poll the weakref with a bounded timeout and per-iteration GC instead of asserting eagerly. Co-authored-by: Cursor <cursoragent@cursor.com>
| def _wait_until(predicate, timeout=2.0, interval=0.01): | ||
| """Poll predicate() until True or timeout, driving gc each iteration. | ||
|
|
||
| Used for assertions about resource cleanup that may be delayed by CUDA's | ||
| asynchronous user-object destructor pump (DPC) or, on free-threaded | ||
| Python, by deferred reference-count processing. A bounded poll keeps the | ||
| test correct without depending on undocumented driver timing guarantees. | ||
| """ | ||
| deadline = time.monotonic() + timeout | ||
| while time.monotonic() < deadline: | ||
| gc.collect() | ||
| if predicate(): | ||
| return | ||
| time.sleep(interval) | ||
| raise AssertionError(f"condition not satisfied within {timeout}s") |
There was a problem hiding this comment.
Wouldn't this still welcome flakiness? I am concerned about this being tested in SWQA hands
There was a problem hiding this comment.
Agreed, it's not perfect, though this is much better than before. Realistically, I think it's either this or we don't test the release condition Rob pointed out. There will be much more work on the graph ownership model, so I expect to revisit all these tests.
| free-threaded Python the resulting Py_DECREF chain may need an extra | ||
| GC pass to settle. | ||
| """ | ||
| from cuda.core._utils.cuda_utils import driver, handle_return |
There was a problem hiding this comment.
nit: move imports to the top, no need to defer import to here
| from cuda.core._utils.cuda_utils import driver, handle_return | ||
|
|
||
| _skip_if_no_mempool() | ||
| dev = Device() |
There was a problem hiding this comment.
| dev = Device() | |
| dev = init_cuda |
|
I merged due to the release deadline, but I will follow-up on the open comments. |
|
…#2047 - New feature: persistent program cache for Program.compile (InMemoryProgramCache, FileStreamProgramCache, make_program_cache_key). - Fix: graph kernel nodes now prevent kernel-argument GC. - Fix: DeviceEvents.__dealloc__ crash on uninitialized handle.
…anup (#2032) * Document cuda.core support policy Add support.rst covering versioning (SemVer), CUDA version support (dual major versions), Python version support (CPython EOL schedule), free-threading (experimental), and release cadence (bimonthly). Closes #2030 * Fix broken CCCL URLs and add missing cuda.bindings interfaces - Update cuda.coop and cuda.compute URLs from the old nvidia.github.io/cccl/python/{coop,compute} paths (now 404) to the current unstable doc paths. - Add nvFatbin and NVML to the cuda.bindings interface list. - Update all three synced files: README.md, cuda_python/DESCRIPTION.rst, and cuda_python/docs/source/index.rst. * Add missing entries to cuda.core 1.0.0 release notes Add new features (green contexts, system.Device NVML APIs, system.typing module, NVML enum re-wrapping), breaking changes (tensor bridge behavior, system.Device renames, privatized helper classes, UUID format change, removed enums), and bug fixes (is_managed for pool alloc, nvJitLink log error handling, NVML event set init, Device.arch unknown, empty field values, runtime error messages, wheel size reduction). * Update cuda.core docs for 1.0.0 GA - api.rst: replace pre-1.0 warning with stable-API statement and link to support policy. - install.rst: update free-threading version reference from 0.4.0 to 1.0.0. - nv-versions.json: add 1.0.0 entry for the version switcher dropdown. * Split cuda.core.system API reference into separate page Move the CUDA system information / NVML section from api.rst into a dedicated api_nvml.rst. The new page uses its own `.. module:: cuda.core.system` directive so autosummary entries no longer need the `system.` prefix. Added to index.rst toctree after api. * Remove algorithm and size details from make_program_cache_key docstring The Returns section exposed the hash algorithm and digest size, which are implementation details. Replace with "opaque bytes digest" so the public API contract does not pin these. See #2043 * Remove deprecated cuda.core.experimental namespace The cuda.core.experimental namespace was deprecated in v0.5.0 when all public APIs moved to the top-level cuda.core namespace. Remove the backward-compatibility shim and its test as promised for v1.0.0. * Add missing release note entries for #1912, #2041, #2047 - New feature: persistent program cache for Program.compile (InMemoryProgramCache, FileStreamProgramCache, make_program_cache_key). - Fix: graph kernel nodes now prevent kernel-argument GC. - Fix: DeviceEvents.__dealloc__ crash on uninitialized handle. * Update 1.0.0-notes.rst * expand support policy * wordsmith
Summary
Closes #2039.
GraphDefinition.launch()did not extend the lifetime of Python kernel-argument objects (e.g.Buffer) to the lifetime of the graph. The ownership represented by aParamHolderconstructed inGN_launchneeds to be attached to the graph to avoid the possibility of stale arguments producing memory corruption or a crash on launch.Changes
cuda_core/cuda/core/graph/_graph_node.pyx: inGN_launch, attach thekernel_argstuple to the graph as a CUDA user object, mirroring the existing handling ofKernelHandleandEventHandle. Reuses the_py_host_destructorpath already used by the host-callback machinery.cuda_core/cuda/core/graph/_utils.pxd: expose_py_host_destructorso the new caller can use it.The new attachment runs only on the graph-construction path and is paid once per kernel node at build time, not at execution time. It does not affect the regular (non-graph) launch path in
_launcher.pyx.Test Coverage
Two tests added in
cuda_core/tests/graph/test_graph_definition_lifetime.py:test_kernel_args_buffer_kept_alive_through_execution: aBufferpassed as a kernel arg survivesdel buf+gc.collect()(weakref check) and the graph executes correctly against its memory after instantiation (value check).test_kernel_args_survive_graph_clone: same scenario but viacuGraphClone, which doesn't carry Python-level references — only CUDA user objects can keep the args alive across the clone.Related Work
_py_host_destructoragainst being invoked afterPy_Finalize. That is a pre-existing risk (also present on the host-callback path) that this PR inherits but does not introduce or widen.