cuda.core: keep kernel-argument objects alive in graph kernel nodes by Andy-Jost · Pull Request #2041 · NVIDIA/cuda-python

Andy-Jost · 2026-05-06T21:21:44Z

Summary

Closes #2039.

GraphDefinition.launch() did not extend the lifetime of Python kernel-argument objects (e.g. Buffer) to the lifetime of the graph. The ownership represented by a ParamHolder constructed in GN_launch needs to be attached to the graph to avoid the possibility of stale arguments producing memory corruption or a crash on launch.

Changes

cuda_core/cuda/core/graph/_graph_node.pyx: in GN_launch, attach the kernel_args tuple to the graph as a CUDA user object, mirroring the existing handling of KernelHandle and EventHandle. Reuses the _py_host_destructor path already used by the host-callback machinery.
cuda_core/cuda/core/graph/_utils.pxd: expose _py_host_destructor so the new caller can use it.

The new attachment runs only on the graph-construction path and is paid once per kernel node at build time, not at execution time. It does not affect the regular (non-graph) launch path in _launcher.pyx.

Test Coverage

Two tests added in cuda_core/tests/graph/test_graph_definition_lifetime.py:

test_kernel_args_buffer_kept_alive_through_execution: a Buffer passed as a kernel arg survives del buf + gc.collect() (weakref check) and the graph executes correctly against its memory after instantiation (value check).
test_kernel_args_survive_graph_clone: same scenario but via cuGraphClone, which doesn't carry Python-level references — only CUDA user objects can keep the args alive across the clone.

Related Work

Closes Graph kernel nodes don't keep kernel argument objects alive #2039.
A follow-up issue will track hardening _py_host_destructor against being invoked after Py_Finalize. That is a pre-existing risk (also present on the host-callback path) that this PR inherits but does not introduce or widen.

`GraphDefinition.launch()` did not extend the lifetime of the Python kernel-argument objects to the lifetime of the graph. The `ParamHolder` built in `GN_launch` held the only references to those objects and was destroyed when `GN_launch` returned. The driver only stores the raw pointer values in the kernel node, so a `Buffer` reachable only through the call could be GC'd before the graph ran, leaving the graph with a stale device pointer. Attach the `kernel_args` tuple to the graph as a CUDA user object, mirroring the existing handling of `KernelHandle` and `EventHandle`. This reuses the `_py_host_destructor` path already used by the host callback machinery. Closes NVIDIA#2039 Co-authored-by: Cursor <cursoragent@cursor.com>

rparolin · 2026-05-06T22:06:43Z

+
+    del buf
+    gc.collect()
+    assert buf_weak() is not None  # graph kept the Buffer alive


Test prove the buffer is kept alive, but it doesn't validate that its cleaned up after the graph is released.

I added a test for this. If it is flakey, we might need to adjust the CU_USER_OBJECT_NO_DESTRUCTOR_SYNC flag so that graph destructors cannot be invoked asynchronously.

Update: I confirmed this is not a concern for source graphs. Asynchronous destruction only comes into play for exec graphs.

This test creates an exec graph, so there is a race. CI for free-threaded Python seems more likely to trigger it. 9f2c8f2 adds polling, but removing the test would also be defensible.

rparolin

The tests should validate that buffer is eventually freed once the graph is refcount is decremented.

Addresses review feedback (PR NVIDIA#2041): the existing test only proved the graph kept the Buffer alive, not that the user-object machinery actually releases it once the graph is destroyed. Without the symmetric check, a working attachment is indistinguishable from a permanent leak. Co-authored-by: Cursor <cursoragent@cursor.com>

rwgk · 2026-05-06T22:22:15Z

Below are the Cursor GPT-5.4 Extra High Fast findings. It was thinking far longer than I'd have expected for a PR this size.

I'm not sure which of these are actually actionable:

Re 1. Do we care about stream-captured graphs?
Re 2. Could we simply document that we don't protect against explicit release?
Re 3. This seems OK to me? (I.e. I mean we can ignore this finding, or document the behavior?)

@Andy-Jost

High: Stream-captured graphs still use the unfixed launcher path

launch() still accepts a GraphBuilder and creates a stack-local ParamHolder in cuda_core/cuda/core/_launcher.pyx:23 and cuda_core/cuda/core/_launcher.pyx:47, but the new ownership transfer exists only in GN_launch() at cuda_core/cuda/core/graph/_graph_node.pyx:623. That means launch(gb, ..., buf) can still capture a kernel node whose Buffer or other Python owner is freed after capture, leaving the graph with a stale raw pointer. This is a real public path, not just a hypothetical one; existing tests exercise captured kernel launches with Buffer args in cuda_core/tests/graph/test_graph_memory_resource.py:171. The new regression tests only cover the explicit GraphDefinition path in cuda_core/tests/graph/test_graph_definition_lifetime.py:497 and cuda_core/tests/graph/test_graph_definition_lifetime.py:533.
High: Retaining the Buffer wrapper does not protect against explicit release

The PR now keeps ker_args.kernel_args alive at cuda_core/cuda/core/graph/_graph_node.pyx:623, but ParamHolder snapshots a Buffer argument as a raw device pointer value in cuda_core/cuda/core/_kernel_arg_handler.pyx:283 and cuda_core/cuda/core/_kernel_arg_handler.pyx:287. If user code explicitly calls buf.close() or exits a context manager, Buffer_close() resets _h_ptr and can free the allocation immediately at cuda_core/cuda/core/_memory/_buffer.pyx:596. At that point the graph still owns only the stale integer pointer copied during node creation; keeping the Python Buffer object alive does not keep the underlying allocation handle alive.
Medium: The new graph-scoped attachment can pin kernel-argument owners after node deletion

_attach_user_object() creates a CUDA user object and moves it into the graph at cuda_core/cuda/core/graph/_utils.pyx:41, but it does not return any handle that could later be released per node. GraphNode.destroy() only calls cuGraphDestroyNode() at cuda_core/cuda/core/graph/_graph_node.pyx:159. Because CUDA user objects are retained and released at graph scope (cuGraphRetainUserObject / cuGraphReleaseUserObject), deleting or rewiring a kernel node can now leave large kernel-argument owners, including Buffers, pinned until the entire graph is destroyed. Node mutation is already a supported workflow in cuda_core/tests/graph/test_graph_definition_mutation.py:159.

Andy-Jost · 2026-05-06T22:37:40Z

Thanks @rwgk

High: Stream-captured graphs still use the unfixed launcher path

I'll look into this. It might need to be deferred because AFAIK the stream capture path does not create any user objects.

High: Retaining the Buffer wrapper does not protect against explicit release

We have a huge class of possible errors of this type, unfortunately. A better approach than storing the Buffer would be to store the DevicePtrHandle (a std::shared_ptr owning the buffer). Definitely out of scope for this change.

Medium: The new graph-scoped attachment can pin kernel-argument owners after node deletion

I have to rework the whole user object design for #1330 (step 4) and I plan to address this.

leofang · 2026-05-06T22:38:51Z

High: Stream-captured graphs still use the unfixed launcher path

This is a good catch, and it is not fixed with this PR. (Which is why kernel arg update is so messy, as noted during the team sync today). I am fine with this PR only fixing the explicit graph construction path.

The freeing assertion at the end of test_kernel_args_buffer_lifetime failed on free-threaded Python (py3.14t) because cuGraphExecDestroy releases its user-object references via an asynchronous DPC, and free- threaded CPython's deferred ref counting can need an extra GC pass to settle. Poll the weakref with a bounded timeout and per-iteration GC instead of asserting eagerly. Co-authored-by: Cursor <cursoragent@cursor.com>

leofang · 2026-05-07T03:01:03Z

+def _wait_until(predicate, timeout=2.0, interval=0.01):
+    """Poll predicate() until True or timeout, driving gc each iteration.
+
+    Used for assertions about resource cleanup that may be delayed by CUDA's
+    asynchronous user-object destructor pump (DPC) or, on free-threaded
+    Python, by deferred reference-count processing. A bounded poll keeps the
+    test correct without depending on undocumented driver timing guarantees.
+    """
+    deadline = time.monotonic() + timeout
+    while time.monotonic() < deadline:
+        gc.collect()
+        if predicate():
+            return
+        time.sleep(interval)
+    raise AssertionError(f"condition not satisfied within {timeout}s")


Wouldn't this still welcome flakiness? I am concerned about this being tested in SWQA hands

Agreed, it's not perfect, though this is much better than before. Realistically, I think it's either this or we don't test the release condition Rob pointed out. There will be much more work on the graph ownership model, so I expect to revisit all these tests.

leofang · 2026-05-07T03:01:38Z

+    free-threaded Python the resulting Py_DECREF chain may need an extra
+    GC pass to settle.
+    """
+    from cuda.core._utils.cuda_utils import driver, handle_return


nit: move imports to the top, no need to defer import to here

leofang · 2026-05-07T03:01:59Z

+    from cuda.core._utils.cuda_utils import driver, handle_return
+
+    _skip_if_no_mempool()
+    dev = Device()


Suggested change

dev = Device()

dev = init_cuda

Andy-Jost · 2026-05-07T14:54:30Z

I merged due to the release deadline, but I will follow-up on the open comments.

github-actions · 2026-05-07T15:33:04Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

…#2047 - New feature: persistent program cache for Program.compile (InMemoryProgramCache, FileStreamProgramCache, make_program_cache_key). - Fix: graph kernel nodes now prevent kernel-argument GC. - Fix: DeviceEvents.__dealloc__ crash on uninitialized handle.

…anup (#2032) * Document cuda.core support policy Add support.rst covering versioning (SemVer), CUDA version support (dual major versions), Python version support (CPython EOL schedule), free-threading (experimental), and release cadence (bimonthly). Closes #2030 * Fix broken CCCL URLs and add missing cuda.bindings interfaces - Update cuda.coop and cuda.compute URLs from the old nvidia.github.io/cccl/python/{coop,compute} paths (now 404) to the current unstable doc paths. - Add nvFatbin and NVML to the cuda.bindings interface list. - Update all three synced files: README.md, cuda_python/DESCRIPTION.rst, and cuda_python/docs/source/index.rst. * Add missing entries to cuda.core 1.0.0 release notes Add new features (green contexts, system.Device NVML APIs, system.typing module, NVML enum re-wrapping), breaking changes (tensor bridge behavior, system.Device renames, privatized helper classes, UUID format change, removed enums), and bug fixes (is_managed for pool alloc, nvJitLink log error handling, NVML event set init, Device.arch unknown, empty field values, runtime error messages, wheel size reduction). * Update cuda.core docs for 1.0.0 GA - api.rst: replace pre-1.0 warning with stable-API statement and link to support policy. - install.rst: update free-threading version reference from 0.4.0 to 1.0.0. - nv-versions.json: add 1.0.0 entry for the version switcher dropdown. * Split cuda.core.system API reference into separate page Move the CUDA system information / NVML section from api.rst into a dedicated api_nvml.rst. The new page uses its own `.. module:: cuda.core.system` directive so autosummary entries no longer need the `system.` prefix. Added to index.rst toctree after api. * Remove algorithm and size details from make_program_cache_key docstring The Returns section exposed the hash algorithm and digest size, which are implementation details. Replace with "opaque bytes digest" so the public API contract does not pin these. See #2043 * Remove deprecated cuda.core.experimental namespace The cuda.core.experimental namespace was deprecated in v0.5.0 when all public APIs moved to the top-level cuda.core namespace. Remove the backward-compatibility shim and its test as promised for v1.0.0. * Add missing release note entries for #1912, #2041, #2047 - New feature: persistent program cache for Program.compile (InMemoryProgramCache, FileStreamProgramCache, make_program_cache_key). - Fix: graph kernel nodes now prevent kernel-argument GC. - Fix: DeviceEvents.__dealloc__ crash on uninitialized handle. * Update 1.0.0-notes.rst * expand support policy * wordsmith

Andy-Jost added this to the cuda.core v1.0.0 milestone May 6, 2026

Andy-Jost added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels May 6, 2026

Andy-Jost self-assigned this May 6, 2026

Andy-Jost mentioned this pull request May 6, 2026

Harden _py_host_destructor against invocation after Py_Finalize #2042

Open

Andy-Jost force-pushed the ajost/graph-kernel-args-lifetime branch from e50c99e to 6654645 Compare May 6, 2026 21:29

This comment has been minimized.

Sign in to view

rparolin reviewed May 6, 2026

View reviewed changes

rparolin requested changes May 6, 2026

View reviewed changes

rparolin approved these changes May 6, 2026

View reviewed changes

rwgk approved these changes May 6, 2026

View reviewed changes

leofang reviewed May 7, 2026

View reviewed changes

Andy-Jost merged commit 35d1722 into NVIDIA:main May 7, 2026
94 checks passed

Andy-Jost deleted the ajost/graph-kernel-args-lifetime branch May 7, 2026 14:53

Andy-Jost mentioned this pull request May 7, 2026

Revisit user object lifetime testing #2044

Open

Conversation

Andy-Jost commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Coverage

Related Work

Uh oh!

This comment has been minimized.

rparolin May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Andy-Jost May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost May 6, 2026

Choose a reason for hiding this comment

Uh oh!

rparolin left a comment

Choose a reason for hiding this comment

Uh oh!

rwgk commented May 6, 2026

Uh oh!

Andy-Jost commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented May 6, 2026

Uh oh!

leofang May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Andy-Jost May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Andy-Jost May 7, 2026

Choose a reason for hiding this comment

Uh oh!

leofang May 7, 2026

Choose a reason for hiding this comment

Uh oh!

leofang May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Andy-Jost commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Andy-Jost commented May 6, 2026 •

edited

Loading

Andy-Jost May 6, 2026 •

edited

Loading

Andy-Jost commented May 6, 2026 •

edited

Loading