Skip to content

cuda.core: keep kernel-argument objects alive in graph kernel nodes#2041

Merged
Andy-Jost merged 3 commits intoNVIDIA:mainfrom
Andy-Jost:ajost/graph-kernel-args-lifetime
May 7, 2026
Merged

cuda.core: keep kernel-argument objects alive in graph kernel nodes#2041
Andy-Jost merged 3 commits intoNVIDIA:mainfrom
Andy-Jost:ajost/graph-kernel-args-lifetime

Conversation

@Andy-Jost
Copy link
Copy Markdown
Contributor

@Andy-Jost Andy-Jost commented May 6, 2026

Summary

Closes #2039.

GraphDefinition.launch() did not extend the lifetime of Python kernel-argument objects (e.g. Buffer) to the lifetime of the graph. The ownership represented by a ParamHolder constructed in GN_launch needs to be attached to the graph to avoid the possibility of stale arguments producing memory corruption or a crash on launch.

Changes

  • cuda_core/cuda/core/graph/_graph_node.pyx: in GN_launch, attach the kernel_args tuple to the graph as a CUDA user object, mirroring the existing handling of KernelHandle and EventHandle. Reuses the _py_host_destructor path already used by the host-callback machinery.
  • cuda_core/cuda/core/graph/_utils.pxd: expose _py_host_destructor so the new caller can use it.

The new attachment runs only on the graph-construction path and is paid once per kernel node at build time, not at execution time. It does not affect the regular (non-graph) launch path in _launcher.pyx.

Test Coverage

Two tests added in cuda_core/tests/graph/test_graph_definition_lifetime.py:

  • test_kernel_args_buffer_kept_alive_through_execution: a Buffer passed as a kernel arg survives del buf + gc.collect() (weakref check) and the graph executes correctly against its memory after instantiation (value check).
  • test_kernel_args_survive_graph_clone: same scenario but via cuGraphClone, which doesn't carry Python-level references — only CUDA user objects can keep the args alive across the clone.

Related Work

@Andy-Jost Andy-Jost added this to the cuda.core v1.0.0 milestone May 6, 2026
@Andy-Jost Andy-Jost added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels May 6, 2026
@Andy-Jost Andy-Jost self-assigned this May 6, 2026
`GraphDefinition.launch()` did not extend the lifetime of the Python
kernel-argument objects to the lifetime of the graph. The `ParamHolder`
built in `GN_launch` held the only references to those objects and was
destroyed when `GN_launch` returned. The driver only stores the raw
pointer values in the kernel node, so a `Buffer` reachable only through
the call could be GC'd before the graph ran, leaving the graph with a
stale device pointer.

Attach the `kernel_args` tuple to the graph as a CUDA user object,
mirroring the existing handling of `KernelHandle` and `EventHandle`.
This reuses the `_py_host_destructor` path already used by the host
callback machinery.

Closes NVIDIA#2039

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost Andy-Jost force-pushed the ajost/graph-kernel-args-lifetime branch from e50c99e to 6654645 Compare May 6, 2026 21:29
@github-actions

This comment has been minimized.


del buf
gc.collect()
assert buf_weak() is not None # graph kept the Buffer alive
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test prove the buffer is kept alive, but it doesn't validate that its cleaned up after the graph is released.

Copy link
Copy Markdown
Contributor Author

@Andy-Jost Andy-Jost May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test for this. If it is flakey, we might need to adjust the CU_USER_OBJECT_NO_DESTRUCTOR_SYNC flag so that graph destructors cannot be invoked asynchronously.

Update: I confirmed this is not a concern for source graphs. Asynchronous destruction only comes into play for exec graphs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test creates an exec graph, so there is a race. CI for free-threaded Python seems more likely to trigger it. 9f2c8f2 adds polling, but removing the test would also be defensible.

Copy link
Copy Markdown
Collaborator

@rparolin rparolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests should validate that buffer is eventually freed once the graph is refcount is decremented.

Addresses review feedback (PR NVIDIA#2041): the existing test only proved the
graph kept the Buffer alive, not that the user-object machinery actually
releases it once the graph is destroyed. Without the symmetric check, a
working attachment is indistinguishable from a permanent leak.

Co-authored-by: Cursor <cursoragent@cursor.com>
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 6, 2026

Below are the Cursor GPT-5.4 Extra High Fast findings. It was thinking far longer than I'd have expected for a PR this size.

I'm not sure which of these are actually actionable:

Re 1. Do we care about stream-captured graphs?
Re 2. Could we simply document that we don't protect against explicit release?
Re 3. This seems OK to me? (I.e. I mean we can ignore this finding, or document the behavior?)

@Andy-Jost


  1. High: Stream-captured graphs still use the unfixed launcher path

    launch() still accepts a GraphBuilder and creates a stack-local ParamHolder in cuda_core/cuda/core/_launcher.pyx:23 and cuda_core/cuda/core/_launcher.pyx:47, but the new ownership transfer exists only in GN_launch() at cuda_core/cuda/core/graph/_graph_node.pyx:623. That means launch(gb, ..., buf) can still capture a kernel node whose Buffer or other Python owner is freed after capture, leaving the graph with a stale raw pointer. This is a real public path, not just a hypothetical one; existing tests exercise captured kernel launches with Buffer args in cuda_core/tests/graph/test_graph_memory_resource.py:171. The new regression tests only cover the explicit GraphDefinition path in cuda_core/tests/graph/test_graph_definition_lifetime.py:497 and cuda_core/tests/graph/test_graph_definition_lifetime.py:533.

  2. High: Retaining the Buffer wrapper does not protect against explicit release

    The PR now keeps ker_args.kernel_args alive at cuda_core/cuda/core/graph/_graph_node.pyx:623, but ParamHolder snapshots a Buffer argument as a raw device pointer value in cuda_core/cuda/core/_kernel_arg_handler.pyx:283 and cuda_core/cuda/core/_kernel_arg_handler.pyx:287. If user code explicitly calls buf.close() or exits a context manager, Buffer_close() resets _h_ptr and can free the allocation immediately at cuda_core/cuda/core/_memory/_buffer.pyx:596. At that point the graph still owns only the stale integer pointer copied during node creation; keeping the Python Buffer object alive does not keep the underlying allocation handle alive.

  3. Medium: The new graph-scoped attachment can pin kernel-argument owners after node deletion

    _attach_user_object() creates a CUDA user object and moves it into the graph at cuda_core/cuda/core/graph/_utils.pyx:41, but it does not return any handle that could later be released per node. GraphNode.destroy() only calls cuGraphDestroyNode() at cuda_core/cuda/core/graph/_graph_node.pyx:159. Because CUDA user objects are retained and released at graph scope (cuGraphRetainUserObject / cuGraphReleaseUserObject), deleting or rewiring a kernel node can now leave large kernel-argument owners, including Buffers, pinned until the entire graph is destroyed. Node mutation is already a supported workflow in cuda_core/tests/graph/test_graph_definition_mutation.py:159.

@Andy-Jost
Copy link
Copy Markdown
Contributor Author

Andy-Jost commented May 6, 2026

Thanks @rwgk

  1. High: Stream-captured graphs still use the unfixed launcher path

I'll look into this. It might need to be deferred because AFAIK the stream capture path does not create any user objects.

  1. High: Retaining the Buffer wrapper does not protect against explicit release

We have a huge class of possible errors of this type, unfortunately. A better approach than storing the Buffer would be to store the DevicePtrHandle (a std::shared_ptr owning the buffer). Definitely out of scope for this change.

  1. Medium: The new graph-scoped attachment can pin kernel-argument owners after node deletion

I have to rework the whole user object design for #1330 (step 4) and I plan to address this.

@leofang
Copy link
Copy Markdown
Member

leofang commented May 6, 2026

High: Stream-captured graphs still use the unfixed launcher path

This is a good catch, and it is not fixed with this PR. (Which is why kernel arg update is so messy, as noted during the team sync today). I am fine with this PR only fixing the explicit graph construction path.

The freeing assertion at the end of test_kernel_args_buffer_lifetime
failed on free-threaded Python (py3.14t) because cuGraphExecDestroy
releases its user-object references via an asynchronous DPC, and free-
threaded CPython's deferred ref counting can need an extra GC pass to
settle. Poll the weakref with a bounded timeout and per-iteration GC
instead of asserting eagerly.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment on lines +16 to +30
def _wait_until(predicate, timeout=2.0, interval=0.01):
"""Poll predicate() until True or timeout, driving gc each iteration.

Used for assertions about resource cleanup that may be delayed by CUDA's
asynchronous user-object destructor pump (DPC) or, on free-threaded
Python, by deferred reference-count processing. A bounded poll keeps the
test correct without depending on undocumented driver timing guarantees.
"""
deadline = time.monotonic() + timeout
while time.monotonic() < deadline:
gc.collect()
if predicate():
return
time.sleep(interval)
raise AssertionError(f"condition not satisfied within {timeout}s")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this still welcome flakiness? I am concerned about this being tested in SWQA hands

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's not perfect, though this is much better than before. Realistically, I think it's either this or we don't test the release condition Rob pointed out. There will be much more work on the graph ownership model, so I expect to revisit all these tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

free-threaded Python the resulting Py_DECREF chain may need an extra
GC pass to settle.
"""
from cuda.core._utils.cuda_utils import driver, handle_return
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move imports to the top, no need to defer import to here

from cuda.core._utils.cuda_utils import driver, handle_return

_skip_if_no_mempool()
dev = Device()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dev = Device()
dev = init_cuda

@Andy-Jost Andy-Jost merged commit 35d1722 into NVIDIA:main May 7, 2026
94 checks passed
@Andy-Jost Andy-Jost deleted the ajost/graph-kernel-args-lifetime branch May 7, 2026 14:53
@Andy-Jost
Copy link
Copy Markdown
Contributor Author

I merged due to the release deadline, but I will follow-up on the open comments.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Doc Preview CI
Preview removed because the pull request was closed or merged.

leofang added a commit to leofang/cuda-python that referenced this pull request May 7, 2026
…#2047

- New feature: persistent program cache for Program.compile (InMemoryProgramCache,
  FileStreamProgramCache, make_program_cache_key).
- Fix: graph kernel nodes now prevent kernel-argument GC.
- Fix: DeviceEvents.__dealloc__ crash on uninitialized handle.
leofang added a commit that referenced this pull request May 7, 2026
…anup (#2032)

* Document cuda.core support policy

Add support.rst covering versioning (SemVer), CUDA version support
(dual major versions), Python version support (CPython EOL schedule),
free-threading (experimental), and release cadence (bimonthly).

Closes #2030

* Fix broken CCCL URLs and add missing cuda.bindings interfaces

- Update cuda.coop and cuda.compute URLs from the old
  nvidia.github.io/cccl/python/{coop,compute} paths (now 404)
  to the current unstable doc paths.
- Add nvFatbin and NVML to the cuda.bindings interface list.
- Update all three synced files: README.md, cuda_python/DESCRIPTION.rst,
  and cuda_python/docs/source/index.rst.

* Add missing entries to cuda.core 1.0.0 release notes

Add new features (green contexts, system.Device NVML APIs, system.typing
module, NVML enum re-wrapping), breaking changes (tensor bridge behavior,
system.Device renames, privatized helper classes, UUID format change,
removed enums), and bug fixes (is_managed for pool alloc, nvJitLink log
error handling, NVML event set init, Device.arch unknown, empty field
values, runtime error messages, wheel size reduction).

* Update cuda.core docs for 1.0.0 GA

- api.rst: replace pre-1.0 warning with stable-API statement and link
  to support policy.
- install.rst: update free-threading version reference from 0.4.0 to
  1.0.0.
- nv-versions.json: add 1.0.0 entry for the version switcher dropdown.

* Split cuda.core.system API reference into separate page

Move the CUDA system information / NVML section from api.rst into a
dedicated api_nvml.rst. The new page uses its own `.. module::
cuda.core.system` directive so autosummary entries no longer need the
`system.` prefix. Added to index.rst toctree after api.

* Remove algorithm and size details from make_program_cache_key docstring

The Returns section exposed the hash algorithm and digest size, which
are implementation details. Replace with "opaque bytes digest" so the
public API contract does not pin these.

See #2043

* Remove deprecated cuda.core.experimental namespace

The cuda.core.experimental namespace was deprecated in v0.5.0 when all
public APIs moved to the top-level cuda.core namespace. Remove the
backward-compatibility shim and its test as promised for v1.0.0.

* Add missing release note entries for #1912, #2041, #2047

- New feature: persistent program cache for Program.compile (InMemoryProgramCache,
  FileStreamProgramCache, make_program_cache_key).
- Fix: graph kernel nodes now prevent kernel-argument GC.
- Fix: DeviceEvents.__dealloc__ crash on uninitialized handle.

* Update 1.0.0-notes.rst

* expand support policy

* wordsmith
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Graph kernel nodes don't keep kernel argument objects alive

4 participants