Summary
_py_host_destructor (cuda_core/cuda/core/graph/_utils.pyx) is the destroy callback we attach to CUDA user objects that hold Python references for graph-resource lifetime. It is declared noexcept with gil and unconditionally calls Py_DECREF:
cdef void _py_host_destructor(void* data) noexcept with gil:
_py_decref(data)
It is attached via _attach_user_object with the CU_USER_OBJECT_NO_DESTRUCTOR_SYNC flag, which explicitly allows CUDA to invoke the destructor asynchronously on a worker thread, decoupled from cuGraphDestroy.
This creates a window during interpreter shutdown:
- Python starts shutdown;
GraphDefinition.__dealloc__ runs and calls cuGraphDestroy.
- CUDA queues the destructor for the user object on a worker thread.
Py_Finalize completes.
- The CUDA worker later runs
_py_host_destructor -> with gil calls PyGILState_Ensure after the runtime is gone -> undefined behavior, typically a crash.
In practice this is usually masked because CUDA tends to run the destructor synchronously inside cuGraphDestroy, but the contract permits the bad ordering and the codebase should not depend on the lucky timing.
Affected callers
All current users of _py_host_destructor:
Proposed fix
Guard the decref with Py_IsInitialized():
cdef extern from \"Python.h\":
int Py_IsInitialized()
cdef void _py_host_destructor(void* data) noexcept with gil:
if Py_IsInitialized():
_py_decref(data)
# else: process is exiting; the OS will reclaim everything.
An alternative is to drop CU_USER_OBJECT_NO_DESTRUCTOR_SYNC for Python-typed user objects so destructors always run synchronously inside cuGraphDestroy (where we are guaranteed to hold the GIL). That is safer but may have performance implications and changes existing semantics for the host-callback path; the Py_IsInitialized() guard is the smaller, safer change.
Context
This should be fixed in the context of #1330. The broader graph-lifetime/update work tracked there is the natural place to review the user-object lifetime model end to end.
Summary
_py_host_destructor(cuda_core/cuda/core/graph/_utils.pyx) is the destroy callback we attach to CUDA user objects that hold Python references for graph-resource lifetime. It is declarednoexcept with giland unconditionally callsPy_DECREF:It is attached via
_attach_user_objectwith theCU_USER_OBJECT_NO_DESTRUCTOR_SYNCflag, which explicitly allows CUDA to invoke the destructor asynchronously on a worker thread, decoupled fromcuGraphDestroy.This creates a window during interpreter shutdown:
GraphDefinition.__dealloc__runs and callscuGraphDestroy.Py_Finalizecompletes._py_host_destructor->with gilcallsPyGILState_Ensureafter the runtime is gone -> undefined behavior, typically a crash.In practice this is usually masked because CUDA tends to run the destructor synchronously inside
cuGraphDestroy, but the contract permits the bad ordering and the codebase should not depend on the lucky timing.Affected callers
All current users of
_py_host_destructor:_attach_host_callback_to_graph(Python callable and ctypesCFuncPtrpaths, plus bytes-backeduser_data) — pre-existing.Proposed fix
Guard the decref with
Py_IsInitialized():An alternative is to drop
CU_USER_OBJECT_NO_DESTRUCTOR_SYNCfor Python-typed user objects so destructors always run synchronously insidecuGraphDestroy(where we are guaranteed to hold the GIL). That is safer but may have performance implications and changes existing semantics for the host-callback path; thePy_IsInitialized()guard is the smaller, safer change.Context
This should be fixed in the context of #1330. The broader graph-lifetime/update work tracked there is the natural place to review the user-object lifetime model end to end.