Feature Request
Add Python bindings and a cuda.core abstraction for CUDA Multicast Objects
(the cuMulticastCreate / cuMulticastAddDevice / cuMulticastBindMem /
cuMulticastBindAddr / cuMulticastUnbind family of driver APIs).
Motivation
Multicast Objects enable a single virtual address to multicast memory accesses
across multiple devices, which is foundational for efficient multi-GPU
collectives and SHARP-style reductions on NVLink-connected systems (Hopper+
and beyond). They are already exposed via the low-level cuda.bindings driver
API, but there is no high-level cuda.core object wrapping them.
This is one of the last major CUDA driver features without a cuda.core
counterpart. Internal teams building multi-GPU libraries (RAPIDS, Triton,
cuBLASMp, etc.) currently have to drop to raw driver calls.
Proposed Scope
- A
MulticastObject (or similarly named) class with:
- Constructor / factory taking granularity + participating-device list
add_device(device) / bind_memory(buffer) / bind_address(...) /
unbind(...) methods
- Context-manager lifecycle (auto-release on exit)
- Integration with existing
Device, Buffer, and VirtualMemoryResource
- Query helpers for multicast granularity (
cuMulticastGetGranularity)
- Example(s) under
cuda_core/examples/ demonstrating a simple 2-GPU
multicast allreduce-style pattern
- API reference entry in
cuda_core/docs/source/api.rst
Related
- Tracking gaps identified in cuda.core feature audit (Nov 2025) — other
untracked gaps include host_launch and Library (cuLibrary) APIs; those
can be filed separately.
- Relevant driver APIs:
cuMulticastCreate, cuMulticastAddDevice,
cuMulticastBindMem, cuMulticastBindAddr, cuMulticastUnbind,
cuMulticastGetGranularity.
Feature Request
Add Python bindings and a
cuda.coreabstraction for CUDA Multicast Objects(the
cuMulticastCreate/cuMulticastAddDevice/cuMulticastBindMem/cuMulticastBindAddr/cuMulticastUnbindfamily of driver APIs).Motivation
Multicast Objects enable a single virtual address to multicast memory accesses
across multiple devices, which is foundational for efficient multi-GPU
collectives and SHARP-style reductions on NVLink-connected systems (Hopper+
and beyond). They are already exposed via the low-level
cuda.bindingsdriverAPI, but there is no high-level
cuda.coreobject wrapping them.This is one of the last major CUDA driver features without a
cuda.corecounterpart. Internal teams building multi-GPU libraries (RAPIDS, Triton,
cuBLASMp, etc.) currently have to drop to raw driver calls.
Proposed Scope
MulticastObject(or similarly named) class with:add_device(device)/bind_memory(buffer)/bind_address(...)/unbind(...)methodsDevice,Buffer, andVirtualMemoryResourcecuMulticastGetGranularity)cuda_core/examples/demonstrating a simple 2-GPUmulticast allreduce-style pattern
cuda_core/docs/source/api.rstRelated
untracked gaps include
host_launchandLibrary(cuLibrary) APIs; thosecan be filed separately.
cuMulticastCreate,cuMulticastAddDevice,cuMulticastBindMem,cuMulticastBindAddr,cuMulticastUnbind,cuMulticastGetGranularity.