cuda.core: convert peer_accessible_by to a live MutableSet view#2018
cuda.core: convert peer_accessible_by to a live MutableSet view#2018Andy-Jost wants to merge 13 commits intoNVIDIA:mainfrom
Conversation
DeviceMemoryResource.peer_accessible_by previously returned a sorted
tuple[int, ...] backed by a Python-level cache, which was prone to
divergence from driver state across multiple wrappers around the same
memory pool. The setter accepted Device | int and emitted a single
batched cuMemPoolSetAccess covering the diff against the cache.
This commit replaces the property with a live driver-backed view:
- Adds PeerAccessibleBySetProxy in _memory/_peer_access_utils.py, a
collections.abc.MutableSet whose reads call cuMemPoolGetAccess and
whose writes call cuMemPoolSetAccess. Iteration yields Device
objects; add, discard, and __contains__ accept either a Device or a
device-ordinal int. The proxy is constructed fresh on every property
access, so there is nothing to cache or pickle.
- Drops the _peer_accessible_by cache field (and its initializations
in __cinit__, _DMR_init, and from_allocation_handle), eliminating
the owned/non-owned read split. All pools now share the same code
path and always query the driver.
- All bulk operations on the proxy (update, |=, &=, -=, ^=, clear,
pop) issue exactly one cuMemPoolSetAccess call. Peer-access
transitions can take seconds per pool because every existing memory
mapping is updated, so coalescing into a single driver call lets the
toolkit handle the mappings in parallel. The property setter
(mr.peer_accessible_by = [...]) preserves its original single-call
behavior via the same shared planner path.
- Single-element add validates can_access_peer through
plan_peer_access_update, matching the existing setter contract.
This is a breaking change captured in the v1.0.0 release notes.
Callers comparing against tuples must update to set comparisons
(mr.peer_accessible_by == {Device(0)}). Existing tests are migrated;
new tests for set-interface conformance are intentionally deferred to
a follow-up.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
FYI: A breaking change must be labeled with P0. |
…s.pyx The previous commit left DeviceMemoryResource carrying three pass-through def methods (_query_peer_access_ids, _peer_access_includes, _apply_peer_access_diff) whose only purpose was to give the pure-Python proxy in _peer_access_utils.py a way to call cdef helpers in _device_memory_resource.pyx. These methods served no public role and cluttered the class API. Promote _peer_access_utils.py to a Cython module so the proxy and the driver-touching helpers can live together: - Convert _peer_access_utils.py to _peer_access_utils.pyx. cimports cydriver and DeviceMemoryResource from the .pxd; uses nogil and direct CUmemAccessDesc packing identically to before. - Move _DMR_query_peer_access_ids, _DMR_peer_access_includes, _DMR_apply_peer_access_diff, and _DMR_replace_peer_accessible_by from _device_memory_resource.pyx into the new module as cdef helpers (and a cpdef replace_peer_accessible_by entry point used by the property setter). - Drop the three pass-through def methods from DeviceMemoryResource. The class is left with the property getter and setter only; everything else is module-level in _peer_access_utils. - The proxy now calls the module-level cdef helpers directly instead of routing through methods on mr. No behavior change. The public surface (PeerAccessibleBySetProxy, plan_peer_access_update, normalize_peer_access_targets, PeerAccessPlan) is preserved at the same import paths. Co-authored-by: Cursor <cursoragent@cursor.com>
Refactor _query_peer_access_ids so the entire driver loop runs inside a single nogil block instead of acquiring/releasing the GIL once per device. The flag query now uses a cached as_cu(mr._h_pool) handle and fills a libcpp.vector[int]; because range(total) ascends, the result is already sorted and the trailing sorted() call is dropped. Also tighten the peer_accessible_by entry in 1.0.0-notes.rst: the breaking-change blurb only needs to state the type/element change, so remove the implementation-flavored details about input acceptance and batched cuMemPoolSetAccess calls. Co-authored-by: Cursor <cursoragent@cursor.com>
…ge cases Existing peer-access tests covered the integration path well (real copies across peers, the full transition matrix, shared-pool consistency) but only touched ``in``, ``==``, and the property setter on the new set proxy. After the v1.0.0 break that surfaced ~25 ``MutableSet`` methods, nothing was pinning the type-coercion contract, the owner-filtering behavior, the ``KeyError``/value-error paths, or the "one ``cuMemPoolSetAccess`` per bulk op" performance invariant. Add the following coverage in ``test_memory_peer_access.py``: - A ``MutableSet`` conformance test using a relaxed ``assert_mutable_set_interface`` mode that admits subjects holding at most one insertable element. CI maxes at two GPUs (one peer), so the multi-element protocol pass cannot run there. The new ``support_multi_insert=False`` path takes one insertable item plus two non-member sentinels and exercises every ``MutableSet`` method (``add``/``discard``/``remove``/``pop``/``clear``/``update``, comparisons, isdisjoint, subset/superset, binary and in-place operators, ``__iter__``/``__len__``/``__repr__``). - ``Device``/``int`` interchangeability on ``add``/``discard``/``__contains__``. - The owner-device filtering contract on every write (silent no-op). - Error paths: ``add(out_of_range)`` and ``add(non_coercible)`` raise while the lenient ``discard``/``__contains__`` paths swallow the same inputs; ``remove(non_member)`` raises ``KeyError``. - "Live driver view" semantics: a proxy obtained before another wrapper modifies the pool reflects the change with no refresh step. - ``__iter__`` ordering is ascending by ``device_id`` and elements are ``Device`` instances; ``__repr__`` includes the class name and tracks live contents; the getter returns the documented proxy type. - A batching spy that monkeypatches the module-level ``_apply_peer_access_diff`` and asserts that every bulk op (``|=``/``&=``/``-=``/``^=``/``update``/``difference_update``/ ``clear``) and the property setter issues at most one driver call, zero when the diff is empty. To make the spy possible, ``_apply_peer_access_diff`` is now a Python-visible ``def`` wrapper around a renamed ``_apply_peer_access_diff_cython`` ``cdef inline``. The proxy and the property setter still call ``_apply_peer_access_diff`` by bare name, which Cython resolves through the module's globals at runtime, so a ``monkeypatch.setattr(_peer_access_utils, "_apply_peer_access_diff", ...)`` intercepts them. The extra Python-level dispatch is negligible next to ``cuMemPoolSetAccess`` itself. Co-authored-by: Cursor <cursoragent@cursor.com>
Augmented assignment on the ``peer_accessible_by`` property
(``dmr.peer_accessible_by |= {...}``) is two trips through the
proxy/setter pair, not one: Python fetches the proxy, the proxy mutates
itself in place via ``__ior__``, and Python then assigns the
(already-mutated) proxy back through the setter. That trailing setter
call computes the diff against current driver state, finds it empty,
and short-circuits inside the ``cdef inline`` before issuing any
``cuMemPoolSetAccess`` work — so the *driver-level* contract ("one
batched call per bulk op") still holds, but the wrapper is invoked
twice, which the spy was counting.
Also, the fixture's ``dmr.peer_accessible_by = []`` reset on an already
empty pool is itself an empty-delta wrapper call.
Filter the recorded calls down to those with non-empty deltas (the ones
that translate to real driver work) and switch the bulk-ops test to use
a locally bound proxy so augmented assignment goes through ``__ior__``
once with no extra setter invocation. The setter test stays on
``dmr.peer_accessible_by = ...`` because that is the public API
contract under test there.
Co-authored-by: Cursor <cursoragent@cursor.com>
…hing
Move the actual ``cuMemPoolSetAccess`` invocation (descriptor-array
build + driver call) into a thin Python-visible ``def _set_pool_access``
in ``_peer_access_utils.pyx``. ``_apply_peer_access_diff`` now does only
the empty-diff short-circuit and delegates the work to
``_set_pool_access``, which Cython resolves through the module globals
at runtime so tests can intercept it via ``monkeypatch.setattr``.
Replace the previous internal-wrapper spy with a driver-call spy that
counts every real ``cuMemPoolSetAccess`` invocation. Earlier no-op
layers (e.g. the augmented-assignment-on-property pattern that writes
an already-mutated proxy back through the setter) short-circuit before
reaching ``_set_pool_access``, so the recorded count is exactly the
number of driver calls. The empty-delta filter and the local-binding
workaround in the bulk-ops test are gone; we now also assert that
``dmr.peer_accessible_by |= {...}`` directly on the property is still
exactly one driver call.
Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the ``support_multi_insert`` flag and ``non_members`` keyword with two purpose-built helpers: - ``assert_mutable_set_interface(subject, items)`` keeps the original signature and contract: at least five distinct insertable items, exercised against a reference set in the standard way. The graph ``AdjacencySetProxy`` test continues to use this unchanged. - ``assert_single_member_mutable_set_interface(subject, member, non_member)`` is a focused pass for proxies whose backing store admits at most one insertable element at a time (here, the peer-access view on a 2-GPU box). It threads a single member and one non-member sentinel through every ``MutableSet`` method. The two helpers share small private utilities (empty-state checks, ``__repr__`` shape) but keep their public surfaces small and linear. A capacity-one proxy is a meaningfully different contract from a general mutable set; naming that explicitly in the API reads better than a flag and avoids forcing call sites to plumb sentinels through. Co-authored-by: Cursor <cursoragent@cursor.com>
|
/ok to test |
| def assert_single_member_mutable_set_interface(subject, member, non_member): | ||
| """Exercise every MutableSet method on a subject with capacity one. | ||
|
|
||
| Use this for proxies whose backing store admits at most one insertable | ||
| element at a time (typically because the underlying resource is bounded | ||
| by hardware, e.g. a peer-access view on a system with a single valid | ||
| peer device). The subject only ever holds ``set()`` or ``{member}``; | ||
| *non_member* supplies the right-hand side of comparisons, ``isdisjoint``, | ||
| subset/superset, and binary/in-place operators so every ``MutableSet`` | ||
| method is exercised at least once. |
There was a problem hiding this comment.
Let's get this PR merged and apply it in #1775 (assuming we still have time).
|
|
Q: It is a big diff with a very niche use case. Are we sure this is P0? |
The peer access list is currently a tuple. This change aligns with the set proxies in the graph module ( This change is not functionally as big as it might appear. (1) Most of the bulk is just code movement: I moved the peer access implementation from |
…-by-set-proxy Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/docs/source/api_private.rst
…-by-set-proxy Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/docs/source/api_private.rst # cuda_core/docs/source/release/1.0.0-notes.rst
…-by-set-proxy Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/cuda/core/_memory/_device_memory_resource.pyx
leofang
left a comment
There was a problem hiding this comment.
Overall well-structured PR. Cache elimination is the right call, proxy pattern is consistent with AdjacencySetProxy / AccessedBySetProxy, and test coverage is thorough. Comments tagged Critical / Consideration / Nitpick.
| def __contains__(self, value) -> bool: | ||
| try: | ||
| dev_id = _resolve_peer_device_id(value) | ||
| except (TypeError, ValueError): |
There was a problem hiding this comment.
Critical — __contains__ can raise on out-of-range int, violating Python convention.
_resolve_peer_device_id calls Device(value). For out-of-range ints, Device() may raise CUDAError (not ValueError or TypeError), which isn't caught here. So 999 in proxy would propagate a CUDAError instead of returning False.
Python's __contains__ contract is that it never raises — callers rely on in being a safe, boolean predicate. The same gap exists in discard (line 261) and the bulk-op resolution loops (difference_update, intersection_update, symmetric_difference_update).
Suggest either:
- catching
CUDAErrorhere too (and in the other resolution try/except blocks), or - having
Device()normalize out-of-range ordinals toValueErrorupstream.
The existing test (test_peer_accessible_by_rejects_invalid_inputs) covers "not-a-device" not in proxy (TypeError, caught) but not bad_id in proxy (CUDAError, uncaught).
There was a problem hiding this comment.
Device(bad_id) raises ValueError at cuda/core/_device.pyx:984. I will add a test to confirm in the next commit.
| dev_id = _resolve_peer_device_id(value) | ||
| except (TypeError, ValueError): | ||
| return | ||
| self._apply((), [dev_id]) |
There was a problem hiding this comment.
Consideration — double resolution in discard.
discard resolves value to dev_id (int) here, then passes [dev_id] into _apply, which iterates removals and calls _resolve_peer_device_id(value) again on each element (line 381). That second call does Device(int).device_id — an extra Device() construction per element.
For a single-element discard this is one redundant Device() allocation. Not wrong, but since cuda.core cares about O(100 ns)-level overhead, you could either:
- have
_applyaccept already-resolved ints for removals (a separate parameter or a sentinel), or - pass the raw
valuethrough and let_applydo the only resolution.
There was a problem hiding this comment.
My 2c is this level of optimization is not needed these APIs. Peer access changes iterate through all memory mappings and can take on the order of seconds to complete. I'd favor simpler code here.
| cdef DeviceMemoryResource mr = <DeviceMemoryResource>self._mr | ||
| owner_id = mr._dev_id | ||
| owner = Device(owner_id) | ||
| current = _query_peer_access_ids(mr) |
There was a problem hiding this comment.
Consideration — every add/discard does a full device scan.
_apply unconditionally calls _query_peer_access_ids(mr) here, which loops over all devices with one cuMemPoolGetAccess per device. For add(dev1), a cheaper path would be:
_peer_access_includes(mr, dev_id)— already granted? return.- validate
can_access_peerfor the single device. _set_pool_access(mr, (dev_id,), ()).
That's 1 driver call for the common "already granted" case vs. N+1. For discard, similarly check _peer_access_includes first.
The current approach is correct and consistent (reuses the planner, validates can_access_peer), so this is about whether the per-device scan matters in practice. On a system with many GPUs (DGX-H100: 8 GPUs = 7 cuMemPoolGetAccess calls per add), it could add up.
| # These tests exercise the ``PeerAccessibleBySetProxy`` surface added in | ||
| # v1.0.0. They run against ``mempool_device_x2`` because every CI machine has | ||
| # at most 2 GPUs, which means at most one valid peer device. The relaxed | ||
| # ``support_multi_insert=False`` path on ``assert_mutable_set_interface`` |
There was a problem hiding this comment.
Nitpick — stale comment. This mentions support_multi_insert=False path on assert_mutable_set_interface, but the actual code below calls assert_single_member_mutable_set_interface — a separate function, not a flag. Looks like a leftover from an earlier iteration.
| ``CUgraphConditionalHandle`` value. Previously, ``.handle`` had to be | ||
| extracted explicitly. | ||
|
|
||
| - :attr:`DeviceMemoryResource.peer_accessible_by` now returns a |
There was a problem hiding this comment.
Nitpick — other breaking-change entries link to their PR/issue (e.g. #1950, #2016). This one could link to #2018 for consistency.
|
@Andy-Jost there are merge conflicts that need to be resolved |
…-by-set-proxy Co-authored-by: Cursor <cursoragent@cursor.com>
- Optimize add/discard to check the single target device (1 driver call) instead of scanning all devices via _query_peer_access_ids - Add test for out-of-range int in __contains__ (returns False) - Fix stale comment referencing removed support_multi_insert flag - Add NVIDIA#2018 link to release notes entry Co-authored-by: Cursor <cursoragent@cursor.com>
|
commit
|
Summary
DeviceMemoryResource.peer_accessible_bypreviously returned a sortedtuple[int, ...]backed by a Python-level cache. This is fragile in two ways: the cache can diverge from driver state across multiple wrappers around the same memory pool (#1720), and tuple semantics force callers into bespoke splice/replace patterns instead of standard set operations. This PR replaces the property with a live driver-backedcollections.abc.MutableSetview, matching the proxy patterns established byAdjacencySetProxy(graphs) and the in-flightAccessedBySet(managed-memory advice, #1775).This is a breaking change, captured in the v1.0.0 release notes.
Changes
New:
PeerAccessibleBySetProxyA
MutableSetsubclass living incuda_core/cuda/core/_memory/_peer_access_utils.pyx.__contains__,__iter__,__len__) callcuMemPoolGetAccess.add,discard, and bulk ops) callcuMemPoolSetAccess.Deviceobjects.add,discard,__contains__acceptDevice | int.update,|=,&=,-=,^=,clear) issue exactly onecuMemPoolSetAccesscall. This matters: peer-access transitions can take seconds per pool because every existing memory mapping is updated, so coalescing into a single driver call lets the toolkit handle the mappings in parallel.The proxy is constructed fresh on every property access. There is nothing to cache or pickle.
Cache removal
The
_peer_accessible_byfield onDeviceMemoryResourceis dropped, along with its initializations in__cinit__,_DMR_init, andfrom_allocation_handle. The owned/non-owned read split is gone — every read path now queries the driver directly. This eliminates the bug class fixed in #1720 and simplifies the implementation.Module consolidation
All peer-access internals now live in
_peer_access_utils.pyx(promoted from a.py). Thecdef inlinedriver helpers (_query_peer_access_ids,_peer_access_includes,_set_pool_access) and thecpdef replace_peer_accessible_bysetter helper moved out of_device_memory_resource.pyx.DeviceMemoryResourcenow exposes only the publicpeer_accessible_byproperty: the getter returnsPeerAccessibleBySetProxy(self)and the setter delegates toreplace_peer_accessible_by(self, devices). This keeps the DMR surface focused on its memory-management role.Property setter
mr.peer_accessible_by = [...]is preserved and unchanged in behavior. It still does a single batchedcuMemPoolSetAccessvia the same sharedplan_peer_access_updatepath the proxy uses for bulk ops._query_peer_access_idsGIL optimizationThe driver loop that enumerates peer device IDs now runs inside a single
nogilblock instead of acquiring/releasing the GIL once per device. Per-call data is collected into alibcpp.vector[int], and becauserange(total)ascends the result is already sorted (so the trailingsorted()is gone). This is a local performance tweak with no behavior change.Test coverage
The existing integration tests (
test_memory_peer_access.py,memory_ipc/test_peer_access.py) migrated from tuple equality to set-of-Deviceequality. Eight new tests pin the proxy's contract end-to-end, all using the existingmempool_device_x2fixture so they run on CI:assert_single_member_mutable_set_interface(subject, member, non_member)helper exercises everyMutableSetmethod against subjects whose backing store admits at most one insertable element (the peer-access proxy can hold at most{dev1}on a 2-GPU box). The originalassert_mutable_set_interface(subject, items)is unchanged for normal multi-element subjects (AdjacencySetProxy).Device/intinterchangeability onadd/discard/__contains__.add(out_of_range)andadd(non_coercible)raise;discard/__contains__swallow the same inputs;remove(non_member)raisesKeyError.device_id; elements areDeviceinstances;__repr__shape; getter return type.monkeypatchspy on_set_pool_access(the thin Python-visible helper extracted from_apply_peer_access_diffpurely so tests can spy on the actual driver call). Every bulk op (|=/&=/-=/^=/update/difference_update/clear) and the property setter is asserted to issue exactly onecuMemPoolSetAccess, zero on no-op. This includes thedmr.peer_accessible_by |= {...}augmented-assignment-on-property pattern, where the empty-diff short-circuit prevents a second driver call from the trailing setter write.Breaking change
DeviceMemoryResource.peer_accessible_byno longer returns atuple[int, ...]. Callers must update:mr.peer_accessible_by == (0,)) -> set comparisons (mr.peer_accessible_by == {Device(0)}).Deviceobjects (use[d.device_id for d in mr.peer_accessible_by]if int IDs are needed).Setter usage (
mr.peer_accessible_by = [...]) is unchanged.Test plan
test_memory_peer_access.pytests pass after migration to set semantics.test_memory_peer_access_utils.pytests still pass (plan_peer_access_updateandnormalize_peer_access_targetsare unchanged).memory_ipc/test_peer_access.pytests pass (IPC-enabled pools, parent-child IPC scenarios).cuda_coretest suite passes on a single-GPU box (2931 passed, 212 skipped — peer-access tests gated onmempool_device_x2/x3).pre-commit run --all-filespasses (ruff, format, SPDX, cython-lint, RST checks).Related work
AccessedBySet, the second driver-backed set proxy incuda.core.cuda_core/cuda/core/graph/_graph_node.pyx(AdjacencySetProxy) — reference pattern for driver-backed set proxies. SameMutableSetshape, same_from_iterableplain-set fallback, same single-driver-call bulk-op contract.