Conversation
|
Here are some instructions to test the Multi-GPU Batched KMeans API with RAFT comms (to be used with Ray/Dask) : RAFT comms (Ray/Dask) demo codeCompilation commandLaunch command
|
| if (has_data) { | ||
| auto& data_batches = *data_batches_opt; | ||
| data_batches.reset(); | ||
| data_batches.prefetch_next_batch(); |
There was a problem hiding this comment.
prefetch needs to be evaluated. If we have to reduce the streaming batch size in order to enable prefetch and fit two batches on the device, it might actually be slower. The lesser the number of batches, faster the iteration.
| auto& data_batches = *data_batches_opt; | ||
| data_batches.reset(); | ||
| data_batches.prefetch_next_batch(); | ||
| for (const auto& data_batch : data_batches) { |
There was a problem hiding this comment.
If have separated out all the computations to be done for a single batch into a new function process_batch here. We should be able to reuse that entirely here.
| auto weight_per_cluster_const = | ||
| raft::make_device_vector_view<const T, IdxT>(weight_per_cluster.data_handle(), n_clusters); | ||
|
|
||
| cuvs::cluster::kmeans::detail::finalize_centroids<T, IdxT>(dev_res, |
There was a problem hiding this comment.
Does this mean that every rank is performing the finalize_centroids call and computing the exact same centroids?
There was a problem hiding this comment.
Yes, exactly. We do the same in the non-batched MG version of KMeans. The new centroids can only be computed after every worker has completed processing its batches and shared the local results via the allreduce operations. After the allreduce, every rank holds identical centroid_sums and weight_per_cluster, so each rank independently computes the same new centroids. Computing them on a single GPU and broadcasting the (possibly large) centroid array would not be any faster. The redundant local division is cheap compared to an extra communication round.
| raft::host_scalar_view<T> inertia, | ||
| raft::host_scalar_view<int64_t> n_iter) | ||
| { | ||
| using IdxT = int64_t; |
There was a problem hiding this comment.
Nit: why dont we template on IdxT and simply instantiate only the main fit function with the correct types.
| /* | ||
| * SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION. | ||
| * SPDX-License-Identifier: Apache-2.0 | ||
| */ |
There was a problem hiding this comment.
Can we get rid of this file entirely and combine with regular mg kmeans (just as we are doing in PR #2015)? Is that possible?
There was a problem hiding this comment.
Also, MNMG should be able to reuse the snmg_fit function (for a single worker) as is, right? Except that the nccl reduce macro will be replaced by something like comms.allreduce()
Closes #1989.
Adds multi-GPU support to KMeans fit for host-resident data, with two modes:
device_resources_snmg.Both modes share the same core Lloyd's loop, batched streaming of host data, NCCL/comms allreduce of centroid sums and counts, and synchronized convergence. Supports sample weights, n_init best-of-N restarts, KMeansPlusPlus initialization, and float/double. Falls back to single-GPU when neither multi-GPU resources nor comms are present.