Skip to content

Use the new tuning API internally for detail::select|three_way_partition::dispatch and DevicePartition#8925

Open
bernhardmgruber wants to merge 1 commit into
NVIDIA:mainfrom
bernhardmgruber:use_tuning_api_partition
Open

Use the new tuning API internally for detail::select|three_way_partition::dispatch and DevicePartition#8925
bernhardmgruber wants to merge 1 commit into
NVIDIA:mainfrom
bernhardmgruber:use_tuning_api_partition

Conversation

@bernhardmgruber
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber commented May 12, 2026

  • No SASS changes for cub.bench.partition.three_way.base on SM75;80;86;90;100
  • Use signed offset type for DevicePartition #8971 (required to avoid SASS changes)
  • No SASS changes for cub.bench.partition.if.base on SM75;80;86;90;100
  • No SASS changes for cub.bench.partition.flagged.base on SM75;80;86;90;100

Fixes: #8879
Fixes: #8380

@bernhardmgruber bernhardmgruber requested review from a team as code owners May 12, 2026 14:43
@bernhardmgruber bernhardmgruber requested a review from shwina May 12, 2026 14:43
@bernhardmgruber bernhardmgruber requested a review from pauleonix May 12, 2026 14:43
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 12, 2026
@bernhardmgruber bernhardmgruber requested a review from elstehle May 12, 2026 14:43
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 12, 2026
@github-actions

This comment has been minimized.

Comment thread cub/test/catch2_test_device_partition_env.cu Outdated
@bernhardmgruber bernhardmgruber force-pushed the use_tuning_api_partition branch from 2202fb7 to f978ca6 Compare May 13, 2026 16:26
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4e509214-cb2e-4a86-96bf-61f80365fafd

📥 Commits

Reviewing files that changed from the base of the PR and between 5d6ec8e and 2e36a8b.

📒 Files selected for processing (6)
  • cub/benchmarks/bench/partition/flagged.cu
  • cub/benchmarks/bench/partition/if.cu
  • cub/benchmarks/bench/partition/three_way.cu
  • cub/cub/device/device_partition.cuh
  • cub/test/catch2_test_device_partition_env.cu
  • thrust/thrust/system/cuda/detail/partition.h
🚧 Files skipped from review as they are similar to previous changes (5)
  • cub/benchmarks/bench/partition/flagged.cu
  • thrust/thrust/system/cuda/detail/partition.h
  • cub/benchmarks/bench/partition/if.cu
  • cub/test/catch2_test_device_partition_env.cu
  • cub/cub/device/device_partition.cuh

📝 Walkthrough

Summary by CodeRabbit

  • Tests

    • Added tuning verification tests to ensure partition operations (If, Flagged, three-way If) run correctly under tuned execution environments.
  • Refactor

    • Streamlined partition benchmark and execution wiring to use environment-based invocation and direct partition calls, enabling execution tuning and simplifying launch/temporary-storage flow. No public API changes.

suggestion:

Walkthrough

This PR rewires DevicePartition/select dispatch to the new tuning API, replaces manual temp-size/allocation with env-based dispatch and policy_selector functors in benchmarks, updates Thrust partition dispatch, and adds tuning tests validating the tuned execution paths.

Changes

Tuning API Integration for DevicePartition and Select Dispatch

Layer / File(s) Summary
DevicePartition environment-based overload refactoring
cub/cub/device/device_partition.cuh
Removed private partition_impl; refactored env-based Flagged/If and three-way overloads to compute signed offsets inline, build default policy selectors, and call detail::select::dispatch / detail::three_way_partition::dispatch via dispatch_with_env_and_tuning.
Benchmark adoption of policy selector and tuned environment
cub/benchmarks/bench/partition/flagged.cu, cub/benchmarks/bench/partition/if.cu, cub/benchmarks/bench/partition/three_way.cu
Introduce policy_selector functors; simplify type aliases; use raw device pointers; remove two-step temp-size/alloc/dispatch flow and instead use caching_allocator_t + cub_bench_env (with cuda::execution::tune(policy_selector<T>{}) when enabled) and direct cub::DevicePartition::* calls.
Thrust partition integration using CUB dispatch
thrust/thrust/system/cuda/detail/partition.h
Added dispatch_partition helper performing query+execute via cub::detail::select::dispatch; unified partition helper to use single THRUST_INDEX_TYPE_DISPATCH that calls dispatch_partition.
Tuning test coverage for DevicePartition operations
cub/test/catch2_test_device_partition_env.cu
Added tuning includes, defined capability-driven partition and three-way policy selectors, and added C2H_TEST cases validating tuning for DevicePartition::If, DevicePartition::Flagged, and three-way If with assertions on selected counts and tuned block sizes.

Assessment against linked issues

Objective Addressed Explanation
Use new tuning API for detail::select::dispatch + DevicePartition [#8879]
Use new tuning API for detail::three_way_partition::dispatch [#8380]

Possibly related PRs

  • NVIDIA/cccl#8971: Overlaps on DevicePartition env-overload signed-offset handling and related selector construction.

Suggested reviewers

  • elstehle

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2e262785-9664-4f8f-b738-a61ffdd14e4e

📥 Commits

Reviewing files that changed from the base of the PR and between 3f440b9 and f978ca6.

📒 Files selected for processing (6)
  • cub/benchmarks/bench/partition/flagged.cu
  • cub/benchmarks/bench/partition/if.cu
  • cub/benchmarks/bench/partition/three_way.cu
  • cub/cub/device/device_partition.cuh
  • cub/test/catch2_test_device_partition_env.cu
  • thrust/thrust/system/cuda/detail/partition.h

Comment thread cub/benchmarks/bench/partition/three_way.cu Outdated
Comment thread cub/cub/device/device_partition.cuh Outdated
@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 52m: Pass: 100%/340 | Total: 7d 14h | Max: 1h 51m | Hits: 62%/627542

See results here.

@bernhardmgruber bernhardmgruber force-pushed the use_tuning_api_partition branch from 5d6ec8e to 2e36a8b Compare May 14, 2026 19:30
{
template <typename Derived, typename InputIt, typename StencilIt, typename OutputIt, typename Predicate, typename OffsetT>
struct DispatchPartitionIf
cudaError_t THRUST_RUNTIME_FUNCTION dispatch_partition(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cudaError_t THRUST_RUNTIME_FUNCTION dispatch_partition(
[[nodiscard]] cudaError_t THRUST_RUNTIME_FUNCTION dispatch_partition(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

3 participants