Skip to content

cuda::std::simd Optimize Min/Max#8949

Open
fbusato wants to merge 6 commits into
NVIDIA:mainfrom
fbusato:simd-optimize-min-max
Open

cuda::std::simd Optimize Min/Max#8949
fbusato wants to merge 6 commits into
NVIDIA:mainfrom
fbusato:simd-optimize-min-max

Conversation

@fbusato
Copy link
Copy Markdown
Contributor

@fbusato fbusato commented May 12, 2026

Description

This PR introduces the following optimizations for SIMD min/max over two vectors:

  • VIMNMX for packed signed/unsigned 16-bit data: SM90+
  • VIMNMX for packed signed/unsigned 8-bit data: SM120f

@fbusato fbusato self-assigned this May 12, 2026
@fbusato fbusato added this to CCCL May 12, 2026
@fbusato fbusato added the libcu++ For all items related to libcu++ label May 12, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 12, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 12, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL May 12, 2026
@fbusato fbusato moved this from In Progress to In Review in CCCL May 12, 2026
@fbusato fbusato marked this pull request as ready for review May 12, 2026 22:53
@fbusato fbusato requested review from a team as code owners May 12, 2026 22:53
@fbusato fbusato requested a review from bernhardmgruber May 12, 2026 22:53
@github-actions

This comment has been minimized.

#define _CCCL_HAS_SIMD_F32X2() (_CCCL_HAS_SIMD_F32X2_INTRINSICS() || _CCCL_HAS_SIMD_F32X2_PTX())

#define _CCCL_HAS_SIMD_F32X2() (_CCCL_HAS_SIMD_F32X2_INTRINSICS() || _CCCL_HAS_SIMD_F32X2_PTX())
#define _CCCL_HAS_SIMD_8BIT_INTRINSICS() 0 // TODO(fbusato): CTK 13.2 produces non-optimal code for 8-bit SIMD instrs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check whether newer compiler generate better code and create an nvbug for the compiler team?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you pointed out I figured out the situation is even more complex.

  • CTK < 12.8: no optimization
  • CTK >= 12.8: 16bit x2 case are partially optimized. We see two VIMNMX.U16x2 instructions + PRMT
  • ToT CTK/nvcc: no optimization for 8bit x4
  • Manual optimization: works as expected

see https://godbolt.org/z/5j5c3sv3Y

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added bug number in the code

_CCCL_TEMPLATE(typename _Tp, typename _Abi, typename _Vec = basic_vec<_Tp, _Abi>)
_CCCL_REQUIRES(totally_ordered<_Tp>)
[[nodiscard]]
_CCCL_API constexpr _Vec min(const basic_vec<_Tp, _Abi>& __lhs, const basic_vec<_Tp, _Abi>& __rhs) noexcept
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: This will break in tile mode, I believe we need to mark all SIMD optimizations as _CCCL_HOST_DEVICE or disable them with !_CCCL_TILE_COMPILATION()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to update several PRs for Tile compatibility...

@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 48m: Pass: 100%/113 | Total: 1d 17h | Max: 54m 01s | Hits: 99%/327429

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libcu++ For all items related to libcu++

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants