cuda::std::simd Optimize Min/Max#8949
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
This comment has been minimized.
This comment has been minimized.
| #define _CCCL_HAS_SIMD_F32X2() (_CCCL_HAS_SIMD_F32X2_INTRINSICS() || _CCCL_HAS_SIMD_F32X2_PTX()) | ||
|
|
||
| #define _CCCL_HAS_SIMD_F32X2() (_CCCL_HAS_SIMD_F32X2_INTRINSICS() || _CCCL_HAS_SIMD_F32X2_PTX()) | ||
| #define _CCCL_HAS_SIMD_8BIT_INTRINSICS() 0 // TODO(fbusato): CTK 13.2 produces non-optimal code for 8-bit SIMD instrs. |
There was a problem hiding this comment.
Can you please check whether newer compiler generate better code and create an nvbug for the compiler team?
There was a problem hiding this comment.
Now that you pointed out I figured out the situation is even more complex.
- CTK < 12.8: no optimization
- CTK >= 12.8: 16bit x2 case are partially optimized. We see two
VIMNMX.U16x2instructions +PRMT - ToT CTK/nvcc: no optimization for 8bit x4
- Manual optimization: works as expected
There was a problem hiding this comment.
added bug number in the code
| _CCCL_TEMPLATE(typename _Tp, typename _Abi, typename _Vec = basic_vec<_Tp, _Abi>) | ||
| _CCCL_REQUIRES(totally_ordered<_Tp>) | ||
| [[nodiscard]] | ||
| _CCCL_API constexpr _Vec min(const basic_vec<_Tp, _Abi>& __lhs, const basic_vec<_Tp, _Abi>& __rhs) noexcept |
There was a problem hiding this comment.
Important: This will break in tile mode, I believe we need to mark all SIMD optimizations as _CCCL_HOST_DEVICE or disable them with !_CCCL_TILE_COMPILATION()
There was a problem hiding this comment.
I have to update several PRs for Tile compatibility...
🥳 CI Workflow Results🟩 Finished in 1h 48m: Pass: 100%/113 | Total: 1d 17h | Max: 54m 01s | Hits: 99%/327429See results here. |
Description
This PR introduces the following optimizations for SIMD min/max over two vectors:
VIMNMXfor packed signed/unsigned 16-bit data:SM90+VIMNMXfor packed signed/unsigned 8-bit data:SM120f