`cuda::std::simd` F32x2 cleanup/refactoring by fbusato · Pull Request #8951 · NVIDIA/cccl

fbusato · 2026-05-12T22:27:27Z

Description

This PR only focuses on cleanup/refactoring. No functional changes.

Split F32x2 file in two: intrinsics, array level. This will help to keep the simd optimizations organized with next PRs.
Same for codegen test.
Add a couple of missing headers
Full qualifications for one function
Add NV_IF_TARGET for asm code + _CCCL_VERIFY

github-actions · 2026-05-13T03:12:35Z

🥳 CI Workflow Results

🟩 Finished in 4h 43m: Pass: 100%/113 | Total: 21h 46m | Max: 44m 54s | Hits: 94%/358077

See results here.

davebayer · 2026-05-13T04:43:06Z

+  NV_IF_TARGET(NV_IS_EXACTLY_SM_100,
+               (__result = ::__fadd2_rn(__lhs, __rhs);),
+               (_CCCL_VERIFY(false, "cuda::std::simd::__add_f32x2: Unsupported architecture");))


Shouldn't this be:

Suggested change

NV_IF_TARGET(NV_IS_EXACTLY_SM_100,

(__result = ::__fadd2_rn(__lhs, __rhs);),

(_CCCL_VERIFY(false, "cuda::std::simd::__add_f32x2: Unsupported architecture");))

NV_IF_TARGET(NV_PROVIDES_SM_100,

(__result = ::__fadd2_rn(__lhs, __rhs);),

(_CCCL_VERIFY(false, "cuda::std::simd::__add_f32x2: Unsupported architecture");))

no, FP32x2 is not supported in SM120

It is, it's sm100 intrinsics, thus usable with any cc >= 100. See https://godbolt.org/z/7exGf9n37

the only GPU architecture where it is supported by HW is SM100. It is emulated on the other ones

But do we care that it's emulated? I mean what if a newer architecture supports fp32x2 operations natively as well? If it doesn't hurt performance, I would rather have sm100+ always using the intrinsic

davebayer · 2026-05-13T04:44:41Z

+    (asm("{"
+         ".reg .b64 __lhs, __rhs, __result;"
+         "mov.b64 __lhs, {%2, %3};"
+         "mov.b64 __rhs, {%4, %5};"
+         "add.f32x2 __result, __lhs, __rhs;"
+         "mov.b64 {%0, %1}, __result;"
+         "}" //


Can't we do just

Suggested change

(asm("{"

".reg .b64 __lhs, __rhs, __result;"

"mov.b64 __lhs, {%2, %3};"

"mov.b64 __rhs, {%4, %5};"

"add.f32x2 __result, __lhs, __rhs;"

"mov.b64 {%0, %1}, __result;"

"}" //

(asm("add.f32x2 {%0, %1}, {%2, %3}, {%4, %5};"

without NV_IF_TARGET the code fails to compile with SM != 100, see https://godbolt.org/z/TqWEszv3a

I meant keeping the NV_IF_TARGET but making it a one liner instead of 6 :D

I tried but ptxas or clang complained about it. I will try again

it doesn't work. add.f32x2 requires .b64 not two f inputs

refactoring

f785b74

fbusato self-assigned this May 12, 2026

fbusato requested a review from a team as a code owner May 12, 2026 22:27

fbusato added this to CCCL May 12, 2026

fbusato requested a review from a team as a code owner May 12, 2026 22:27

fbusato requested a review from miscco May 12, 2026 22:27

fbusato added the libcu++ For all items related to libcu++ label May 12, 2026

github-project-automation Bot moved this to Todo in CCCL May 12, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 12, 2026

davebayer reviewed May 13, 2026

View reviewed changes

fbusato requested a review from davebayer May 13, 2026 21:46

disabele Fp32x2 operations with tile

a785e0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cuda::std::simd` F32x2 cleanup/refactoring#8951

`cuda::std::simd` F32x2 cleanup/refactoring#8951
fbusato wants to merge 2 commits into
NVIDIA:mainfrom
fbusato:simd-cleanup-f32x2

fbusato commented May 12, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

davebayer May 13, 2026

Uh oh!

fbusato May 13, 2026

Uh oh!

davebayer May 14, 2026

Uh oh!

fbusato May 14, 2026

Uh oh!

davebayer May 14, 2026

Uh oh!

davebayer May 13, 2026

Uh oh!

fbusato May 13, 2026

Uh oh!

davebayer May 14, 2026

Uh oh!

fbusato May 14, 2026

Uh oh!

fbusato May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fbusato commented May 12, 2026

Description

Uh oh!

github-actions Bot commented May 13, 2026

🥳 CI Workflow Results

🟩 Finished in 4h 43m: Pass: 100%/113 | Total: 21h 46m | Max: 44m 54s | Hits: 94%/358077

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants