Skip to content

cuda::std::simd F32x2 cleanup/refactoring#8951

Open
fbusato wants to merge 2 commits into
NVIDIA:mainfrom
fbusato:simd-cleanup-f32x2
Open

cuda::std::simd F32x2 cleanup/refactoring#8951
fbusato wants to merge 2 commits into
NVIDIA:mainfrom
fbusato:simd-cleanup-f32x2

Conversation

@fbusato
Copy link
Copy Markdown
Contributor

@fbusato fbusato commented May 12, 2026

Description

This PR only focuses on cleanup/refactoring. No functional changes.

  • Split F32x2 file in two: intrinsics, array level. This will help to keep the simd optimizations organized with next PRs.
  • Same for codegen test.
  • Add a couple of missing headers
  • Full qualifications for one function
  • Add NV_IF_TARGET for asm code + _CCCL_VERIFY

@fbusato fbusato self-assigned this May 12, 2026
@fbusato fbusato requested a review from a team as a code owner May 12, 2026 22:27
@fbusato fbusato added this to CCCL May 12, 2026
@fbusato fbusato requested a review from a team as a code owner May 12, 2026 22:27
@fbusato fbusato requested a review from miscco May 12, 2026 22:27
@fbusato fbusato added the libcu++ For all items related to libcu++ label May 12, 2026
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 12, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 12, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 4h 43m: Pass: 100%/113 | Total: 21h 46m | Max: 44m 54s | Hits: 94%/358077

See results here.

Comment on lines +40 to +42
NV_IF_TARGET(NV_IS_EXACTLY_SM_100,
(__result = ::__fadd2_rn(__lhs, __rhs);),
(_CCCL_VERIFY(false, "cuda::std::simd::__add_f32x2: Unsupported architecture");))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be:

Suggested change
NV_IF_TARGET(NV_IS_EXACTLY_SM_100,
(__result = ::__fadd2_rn(__lhs, __rhs);),
(_CCCL_VERIFY(false, "cuda::std::simd::__add_f32x2: Unsupported architecture");))
NV_IF_TARGET(NV_PROVIDES_SM_100,
(__result = ::__fadd2_rn(__lhs, __rhs);),
(_CCCL_VERIFY(false, "cuda::std::simd::__add_f32x2: Unsupported architecture");))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, FP32x2 is not supported in SM120

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, it's sm100 intrinsics, thus usable with any cc >= 100. See https://godbolt.org/z/7exGf9n37

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only GPU architecture where it is supported by HW is SM100. It is emulated on the other ones

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do we care that it's emulated? I mean what if a newer architecture supports fp32x2 operations natively as well? If it doesn't hurt performance, I would rather have sm100+ always using the intrinsic

Comment on lines +46 to +52
(asm("{"
".reg .b64 __lhs, __rhs, __result;"
"mov.b64 __lhs, {%2, %3};"
"mov.b64 __rhs, {%4, %5};"
"add.f32x2 __result, __lhs, __rhs;"
"mov.b64 {%0, %1}, __result;"
"}" //
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we do just

Suggested change
(asm("{"
".reg .b64 __lhs, __rhs, __result;"
"mov.b64 __lhs, {%2, %3};"
"mov.b64 __rhs, {%4, %5};"
"add.f32x2 __result, __lhs, __rhs;"
"mov.b64 {%0, %1}, __result;"
"}" //
(asm("add.f32x2 {%0, %1}, {%2, %3}, {%4, %5};"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without NV_IF_TARGET the code fails to compile with SM != 100, see https://godbolt.org/z/TqWEszv3a

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant keeping the NV_IF_TARGET but making it a one liner instead of 6 :D

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but ptxas or clang complained about it. I will try again

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't work. add.f32x2 requires .b64 not two f inputs

@fbusato fbusato requested a review from davebayer May 13, 2026 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libcu++ For all items related to libcu++

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants