cuda::std::simd F32x2 cleanup/refactoring#8951
Conversation
🥳 CI Workflow Results🟩 Finished in 4h 43m: Pass: 100%/113 | Total: 21h 46m | Max: 44m 54s | Hits: 94%/358077See results here. |
| NV_IF_TARGET(NV_IS_EXACTLY_SM_100, | ||
| (__result = ::__fadd2_rn(__lhs, __rhs);), | ||
| (_CCCL_VERIFY(false, "cuda::std::simd::__add_f32x2: Unsupported architecture");)) |
There was a problem hiding this comment.
Shouldn't this be:
| NV_IF_TARGET(NV_IS_EXACTLY_SM_100, | |
| (__result = ::__fadd2_rn(__lhs, __rhs);), | |
| (_CCCL_VERIFY(false, "cuda::std::simd::__add_f32x2: Unsupported architecture");)) | |
| NV_IF_TARGET(NV_PROVIDES_SM_100, | |
| (__result = ::__fadd2_rn(__lhs, __rhs);), | |
| (_CCCL_VERIFY(false, "cuda::std::simd::__add_f32x2: Unsupported architecture");)) |
There was a problem hiding this comment.
no, FP32x2 is not supported in SM120
There was a problem hiding this comment.
It is, it's sm100 intrinsics, thus usable with any cc >= 100. See https://godbolt.org/z/7exGf9n37
There was a problem hiding this comment.
the only GPU architecture where it is supported by HW is SM100. It is emulated on the other ones
There was a problem hiding this comment.
But do we care that it's emulated? I mean what if a newer architecture supports fp32x2 operations natively as well? If it doesn't hurt performance, I would rather have sm100+ always using the intrinsic
| (asm("{" | ||
| ".reg .b64 __lhs, __rhs, __result;" | ||
| "mov.b64 __lhs, {%2, %3};" | ||
| "mov.b64 __rhs, {%4, %5};" | ||
| "add.f32x2 __result, __lhs, __rhs;" | ||
| "mov.b64 {%0, %1}, __result;" | ||
| "}" // |
There was a problem hiding this comment.
Can't we do just
| (asm("{" | |
| ".reg .b64 __lhs, __rhs, __result;" | |
| "mov.b64 __lhs, {%2, %3};" | |
| "mov.b64 __rhs, {%4, %5};" | |
| "add.f32x2 __result, __lhs, __rhs;" | |
| "mov.b64 {%0, %1}, __result;" | |
| "}" // | |
| (asm("add.f32x2 {%0, %1}, {%2, %3}, {%4, %5};" |
There was a problem hiding this comment.
without NV_IF_TARGET the code fails to compile with SM != 100, see https://godbolt.org/z/TqWEszv3a
There was a problem hiding this comment.
I meant keeping the NV_IF_TARGET but making it a one liner instead of 6 :D
There was a problem hiding this comment.
I tried but ptxas or clang complained about it. I will try again
There was a problem hiding this comment.
it doesn't work. add.f32x2 requires .b64 not two f inputs
Description
This PR only focuses on cleanup/refactoring. No functional changes.
NV_IF_TARGETforasmcode +_CCCL_VERIFY