Skip to content

KISS access to intrinsics#212

Closed
Shnatsel wants to merge 13 commits into
linebender:mainfrom
Shnatsel:kiss-intrinsics
Closed

KISS access to intrinsics#212
Shnatsel wants to merge 13 commits into
linebender:mainfrom
Shnatsel:kiss-intrinsics

Conversation

@Shnatsel
Copy link
Copy Markdown
Contributor

@Shnatsel Shnatsel commented May 16, 2026

Alternative to #108 and #196

The idea is to give the user access to a context with the appropriate #[target_feature] annotations based on a runtime token, and let them make use of the existing safe intrinsics in std without us having to wrap every single intrinsic.

The updated examples/srgb.rs shows this API in action.

The main draw is simplicity: the diff is 10x smaller than the alternatives, and half of that is generated code. This way we don't need to maintain another complex, safety-critical code generator that emits unsafe, unlike the archmage crate or #196.

After this is merged, everything that is gated on the safe_wrappers feature can be deleted.

Future work: better documentation, this feature is poorly documented both before and after this PR

@Shnatsel Shnatsel requested a review from LaurenzV May 16, 2026 16:36
Copy link
Copy Markdown
Member

@DJMcNab DJMcNab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this can provide a practical path to dropping the safe_wrappers code, I'm all for it.

I still slightly prefer the design of #108, even for use with the current choice of Levels. That's due to it being more expression oriented. But that's by-the-by presently, as no-one is motivated to review #108

macro_rules! neon_kernel {
(
$(#[$meta:meta])*
$vis:vis fn $name:ident(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only holdup with this PR is that you aren't required to state the token argument "upfront".
That is, I'd like there to be a $token: ident: $token_ty: Ty here, where we validate that $token_ty is Neon (fwiw, we should still use $crate:: when defining the actual outer function, to make it easier to reason about).

I think that not having that makes it much harder to understand the functions defined by calls to this macro.

Copy link
Copy Markdown
Contributor Author

@Shnatsel Shnatsel May 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that instead of

#[cfg(target_arch = "aarch64")]
fearless_simd::neon_kernel! {
    fn copy_alpha_neon(a: float32x4_t, b: float32x4_t) -> float32x4_t {
        vcopyq_laneq_f32::<3, 3>(a, b)
    }
}

you would like to see

#[cfg(target_arch = "aarch64")]
fearless_simd::neon_kernel! {
    fn copy_alpha_neon(_token: fearless_simd::Neon, a: float32x4_t, b: float32x4_t) -> float32x4_t {
        vcopyq_laneq_f32::<3, 3>(a, b)
    }
}

Is that right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two reasons I don't want to expose the token to the inner function.

One is the boilerplate. With all the #[inline(always)] fn foo<S: Simd>(simd: S, ... boilerplate going around we already have kind of too much of it. I agree turning that into #[simd] fn foo (... is too magical, but the boilerplate really adds up, and between all the #[cfg]s and the macros I'll take a less magical boilerplate reduction if I can.

The other reason is more subtle. It would be a bad idea to actually use the token inside the function, because the caller certifies that the code is running with SSE 4.2 or higher, while all the methods on the token only have access to features from SSE 4.2 and not the actual CPU we're running on. If we want to allow using high-level methods inside SIMD kernels, we need to allow them to be generic over SIMD and do the whole #[inline(always)] fn foo<S: Simd>(simd: S, ... dance again, which I actually have implemented in a branch but didn't want to include in this PR due to the added complexity.

And circling back to the boilerplate topic, I don't want to that to turn into #[inline(always)] fn foo<S: Simd>(_token: fearless_simd::Sse4_2, simd: S, ... because my eyes are already glazing over and I didn't even get to the actual function arguments yet.

Copy link
Copy Markdown
Member

@DJMcNab DJMcNab May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. I hadn't considered that it would look like the token is accessible to the inner function. I agree that this makes it less obvious what the path forward was. I never wanted it to be actually available to the inner function. In that case, perhaps fn_name($token_ty: ty, would be a better syntax.
  2. my argument was entirely that if I read:
neon_function! {
fn some_function(arg1: A, arg2: B);
}

I expect it to be called as:
some_function(arg1, arg2).

However, your proposal instead will lead to someone attempting that to get a cryptic error message about the type mismatching and there being the wrong number of arguments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't considered that it would look like the token is accessible to the inner function. I agree that this makes it less obvious what the path forward was.

I've experimented some more and I think this is actually fine.

In the srgb example, the explicit SSE4.2 path gets inlined into an AVX2-enabled generic context if AVX2 is actually availble on the system, and we get AVX2 vblendps emitted rather than SSE4.2 blendps. So you still get instruction encoding and register space benefits from the higher level if it's available most of the time.

And the token is actually useful inside the function, e.g. for safe loads/stores from slices: token.load_array_f32x4(input).into() is a safe load into the native SSE4.2 vector type.

$($kernel_body:tt)*
}
) => {
#[cfg(target_arch = "aarch64")]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should comment that this is here for docs when doc_auto_cfg is enabled.

#[doc = "}"]
#[doc = "```"]
#[macro_export]
macro_rules! wasm_simd128_kernel {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd lean slightly towards just deleting this entirely, but not strongly enough to block this pr on it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on this, but I think it's nice to provide a uniform API surface.

I already carved out a separate doc string in the generator just for this macro, I'm happy to add an edit of your choosing to it. We can totally highlight other ways to do it there.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, we should remove the entire rest of the docs and just say "this macro is the identity macro, and only exists to be uniform with the other macros".

Obviously in the scenario where we remove this macro, we'd mention and explain it not existing in the doc comment.
But it's also the case that anyone who would be using this macro is already aware that they need to compile two binaries.

Having thought about this more makes me even more in favour of removing it!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, NEON is purely compile-time-gated as well. Aarch64 includes NEON in the baseline instruction set, and LLVM doesn't support #[target_feature(disable = "foo")]. 32-bit ARM doesn't really have a way to detect target features at runtime (the std macro exists bu it's nightly-only and incomplete with no path to stabilization) and in practice anything targeting 32-bit ARM just enables NEON at compile time anyway.

Does that mean we should remove the NEON macro too and teach users two entirely different patterns for different architectures? Or even for the same architecture, if we add another ARM level later?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This being the case about neon is entirely news to me...

I feel like that fact should have had some wider implications to this library, but I've not thought about it fully.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've double-checked and I was wrong about this. While NEON is an always-enabled feature on Aarch64, you still need to wrap intrinsic calls into #[target_feature(enable = "neon")], so this macro is still necessary.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about not even auto-generating these, and just having them written manually?

OTOH, I guess I'd probably still want a test which double checks the impls were sound, so it's probably fine as is.
I think that this is reasonable as is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't like the idea of unsafe macro code copy-pasted four times (and possibly more later on), and my experiments with having a single kernel! macro that accepts a Level were somewhat awkward to use.

I'm not thrilled about the current implementation but it's the least awkward option that I can see.

@Shnatsel
Copy link
Copy Markdown
Contributor Author

I've implemented the API shape you suggested in #214 and while I'm not a fan of the API, the simplicity of it is hard to pass up. I'm closing this in favor of #214.

@Shnatsel Shnatsel closed this May 18, 2026
Shnatsel added a commit to Shnatsel/fearless_simd that referenced this pull request May 19, 2026
An even simpler implementation of the concept from linebender#212, thanks to the
review feedback by @DJMcNab

Alternative to linebender#108 and
linebender#196

Part of linebender#166
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants