Skip to content

Add a WebAssembly SIMD backend for reusable intrinsics kernels#5685

Merged
martin-frbg merged 3 commits intoOpenMathLib:developfrom
teddygood:wasm-intrin-backend-exp
Mar 18, 2026
Merged

Add a WebAssembly SIMD backend for reusable intrinsics kernels#5685
martin-frbg merged 3 commits intoOpenMathLib:developfrom
teddygood:wasm-intrin-backend-exp

Conversation

@teddygood
Copy link
Contributor

Follow-up to #5676 and #4023.

This PR adds a WebAssembly SIMD backend for the shared SIMD intrinsics layer. The goal is to let existing reusable kernels that already use the common intrinsics API execute through WebAssembly SIMD on ARCH_WASM instead of falling back to scalar code. As part of that, it adds kernel/simd/intrin_wasm.h, wires it up from kernel/simd/intrin.h for __wasm_simd128__, and fills in the missing reduction helpers needed for backend completeness.

I also enabled the generic vector path for srot on ARCH_WASM. In local Pyodide/Emscripten testing, contiguous daxpy improved by about 1.23x–1.40x over the current baseline, and contiguous srot improved by about 1.41x–2.13x. Stride-2 cases were approximately flat. drot was evaluated too, but local testing did not show a clear or consistent benefit on WASM SIMD, so I left it out of this PR.

This PR intentionally keeps the scope small. It does not add new WASM-specific kernels, and it does not try to cover every reusable kernel that could potentially use the shared intrinsics layer at once.

If this is not the direction you would prefer upstream, I would be happy to adjust it.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Mar 18, 2026

Thanks - having the universal simd header should also allow leveraging the simd-based optimizations from #2867 in the generic dot.c (SDOT/DDOT/DSDOT) - the kernel file copied out of the riscv64_generic support currently uses its own copy of the more trivial generic code from arm/dot.c for some reason.

@teddygood
Copy link
Contributor Author

Thanks - having the universal simd header should also allow leveraging the simd-based optimizations from #2867 in the generic dot.c (SDOT/DDOT/DSDOT) - the kernel file copied out of the riscv64_generic support currently uses its own copy of the more trivial generic code from arm/dot.c for some reason.

Thanks, that is very helpful. Would you prefer to include that in this PR as well, or keep it as a follow-up once this backend is in place?

@martin-frbg
Copy link
Collaborator

As you prefer - come to think of it, it should also be possible to trivially copy the "ifndef DOUBLE" part of the SIMD fallback kernel from x86_64 daxpy.c to saxpy.c (which currently has only a trivial C loop for when no assembly microkernel is available). But I guess that is all existing SIMD kernels then. :)

@teddygood
Copy link
Contributor Author

Thanks - having the universal simd header should also allow leveraging the simd-based optimizations from #2867 in the generic dot.c (SDOT/DDOT/DSDOT) - the kernel file copied out of the riscv64_generic support currently uses its own copy of the more trivial generic code from arm/dot.c for some reason.

Thanks, that is very helpful. In that case I’ll keep this PR to the intrinsics backend plus SAXPY, and leave the dot-related changes for a follow-up PR.

@martin-frbg martin-frbg added this to the 0.3.32 milestone Mar 18, 2026
@teddygood
Copy link
Contributor Author

I went ahead and added SAXPY in this PR by switching SAXPYKERNEL to x86_64/saxpy.c and adding the same kind of SIMD fallback used in daxpy.c. In local direct WASM benchmarking, contiguous saxpy improved by about 2.14x, 1.46x, and 1.10x for the sizes I tested, while a stride-2 case was roughly flat at about 1.04x.

@martin-frbg martin-frbg merged commit adba2c3 into OpenMathLib:develop Mar 18, 2026
79 of 80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants