Add a WebAssembly SIMD backend for reusable intrinsics kernels#5685
Add a WebAssembly SIMD backend for reusable intrinsics kernels#5685martin-frbg merged 3 commits intoOpenMathLib:developfrom
Conversation
|
Thanks - having the universal simd header should also allow leveraging the simd-based optimizations from #2867 in the generic dot.c (SDOT/DDOT/DSDOT) - the kernel file copied out of the riscv64_generic support currently uses its own copy of the more trivial generic code from arm/dot.c for some reason. |
Thanks, that is very helpful. Would you prefer to include that in this PR as well, or keep it as a follow-up once this backend is in place? |
|
As you prefer - come to think of it, it should also be possible to trivially copy the "ifndef DOUBLE" part of the SIMD fallback kernel from x86_64 daxpy.c to saxpy.c (which currently has only a trivial C loop for when no assembly microkernel is available). But I guess that is all existing SIMD kernels then. :) |
Thanks, that is very helpful. In that case I’ll keep this PR to the intrinsics backend plus SAXPY, and leave the dot-related changes for a follow-up PR. |
|
I went ahead and added SAXPY in this PR by switching SAXPYKERNEL to x86_64/saxpy.c and adding the same kind of SIMD fallback used in daxpy.c. In local direct WASM benchmarking, contiguous saxpy improved by about 2.14x, 1.46x, and 1.10x for the sizes I tested, while a stride-2 case was roughly flat at about 1.04x. |
Follow-up to #5676 and #4023.
This PR adds a WebAssembly SIMD backend for the shared SIMD intrinsics layer. The goal is to let existing reusable kernels that already use the common intrinsics API execute through WebAssembly SIMD on
ARCH_WASMinstead of falling back to scalar code. As part of that, it addskernel/simd/intrin_wasm.h, wires it upfrom kernel/simd/intrin.hfor__wasm_simd128__,and fills in the missing reduction helpers needed for backend completeness.I also enabled the generic vector path for
srotonARCH_WASM. In local Pyodide/Emscripten testing, contiguousdaxpyimproved by about 1.23x–1.40x over the current baseline, and contiguoussrotimproved by about 1.41x–2.13x. Stride-2 cases were approximately flat.drotwas evaluated too, but local testing did not show a clear or consistent benefit on WASM SIMD, so I left it out of this PR.This PR intentionally keeps the scope small. It does not add new WASM-specific kernels, and it does not try to cover every reusable kernel that could potentially use the shared intrinsics layer at once.
If this is not the direction you would prefer upstream, I would be happy to adjust it.