Port Panama SIMD kernels to C++ using Google Highway #668
Conversation
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
eb12f8e to
73efe07
Compare
MarkWolters
left a comment
There was a problem hiding this comment.
I think it looks good. I did also use Bob to do a review and double check myself and it had this comment (I understand what it is saying but I will leave it to your discretion if it is correct):
jvector_simd_kernels.cpp lines 571 and 943
Both calculate_partial_sums_f32 and calculate_partial_sums_self_magnitude_f32 have a size==2 fast path that uses hn::Shuffle2301 for the horizontal add. This is wrong.
For size==2, centroids are interleaved as [c0[0], c0[1], c1[0], c1[1], ...]. The goal is to sum adjacent pairs within each centroid. Shuffle2301 on [a, b, c, d] produces [c, d, a, b] — swaps 64-bit halves — so
score + Shuffle2301(score) mixes elements from different centroids. The correct shuffle is Shuffle1032, which swaps adjacent 32-bit elements: [b, a, d, c], so score + Shuffle1032(score) gives [s0+s1, s0+s1,
s2+s3, s2+s3], which is the correct per-centroid sum.
Impact: Any PQ index with size==2 subspaces (e.g., 128-dim vectors with 64 subspaces) silently produces wrong search scores. The scalar fallback is never reached because the fast path advances the loop index
past all centroids.
Fix:
// Line 571 and 943: change
hn::Shuffle2301(score) → hn::Shuffle1032(score)
hn::Shuffle2301(sum) → hn::Shuffle1032(sum)
Not sure why Bob thinks that, but from Google Highway documentation, Shuffle2301 is the right instruction. Modifying it to Shuffle1032 results in our unit tests failing. |
I believe the confusion came from incorrect comments in the C++ file—Bob likely relied on those rather than the official Google Highway documentation to interpret the intrinsic. I’ve since corrected the comments here: 9a52b4d |
69f3d38 to
0fae13f
Compare
- Replace jvector_simd.c + jvector_simd_check.c with C++ using Highway - Add jvector_simd.cpp (JNI dispatch layer) and jvector_simd_kernels.cpp/h (all SIMD kernel implementations: FP32, PQ, NVQ) - Add meson.build for building with Highway targets - Add Google Highway as git submodule (third_party/highway) - Add supporting headers: jvector_cpuFeatures.h, assertHwyTargets.h - Regenerate NativeSimdOps.java JNI bindings from new jvector_simd.h - Add __fsid_t.java and max_align_t.java (jextract-generated stubs) - Remove AVX-512 check from NativeVectorizationProvider; replace with x86_64 architecture guard (Highway selects best ISA at runtime) - Remove AVX-512 test from NativeSimdOpsTest - Update jextract_vector_simd.sh for new header layout - Update README with Highway build instructions
Wire up FP32 SIMD kernels in NativeVectorUtilSupport: - dotProduct(v1, v2) and dotProduct(v1, offset, v2, offset, len) via dot_product_f32 (dispatches to best ISA via Highway) - cosine(v1, v2) and cosine(v1, offset, v2, offset, len) via cosine_f32 - squareDistance(v1, v2) and squareDistance(v1, offset, v2, offset, len) via euclidean_f32 - addInPlace(v1, v2) and addInPlace(v1, scalar) via add_in_place_f32 / add_scalar_in_place_f32 - subInPlace(v1, v2) and subInPlace(v1, scalar) via sub_in_place_f32 / sub_scalar_in_place_f32 - max(v) via max_f32 - minInPlace(v1, v2) via min_in_place_f32 FP32 distance kernels are gated on length >= 128 (below that threshold the Panama vector fallback is used).
Wire up PQ SIMD kernels in NativeVectorUtilSupport: - assembleAndSum: switch to assemble_and_sum_f32 (was _512 variant) - assembleAndSumPQ: replace Java fallback with assemble_and_sum_pq_f32 native call; validates ordinal offsets are 0 via assertions - pqDecodedCosineSimilarity: switch to pq_decoded_cosine_similarity_f32 (was _512 variant); passes length as long - calculatePartialSums (new): dispatches to calculate_partial_sums_euclidean_f32 or calculate_partial_sums_dot_f32 based on VectorSimilarityFunction
Wire up NVQ (Non-uniform Vector Quantization) SIMD kernels in NativeVectorUtilSupport: - nvqShuffleQueryInPlace8bit: pre-shuffle query vector for fast-lane dequantization in scoring kernels - nvqQuantize8bit: quantize float vector to 8-bit NVQ representation - nvqLoss / nvqUniformLoss: compute quantization loss for parameter tuning - nvqSquareL2Distance8bit: L2 distance between float query and 8-bit quantized vector - nvqDotProduct8bit: dot product between float query and 8-bit quantized vector - nvqCosine8bit: cosine similarity; native returns packed int64 (low 32 bits = dot sum, high 32 bits = quantized magnitude), unpacked to float[]
0fae13f to
18b1b5d
Compare
2c8e07d to
1edff51
Compare
|
@akash-shankaran Thanks for the review! I resolved all the conversations for now. Feel free to reopen any if you have more questions/concerns. |
This reverts commit 1edff51.
…lerplate Add jvector_simd_kernel_list.h with a single JVECTOR_SIMD_KERNEL_LIST X-Macro table that serves as the single source of truth for all SIMD kernel signatures. The macro auto-generates: - Namespace declarations in jvector_simd_kernels.h - KernelVTable struct members in jvector_simd.cpp - Vtable initializers for AVX3, AVX2, and SSE42 - Public API wrapper functions and their C declarations in jvector_simd.h Net result: ~390 lines removed. Adding a new kernel now requires only a single KERNEL_ENTRY line in the list header.
|
Updated PR with these commits:
|
|
|
||
| HWY_FLATTEN float my_new_op_f32(const float* HWY_RESTRICT a, size_t length) | ||
| { | ||
| #if HWY_STATIC_TARGET == HWY_AVX3 |
There was a problem hiding this comment.
| #if HWY_STATIC_TARGET == HWY_AVX3 | |
| #if HWY_STATIC_TARGET >= HWY_AVX3 |
| float result = _mm512_reduce_add_ps(acc); | ||
| for (; i < length; ++i) result += a[i]; | ||
| return result; | ||
| #else |
There was a problem hiding this comment.
I wonder if there's a better way to do this than using #if... #else within a function body. In particular, specialisations which use completely different algorithms probably deserve their own function.
IDE tools also don't play that well with #if... #else blocks, and things like this are possible.
| } // namespace MY_NEW_TIER | ||
| ``` | ||
|
|
||
| ### Step 6 — Add a compile-time assertion (`assert_hwy_targets.h`) |
There was a problem hiding this comment.
Adding a new SIMD architecture seems like a fairly involved process. Does it make sense to use an X-macro style approach here as well, or is that just adding unnecessary complexity?
| within an existing tier. A tier is appropriate when a new ISA extension (e.g. | ||
| native FP16, BF16, or AMX arithmetic) requires a different compiler target than | ||
| any existing tier, so the kernels cannot share a compilation unit with | ||
| `jvector_simd_kernels.cpp`. |
There was a problem hiding this comment.
Is this process any different if adding support for a completely different architecture (say, NEON) as opposed to adding an x86_64 extension? Would be useful to update this section accordingly.
| `jvector-native/src/main/native/jextract_vector_simd.sh`. To build and auto-install `g++` on Ubuntu: | ||
|
|
||
| ```bash | ||
| ./jvector-native/src/main/native/jextract_vector_simd.sh --auto-install-g++ |
There was a problem hiding this comment.
This and the paragraph above it needs to be updated to --auto-install-deps
Add two new Highway ISA tiers above the existing AVX3 baseline:
- AVX3_DL (Ice Lake): compiled with -march=icelake-server, covering the
full ICX extension set (VNNI, VBMI, VBMI2, IFMA, BITALG, VPOPCNTDQ,
GFNI, VAES, VPCLMULQDQ). Source file: jvector_avx3_dl_kernels.cpp.
- AVX3_SPR (Sapphire Rapids): compiled with -march=sapphirerapids, adding
AVX512FP16 and AVX512BF16 on top of the ICX set. Source file:
jvector_avx3_spr_kernels.cpp.
Both tiers currently inherit all vtable slots from AVX3 unchanged;
their dedicated source files are ready for ISA-specific kernel overrides.
Infrastructure changes:
- jvector_cpu_features.h: add composite CpuFeature::AVX3 (100),
AVX3_DL (101), AVX3_SPR (102) flags computed from raw CPUID bits in
populate_cpu_features(), simplifying dispatch to a single has() test
per tier. Values start at 100 to leave room for future raw features.
- jvector_simd.cpp: add MaxIsa::AVX3_DL / AVX3_SPR enum values,
read_max_isa() mappings ("avx3_dl", "avx3_spr"), vtables, and
CPUID dispatch gates.
- jvector_simd_kernels.h: add DECLARE_SIMD_KERNELS for both new namespaces.
- assert_hwy_targets.h: add JV_REQUIRE_HWY_AVX3_DL / AVX3_SPR guards.
- meson.build: register the two new static_library() compilation units.
- README.md: update architecture diagram, ISA cap env-var docs, and both
the "Adding a new kernel" and "Adding a new ISA tier" how-to sections.
c4ae96e to
5812dc4
Compare
|
Added two commits related to unit tests and CI coverage:
|
5eccdbe to
242280a
Compare
This PR rewrites JVector's Panama Vector API-based SIMD kernels with native C++ implementations using Google Highway, a portable SIMD library that compiles a single kernel source into multiple ISA targets (SSE42, AVX2 and AVX-512) and dispatches at runtime.
What changed
jvector_simd.cis replaced byjvector_simd.cpp(ISA dispatch shim) andjvector_simd_kernels.cpp(all Highway kernel implementations), with a newmeson.builddriving multi-target compilation (adds meson as a build dependency).calculatePartialSelfSumkernel.NativeSimdOps.javais regenerated viajextractto match the updated C APINativeVectorUtilSupportis updated and the native kernels are now unconditionally preferred over any Panama fallback for dot product, L2, and cosine distanceThe README file
jvector-native/src/main/c/README.mdis a good start before reviewing the code.