Port Panama SIMD kernels to C++ using Google Highway by r-devulap · Pull Request #668 · datastax/jvector

r-devulap · 2026-05-26T15:15:21Z

This PR rewrites JVector's Panama Vector API-based SIMD kernels with native C++ implementations using Google Highway, a portable SIMD library that compiles a single kernel source into multiple ISA targets (SSE42, AVX2 and AVX-512) and dispatches at runtime.

What changed

Introduce Google Highway as a git submodule.
jvector_simd.c is replaced by jvector_simd.cpp (ISA dispatch shim) and jvector_simd_kernels.cpp (all Highway kernel implementations), with a new meson.build driving multi-target compilation (adds meson as a build dependency).
All FP32, PQ, and NVQ kernels are ported to Highway, with a new calculatePartialSelfSum kernel.
Ported the optimizations in Wire calculatePartialSums to native SIMD via Panama FFI downcall #651 to Google Highway.
NativeSimdOps.java is regenerated via jextract to match the updated C API
NativeVectorUtilSupport is updated and the native kernels are now unconditionally preferred over any Panama fallback for dot product, L2, and cosine distance

The README file jvector-native/src/main/c/README.md is a good start before reviewing the code.

github-actions · 2026-05-26T15:15:38Z

Before you submit for review:

Does your PR follow guidelines from CONTRIBUTIONS.md?
Did you summarize what this PR does clearly and concisely?
Did you include performance data for changes which may be performance impacting?
Did you include useful docs for any user-facing changes or features?
Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
Did you trigger and review regression testing results against the base branch via Run Bench Main?
Did you adhere to the code formatting guidelines (TBD)
Did you group your changes for easy review, providing meaningful descriptions for each commit?
Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

jshook

Looks ok to me.

MarkWolters

I think it looks good. I did also use Bob to do a review and double check myself and it had this comment (I understand what it is saying but I will leave it to your discretion if it is correct):

jvector_simd_kernels.cpp lines 571 and 943

Both calculate_partial_sums_f32 and calculate_partial_sums_self_magnitude_f32 have a size==2 fast path that uses hn::Shuffle2301 for the horizontal add. This is wrong.

For size==2, centroids are interleaved as [c0[0], c0[1], c1[0], c1[1], ...]. The goal is to sum adjacent pairs within each centroid. Shuffle2301 on [a, b, c, d] produces [c, d, a, b] — swaps 64-bit halves — so
score + Shuffle2301(score) mixes elements from different centroids. The correct shuffle is Shuffle1032, which swaps adjacent 32-bit elements: [b, a, d, c], so score + Shuffle1032(score) gives [s0+s1, s0+s1,
s2+s3, s2+s3], which is the correct per-centroid sum.

Impact: Any PQ index with size==2 subspaces (e.g., 128-dim vectors with 64 subspaces) silently produces wrong search scores. The scalar fallback is never reached because the fast path advances the loop index
past all centroids.

Fix:
// Line 571 and 943: change
hn::Shuffle2301(score) → hn::Shuffle1032(score)
hn::Shuffle2301(sum) → hn::Shuffle1032(sum)

r-devulap · 2026-05-27T01:34:52Z

Impact: Any PQ index with size==2 subspaces (e.g., 128-dim vectors with 64 subspaces) silently produces wrong search scores. The scalar fallback is never reached because the fast path advances the loop index past all centroids.

Fix: // Line 571 and 943: change hn::Shuffle2301(score) → hn::Shuffle1032(score) hn::Shuffle2301(sum) → hn::Shuffle1032(sum)

Not sure why Bob thinks that, but from Google Highway documentation, Shuffle2301 is the right instruction. Modifying it to Shuffle1032 results in our unit tests failing.

V: {u,i,f}{32}
V Shuffle2301(V): returns blocks with 32-bit halves swapped inside 64-bit halves.

r-devulap · 2026-05-27T02:54:33Z

Not sure why Bob thinks that, but from Google Highway documentation, Shuffle2301 is the right instruction. Modifying it to Shuffle1032 results in our unit tests failing.

I believe the confusion came from incorrect comments in the C++ file—Bob likely relied on those rather than the official Google Highway documentation to interpret the intrinsic. I’ve since corrected the comments here: 9a52b4d

- Replace jvector_simd.c + jvector_simd_check.c with C++ using Highway - Add jvector_simd.cpp (JNI dispatch layer) and jvector_simd_kernels.cpp/h (all SIMD kernel implementations: FP32, PQ, NVQ) - Add meson.build for building with Highway targets - Add Google Highway as git submodule (third_party/highway) - Add supporting headers: jvector_cpuFeatures.h, assertHwyTargets.h - Regenerate NativeSimdOps.java JNI bindings from new jvector_simd.h - Add __fsid_t.java and max_align_t.java (jextract-generated stubs) - Remove AVX-512 check from NativeVectorizationProvider; replace with x86_64 architecture guard (Highway selects best ISA at runtime) - Remove AVX-512 test from NativeSimdOpsTest - Update jextract_vector_simd.sh for new header layout - Update README with Highway build instructions

Wire up FP32 SIMD kernels in NativeVectorUtilSupport: - dotProduct(v1, v2) and dotProduct(v1, offset, v2, offset, len) via dot_product_f32 (dispatches to best ISA via Highway) - cosine(v1, v2) and cosine(v1, offset, v2, offset, len) via cosine_f32 - squareDistance(v1, v2) and squareDistance(v1, offset, v2, offset, len) via euclidean_f32 - addInPlace(v1, v2) and addInPlace(v1, scalar) via add_in_place_f32 / add_scalar_in_place_f32 - subInPlace(v1, v2) and subInPlace(v1, scalar) via sub_in_place_f32 / sub_scalar_in_place_f32 - max(v) via max_f32 - minInPlace(v1, v2) via min_in_place_f32 FP32 distance kernels are gated on length >= 128 (below that threshold the Panama vector fallback is used).

Wire up PQ SIMD kernels in NativeVectorUtilSupport: - assembleAndSum: switch to assemble_and_sum_f32 (was _512 variant) - assembleAndSumPQ: replace Java fallback with assemble_and_sum_pq_f32 native call; validates ordinal offsets are 0 via assertions - pqDecodedCosineSimilarity: switch to pq_decoded_cosine_similarity_f32 (was _512 variant); passes length as long - calculatePartialSums (new): dispatches to calculate_partial_sums_euclidean_f32 or calculate_partial_sums_dot_f32 based on VectorSimilarityFunction

Wire up NVQ (Non-uniform Vector Quantization) SIMD kernels in NativeVectorUtilSupport: - nvqShuffleQueryInPlace8bit: pre-shuffle query vector for fast-lane dequantization in scoring kernels - nvqQuantize8bit: quantize float vector to 8-bit NVQ representation - nvqLoss / nvqUniformLoss: compute quantization loss for parameter tuning - nvqSquareL2Distance8bit: L2 distance between float query and 8-bit quantized vector - nvqDotProduct8bit: dot product between float query and 8-bit quantized vector - nvqCosine8bit: cosine similarity; native returns packed int64 (low 32 bits = dot sum, high 32 bits = quantized magnitude), unpacked to float[]

…pped

r-devulap · 2026-06-25T04:40:52Z

@akash-shankaran Thanks for the review! I resolved all the conversations for now. Feel free to reopen any if you have more questions/concerns.

This reverts commit 1edff51.

…flow

…ase (default)

…lerplate Add jvector_simd_kernel_list.h with a single JVECTOR_SIMD_KERNEL_LIST X-Macro table that serves as the single source of truth for all SIMD kernel signatures. The macro auto-generates: - Namespace declarations in jvector_simd_kernels.h - KernelVTable struct members in jvector_simd.cpp - Vtable initializers for AVX3, AVX2, and SSE42 - Public API wrapper functions and their C declarations in jvector_simd.h Net result: ~390 lines removed. Adding a new kernel now requires only a single KERNEL_ENTRY line in the list header.

r-devulap · 2026-06-30T16:29:44Z

Updated PR with these commits:

SHA	Description
`b0136cb`	Adds Maven build profiles (debug, debugoptimized, release) for the native module. These are required to debug the native module using gdb or CLion
`2075f2f`	Refactors native code to use an X-Macro kernel list, eliminating repetitive SIMD boilerplate. Makes it a simple 3 step process to adding new kernels for AVX3/AXV2/SSE42 targets.
`e22aedd`	Updates release notes and the native module README to document recent changes, including a new step-by-step checklist in the README for adding SIMD kernels to the native module.
`8cf3162`	Adds AVX3_DL (Ice Lake) and AVX3_SPR (Sapphire Rapids) ISA tiers to the native module for finer-grained CPU dispatch, bringing x86 ISA support up to date through Intel GNR. This also makes it easier for developers to add specialized SIMD kernels going forward.

ashkrisk · 2026-07-01T06:21:42Z

+
+HWY_FLATTEN float my_new_op_f32(const float* HWY_RESTRICT a, size_t length)
+{
+#if HWY_STATIC_TARGET == HWY_AVX3


Suggested change

#if HWY_STATIC_TARGET == HWY_AVX3

#if HWY_STATIC_TARGET >= HWY_AVX3

ashkrisk · 2026-07-01T06:35:24Z

+    float result = _mm512_reduce_add_ps(acc);
+    for (; i < length; ++i) result += a[i];
+    return result;
+#else


I wonder if there's a better way to do this than using #if... #else within a function body. In particular, specialisations which use completely different algorithms probably deserve their own function.

IDE tools also don't play that well with #if... #else blocks, and things like this are possible.

ashkrisk · 2026-07-01T06:37:19Z

+} // namespace MY_NEW_TIER
+```
+
+### Step 6 — Add a compile-time assertion (`assert_hwy_targets.h`)


Adding a new SIMD architecture seems like a fairly involved process. Does it make sense to use an X-macro style approach here as well, or is that just adding unnecessary complexity?

ashkrisk · 2026-07-01T06:41:08Z

+within an existing tier. A tier is appropriate when a new ISA extension (e.g.
+native FP16, BF16, or AMX arithmetic) requires a different compiler target than
+any existing tier, so the kernels cannot share a compilation unit with
+`jvector_simd_kernels.cpp`.


Is this process any different if adding support for a completely different architecture (say, NEON) as opposed to adding an x86_64 extension? Would be useful to update this section accordingly.

ashkrisk · 2026-07-01T06:42:44Z

+`jvector-native/src/main/native/jextract_vector_simd.sh`. To build and auto-install `g++` on Ubuntu:
+
+```bash
+./jvector-native/src/main/native/jextract_vector_simd.sh --auto-install-g++


This and the paragraph above it needs to be updated to --auto-install-deps

Add two new Highway ISA tiers above the existing AVX3 baseline: - AVX3_DL (Ice Lake): compiled with -march=icelake-server, covering the full ICX extension set (VNNI, VBMI, VBMI2, IFMA, BITALG, VPOPCNTDQ, GFNI, VAES, VPCLMULQDQ). Source file: jvector_avx3_dl_kernels.cpp. - AVX3_SPR (Sapphire Rapids): compiled with -march=sapphirerapids, adding AVX512FP16 and AVX512BF16 on top of the ICX set. Source file: jvector_avx3_spr_kernels.cpp. Both tiers currently inherit all vtable slots from AVX3 unchanged; their dedicated source files are ready for ISA-specific kernel overrides. Infrastructure changes: - jvector_cpu_features.h: add composite CpuFeature::AVX3 (100), AVX3_DL (101), AVX3_SPR (102) flags computed from raw CPUID bits in populate_cpu_features(), simplifying dispatch to a single has() test per tier. Values start at 100 to leave room for future raw features. - jvector_simd.cpp: add MaxIsa::AVX3_DL / AVX3_SPR enum values, read_max_isa() mappings ("avx3_dl", "avx3_spr"), vtables, and CPUID dispatch gates. - jvector_simd_kernels.h: add DECLARE_SIMD_KERNELS for both new namespaces. - assert_hwy_targets.h: add JV_REQUIRE_HWY_AVX3_DL / AVX3_SPR guards. - meson.build: register the two new static_library() compilation units. - README.md: update architecture diagram, ISA cap env-var docs, and both the "Adding a new kernel" and "Adding a new ISA tier" how-to sections.

r-devulap · 2026-07-03T07:39:43Z

Added two commits related to unit tests and CI coverage:

242280a — Expanded the build-avx512 CI matrix with avx2 and sse42 JVECTOR_MAX_ISA scenarios on JDK 24 only. Also updated it run all tests and not just jvector-tests module.
63eecd9 — Added the native jvector_simd_get_active_isa() API and Java bindings, along with a new DispatcherCpuFlagsTest to verify JVECTOR_MAX_ISA dispatch behavior.

…nly)

r-devulap requested review from MarkWolters, ashkrisk, jshook and tlwillke as code owners May 26, 2026 15:15

r-devulap force-pushed the hwy-native-funcs branch from eb12f8e to 73efe07 Compare May 26, 2026 15:34

jshook reviewed May 26, 2026

View reviewed changes

Comment thread .github/workflows/run-bench.yml

jshook reviewed May 26, 2026

View reviewed changes

Comment thread jvector-native/src/main/native/third_party/highway

jshook approved these changes May 26, 2026

View reviewed changes

MarkWolters approved these changes May 26, 2026

View reviewed changes

Comment thread .github/workflows/run-bench.yml

Comment thread .github/workflows/run-bench.yml

Comment thread jvector-native/src/main/native/jextract_vector_simd.sh

Comment thread jvector-native/src/test/java/io/github/jbellis/jvector/vector/cnative/NativeSimdOpsTest.java

ashkrisk reviewed Jun 8, 2026

View reviewed changes

Comment thread jvector-native/src/main/native/jvector_cpu_features.h

Comment thread jvector-native/src/main/c/jextract_vector_simd.sh Outdated

Comment thread README.md Outdated

Comment thread jvector-native/src/main/native/jvector_simd.cpp Outdated

r-devulap force-pushed the hwy-native-funcs branch from 69f3d38 to 0fae13f Compare June 10, 2026 05:11

akash-shankaran reviewed Jun 12, 2026

View reviewed changes

Comment thread jvector-native/src/main/native/jvector_simd.h

This was referenced Jun 15, 2026

Wire calculatePartialSums to native SIMD via Panama FFI downcall #651

Closed

Use bulk writes instead of per-element writes in vector serialization #681

Merged

Release docs #677

Merged

Usage of imprecise fp-model=fast. #360

Closed

akash-shankaran reviewed Jun 18, 2026

View reviewed changes

r-devulap added 10 commits June 22, 2026 15:20

Always prefer native dp, l2 and cosine

397f712

Add new calculatePartialSelfSum

ced7519

Add Meson and Ninja installation to GitHub Actions workflows

6510a11

Add git submodule initialization to GitHub Actions workflows

4190499

Exclude .gitmodules from RAT checks

e381500

Remove recursive submodule init from unit tests workflow

18a90ce

r-devulap added 4 commits June 22, 2026 15:21

Fix Shuffle2301/Shuffle1032 comments: diagrams and semantics were swa…

e0106f5

…pped

Fix RAT excludes: add module-relative paths for build/ and third_party/

c2e5068

Use snake case consistently

00b5eef

Use --auto-install-deps instead of --auto-install-gcc

18b1b5d

r-devulap force-pushed the hwy-native-funcs branch from 0fae13f to 18b1b5d Compare June 22, 2026 15:23

tlwillke assigned r-devulap Jun 24, 2026

tlwillke added the performance improvement A contribution that aims to improve library performance, possibly along with functionality. label Jun 24, 2026

Configure maven-clean-plugin to remove meson build artifacts

1edff51

r-devulap force-pushed the hwy-native-funcs branch from 2c8e07d to 1edff51 Compare June 25, 2026 04:15

r-devulap added 5 commits June 25, 2026 08:49

Mode meson build directory to jvector-native/target

80f3229

Revert "Configure maven-clean-plugin to remove meson build artifacts"

32a0565

This reverts commit 1edff51.

Rename jvector-native/src/main/c -> jvector-native/src/main/native

d940c55

ci: add meson/ninja install and submodule init to run-compaction work…

398caeb

…flow

Adding release notes for PR #668

2053dcb

r-devulap requested review from ashkrisk and jshook June 26, 2026 15:19

Add Maven profiles for native builds: debug, debugoptimized, and rele…

b0136cb

…ase (default)

reta reviewed Jun 29, 2026

View reviewed changes

Comment thread jvector-native/src/main/native/README.md Outdated

r-devulap added 2 commits June 30, 2026 15:56

Update release notes and native module README to reflect new changes

e22aedd

ashkrisk reviewed Jul 1, 2026

View reviewed changes

r-devulap added 2 commits July 3, 2026 07:28

add jvector_simd_get_active_isa() API and dispatcher tests

63eecd9

r-devulap force-pushed the hwy-native-funcs branch from c4ae96e to 5812dc4 Compare July 3, 2026 07:38

ci: add avx2 and sse42 ISA cap scenarios to build-avx512 job (jdk24 o…

242280a

…nly)

r-devulap force-pushed the hwy-native-funcs branch from 5eccdbe to 242280a Compare July 3, 2026 08:02

Run only jvector-tests on jdk 20

f1277d3

	#if HWY_STATIC_TARGET == HWY_AVX3
	#if HWY_STATIC_TARGET >= HWY_AVX3

Uh oh!

Conversation

r-devulap commented May 26, 2026

What changed

Uh oh!

github-actions Bot commented May 26, 2026 • edited by r-devulap Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jshook left a comment

Choose a reason for hiding this comment

Uh oh!

MarkWolters left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

r-devulap commented May 27, 2026

Uh oh!

r-devulap commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

r-devulap commented Jun 25, 2026

Uh oh!

Uh oh!

r-devulap commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashkrisk Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

ashkrisk Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashkrisk Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

ashkrisk Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

ashkrisk Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

r-devulap commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

github-actions Bot commented May 26, 2026 •

edited by r-devulap

Loading

r-devulap commented Jun 30, 2026 •

edited

Loading

ashkrisk Jul 1, 2026 •

edited

Loading

r-devulap commented Jul 3, 2026 •

edited

Loading