Skip to content

Port Panama SIMD kernels to C++ using Google Highway #668

Open
r-devulap wants to merge 29 commits into
mainfrom
hwy-native-funcs
Open

Port Panama SIMD kernels to C++ using Google Highway #668
r-devulap wants to merge 29 commits into
mainfrom
hwy-native-funcs

Conversation

@r-devulap

Copy link
Copy Markdown
Contributor

This PR rewrites JVector's Panama Vector API-based SIMD kernels with native C++ implementations using Google Highway, a portable SIMD library that compiles a single kernel source into multiple ISA targets (SSE42, AVX2 and AVX-512) and dispatches at runtime.

What changed

  • Introduce Google Highway as a git submodule.
  • jvector_simd.c is replaced by jvector_simd.cpp (ISA dispatch shim) and jvector_simd_kernels.cpp (all Highway kernel implementations), with a new meson.build driving multi-target compilation (adds meson as a build dependency).
  • All FP32, PQ, and NVQ kernels are ported to Highway, with a new calculatePartialSelfSum kernel.
  • Ported the optimizations in Wire calculatePartialSums to native SIMD via Panama FFI downcall #651 to Google Highway.
  • NativeSimdOps.java is regenerated via jextract to match the updated C API
  • NativeVectorUtilSupport is updated and the native kernels are now unconditionally preferred over any Panama fallback for dot product, L2, and cosine distance

The README file jvector-native/src/main/c/README.md is a good start before reviewing the code.

@github-actions

github-actions Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

Comment thread .github/workflows/run-bench.yml
Comment thread jvector-native/src/main/native/third_party/highway

@jshook jshook left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me.

@MarkWolters MarkWolters left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good. I did also use Bob to do a review and double check myself and it had this comment (I understand what it is saying but I will leave it to your discretion if it is correct):

jvector_simd_kernels.cpp lines 571 and 943

Both calculate_partial_sums_f32 and calculate_partial_sums_self_magnitude_f32 have a size==2 fast path that uses hn::Shuffle2301 for the horizontal add. This is wrong.

For size==2, centroids are interleaved as [c0[0], c0[1], c1[0], c1[1], ...]. The goal is to sum adjacent pairs within each centroid. Shuffle2301 on [a, b, c, d] produces [c, d, a, b] — swaps 64-bit halves — so
score + Shuffle2301(score) mixes elements from different centroids. The correct shuffle is Shuffle1032, which swaps adjacent 32-bit elements: [b, a, d, c], so score + Shuffle1032(score) gives [s0+s1, s0+s1,
s2+s3, s2+s3], which is the correct per-centroid sum.

Impact: Any PQ index with size==2 subspaces (e.g., 128-dim vectors with 64 subspaces) silently produces wrong search scores. The scalar fallback is never reached because the fast path advances the loop index
past all centroids.

Fix:
// Line 571 and 943: change
hn::Shuffle2301(score) → hn::Shuffle1032(score)
hn::Shuffle2301(sum) → hn::Shuffle1032(sum)

Comment thread .github/workflows/run-bench.yml
Comment thread .github/workflows/run-bench.yml
Comment thread jvector-native/src/main/native/jextract_vector_simd.sh
@r-devulap

Copy link
Copy Markdown
Contributor Author

Impact: Any PQ index with size==2 subspaces (e.g., 128-dim vectors with 64 subspaces) silently produces wrong search scores. The scalar fallback is never reached because the fast path advances the loop index past all centroids.

Fix: // Line 571 and 943: change hn::Shuffle2301(score) → hn::Shuffle1032(score) hn::Shuffle2301(sum) → hn::Shuffle1032(sum)

Not sure why Bob thinks that, but from Google Highway documentation, Shuffle2301 is the right instruction. Modifying it to Shuffle1032 results in our unit tests failing.

V: {u,i,f}{32}
V Shuffle2301(V): returns blocks with 32-bit halves swapped inside 64-bit halves.

@r-devulap

Copy link
Copy Markdown
Contributor Author

Not sure why Bob thinks that, but from Google Highway documentation, Shuffle2301 is the right instruction. Modifying it to Shuffle1032 results in our unit tests failing.

I believe the confusion came from incorrect comments in the C++ file—Bob likely relied on those rather than the official Google Highway documentation to interpret the intrinsic. I’ve since corrected the comments here: 9a52b4d

Comment thread jvector-native/src/main/native/jvector_cpu_features.h
Comment thread jvector-native/src/main/c/jextract_vector_simd.sh Outdated
Comment thread README.md Outdated
Comment thread jvector-native/src/main/native/jvector_simd.cpp Outdated
Comment thread jvector-native/src/main/native/jvector_simd.h
Comment thread jvector-native/src/main/native/jvector_simd_kernels.cpp
Comment thread jvector-native/src/main/native/jvector_simd_kernels.cpp
Comment thread .gitmodules
Comment thread README.md
Comment thread jvector-native/src/main/native/README.md
Comment thread jvector-native/src/main/native/meson.build
Comment thread jvector-native/src/main/native/meson.build
Comment thread jvector-native/src/main/native/README.md
r-devulap added 10 commits June 22, 2026 15:20
- Replace jvector_simd.c + jvector_simd_check.c with C++ using Highway
- Add jvector_simd.cpp (JNI dispatch layer) and jvector_simd_kernels.cpp/h
  (all SIMD kernel implementations: FP32, PQ, NVQ)
- Add meson.build for building with Highway targets
- Add Google Highway as git submodule (third_party/highway)
- Add supporting headers: jvector_cpuFeatures.h, assertHwyTargets.h
- Regenerate NativeSimdOps.java JNI bindings from new jvector_simd.h
- Add __fsid_t.java and max_align_t.java (jextract-generated stubs)
- Remove AVX-512 check from NativeVectorizationProvider; replace with
  x86_64 architecture guard (Highway selects best ISA at runtime)
- Remove AVX-512 test from NativeSimdOpsTest
- Update jextract_vector_simd.sh for new header layout
- Update README with Highway build instructions
Wire up FP32 SIMD kernels in NativeVectorUtilSupport:
- dotProduct(v1, v2) and dotProduct(v1, offset, v2, offset, len)
  via dot_product_f32 (dispatches to best ISA via Highway)
- cosine(v1, v2) and cosine(v1, offset, v2, offset, len)
  via cosine_f32
- squareDistance(v1, v2) and squareDistance(v1, offset, v2, offset, len)
  via euclidean_f32
- addInPlace(v1, v2) and addInPlace(v1, scalar) via add_in_place_f32 /
  add_scalar_in_place_f32
- subInPlace(v1, v2) and subInPlace(v1, scalar) via sub_in_place_f32 /
  sub_scalar_in_place_f32
- max(v) via max_f32
- minInPlace(v1, v2) via min_in_place_f32

FP32 distance kernels are gated on length >= 128 (below that threshold
the Panama vector fallback is used).
Wire up PQ SIMD kernels in NativeVectorUtilSupport:
- assembleAndSum: switch to assemble_and_sum_f32 (was _512 variant)
- assembleAndSumPQ: replace Java fallback with assemble_and_sum_pq_f32
  native call; validates ordinal offsets are 0 via assertions
- pqDecodedCosineSimilarity: switch to pq_decoded_cosine_similarity_f32
  (was _512 variant); passes length as long
- calculatePartialSums (new): dispatches to calculate_partial_sums_euclidean_f32
  or calculate_partial_sums_dot_f32 based on VectorSimilarityFunction
Wire up NVQ (Non-uniform Vector Quantization) SIMD kernels in
NativeVectorUtilSupport:
- nvqShuffleQueryInPlace8bit: pre-shuffle query vector for fast-lane
  dequantization in scoring kernels
- nvqQuantize8bit: quantize float vector to 8-bit NVQ representation
- nvqLoss / nvqUniformLoss: compute quantization loss for parameter tuning
- nvqSquareL2Distance8bit: L2 distance between float query and 8-bit
  quantized vector
- nvqDotProduct8bit: dot product between float query and 8-bit quantized
  vector
- nvqCosine8bit: cosine similarity; native returns packed int64 (low 32
  bits = dot sum, high 32 bits = quantized magnitude), unpacked to float[]
@tlwillke tlwillke added the performance improvement A contribution that aims to improve library performance, possibly along with functionality. label Jun 24, 2026
@r-devulap

Copy link
Copy Markdown
Contributor Author

@akash-shankaran Thanks for the review! I resolved all the conversations for now. Feel free to reopen any if you have more questions/concerns.

@r-devulap r-devulap requested review from ashkrisk and jshook June 26, 2026 15:19
Comment thread jvector-native/src/main/native/README.md Outdated
…lerplate

Add jvector_simd_kernel_list.h with a single JVECTOR_SIMD_KERNEL_LIST
X-Macro table that serves as the single source of truth for all SIMD
kernel signatures. The macro auto-generates:
  - Namespace declarations in jvector_simd_kernels.h
  - KernelVTable struct members in jvector_simd.cpp
  - Vtable initializers for AVX3, AVX2, and SSE42
  - Public API wrapper functions and their C declarations in jvector_simd.h

Net result: ~390 lines removed. Adding a new kernel now requires only a
single KERNEL_ENTRY line in the list header.
@r-devulap

r-devulap commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Updated PR with these commits:

SHA Description
b0136cb Adds Maven build profiles (debug, debugoptimized, release) for the native module. These are required to debug the native module using gdb or CLion
2075f2f Refactors native code to use an X-Macro kernel list, eliminating repetitive SIMD boilerplate. Makes it a simple 3 step process to adding new kernels for AVX3/AXV2/SSE42 targets.
e22aedd Updates release notes and the native module README to document recent changes, including a new step-by-step checklist in the README for adding SIMD kernels to the native module.
8cf3162 Adds AVX3_DL (Ice Lake) and AVX3_SPR (Sapphire Rapids) ISA tiers to the native module for finer-grained CPU dispatch, bringing x86 ISA support up to date through Intel GNR. This also makes it easier for developers to add specialized SIMD kernels going forward.


HWY_FLATTEN float my_new_op_f32(const float* HWY_RESTRICT a, size_t length)
{
#if HWY_STATIC_TARGET == HWY_AVX3

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if HWY_STATIC_TARGET == HWY_AVX3
#if HWY_STATIC_TARGET >= HWY_AVX3

float result = _mm512_reduce_add_ps(acc);
for (; i < length; ++i) result += a[i];
return result;
#else

@ashkrisk ashkrisk Jul 1, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a better way to do this than using #if... #else within a function body. In particular, specialisations which use completely different algorithms probably deserve their own function.

IDE tools also don't play that well with #if... #else blocks, and things like this are possible.

} // namespace MY_NEW_TIER
```

### Step 6 — Add a compile-time assertion (`assert_hwy_targets.h`)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a new SIMD architecture seems like a fairly involved process. Does it make sense to use an X-macro style approach here as well, or is that just adding unnecessary complexity?

Comment on lines +361 to +364
within an existing tier. A tier is appropriate when a new ISA extension (e.g.
native FP16, BF16, or AMX arithmetic) requires a different compiler target than
any existing tier, so the kernels cannot share a compilation unit with
`jvector_simd_kernels.cpp`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this process any different if adding support for a completely different architecture (say, NEON) as opposed to adding an x86_64 extension? Would be useful to update this section accordingly.

Comment thread README.md
`jvector-native/src/main/native/jextract_vector_simd.sh`. To build and auto-install `g++` on Ubuntu:

```bash
./jvector-native/src/main/native/jextract_vector_simd.sh --auto-install-g++

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the paragraph above it needs to be updated to --auto-install-deps

r-devulap added 2 commits July 3, 2026 07:28
Add two new Highway ISA tiers above the existing AVX3 baseline:

- AVX3_DL (Ice Lake): compiled with -march=icelake-server, covering the
  full ICX extension set (VNNI, VBMI, VBMI2, IFMA, BITALG, VPOPCNTDQ,
  GFNI, VAES, VPCLMULQDQ). Source file: jvector_avx3_dl_kernels.cpp.

- AVX3_SPR (Sapphire Rapids): compiled with -march=sapphirerapids, adding
  AVX512FP16 and AVX512BF16 on top of the ICX set. Source file:
  jvector_avx3_spr_kernels.cpp.

Both tiers currently inherit all vtable slots from AVX3 unchanged;
their dedicated source files are ready for ISA-specific kernel overrides.

Infrastructure changes:
- jvector_cpu_features.h: add composite CpuFeature::AVX3 (100),
  AVX3_DL (101), AVX3_SPR (102) flags computed from raw CPUID bits in
  populate_cpu_features(), simplifying dispatch to a single has() test
  per tier. Values start at 100 to leave room for future raw features.
- jvector_simd.cpp: add MaxIsa::AVX3_DL / AVX3_SPR enum values,
  read_max_isa() mappings ("avx3_dl", "avx3_spr"), vtables, and
  CPUID dispatch gates.
- jvector_simd_kernels.h: add DECLARE_SIMD_KERNELS for both new namespaces.
- assert_hwy_targets.h: add JV_REQUIRE_HWY_AVX3_DL / AVX3_SPR guards.
- meson.build: register the two new static_library() compilation units.
- README.md: update architecture diagram, ISA cap env-var docs, and both
  the "Adding a new kernel" and "Adding a new ISA tier" how-to sections.
@r-devulap

r-devulap commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Added two commits related to unit tests and CI coverage:

  • 242280a — Expanded the build-avx512 CI matrix with avx2 and sse42 JVECTOR_MAX_ISA scenarios on JDK 24 only. Also updated it run all tests and not just jvector-tests module.

  • 63eecd9 — Added the native jvector_simd_get_active_isa() API and Java bindings, along with a new DispatcherCpuFlagsTest to verify JVECTOR_MAX_ISA dispatch behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance improvement A contribution that aims to improve library performance, possibly along with functionality.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants