Type of issue
Potential Bug in SVE2 MATCH Learning Path
I was reviewing the learning path and think I found a bug when reading the source code for the search_generic and search_sve2match functions when the number of keys exceeds the VL variable.
Root cause
The key vector is built like this.
VL is the number of elements that fit in a scalable vector, e.g. 16 bytes for a 128-bit SVE2 implementation. The tmp buffer is filled with keys[i % nkeys] for i in [0, VL), then loaded into keyvec. If nkeys exceeds VL (e.g. 17+ keys on a 128-bit implementation), the extra keys are omitted from the tmp buffer and never compared in the vectorized section. Only the scalar tail loop correctly checks all keys.
uint8_t tmp[256];
for (size_t i = 0; i < VL; ++i) tmp[i] = keys[i % nkeys];
svuint8_t keyvec = svld1(pg, tmp);
The vectorized loop then uses this single keyvec for all comparisons:
for (; i + VL <= n; i += VL) {
svuint8_t block = svld1(pg, &hay[i]);
if (svptest_any(pg, svmatch_u8(pg, block, keyvec))) return 1;
}
Only the trailing scalar fallback correctly scans all keys:
for (; i < n; ++i)
for (size_t k = 0; k < nkeys; ++k)
if (hay[i] == keys[k]) return 1;
This may be intentional if the assumption is that nkeys will always be ≤ 16 (minimum SVL is 128 bits, giving 16 bytes), but it isn't documented. This likely wasn't caught because the microbenchmark has hardcoded the number of keys to be less that 16.
...
const uint8_t keys8[] = {0x13, 0x7F, 0xA5, 0xEE, 0x4C, 0x42, 0x01, 0x9B};
const uint16_t keys16[] = {0x1234, 0x7F7F, 0xA5A5, 0xEEEE, 0x4C4C, 0x4242};
Steps to reproduce
Build and run the following file (bug_demo.c). It uses 20 keys and a haystack containing only the 20th key — the generic scalar search finds it, but the SVE2 version misses it.
gcc -O2 -march=armv9-a+sve2 -o bug_demo bug_demo.c
./bug_demo
/*
* bug_demo.c — svmatch bug when nkeys > 16 (u8 segment size).
* Build: gcc -O2 -march=armv9-a+sve2 -o bug_demo bug_demo.c
*/
#include <arm_sve.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
int search_generic_u8(const uint8_t *hay, size_t n,
const uint8_t *keys, size_t nkeys) {
for (size_t i = 0; i < n; ++i)
for (size_t k = 0; k < nkeys; ++k)
if (hay[i] == keys[k]) return 1;
return 0;
}
int search_sve2_match_u8(const uint8_t *hay, size_t n,
const uint8_t *keys, size_t nkeys) {
if (nkeys == 0) return 0;
const size_t VL = svcntb();
svbool_t pg = svptrue_b8();
uint8_t tmp[256];
for (size_t i = 0; i < VL; ++i) tmp[i] = keys[i % nkeys];
svuint8_t keyvec = svld1(pg, tmp);
size_t i = 0;
for (; i + VL <= n; i += VL) {
svuint8_t block = svld1(pg, &hay[i]);
if (svptest_any(pg, svmatch_u8(pg, block, keyvec))) return 1;
}
for (; i < n; ++i)
for (size_t k = 0; k < nkeys; ++k)
if (hay[i] == keys[k]) return 1;
return 0;
}
int main(void) {
/* 20 keys — exceeds the 16-byte segment size */
const uint8_t keys[20] = {
0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F,
0x20, 0x21, 0x22, 0x23
};
/* Haystack: all 0xFF except position 100 = 0x23 (key index 19) */
uint8_t hay[256];
memset(hay, 0xFF, sizeof(hay));
hay[100] = 0x23;
int r_generic = search_generic_u8(hay, 256, keys, 20);
int r_sve2 = search_sve2_match_u8(hay, 256, keys, 20);
printf("generic: %s\n", r_generic ? "found" : "not found");
printf("sve2: %s\n", r_sve2 ? "found" : "not found");
return (r_generic == r_sve2) ? 0 : 1;
}
Expected output:
generic: found
sve2: found
Actual output:
generic: found
sve2: not found
Potential Fix
If my assessment is correct, I can submit a PR to fix this?
I believe a simple solution could check if number of keys exceeds VL and create chunks. But there may be a more optimal way. Something like the function below.
int search_sve2_match_u8(const uint8_t *hay, size_t n, const uint8_t *keys,
size_t nkeys) {
if (nkeys == 0) return 0;
const size_t VL = svcntb();
svbool_t pg = svptrue_b8();
// Segment size for svmatch u8 is 16 (128-bit / 8-bit)
const size_t SEG = 16;
// Number of key vectors needed to cover all keys
size_t nchunks = (nkeys + SEG - 1) / SEG;
uint8_t tmp[256] __attribute__((aligned(64)));
size_t i = 0;
for (; i + VL <= n; i += VL) {
svuint8_t block = svld1(pg, &hay[i]);
for (size_t c = 0; c < nchunks; ++c) {
size_t base = c * SEG;
size_t chunk_len = (base + SEG <= nkeys) ? SEG : (nkeys - base);
for (size_t j = 0; j < VL; ++j)
tmp[j] = keys[base + (j % SEG) % chunk_len];
svuint8_t keyvec = svld1(pg, tmp);
if (svptest_any(pg, svmatch_u8(pg, block, keyvec))) return 1;
}
}
for (; i < n; ++i) {
uint8_t v = hay[i];
for (size_t k = 0; k < nkeys; ++k)
if (v == keys[k]) return 1;
}
return 0;
}
Benchmarking the above I was able to achieve the following on a AWS Graviton 4 instance (Neoverse V2) with 128-bit SVL. note: added scenario where we have 18 keys so nkeys > VL.
aystack length : 67108864 elements
Iterations : 5
Hit probability : 0.001000 (0.1000 % )
Average latency over 5 iterations (ns):
generic_u8 : 4229.40
sve2_u8 : 926.80
sve2_u8_unrolled : 310.60
speed-up (orig) : 4.56x
speed-up (unroll): 13.62x
generic_u8 (18 keys) : 1516.60
sve2_u8 (18 keys) : 1599.20
sve2_u8_unr (18 keys): 491.20
speed-up (orig) : 0.95x
speed-up (unroll): 3.09x
generic_u16 : 337.80
sve2_u16 : 1158.20
sve2_u16_unrolled: 374.00
speed-up (orig) : 0.29x
speed-up (unroll): 0.90x
Throughput (million items/second):
generic_u8 : 15867230.34 Mi/s
sve2_u8 : 72409218.82 Mi/s
sve2_u8_unrolled : 216062021.89 Mi/s
speed-up (orig) : 4.56x
speed-up (unroll): 13.62x
generic_u8 (18 keys) : 44249547.67 Mi/s
sve2_u8 (18 keys) : 41964022.01 Mi/s
sve2_u8_unr (18 keys): 136622280.13 Mi/s
speed-up (orig) : 0.95x
speed-up (unroll): 3.09x
generic_u16 : 198664487.86 Mi/s
sve2_u16 : 57942379.55 Mi/s
sve2_u16_unrolled: 179435465.24 Mi/s
speed-up (orig) : 0.29x
speed-up (unroll): 0.90x
Type of issue
Potential Bug in
SVE2 MATCHLearning PathI was reviewing the learning path and think I found a bug when reading the source code for the
search_genericandsearch_sve2matchfunctions when the number of keys exceeds theVLvariable.Root cause
The key vector is built like this.
VLis the number of elements that fit in a scalable vector, e.g. 16 bytes for a 128-bit SVE2 implementation. Thetmpbuffer is filled withkeys[i % nkeys]foriin[0, VL), then loaded intokeyvec. IfnkeysexceedsVL(e.g. 17+ keys on a 128-bit implementation), the extra keys are omitted from thetmpbuffer and never compared in the vectorized section. Only the scalar tail loop correctly checks all keys.The vectorized loop then uses this single
keyvecfor all comparisons:Only the trailing scalar fallback correctly scans all keys:
This may be intentional if the assumption is that
nkeyswill always be ≤ 16 (minimum SVL is 128 bits, giving 16 bytes), but it isn't documented. This likely wasn't caught because the microbenchmark has hardcoded the number of keys to be less that 16.Steps to reproduce
Build and run the following file (
bug_demo.c). It uses 20 keys and a haystack containing only the 20th key — the generic scalar search finds it, but the SVE2 version misses it.Expected output:
Actual output:
Potential Fix
If my assessment is correct, I can submit a PR to fix this?
I believe a simple solution could check if number of keys exceeds VL and create chunks. But there may be a more optimal way. Something like the function below.
Benchmarking the above I was able to achieve the following on a AWS Graviton 4 instance (Neoverse V2) with 128-bit SVL. note: added scenario where we have 18 keys so
nkeys>VL.