Potential Bug in SVE_MATCH learning path

**Type of issue**
- [ ] Instructions in the learning path do not work.
- [ ] Spelling error in the learning path.
- [ ] Instructions in the learning path are hard to follow.
- [ ] Instructions to contribute a new learning path are not clear.
- [x] Other

# Potential Bug in `SVE2 MATCH` Learning Path

I was reviewing the learning path and think I found a bug when reading the source code for the `search_generic` and `search_sve2match` functions when the number of keys exceeds the `VL` variable.

## Root cause

The key vector is built like this.

`VL` is the number of elements that fit in a scalable vector, e.g. 16 bytes for a 128-bit SVE2 implementation. The `tmp` buffer is filled with `keys[i % nkeys]` for `i` in `[0, VL)`, then loaded into `keyvec`. If `nkeys` exceeds `VL` (e.g. 17+ keys on a 128-bit implementation), the extra keys are omitted from the `tmp` buffer and never compared in the vectorized section. Only the scalar tail loop correctly checks all keys.
```c
uint8_t tmp[256];
for (size_t i = 0; i < VL; ++i) tmp[i] = keys[i % nkeys];
svuint8_t keyvec = svld1(pg, tmp);
```

The vectorized loop then uses this single `keyvec` for all comparisons:

```c
for (; i + VL <= n; i += VL) {
    svuint8_t block = svld1(pg, &hay[i]);
    if (svptest_any(pg, svmatch_u8(pg, block, keyvec))) return 1;
}
```

Only the trailing scalar fallback correctly scans all keys:

```c
for (; i < n; ++i)
    for (size_t k = 0; k < nkeys; ++k)
        if (hay[i] == keys[k]) return 1;
```

This may be intentional if the assumption is that `nkeys` will always be ≤ 16 (minimum SVL is 128 bits, giving 16 bytes), but it isn't documented. This likely wasn't caught because the microbenchmark has hardcoded the number of keys to be less that 16. 

```c
...
  const uint8_t keys8[]   = {0x13, 0x7F, 0xA5, 0xEE, 0x4C, 0x42, 0x01, 0x9B};
  const uint16_t keys16[] = {0x1234, 0x7F7F, 0xA5A5, 0xEEEE, 0x4C4C, 0x4242};
```

## Steps to reproduce

Build and run the following file (`bug_demo.c`). It uses 20 keys and a haystack containing only the 20th key — the generic scalar search finds it, but the SVE2 version misses it.

```bash
gcc -O2 -march=armv9-a+sve2 -o bug_demo bug_demo.c
./bug_demo
```

```c
/*
 * bug_demo.c — svmatch bug when nkeys > 16 (u8 segment size).
 * Build: gcc -O2 -march=armv9-a+sve2 -o bug_demo bug_demo.c
 */

#include <arm_sve.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>

int search_generic_u8(const uint8_t *hay, size_t n,
                      const uint8_t *keys, size_t nkeys) {
  for (size_t i = 0; i < n; ++i)
    for (size_t k = 0; k < nkeys; ++k)
      if (hay[i] == keys[k]) return 1;
  return 0;
}

int search_sve2_match_u8(const uint8_t *hay, size_t n,
                         const uint8_t *keys, size_t nkeys) {
  if (nkeys == 0) return 0;
  const size_t VL = svcntb();
  svbool_t pg = svptrue_b8();
  uint8_t tmp[256];
  for (size_t i = 0; i < VL; ++i) tmp[i] = keys[i % nkeys];
  svuint8_t keyvec = svld1(pg, tmp);
  size_t i = 0;
  for (; i + VL <= n; i += VL) {
    svuint8_t block = svld1(pg, &hay[i]);
    if (svptest_any(pg, svmatch_u8(pg, block, keyvec))) return 1;
  }
  for (; i < n; ++i)
    for (size_t k = 0; k < nkeys; ++k)
      if (hay[i] == keys[k]) return 1;
  return 0;
}

int main(void) {
  /* 20 keys — exceeds the 16-byte segment size */
  const uint8_t keys[20] = {
    0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
    0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F,
    0x20, 0x21, 0x22, 0x23
  };

  /* Haystack: all 0xFF except position 100 = 0x23 (key index 19) */
  uint8_t hay[256];
  memset(hay, 0xFF, sizeof(hay));
  hay[100] = 0x23;

  int r_generic = search_generic_u8(hay, 256, keys, 20);
  int r_sve2    = search_sve2_match_u8(hay, 256, keys, 20);

  printf("generic: %s\n", r_generic ? "found" : "not found");
  printf("sve2:    %s\n", r_sve2    ? "found" : "not found");

  return (r_generic == r_sve2) ? 0 : 1;
}
```

Expected output:
```
generic: found
sve2:    found
```

Actual output:
```
generic: found
sve2:    not found
```

## Potential Fix

If my assessment is correct, I can submit a PR to fix this?

I believe a simple solution could check if number of keys exceeds VL and create chunks. But there may be a more optimal way. Something like the function below. 

```c
int search_sve2_match_u8(const uint8_t *hay, size_t n, const uint8_t *keys,
                         size_t nkeys) {
  if (nkeys == 0) return 0;
  const size_t VL = svcntb();
  svbool_t pg = svptrue_b8();

  // Segment size for svmatch u8 is 16 (128-bit / 8-bit)
  const size_t SEG = 16;

  // Number of key vectors needed to cover all keys
  size_t nchunks = (nkeys + SEG - 1) / SEG;
  uint8_t tmp[256] __attribute__((aligned(64)));

  size_t i = 0;
  for (; i + VL <= n; i += VL) {
    svuint8_t block = svld1(pg, &hay[i]);
    for (size_t c = 0; c < nchunks; ++c) {
      size_t base = c * SEG;
      size_t chunk_len = (base + SEG <= nkeys) ? SEG : (nkeys - base);
      for (size_t j = 0; j < VL; ++j)
        tmp[j] = keys[base + (j % SEG) % chunk_len];
      svuint8_t keyvec = svld1(pg, tmp);
      if (svptest_any(pg, svmatch_u8(pg, block, keyvec))) return 1;
    }
  }
  for (; i < n; ++i) {
    uint8_t v = hay[i];
    for (size_t k = 0; k < nkeys; ++k)
      if (v == keys[k]) return 1;
  }
  return 0;
}

```
Benchmarking the above I was able to achieve the following on a AWS Graviton 4 instance (Neoverse V2) with 128-bit SVL. **note:** added scenario where we have 18 keys so `nkeys` > `VL`.

```output
aystack length : 67108864 elements
Iterations      : 5
Hit probability : 0.001000 (0.1000 % )

Average latency over 5 iterations (ns):
  generic_u8       : 4229.40
  sve2_u8          : 926.80
  sve2_u8_unrolled : 310.60
  speed-up (orig)  : 4.56x
  speed-up (unroll): 13.62x

  generic_u8 (18 keys) : 1516.60
  sve2_u8 (18 keys)    : 1599.20
  sve2_u8_unr (18 keys): 491.20
  speed-up (orig)  : 0.95x
  speed-up (unroll): 3.09x

  generic_u16      : 337.80
  sve2_u16         : 1158.20
  sve2_u16_unrolled: 374.00
  speed-up (orig)  : 0.29x
  speed-up (unroll): 0.90x

Throughput (million items/second):
  generic_u8       : 15867230.34 Mi/s
  sve2_u8          : 72409218.82 Mi/s
  sve2_u8_unrolled : 216062021.89 Mi/s
  speed-up (orig)  : 4.56x
  speed-up (unroll): 13.62x

  generic_u8 (18 keys) : 44249547.67 Mi/s
  sve2_u8 (18 keys)    : 41964022.01 Mi/s
  sve2_u8_unr (18 keys): 136622280.13 Mi/s
  speed-up (orig)  : 0.95x
  speed-up (unroll): 3.09x

  generic_u16      : 198664487.86 Mi/s
  sve2_u16         : 57942379.55 Mi/s
  sve2_u16_unrolled: 179435465.24 Mi/s
  speed-up (orig)  : 0.29x
  speed-up (unroll): 0.90x
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Bug in SVE_MATCH learning path #3302

Potential Bug in `SVE2 MATCH` Learning Path

Root cause

Steps to reproduce

Potential Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential Bug in SVE_MATCH learning path #3302

Description

Potential Bug in SVE2 MATCH Learning Path

Root cause

Steps to reproduce

Potential Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Potential Bug in `SVE2 MATCH` Learning Path