Skip to content

Potential Bug in SVE_MATCH learning path #3302

@kieranhejmadi01

Description

@kieranhejmadi01

Type of issue

  • Instructions in the learning path do not work.
  • Spelling error in the learning path.
  • Instructions in the learning path are hard to follow.
  • Instructions to contribute a new learning path are not clear.
  • Other

Potential Bug in SVE2 MATCH Learning Path

I was reviewing the learning path and think I found a bug when reading the source code for the search_generic and search_sve2match functions when the number of keys exceeds the VL variable.

Root cause

The key vector is built like this.

VL is the number of elements that fit in a scalable vector, e.g. 16 bytes for a 128-bit SVE2 implementation. The tmp buffer is filled with keys[i % nkeys] for i in [0, VL), then loaded into keyvec. If nkeys exceeds VL (e.g. 17+ keys on a 128-bit implementation), the extra keys are omitted from the tmp buffer and never compared in the vectorized section. Only the scalar tail loop correctly checks all keys.

uint8_t tmp[256];
for (size_t i = 0; i < VL; ++i) tmp[i] = keys[i % nkeys];
svuint8_t keyvec = svld1(pg, tmp);

The vectorized loop then uses this single keyvec for all comparisons:

for (; i + VL <= n; i += VL) {
    svuint8_t block = svld1(pg, &hay[i]);
    if (svptest_any(pg, svmatch_u8(pg, block, keyvec))) return 1;
}

Only the trailing scalar fallback correctly scans all keys:

for (; i < n; ++i)
    for (size_t k = 0; k < nkeys; ++k)
        if (hay[i] == keys[k]) return 1;

This may be intentional if the assumption is that nkeys will always be ≤ 16 (minimum SVL is 128 bits, giving 16 bytes), but it isn't documented. This likely wasn't caught because the microbenchmark has hardcoded the number of keys to be less that 16.

...
  const uint8_t keys8[]   = {0x13, 0x7F, 0xA5, 0xEE, 0x4C, 0x42, 0x01, 0x9B};
  const uint16_t keys16[] = {0x1234, 0x7F7F, 0xA5A5, 0xEEEE, 0x4C4C, 0x4242};

Steps to reproduce

Build and run the following file (bug_demo.c). It uses 20 keys and a haystack containing only the 20th key — the generic scalar search finds it, but the SVE2 version misses it.

gcc -O2 -march=armv9-a+sve2 -o bug_demo bug_demo.c
./bug_demo
/*
 * bug_demo.c — svmatch bug when nkeys > 16 (u8 segment size).
 * Build: gcc -O2 -march=armv9-a+sve2 -o bug_demo bug_demo.c
 */

#include <arm_sve.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>

int search_generic_u8(const uint8_t *hay, size_t n,
                      const uint8_t *keys, size_t nkeys) {
  for (size_t i = 0; i < n; ++i)
    for (size_t k = 0; k < nkeys; ++k)
      if (hay[i] == keys[k]) return 1;
  return 0;
}

int search_sve2_match_u8(const uint8_t *hay, size_t n,
                         const uint8_t *keys, size_t nkeys) {
  if (nkeys == 0) return 0;
  const size_t VL = svcntb();
  svbool_t pg = svptrue_b8();
  uint8_t tmp[256];
  for (size_t i = 0; i < VL; ++i) tmp[i] = keys[i % nkeys];
  svuint8_t keyvec = svld1(pg, tmp);
  size_t i = 0;
  for (; i + VL <= n; i += VL) {
    svuint8_t block = svld1(pg, &hay[i]);
    if (svptest_any(pg, svmatch_u8(pg, block, keyvec))) return 1;
  }
  for (; i < n; ++i)
    for (size_t k = 0; k < nkeys; ++k)
      if (hay[i] == keys[k]) return 1;
  return 0;
}

int main(void) {
  /* 20 keys — exceeds the 16-byte segment size */
  const uint8_t keys[20] = {
    0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
    0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F,
    0x20, 0x21, 0x22, 0x23
  };

  /* Haystack: all 0xFF except position 100 = 0x23 (key index 19) */
  uint8_t hay[256];
  memset(hay, 0xFF, sizeof(hay));
  hay[100] = 0x23;

  int r_generic = search_generic_u8(hay, 256, keys, 20);
  int r_sve2    = search_sve2_match_u8(hay, 256, keys, 20);

  printf("generic: %s\n", r_generic ? "found" : "not found");
  printf("sve2:    %s\n", r_sve2    ? "found" : "not found");

  return (r_generic == r_sve2) ? 0 : 1;
}

Expected output:

generic: found
sve2:    found

Actual output:

generic: found
sve2:    not found

Potential Fix

If my assessment is correct, I can submit a PR to fix this?

I believe a simple solution could check if number of keys exceeds VL and create chunks. But there may be a more optimal way. Something like the function below.

int search_sve2_match_u8(const uint8_t *hay, size_t n, const uint8_t *keys,
                         size_t nkeys) {
  if (nkeys == 0) return 0;
  const size_t VL = svcntb();
  svbool_t pg = svptrue_b8();

  // Segment size for svmatch u8 is 16 (128-bit / 8-bit)
  const size_t SEG = 16;

  // Number of key vectors needed to cover all keys
  size_t nchunks = (nkeys + SEG - 1) / SEG;
  uint8_t tmp[256] __attribute__((aligned(64)));

  size_t i = 0;
  for (; i + VL <= n; i += VL) {
    svuint8_t block = svld1(pg, &hay[i]);
    for (size_t c = 0; c < nchunks; ++c) {
      size_t base = c * SEG;
      size_t chunk_len = (base + SEG <= nkeys) ? SEG : (nkeys - base);
      for (size_t j = 0; j < VL; ++j)
        tmp[j] = keys[base + (j % SEG) % chunk_len];
      svuint8_t keyvec = svld1(pg, tmp);
      if (svptest_any(pg, svmatch_u8(pg, block, keyvec))) return 1;
    }
  }
  for (; i < n; ++i) {
    uint8_t v = hay[i];
    for (size_t k = 0; k < nkeys; ++k)
      if (v == keys[k]) return 1;
  }
  return 0;
}

Benchmarking the above I was able to achieve the following on a AWS Graviton 4 instance (Neoverse V2) with 128-bit SVL. note: added scenario where we have 18 keys so nkeys > VL.

aystack length : 67108864 elements
Iterations      : 5
Hit probability : 0.001000 (0.1000 % )

Average latency over 5 iterations (ns):
  generic_u8       : 4229.40
  sve2_u8          : 926.80
  sve2_u8_unrolled : 310.60
  speed-up (orig)  : 4.56x
  speed-up (unroll): 13.62x

  generic_u8 (18 keys) : 1516.60
  sve2_u8 (18 keys)    : 1599.20
  sve2_u8_unr (18 keys): 491.20
  speed-up (orig)  : 0.95x
  speed-up (unroll): 3.09x

  generic_u16      : 337.80
  sve2_u16         : 1158.20
  sve2_u16_unrolled: 374.00
  speed-up (orig)  : 0.29x
  speed-up (unroll): 0.90x

Throughput (million items/second):
  generic_u8       : 15867230.34 Mi/s
  sve2_u8          : 72409218.82 Mi/s
  sve2_u8_unrolled : 216062021.89 Mi/s
  speed-up (orig)  : 4.56x
  speed-up (unroll): 13.62x

  generic_u8 (18 keys) : 44249547.67 Mi/s
  sve2_u8 (18 keys)    : 41964022.01 Mi/s
  sve2_u8_unr (18 keys): 136622280.13 Mi/s
  speed-up (orig)  : 0.95x
  speed-up (unroll): 3.09x

  generic_u16      : 198664487.86 Mi/s
  sve2_u16         : 57942379.55 Mi/s
  sve2_u16_unrolled: 179435465.24 Mi/s
  speed-up (orig)  : 0.29x
  speed-up (unroll): 0.90x

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions