Skip to content

[bug fix] adler32: reduce pair mod BASE after handle_tail on aarch64 NEON#504

Merged
folkertdev merged 1 commit intotrifectatechfoundation:mainfrom
shenxiul:fix/aarch64-neon-adler32-overflow
May 4, 2026
Merged

[bug fix] adler32: reduce pair mod BASE after handle_tail on aarch64 NEON#504
folkertdev merged 1 commit intotrifectatechfoundation:mainfrom
shenxiul:fix/aarch64-neon-adler32-overflow

Conversation

@shenxiul
Copy link
Copy Markdown
Contributor

aarch64 NEON accum32 integer overflow with non-aligned before and non-zero seed

Crate / repo: zlib-rs (trifectatechfoundation/zlib-rs)
File: zlib-rs/src/adler32/neon.rs
Architecture: aarch64 (NEON) only. Other backends (AVX2, AVX-512, scalar
generic, WASM) are not affected.
Confirmed reproducing on: 0.5.2, 0.5.5, 0.6.3, and current trunk
(SHA e51e62b2d1131e30576080966c36b5b5a2abcc56, dated 2026-04-20).
The code path is unchanged across those releases, so earlier versions that
shipped the same adler32::neon::adler32_neon_internal (any version that has
handle_tail(pair, before) followed directly by the SIMD chunk loop) are
expected to be affected as well.

Reproducer (60 lines, cargo run)

Cargo.toml:

[package]
name = "zlib_rs_neon_repro"
version = "0.0.0"
edition = "2021"

[dependencies]
# Pin to whichever published version you want to test.
zlib-rs = "0.6.3"

src/main.rs:

//! Minimal reproducer for a bug in zlib-rs's aarch64 NEON Adler-32
//! implementation. Build and run on an aarch64 machine with NEON
//! (any modern ARM server CPU); the bug is in `accum32` at
//! `zlib-rs/src/adler32/neon.rs`.
//!
//! On x86_64, zlib-rs dispatches to its AVX2 implementation and this
//! reproducer would not exercise the NEON code path; this specific bug
//! has only been confirmed on aarch64.
//!
//! Tested versions: zlib-rs 0.5.2, 0.5.5, 0.6.3 (all reproduce).

const BASE: u32 = 65521;

/// Reference scalar Adler-32 (textbook). Matches stock C zlib and
/// `zlib_rs::adler32::generic::adler32_rust` exactly.
fn adler32_scalar(start: u32, data: &[u8]) -> u32 {
    let mut s1 = start & 0xffff;
    let mut s2 = (start >> 16) & 0xffff;
    for &b in data {
        s1 = (s1 + b as u32) % BASE;
        s2 = (s2 + s1) % BASE;
    }
    (s2 << 16) | s1
}

fn main() {
    // Trigger conditions, found by bisection:
    //  - input: at least 5567 bytes of 0xFF
    //  - sliced from a non-16-aligned offset (so `align_to` leaves
    //    `before` bytes and `middle.len()` reaches NMAX/16 = 347
    //    uint8x16_t elements — the maximum chunk size accum32 sees)
    //  - non-zero initial s1 AND non-zero initial s2
    let backing = vec![0xffu8; 5568];
    let buf: &[u8] = &backing[1..1 + 5567]; // off=1
    let start: u32 = 0xa4c1_fb51;           // sum2=0xa4c1, sum1=0xfb51

    let scalar = adler32_scalar(start, buf);
    let zlibrs = zlib_rs::adler32::adler32(start, buf);

    println!("input  : vec![0xff; 5568][1..5568]   ({} bytes)", buf.len());
    println!("start  : 0x{start:08x}");
    println!("scalar : 0x{scalar:08x}");
    println!("zlib-rs: 0x{zlibrs:08x}");
    if scalar == zlibrs {
        println!("OK — values match (bug not triggered on this target)");
    } else {
        let s1_diff = (scalar & 0xffff) as i32 - (zlibrs & 0xffff) as i32;
        let s2_diff = (scalar >> 16) as i32 - (zlibrs >> 16) as i32;
        println!("BUG    : s1 diff = {s1_diff}, s2 diff = {s2_diff} (expected 0)");
        std::process::exit(1);
    }
}

Expected output (against patched zlib-rs)

input  : vec![0xff; 5568][1..5568]   (5567 bytes)
start  : 0xa4c1fb51
scalar : 0x7ca7a5dc
zlib-rs: 0x7ca7a5dc
OK — values match (bug not triggered on this target)

Actual output (against unpatched zlib-rs 0.5.2 / 0.5.5 / 0.6.3 / trunk)

input  : vec![0xff; 5568][1..5568]   (5567 bytes)
start  : 0xa4c1fb51
scalar : 0x7ca7a5dc
zlib-rs: 0x7bc6a5dc
BUG    : s1 diff = 0, s2 diff = 225 (expected 0)

(s1 is correct; s2 is short by exactly 225 = 14_745_600 / 65_536 because
the four-lane horizontal sum wraps modulo 2³² before the final % BASE. See
Root cause below.)

This was originally observed in production while decoding zlib streams whose
trailer Adler-32 had been computed by flate2 + miniz_oxide: decoding via
zlib-rs's NEON path produced Z_DATA_ERROR ("incorrect data check") even
though the data itself was sound.

Trigger conditions (verbatim)

aarch64 NEON Adler-32 implementation, function accum32 in
src/adler32/neon.rs. Triggers when ALL THREE of:

  1. Input contains ≥ 5552 contiguous bytes of 0xFF
    (one full NMAX-sized SIMD chunk fed to accum32).
    The minimal reduced repro uses 5567 bytes; the threshold is NMAX = 5552.
  2. Slice starts at a non-16-aligned offset (so
    slice::align_to::<uint8x16_t> produces a non-empty before,
    which handle_tail accumulates into pair without a % BASE
    reduction).
  3. Initial s1 AND s2 are both non-zero (carry-in from prior bytes,
    e.g. from chunked Adler-32 over a multi-buffer stream, or from a PNG
    IDAT Adler-32 carry).

If any one condition is missing, the result is correct. In particular,
for the spec-default seed 1 (s1 = 1, s2 = 0) the four lanes never
accumulate enough for the final horizontal sum to overflow.

Root cause

adler32_neon_internal (zlib-rs/src/adler32/neon.rs:33) does:

// zlib-rs/src/adler32/neon.rs:61-66 (upstream, unpatched)
let (before, middle, after) = unsafe { buf.align_to::<uint8x16_t>() };

pair = handle_tail(pair, before);          // ← no `% BASE` here

for chunk in middle.chunks(NMAX as usize / core::mem::size_of::<uint8x16_t>()) {
    pair = unsafe { accum32(pair, chunk) };
    pair.0 %= BASE;
    pair.1 %= BASE;
}

handle_tail (neon.rs:81) is the trivial scalar loop:

fn handle_tail(mut pair: (u32, u32), buf: &[u8]) -> (u32, u32) {
    for x in buf {
        pair.0 += *x as u32;
        pair.1 += pair.0;
    }
    pair
}

Note handle_tail does not apply % BASE. With the trigger seed
(s1 = 0xfb51, s2 = 0xa4c1) and before.len() == 15 of 0xffs, the
post-tail pair is

pair.0 = 0xfb51 + 15 * 0xff = 68_162   (> BASE = 65_521)
pair.1 = 1_037_832                       (>> BASE)

This out-of-range pair is then handed straight to accum32
(neon.rs:91). accum32 plants s.0 and s.1 into lane 0 of the
adacc / s2acc u32×4 vectors, accumulates m SIMD groups of 64 bytes
into s3acc and s2acc, and finally horizontally sums all four lanes
with two vpadd_u32 instructions (neon.rs:196-198):

let adacc2 = vpadd_u32(vget_low_u32(adacc), vget_high_u32(adacc));
let s2acc2 = vpadd_u32(vget_low_u32(s2acc), vget_high_u32(s2acc));
let as_   = vpadd_u32(adacc2, s2acc2);     // ← lane 1 holds s2acc lanes summed

For the trigger input the four lanes of s2acc + s3acc (instrumented
dump) are:

[1_362_718_576, 982_891_380, 982_537_440, 982_183_500]   (sum = 4_310_330_896)

u32::MAX = 4_294_967_295. The horizontal sum computed by vpadd_u32
is performed in u32 (NEON UADDP), so it wraps:

4_310_330_896 mod 2³² = 15_363_600

which after % BASE = 65_521 is 31_686. The correct value is
31_911 — exactly 225 greater, matching the observed s2_diff.

Why this overflows: with n = 5552 bytes of 0xff and s.0 ≈ BASE,
the quantity accum32 materialises into the four lanes (before
horizontal-summing) is the unreduced

s.1 + n * s.0 + sum_{p=0..n-1} (n - p) * b[p]
≈ 65_521 + 5552 * 65_521 + 255 * 5552 * 5553 / 2
≈ 4.30 × 10⁹

right at the u32 ceiling. Any extra contribution from a before tail
(pair.0 > BASE) or a similarly oversized seed pushes the lane sums
over u32::MAX. With s.0 < BASE and s.1 < BASE (the precondition the
algorithm implicitly assumes — and that the accum32 chunk loop maintains
via its own % BASE after each call), the same sum stays below
u32::MAX with ≈ 32 MB of margin.

That is why the bug is invisible whenever before.len() == 0 or the
caller's seed fits comfortably in 16 bits.

Why scalar / AVX2 / AVX-512 are not affected

  • adler32::generic::adler32_rust reduces adler and sum2 mod BASE inside
    the chunk loop, and the entry path (adler32_len_*) operates on values
    < BASE.
  • avx2::adler32_avx2 reduces pair mod BASE inside its own SIMD chunk
    loop and its scalar tail accumulator stays small enough not to
    overflow.
  • The NEON path is unique in that handle_tail (used for the alignment
    prefix) is unbounded and the result is fed directly into accum32
    without reduction.

Patch

Reduce pair mod BASE after handle_tail and before the SIMD chunk
loop. This restores the precondition that accum32's lane arithmetic
relies on, without touching any of the SIMD code paths.

diff --git a/zlib-rs/src/adler32/neon.rs b/zlib-rs/src/adler32/neon.rs
index 37816f8..bc4f70d 100644
--- a/zlib-rs/src/adler32/neon.rs
+++ b/zlib-rs/src/adler32/neon.rs
@@ -61,6 +61,13 @@ unsafe fn adler32_neon_internal(mut adler: u32, buf: &[u8]) -> u32 {
     let (before, middle, after) = unsafe { buf.align_to::<uint8x16_t>() };
 
     pair = handle_tail(pair, before);
+    // `accum32` accumulates in u32 lanes and assumes both components fit in 16
+    // bits on entry; without this reduction, a `before` tail (or large caller
+    // seed) can leave `pair.0`/`pair.1` above BASE and overflow the final
+    // horizontal sum on inputs near NMAX. See bug repro: 5567 bytes of 0xff
+    // sliced from offset 1 with seed 0xa4c1_fb51.
+    pair.0 %= BASE;
+    pair.1 %= BASE;
 
     for chunk in middle.chunks(NMAX as usize / core::mem::size_of::<uint8x16_t>()) {
         pair = unsafe { accum32(pair, chunk) };
@@ -246,4 +253,24 @@ mod tests {
 
         assert_eq!(neon, rust);
     }
+
+    /// Regression test for u32 overflow in `accum32`'s final horizontal sum.
+    ///
+    /// Triggered when:
+    /// 1. Input has a non-zero `before` tail (non-16-aligned slice) so that
+    ///    `handle_tail` runs without reducing `pair` mod BASE, and
+    /// 2. The post-`handle_tail` `pair` exceeds BASE, and
+    /// 3. The SIMD chunk approaches NMAX bytes of 0xff.
+    ///
+    /// Under those conditions, `s.0 * n + sum_p (n-p) * b[p] + s.1` can exceed
+    /// `u32::MAX`, wrapping the lane-summed result and corrupting `s2`.
+    #[test]
+    fn carry_in_with_unaligned_before_no_overflow() {
+        let backing = vec![0xffu8; 5568];
+        let buf: &[u8] = &backing[1..1 + 5567];
+        let start: u32 = 0xa4c1_fb51;
+        let neon = adler32_neon(start, buf);
+        let rust = crate::adler32::generic::adler32_rust(start, buf);
+        assert_eq!(neon, rust);
+    }
 }

A slightly more defensive alternative is to reduce inside accum32
itself (e.g. immediately after vsetq_lane_u32(s.0, …, 0)) so the
function is robust against any caller. The patch above is the minimum
change that fixes the bug without altering accum32.

Validation

All performed on aarch64 (Neoverse-class server, uname -m = aarch64,
rustc 1.88.0).

  1. Repro: with the patch applied, cargo run --release of the
    standalone reproducer above against the patched trunk crate prints
    OK — values match. Without the patch it prints
    BUG: s1 diff = 0, s2 diff = 225.

  2. cargo test --release (full workspace, patched):
    13 + 248 + 83 + 14 + 0 + 4 = 362 tests pass, 0 failures, 2
    ignored. Includes the existing adler32_neon_is_adler32_rust
    quickcheck/Miri test and the new
    carry_in_with_unaligned_before_no_overflow regression test.

  3. Regression test demonstrably exercises the bug: temporarily
    reverting only the pair.0 %= BASE; pair.1 %= BASE; lines (keeping
    the new test) makes
    cargo test --release -p zlib-rs --lib carry_in_with_unaligned_before_no_overflow
    fail with

    assertion `left == right` failed
      left:  2076616156
      right: 2091361756
    

    The 14 745 600 difference is exactly 225 << 16, matching the
    s2_diff = 225 the standalone repro shows. Restoring the patch
    makes the test pass.

  4. Fuzz (cargo +nightly fuzz run checksum): 60 seconds,
    1 453 754 iterations, no crashes or assertion failures.

How the bug got missed

  • The existing start_alignment test (neon.rs test module) only uses
    initial seeds 1 (the Adler-32 spec's default) and 42. Both have
    s2 = 0 and s1 < BASE, so neither component can be out of range
    after handle_tail, regardless of alignment.
  • The existing adler32_neon_is_adler32_rust quickcheck uses default
    quickcheck Vec<u8> shrinking, which produces small random inputs;
    reaching ≥ 5552 contiguous identical bytes randomly is essentially
    impossible.
  • The large_input test uses a real PDF (no long 0xff runs).
  • Real-world traffic that exercises this path needs a ≥ NMAX run of
    0xff (rare in compressed data, common in raw zero-padded /
    uninitialised buffers and certain image formats), a non-16-aligned
    slice (very common via &buf[i..]), and a non-default carry-in seed
    (common in chunked Adler-32 — e.g. PNG IDAT, multi-buffer
    compress2 callers, or anything that resumes a prior Adler-32
    across read boundaries).

A useful follow-up would be a fuzz target that explicitly varies the
seed start: u32 and feeds long runs of constant-value bytes; the
regression test added here is a sufficient point check.

@shenxiul shenxiul changed the title adler32: reduce pair mod BASE after handle_tail on aarch64 NEON [bug fix] adler32: reduce pair mod BASE after handle_tail on aarch64 NEON Apr 28, 2026
@folkertdev
Copy link
Copy Markdown
Member

This makes sense, I'm trying to

  1. understand why zlib-ng doesn't run into this problem. They use a slightly different approach to make the alignment work, but from what I can see they don't actually apply the modulo
  2. get our fuzzer to actually hit this on aarch64. I'm having some issues with the aarch64 machine we have though, so this may take a bit

I can reproduce it with qemu though, thanks for the test case.

`accum32` accumulates in u32 lanes and assumes both components of `pair`
fit in 16 bits on entry. Without a `% BASE` reduction after the
alignment-prefix `handle_tail` call, a non-empty `before` tail (or a
large caller-supplied seed) can leave `pair.0` / `pair.1` above BASE,
which lets the four-lane horizontal sum at the end of `accum32`
overflow `u32::MAX` for inputs near `NMAX`.

Concretely: with 5567 bytes of `0xff` sliced from offset 1 (so
`before.len() == 15`) and seed `0xa4c1_fb51`, the post-`handle_tail`
pair is `(68_162, 1_037_832)` and the four lanes of `s2acc + s3acc` sum
to `4_310_330_896 > 2^32`. The wrap shows up as `s2_diff = 225` between
the NEON result and the scalar reference (`adler32::generic::adler32_rust`).

Fix: reduce `pair` mod BASE after `handle_tail(pair, before)` and
before the SIMD chunk loop. This restores the precondition `accum32`
relies on without altering any of the SIMD code paths. The chunk loop
already does `% BASE` after each `accum32` call, so subsequent
iterations were already safe — only the entry into the loop was missing
the reduction.

Also adds a regression test that fails on the unpatched code with the
exact bug signature.
@folkertdev folkertdev force-pushed the fix/aarch64-neon-adler32-overflow branch from cfcde02 to e3c711c Compare May 4, 2026 10:28
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
fuzz-compress ?
fuzz-decompress ?
test-aarch64-apple-darwin 89.59% <100.00%> (+0.01%) ⬆️
test-aarch64-unknown-linux-gnu 85.39% <100.00%> (+0.03%) ⬆️
test-i686-unknown-linux-gnu 85.12% <ø> (+0.06%) ⬆️
test-x86_64-apple-darwin 88.97% <ø> (+0.05%) ⬆️
test-x86_64-unknown-linux-gnu 91.00% <ø> (-2.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
zlib-rs/src/adler32/neon.rs 100.00% <100.00%> (ø)

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@folkertdev folkertdev merged commit 0c67788 into trifectatechfoundation:main May 4, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants