[bug fix] adler32: reduce `pair` mod BASE after `handle_tail` on aarch64 NEON by shenxiul · Pull Request #504 · trifectatechfoundation/zlib-rs

shenxiul · 2026-04-28T05:44:35Z

aarch64 NEON `accum32` integer overflow with non-aligned `before` and non-zero seed

Crate / repo: zlib-rs (trifectatechfoundation/zlib-rs)
File: zlib-rs/src/adler32/neon.rs
Architecture: aarch64 (NEON) only. Other backends (AVX2, AVX-512, scalar
generic, WASM) are not affected.
Confirmed reproducing on: 0.5.2, 0.5.5, 0.6.3, and current trunk
(SHA e51e62b2d1131e30576080966c36b5b5a2abcc56, dated 2026-04-20).
The code path is unchanged across those releases, so earlier versions that
shipped the same adler32::neon::adler32_neon_internal (any version that has
handle_tail(pair, before) followed directly by the SIMD chunk loop) are
expected to be affected as well.

Reproducer (60 lines, `cargo run`)

Cargo.toml:

[package]
name = "zlib_rs_neon_repro"
version = "0.0.0"
edition = "2021"

[dependencies]
# Pin to whichever published version you want to test.
zlib-rs = "0.6.3"

src/main.rs:

//! Minimal reproducer for a bug in zlib-rs's aarch64 NEON Adler-32
//! implementation. Build and run on an aarch64 machine with NEON
//! (any modern ARM server CPU); the bug is in `accum32` at
//! `zlib-rs/src/adler32/neon.rs`.
//!
//! On x86_64, zlib-rs dispatches to its AVX2 implementation and this
//! reproducer would not exercise the NEON code path; this specific bug
//! has only been confirmed on aarch64.
//!
//! Tested versions: zlib-rs 0.5.2, 0.5.5, 0.6.3 (all reproduce).

const BASE: u32 = 65521;

/// Reference scalar Adler-32 (textbook). Matches stock C zlib and
/// `zlib_rs::adler32::generic::adler32_rust` exactly.
fn adler32_scalar(start: u32, data: &[u8]) -> u32 {
    let mut s1 = start & 0xffff;
    let mut s2 = (start >> 16) & 0xffff;
    for &b in data {
        s1 = (s1 + b as u32) % BASE;
        s2 = (s2 + s1) % BASE;
    }
    (s2 << 16) | s1
}

fn main() {
    // Trigger conditions, found by bisection:
    //  - input: at least 5567 bytes of 0xFF
    //  - sliced from a non-16-aligned offset (so `align_to` leaves
    //    `before` bytes and `middle.len()` reaches NMAX/16 = 347
    //    uint8x16_t elements — the maximum chunk size accum32 sees)
    //  - non-zero initial s1 AND non-zero initial s2
    let backing = vec![0xffu8; 5568];
    let buf: &[u8] = &backing[1..1 + 5567]; // off=1
    let start: u32 = 0xa4c1_fb51;           // sum2=0xa4c1, sum1=0xfb51

    let scalar = adler32_scalar(start, buf);
    let zlibrs = zlib_rs::adler32::adler32(start, buf);

    println!("input  : vec![0xff; 5568][1..5568]   ({} bytes)", buf.len());
    println!("start  : 0x{start:08x}");
    println!("scalar : 0x{scalar:08x}");
    println!("zlib-rs: 0x{zlibrs:08x}");
    if scalar == zlibrs {
        println!("OK — values match (bug not triggered on this target)");
    } else {
        let s1_diff = (scalar & 0xffff) as i32 - (zlibrs & 0xffff) as i32;
        let s2_diff = (scalar >> 16) as i32 - (zlibrs >> 16) as i32;
        println!("BUG    : s1 diff = {s1_diff}, s2 diff = {s2_diff} (expected 0)");
        std::process::exit(1);
    }
}

Expected output (against patched zlib-rs)

input  : vec![0xff; 5568][1..5568]   (5567 bytes)
start  : 0xa4c1fb51
scalar : 0x7ca7a5dc
zlib-rs: 0x7ca7a5dc
OK — values match (bug not triggered on this target)

Actual output (against unpatched zlib-rs 0.5.2 / 0.5.5 / 0.6.3 / trunk)

input  : vec![0xff; 5568][1..5568]   (5567 bytes)
start  : 0xa4c1fb51
scalar : 0x7ca7a5dc
zlib-rs: 0x7bc6a5dc
BUG    : s1 diff = 0, s2 diff = 225 (expected 0)

(s1 is correct; s2 is short by exactly 225 = 14_745_600 / 65_536 because
the four-lane horizontal sum wraps modulo 2³² before the final % BASE. See
Root cause below.)

This was originally observed in production while decoding zlib streams whose
trailer Adler-32 had been computed by flate2 + miniz_oxide: decoding via
zlib-rs's NEON path produced Z_DATA_ERROR ("incorrect data check") even
though the data itself was sound.

Trigger conditions (verbatim)

aarch64 NEON Adler-32 implementation, function accum32 in
src/adler32/neon.rs. Triggers when ALL THREE of:

Input contains ≥ 5552 contiguous bytes of 0xFF
(one full NMAX-sized SIMD chunk fed to accum32).
The minimal reduced repro uses 5567 bytes; the threshold is NMAX = 5552.

Slice starts at a non-16-aligned offset (so
slice::align_to::<uint8x16_t> produces a non-empty before,
which handle_tail accumulates into pair without a % BASE
reduction).

Initial s1 AND s2 are both non-zero (carry-in from prior bytes,
e.g. from chunked Adler-32 over a multi-buffer stream, or from a PNG
IDAT Adler-32 carry).

If any one condition is missing, the result is correct. In particular,
for the spec-default seed 1 (s1 = 1, s2 = 0) the four lanes never
accumulate enough for the final horizontal sum to overflow.

Root cause

adler32_neon_internal (zlib-rs/src/adler32/neon.rs:33) does:

// zlib-rs/src/adler32/neon.rs:61-66 (upstream, unpatched)
let (before, middle, after) = unsafe { buf.align_to::<uint8x16_t>() };

pair = handle_tail(pair, before);          // ← no `% BASE` here

for chunk in middle.chunks(NMAX as usize / core::mem::size_of::<uint8x16_t>()) {
    pair = unsafe { accum32(pair, chunk) };
    pair.0 %= BASE;
    pair.1 %= BASE;
}

handle_tail (neon.rs:81) is the trivial scalar loop:

fn handle_tail(mut pair: (u32, u32), buf: &[u8]) -> (u32, u32) {
    for x in buf {
        pair.0 += *x as u32;
        pair.1 += pair.0;
    }
    pair
}

Note handle_tail does not apply % BASE. With the trigger seed
(s1 = 0xfb51, s2 = 0xa4c1) and before.len() == 15 of 0xffs, the
post-tail pair is

pair.0 = 0xfb51 + 15 * 0xff = 68_162   (> BASE = 65_521)
pair.1 = 1_037_832                       (>> BASE)

This out-of-range pair is then handed straight to accum32
(neon.rs:91). accum32 plants s.0 and s.1 into lane 0 of the
adacc / s2acc u32×4 vectors, accumulates m SIMD groups of 64 bytes
into s3acc and s2acc, and finally horizontally sums all four lanes
with two vpadd_u32 instructions (neon.rs:196-198):

let adacc2 = vpadd_u32(vget_low_u32(adacc), vget_high_u32(adacc));
let s2acc2 = vpadd_u32(vget_low_u32(s2acc), vget_high_u32(s2acc));
let as_   = vpadd_u32(adacc2, s2acc2);     // ← lane 1 holds s2acc lanes summed

For the trigger input the four lanes of s2acc + s3acc (instrumented
dump) are:

[1_362_718_576, 982_891_380, 982_537_440, 982_183_500]   (sum = 4_310_330_896)

u32::MAX = 4_294_967_295. The horizontal sum computed by vpadd_u32
is performed in u32 (NEON UADDP), so it wraps:

4_310_330_896 mod 2³² = 15_363_600

which after % BASE = 65_521 is 31_686. The correct value is
31_911 — exactly 225 greater, matching the observed s2_diff.

Why this overflows: with n = 5552 bytes of 0xff and s.0 ≈ BASE,
the quantity accum32 materialises into the four lanes (before
horizontal-summing) is the unreduced

s.1 + n * s.0 + sum_{p=0..n-1} (n - p) * b[p]
≈ 65_521 + 5552 * 65_521 + 255 * 5552 * 5553 / 2
≈ 4.30 × 10⁹

right at the u32 ceiling. Any extra contribution from a before tail
(pair.0 > BASE) or a similarly oversized seed pushes the lane sums
over u32::MAX. With s.0 < BASE and s.1 < BASE (the precondition the
algorithm implicitly assumes — and that the accum32 chunk loop maintains
via its own % BASE after each call), the same sum stays below
u32::MAX with ≈ 32 MB of margin.

That is why the bug is invisible whenever before.len() == 0 or the
caller's seed fits comfortably in 16 bits.

Why scalar / AVX2 / AVX-512 are not affected

adler32::generic::adler32_rust reduces adler and sum2 mod BASE inside
the chunk loop, and the entry path (adler32_len_*) operates on values
< BASE.
avx2::adler32_avx2 reduces pair mod BASE inside its own SIMD chunk
loop and its scalar tail accumulator stays small enough not to
overflow.
The NEON path is unique in that handle_tail (used for the alignment
prefix) is unbounded and the result is fed directly into accum32
without reduction.

Patch

Reduce pair mod BASE after handle_tail and before the SIMD chunk
loop. This restores the precondition that accum32's lane arithmetic
relies on, without touching any of the SIMD code paths.

diff --git a/zlib-rs/src/adler32/neon.rs b/zlib-rs/src/adler32/neon.rs
index 37816f8..bc4f70d 100644
--- a/zlib-rs/src/adler32/neon.rs
+++ b/zlib-rs/src/adler32/neon.rs
@@ -61,6 +61,13 @@ unsafe fn adler32_neon_internal(mut adler: u32, buf: &[u8]) -> u32 {
     let (before, middle, after) = unsafe { buf.align_to::<uint8x16_t>() };
 
     pair = handle_tail(pair, before);
+    // `accum32` accumulates in u32 lanes and assumes both components fit in 16
+    // bits on entry; without this reduction, a `before` tail (or large caller
+    // seed) can leave `pair.0`/`pair.1` above BASE and overflow the final
+    // horizontal sum on inputs near NMAX. See bug repro: 5567 bytes of 0xff
+    // sliced from offset 1 with seed 0xa4c1_fb51.
+    pair.0 %= BASE;
+    pair.1 %= BASE;
 
     for chunk in middle.chunks(NMAX as usize / core::mem::size_of::<uint8x16_t>()) {
         pair = unsafe { accum32(pair, chunk) };
@@ -246,4 +253,24 @@ mod tests {
 
         assert_eq!(neon, rust);
     }
+
+    /// Regression test for u32 overflow in `accum32`'s final horizontal sum.
+    ///
+    /// Triggered when:
+    /// 1. Input has a non-zero `before` tail (non-16-aligned slice) so that
+    ///    `handle_tail` runs without reducing `pair` mod BASE, and
+    /// 2. The post-`handle_tail` `pair` exceeds BASE, and
+    /// 3. The SIMD chunk approaches NMAX bytes of 0xff.
+    ///
+    /// Under those conditions, `s.0 * n + sum_p (n-p) * b[p] + s.1` can exceed
+    /// `u32::MAX`, wrapping the lane-summed result and corrupting `s2`.
+    #[test]
+    fn carry_in_with_unaligned_before_no_overflow() {
+        let backing = vec![0xffu8; 5568];
+        let buf: &[u8] = &backing[1..1 + 5567];
+        let start: u32 = 0xa4c1_fb51;
+        let neon = adler32_neon(start, buf);
+        let rust = crate::adler32::generic::adler32_rust(start, buf);
+        assert_eq!(neon, rust);
+    }
 }

A slightly more defensive alternative is to reduce inside accum32
itself (e.g. immediately after vsetq_lane_u32(s.0, …, 0)) so the
function is robust against any caller. The patch above is the minimum
change that fixes the bug without altering accum32.

Validation

All performed on aarch64 (Neoverse-class server, uname -m = aarch64,
rustc 1.88.0).

Repro: with the patch applied, cargo run --release of the
standalone reproducer above against the patched trunk crate prints
OK — values match. Without the patch it prints
BUG: s1 diff = 0, s2 diff = 225.
cargo test --release (full workspace, patched):
13 + 248 + 83 + 14 + 0 + 4 = 362 tests pass, 0 failures, 2
ignored. Includes the existing adler32_neon_is_adler32_rust
quickcheck/Miri test and the new
carry_in_with_unaligned_before_no_overflow regression test.
Regression test demonstrably exercises the bug: temporarily
reverting only the pair.0 %= BASE; pair.1 %= BASE; lines (keeping
the new test) makes
cargo test --release -p zlib-rs --lib carry_in_with_unaligned_before_no_overflow
fail with
```
assertion `left == right` failed
  left:  2076616156
  right: 2091361756
```
The 14 745 600 difference is exactly 225 << 16, matching the
s2_diff = 225 the standalone repro shows. Restoring the patch
makes the test pass.
Fuzz (cargo +nightly fuzz run checksum): 60 seconds,
1 453 754 iterations, no crashes or assertion failures.

How the bug got missed

The existing start_alignment test (neon.rs test module) only uses
initial seeds 1 (the Adler-32 spec's default) and 42. Both have
s2 = 0 and s1 < BASE, so neither component can be out of range
after handle_tail, regardless of alignment.
The existing adler32_neon_is_adler32_rust quickcheck uses default
quickcheck Vec<u8> shrinking, which produces small random inputs;
reaching ≥ 5552 contiguous identical bytes randomly is essentially
impossible.
The large_input test uses a real PDF (no long 0xff runs).
Real-world traffic that exercises this path needs a ≥ NMAX run of
0xff (rare in compressed data, common in raw zero-padded /
uninitialised buffers and certain image formats), a non-16-aligned
slice (very common via &buf[i..]), and a non-default carry-in seed
(common in chunked Adler-32 — e.g. PNG IDAT, multi-buffer
compress2 callers, or anything that resumes a prior Adler-32
across read boundaries).

A useful follow-up would be a fuzz target that explicitly varies the
seed start: u32 and feeds long runs of constant-value bytes; the
regression test added here is a sufficient point check.

folkertdev · 2026-04-28T18:55:44Z

This makes sense, I'm trying to

understand why zlib-ng doesn't run into this problem. They use a slightly different approach to make the alignment work, but from what I can see they don't actually apply the modulo
get our fuzzer to actually hit this on aarch64. I'm having some issues with the aarch64 machine we have though, so this may take a bit

I can reproduce it with qemu though, thanks for the test case.

`accum32` accumulates in u32 lanes and assumes both components of `pair` fit in 16 bits on entry. Without a `% BASE` reduction after the alignment-prefix `handle_tail` call, a non-empty `before` tail (or a large caller-supplied seed) can leave `pair.0` / `pair.1` above BASE, which lets the four-lane horizontal sum at the end of `accum32` overflow `u32::MAX` for inputs near `NMAX`. Concretely: with 5567 bytes of `0xff` sliced from offset 1 (so `before.len() == 15`) and seed `0xa4c1_fb51`, the post-`handle_tail` pair is `(68_162, 1_037_832)` and the four lanes of `s2acc + s3acc` sum to `4_310_330_896 > 2^32`. The wrap shows up as `s2_diff = 225` between the NEON result and the scalar reference (`adler32::generic::adler32_rust`). Fix: reduce `pair` mod BASE after `handle_tail(pair, before)` and before the SIMD chunk loop. This restores the precondition `accum32` relies on without altering any of the SIMD code paths. The chunk loop already does `% BASE` after each `accum32` call, so subsequent iterations were already safe — only the entry into the loop was missing the reduction. Also adds a regression test that fails on the unpatched code with the exact bug signature.

codecov · 2026-05-04T10:30:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag	Coverage Δ
fuzz-compress	`?`
fuzz-decompress	`?`
test-aarch64-apple-darwin	`89.59% <100.00%> (+0.01%)`	⬆️
test-aarch64-unknown-linux-gnu	`85.39% <100.00%> (+0.03%)`	⬆️
test-i686-unknown-linux-gnu	`85.12% <ø> (+0.06%)`	⬆️
test-x86_64-apple-darwin	`88.97% <ø> (+0.05%)`	⬆️
test-x86_64-unknown-linux-gnu	`91.00% <ø> (-2.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
zlib-rs/src/adler32/neon.rs	`100.00% <100.00%> (ø)`

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

shenxiul changed the title ~~adler32: reduce pair mod BASE after handle_tail on aarch64 NEON~~ [bug fix] adler32: reduce pair mod BASE after handle_tail on aarch64 NEON Apr 28, 2026

folkertdev mentioned this pull request Apr 28, 2026

vary alignment in checksum fuzzer #506

Open

folkertdev force-pushed the fix/aarch64-neon-adler32-overflow branch from cfcde02 to e3c711c Compare May 4, 2026 10:28

folkertdev merged commit 0c67788 into trifectatechfoundation:main May 4, 2026
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug fix] adler32: reduce `pair` mod BASE after `handle_tail` on aarch64 NEON#504

[bug fix] adler32: reduce `pair` mod BASE after `handle_tail` on aarch64 NEON#504
folkertdev merged 1 commit intotrifectatechfoundation:mainfrom
shenxiul:fix/aarch64-neon-adler32-overflow

shenxiul commented Apr 28, 2026

Uh oh!

folkertdev commented Apr 28, 2026

Uh oh!

codecov Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

shenxiul commented Apr 28, 2026

aarch64 NEON accum32 integer overflow with non-aligned before and non-zero seed

Reproducer (60 lines, cargo run)

Expected output (against patched zlib-rs)

Actual output (against unpatched zlib-rs 0.5.2 / 0.5.5 / 0.6.3 / trunk)

Trigger conditions (verbatim)

Root cause

Why scalar / AVX2 / AVX-512 are not affected

Patch

Validation

How the bug got missed

Uh oh!

folkertdev commented Apr 28, 2026

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aarch64 NEON `accum32` integer overflow with non-aligned `before` and non-zero seed

Reproducer (60 lines, `cargo run`)

codecov Bot commented May 4, 2026 •

edited

Loading