[bug fix] adler32: reduce pair mod BASE after handle_tail on aarch64 NEON#504
Merged
folkertdev merged 1 commit intotrifectatechfoundation:mainfrom May 4, 2026
Conversation
pair mod BASE after handle_tail on aarch64 NEONpair mod BASE after handle_tail on aarch64 NEON
Member
|
This makes sense, I'm trying to
I can reproduce it with qemu though, thanks for the test case. |
`accum32` accumulates in u32 lanes and assumes both components of `pair` fit in 16 bits on entry. Without a `% BASE` reduction after the alignment-prefix `handle_tail` call, a non-empty `before` tail (or a large caller-supplied seed) can leave `pair.0` / `pair.1` above BASE, which lets the four-lane horizontal sum at the end of `accum32` overflow `u32::MAX` for inputs near `NMAX`. Concretely: with 5567 bytes of `0xff` sliced from offset 1 (so `before.len() == 15`) and seed `0xa4c1_fb51`, the post-`handle_tail` pair is `(68_162, 1_037_832)` and the four lanes of `s2acc + s3acc` sum to `4_310_330_896 > 2^32`. The wrap shows up as `s2_diff = 225` between the NEON result and the scalar reference (`adler32::generic::adler32_rust`). Fix: reduce `pair` mod BASE after `handle_tail(pair, before)` and before the SIMD chunk loop. This restores the precondition `accum32` relies on without altering any of the SIMD code paths. The chunk loop already does `% BASE` after each `accum32` call, so subsequent iterations were already safe — only the entry into the loop was missing the reduction. Also adds a regression test that fails on the unpatched code with the exact bug signature.
cfcde02 to
e3c711c
Compare
Codecov Report✅ All modified and coverable lines are covered by tests.
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 6 files with indirect coverage changes 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
aarch64 NEON
accum32integer overflow with non-alignedbeforeand non-zero seedCrate / repo:
zlib-rs(trifectatechfoundation/zlib-rs)File:
zlib-rs/src/adler32/neon.rsArchitecture: aarch64 (NEON) only. Other backends (AVX2, AVX-512, scalar
generic, WASM) are not affected.
Confirmed reproducing on: 0.5.2, 0.5.5, 0.6.3, and current trunk
(SHA
e51e62b2d1131e30576080966c36b5b5a2abcc56, dated 2026-04-20).The code path is unchanged across those releases, so earlier versions that
shipped the same
adler32::neon::adler32_neon_internal(any version that hashandle_tail(pair, before)followed directly by the SIMD chunk loop) areexpected to be affected as well.
Reproducer (60 lines,
cargo run)Cargo.toml:src/main.rs:Expected output (against patched zlib-rs)
Actual output (against unpatched zlib-rs 0.5.2 / 0.5.5 / 0.6.3 / trunk)
(
s1is correct;s2is short by exactly225 = 14_745_600 / 65_536becausethe four-lane horizontal sum wraps modulo 2³² before the final
% BASE. SeeRoot cause below.)
This was originally observed in production while decoding zlib streams whose
trailer Adler-32 had been computed by
flate2+miniz_oxide: decoding viazlib-rs's NEON path produced
Z_DATA_ERROR ("incorrect data check")eventhough the data itself was sound.
Trigger conditions (verbatim)
If any one condition is missing, the result is correct. In particular,
for the spec-default seed
1(s1 = 1, s2 = 0) the four lanes neveraccumulate enough for the final horizontal sum to overflow.
Root cause
adler32_neon_internal(zlib-rs/src/adler32/neon.rs:33) does:handle_tail(neon.rs:81) is the trivial scalar loop:Note
handle_taildoes not apply% BASE. With the trigger seed(s1 = 0xfb51, s2 = 0xa4c1)andbefore.len() == 15of0xffs, thepost-tail pair is
This out-of-range pair is then handed straight to
accum32(
neon.rs:91).accum32plantss.0ands.1into lane 0 of theadacc/s2accu32×4 vectors, accumulatesmSIMD groups of 64 bytesinto
s3accands2acc, and finally horizontally sums all four laneswith two
vpadd_u32instructions (neon.rs:196-198):For the trigger input the four lanes of
s2acc + s3acc(instrumenteddump) are:
u32::MAX = 4_294_967_295. The horizontal sum computed byvpadd_u32is performed in u32 (NEON
UADDP), so it wraps:which after
% BASE = 65_521is31_686. The correct value is31_911— exactly 225 greater, matching the observeds2_diff.Why this overflows: with
n = 5552bytes of0xffands.0 ≈ BASE,the quantity
accum32materialises into the four lanes (beforehorizontal-summing) is the unreduced
right at the u32 ceiling. Any extra contribution from a
beforetail(
pair.0 > BASE) or a similarly oversized seed pushes the lane sumsover
u32::MAX. Withs.0 < BASEands.1 < BASE(the precondition thealgorithm implicitly assumes — and that the
accum32chunk loop maintainsvia its own
% BASEafter each call), the same sum stays belowu32::MAXwith ≈ 32 MB of margin.That is why the bug is invisible whenever
before.len() == 0or thecaller's seed fits comfortably in 16 bits.
Why scalar / AVX2 / AVX-512 are not affected
adler32::generic::adler32_rustreducesadlerandsum2mod BASE insidethe chunk loop, and the entry path (
adler32_len_*) operates on values< BASE.avx2::adler32_avx2reducespairmod BASE inside its own SIMD chunkloop and its scalar tail accumulator stays small enough not to
overflow.
handle_tail(used for the alignmentprefix) is unbounded and the result is fed directly into
accum32without reduction.
Patch
Reduce
pairmod BASE afterhandle_tailand before the SIMD chunkloop. This restores the precondition that
accum32's lane arithmeticrelies on, without touching any of the SIMD code paths.
A slightly more defensive alternative is to reduce inside
accum32itself (e.g. immediately after
vsetq_lane_u32(s.0, …, 0)) so thefunction is robust against any caller. The patch above is the minimum
change that fixes the bug without altering
accum32.Validation
All performed on aarch64 (Neoverse-class server,
uname -m = aarch64,rustc 1.88.0).Repro: with the patch applied,
cargo run --releaseof thestandalone reproducer above against the patched trunk crate prints
OK — values match. Without the patch it printsBUG: s1 diff = 0, s2 diff = 225.cargo test --release(full workspace, patched):13 + 248 + 83 + 14 + 0 + 4 = 362tests pass, 0 failures, 2ignored. Includes the existing
adler32_neon_is_adler32_rustquickcheck/Miri test and the new
carry_in_with_unaligned_before_no_overflowregression test.Regression test demonstrably exercises the bug: temporarily
reverting only the
pair.0 %= BASE; pair.1 %= BASE;lines (keepingthe new test) makes
cargo test --release -p zlib-rs --lib carry_in_with_unaligned_before_no_overflowfail with
The 14 745 600 difference is exactly
225 << 16, matching thes2_diff = 225the standalone repro shows. Restoring the patchmakes the test pass.
Fuzz (
cargo +nightly fuzz run checksum): 60 seconds,1 453 754 iterations, no crashes or assertion failures.
How the bug got missed
start_alignmenttest (neon.rstest module) only usesinitial seeds
1(the Adler-32 spec's default) and42. Both haves2 = 0ands1 < BASE, so neither component can be out of rangeafter
handle_tail, regardless of alignment.adler32_neon_is_adler32_rustquickcheck uses defaultquickcheck
Vec<u8>shrinking, which produces small random inputs;reaching ≥ 5552 contiguous identical bytes randomly is essentially
impossible.
large_inputtest uses a real PDF (no long0xffruns).0xff(rare in compressed data, common in raw zero-padded /uninitialised buffers and certain image formats), a non-16-aligned
slice (very common via
&buf[i..]), and a non-default carry-in seed(common in chunked Adler-32 — e.g. PNG
IDAT, multi-buffercompress2callers, or anything that resumes a prior Adler-32across read boundaries).
A useful follow-up would be a fuzz target that explicitly varies the
seed
start: u32and feeds long runs of constant-value bytes; theregression test added here is a sufficient point check.