HDDS-15356. Make multi-buffer chunk checksum allocation-free by smengcl · Pull Request #10350 · apache/ozone

smengcl · 2026-05-23T23:26:16Z

Generated-by: Claude Code (Opus 4.7)

What changes were proposed in this pull request?

A multi-buffer ChunkBuffer is one whose data is stored as more than one underlying ByteBuffer instead of a single contiguous block. In production this shape occurs when the data was assembled from a list of Netty ByteBufs (Ratis state-machine-data read via ChunkedNioFile), when the verifier received a List<ByteString> spanning more than one checksum window (client read-verify), or when an operator opted into IncrementalChunkBuffer via ozone.client.bytebuffer.increment > 0.

Checksum.computeChecksum(ChunkBuffer) previously delegated to ChunkBuffer.iterate(bytesPerChecksum), which allocates a fresh byte[bytesPerChecksum] (1 MB by default) and memcpys into it whenever a checksum window straddles two of those underlying ByteBuffers.

This change rewrites both the no-cache and cache compute paths to walk data.asByteBufferList() directly, slicing each window via a non-copying ByteBuffer.duplicate() helper (BufferUtils.slice) and feeding slices to a new Checksum.StreamingChecksum strategy that wraps ChecksumByteBuffer (CRC32, CRC32C) or MessageDigest (SHA-256, MD5). The update(ByteBuffer) contracts of both define incremental update as byte-equivalent to a single update over the concatenation, so output bytes are bit-identical.

KeyValueHandler.validateChunkChecksumData is also switched to the multi-buffer overload of Checksum.verifyChecksum, which now benefits from the same walk.

When this helps

The fix only matters when the input ChunkBuffer is multi-buffer.

Ratis state-machine-data read-replay validate (always). ChunkUtils.readData wraps a List<ByteBuf> from ChunkedNioFile.
Client multi-window read-verify (always, when one read spans more than one checksum window). Goes through Checksum.verifyChecksum(List<ByteString>, ...).
Operator-tuned writes with ozone.client.bytebuffer.increment > 0 (off by default), which uses IncrementalChunkBuffer.

When this does nothing

Default-config BOS client write. Each chunk is one contiguous 4 MB direct ByteBuffer (bufferIncrement = 0), so iterate() was already taking the no-copy slice path.
Default-config DN write-side validate. Proto3 parse produces a contiguous LiteralByteString, so asReadOnlyByteBufferList() returns a single buffer.
DN background scanner. Single direct buffer.

Microbenchmark

Checksum.computeChecksum(ChunkBuffer) over 4 MB with 1 MB bytesPerChecksum. 16 buf, current = master, every 1 MB window straddles 4 of 16 × 256 KB pieces. 16 buf, this PR = same shape, no linearization. 1 buffer, floor = one contiguous 4 MB ByteBuffer. (30 warmup, 30 measurement iterations, well above HotSpot C2.)

Xeon Silver 4416+, Ubuntu 22.04, OpenJDK 17.0.16

Algorithm	16 buf, current	16 buf, this PR	1 buffer, floor
CRC32	704 µs	177 µs (75% faster)	158 µs
CRC32C	763 µs	165 µs (78% faster)	158 µs
SHA-256	3727 µs	2997 µs (20% faster)	3019 µs
MD5	7591 µs	7037 µs (7% faster)	7025 µs

Apple M5 Max, OpenJDK Zulu 25.28+85

Algorithm	16 buf, current	16 buf, this PR	1 buffer, floor
CRC32	441 µs	368 µs (17% faster)	356 µs
CRC32C	439 µs	362 µs (18% faster)	350 µs
SHA-256	1339 µs	1239 µs (7% faster)	1250 µs
MD5	4355 µs	4256 µs (2% faster)	4270 µs

The wall-clock win is larger for CRC32/CRC32C (the per-window byte[1 MB] alloc plus memcpy dominates there) and larger on Xeon (x86 CRC32 ISA and PCLMULQDQ make the CRC itself ~40 µs/MB). On SHA-256 / MD5 the digest dominates so wall-clock barely moves, but the allocation is still gone.

Compatibility

Wire format, on-disk format, and computed checksum bytes are bit-identical. Streaming update is the standard contract for both CRC and message-digest algorithms, so any mix of old and new clients and DNs interoperate. The DN scanner re-checking previously persisted checksums produces byte-identical recomputed values.

Net summary

Default-config OM/DN throughput: unchanged.
Default-config heap pressure: unchanged.
Multi-buffer paths above: per-chunk byte[bytesPerChecksum] alloc and memcpy eliminated; CRC32/CRC32C wall-clock 75–78 % faster on Xeon.
Codebase: removes a perf cliff so future config or gRPC framing changes that produce multi-buffer ChunkBuffers cannot silently regress.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15356

How was this patch tested?

Added TestChecksumMultiBuffer (8 cases across CRC32, CRC32C, SHA-256, MD5) verifies bit-identical checksums between single-buffer and split-buffer inputs across both aligned (16 × 256 KB with 1 MB windows straddling 4 buffers) and unaligned (333 KB pieces, total 3 × 1 MB plus 12,345 trailing bytes) shapes.

Generated-by: Claude Code (Opus 4.7)

HDDS-15356. Make multi-buffer chunk checksum allocation-free

6b5406d

Generated-by: Claude Code (Opus 4.7)

smengcl requested a review from jojochuang May 23, 2026 23:26

smengcl added performance AI-gen labels May 23, 2026

peterxcli self-requested a review May 24, 2026 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-15356. Make multi-buffer chunk checksum allocation-free#10350

HDDS-15356. Make multi-buffer chunk checksum allocation-free#10350
smengcl wants to merge 1 commit into
apache:masterfrom
smengcl:streaming-checksum

smengcl commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smengcl commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

When this helps

When this does nothing

Microbenchmark

Xeon Silver 4416+, Ubuntu 22.04, OpenJDK 17.0.16

Apple M5 Max, OpenJDK Zulu 25.28+85

Compatibility

Net summary

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

smengcl commented May 23, 2026 •

edited

Loading