Skip to content

HDDS-15356. Make multi-buffer chunk checksum allocation-free#10350

Draft
smengcl wants to merge 1 commit into
apache:masterfrom
smengcl:streaming-checksum
Draft

HDDS-15356. Make multi-buffer chunk checksum allocation-free#10350
smengcl wants to merge 1 commit into
apache:masterfrom
smengcl:streaming-checksum

Conversation

@smengcl
Copy link
Copy Markdown
Contributor

@smengcl smengcl commented May 23, 2026

Generated-by: Claude Code (Opus 4.7)

What changes were proposed in this pull request?

A multi-buffer ChunkBuffer is one whose data is stored as more than one underlying ByteBuffer instead of a single contiguous block. In production this shape occurs when the data was assembled from a list of Netty ByteBufs (Ratis state-machine-data read via ChunkedNioFile), when the verifier received a List<ByteString> spanning more than one checksum window (client read-verify), or when an operator opted into IncrementalChunkBuffer via ozone.client.bytebuffer.increment > 0.

Checksum.computeChecksum(ChunkBuffer) previously delegated to ChunkBuffer.iterate(bytesPerChecksum), which allocates a fresh byte[bytesPerChecksum] (1 MB by default) and memcpys into it whenever a checksum window straddles two of those underlying ByteBuffers.

This change rewrites both the no-cache and cache compute paths to walk data.asByteBufferList() directly, slicing each window via a non-copying ByteBuffer.duplicate() helper (BufferUtils.slice) and feeding slices to a new Checksum.StreamingChecksum strategy that wraps ChecksumByteBuffer (CRC32, CRC32C) or MessageDigest (SHA-256, MD5). The update(ByteBuffer) contracts of both define incremental update as byte-equivalent to a single update over the concatenation, so output bytes are bit-identical.

KeyValueHandler.validateChunkChecksumData is also switched to the multi-buffer overload of Checksum.verifyChecksum, which now benefits from the same walk.

When this helps

The fix only matters when the input ChunkBuffer is multi-buffer.

  • Ratis state-machine-data read-replay validate (always). ChunkUtils.readData wraps a List<ByteBuf> from ChunkedNioFile.
  • Client multi-window read-verify (always, when one read spans more than one checksum window). Goes through Checksum.verifyChecksum(List<ByteString>, ...).
  • Operator-tuned writes with ozone.client.bytebuffer.increment > 0 (off by default), which uses IncrementalChunkBuffer.

When this does nothing

  • Default-config BOS client write. Each chunk is one contiguous 4 MB direct ByteBuffer (bufferIncrement = 0), so iterate() was already taking the no-copy slice path.
  • Default-config DN write-side validate. Proto3 parse produces a contiguous LiteralByteString, so asReadOnlyByteBufferList() returns a single buffer.
  • DN background scanner. Single direct buffer.

Microbenchmark

Checksum.computeChecksum(ChunkBuffer) over 4 MB with 1 MB bytesPerChecksum. 16 buf, current = master, every 1 MB window straddles 4 of 16 × 256 KB pieces. 16 buf, this PR = same shape, no linearization. 1 buffer, floor = one contiguous 4 MB ByteBuffer. (30 warmup, 30 measurement iterations, well above HotSpot C2.)

Xeon Silver 4416+, Ubuntu 22.04, OpenJDK 17.0.16

Algorithm 16 buf, current 16 buf, this PR 1 buffer, floor
CRC32 704 µs 177 µs (75% faster) 158 µs
CRC32C 763 µs 165 µs (78% faster) 158 µs
SHA-256 3727 µs 2997 µs (20% faster) 3019 µs
MD5 7591 µs 7037 µs (7% faster) 7025 µs

Apple M5 Max, OpenJDK Zulu 25.28+85

Algorithm 16 buf, current 16 buf, this PR 1 buffer, floor
CRC32 441 µs 368 µs (17% faster) 356 µs
CRC32C 439 µs 362 µs (18% faster) 350 µs
SHA-256 1339 µs 1239 µs (7% faster) 1250 µs
MD5 4355 µs 4256 µs (2% faster) 4270 µs

The wall-clock win is larger for CRC32/CRC32C (the per-window byte[1 MB] alloc plus memcpy dominates there) and larger on Xeon (x86 CRC32 ISA and PCLMULQDQ make the CRC itself ~40 µs/MB). On SHA-256 / MD5 the digest dominates so wall-clock barely moves, but the allocation is still gone.

Compatibility

Wire format, on-disk format, and computed checksum bytes are bit-identical. Streaming update is the standard contract for both CRC and message-digest algorithms, so any mix of old and new clients and DNs interoperate. The DN scanner re-checking previously persisted checksums produces byte-identical recomputed values.

Net summary

  • Default-config OM/DN throughput: unchanged.
  • Default-config heap pressure: unchanged.
  • Multi-buffer paths above: per-chunk byte[bytesPerChecksum] alloc and memcpy eliminated; CRC32/CRC32C wall-clock 75–78 % faster on Xeon.
  • Codebase: removes a perf cliff so future config or gRPC framing changes that produce multi-buffer ChunkBuffers cannot silently regress.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15356

How was this patch tested?

  • Added TestChecksumMultiBuffer (8 cases across CRC32, CRC32C, SHA-256, MD5) verifies bit-identical checksums between single-buffer and split-buffer inputs across both aligned (16 × 256 KB with 1 MB windows straddling 4 buffers) and unaligned (333 KB pieces, total 3 × 1 MB plus 12,345 trailing bytes) shapes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant