HDDS-15356. Make multi-buffer chunk checksum allocation-free#10350
Draft
smengcl wants to merge 1 commit into
Draft
HDDS-15356. Make multi-buffer chunk checksum allocation-free#10350smengcl wants to merge 1 commit into
smengcl wants to merge 1 commit into
Conversation
Generated-by: Claude Code (Opus 4.7)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Generated-by: Claude Code (Opus 4.7)
What changes were proposed in this pull request?
A multi-buffer
ChunkBufferis one whose data is stored as more than one underlyingByteBufferinstead of a single contiguous block. In production this shape occurs when the data was assembled from a list of NettyByteBufs (Ratis state-machine-data read viaChunkedNioFile), when the verifier received aList<ByteString>spanning more than one checksum window (client read-verify), or when an operator opted intoIncrementalChunkBufferviaozone.client.bytebuffer.increment > 0.Checksum.computeChecksum(ChunkBuffer)previously delegated toChunkBuffer.iterate(bytesPerChecksum), which allocates a freshbyte[bytesPerChecksum](1 MB by default) and memcpys into it whenever a checksum window straddles two of those underlying ByteBuffers.This change rewrites both the no-cache and cache compute paths to walk
data.asByteBufferList()directly, slicing each window via a non-copyingByteBuffer.duplicate()helper (BufferUtils.slice) and feeding slices to a newChecksum.StreamingChecksumstrategy that wrapsChecksumByteBuffer(CRC32, CRC32C) orMessageDigest(SHA-256, MD5). Theupdate(ByteBuffer)contracts of both define incremental update as byte-equivalent to a single update over the concatenation, so output bytes are bit-identical.KeyValueHandler.validateChunkChecksumDatais also switched to the multi-buffer overload ofChecksum.verifyChecksum, which now benefits from the same walk.When this helps
The fix only matters when the input
ChunkBufferis multi-buffer.ChunkUtils.readDatawraps aList<ByteBuf>fromChunkedNioFile.Checksum.verifyChecksum(List<ByteString>, ...).ozone.client.bytebuffer.increment > 0(off by default), which usesIncrementalChunkBuffer.When this does nothing
ByteBuffer(bufferIncrement = 0), soiterate()was already taking the no-copy slice path.LiteralByteString, soasReadOnlyByteBufferList()returns a single buffer.Microbenchmark
Checksum.computeChecksum(ChunkBuffer)over 4 MB with 1 MBbytesPerChecksum. 16 buf, current = master, every 1 MB window straddles 4 of 16 × 256 KB pieces. 16 buf, this PR = same shape, no linearization. 1 buffer, floor = one contiguous 4 MBByteBuffer. (30 warmup, 30 measurement iterations, well above HotSpot C2.)Xeon Silver 4416+, Ubuntu 22.04, OpenJDK 17.0.16
Apple M5 Max, OpenJDK Zulu 25.28+85
The wall-clock win is larger for CRC32/CRC32C (the per-window
byte[1 MB]alloc plus memcpy dominates there) and larger on Xeon (x86CRC32ISA andPCLMULQDQmake the CRC itself ~40 µs/MB). On SHA-256 / MD5 the digest dominates so wall-clock barely moves, but the allocation is still gone.Compatibility
Wire format, on-disk format, and computed checksum bytes are bit-identical. Streaming update is the standard contract for both CRC and message-digest algorithms, so any mix of old and new clients and DNs interoperate. The DN scanner re-checking previously persisted checksums produces byte-identical recomputed values.
Net summary
byte[bytesPerChecksum]alloc and memcpy eliminated; CRC32/CRC32C wall-clock 75–78 % faster on Xeon.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15356
How was this patch tested?