pre-size chunk buffer to avoid realloc fragmentation; fix nbit/scaleoffset no-op return values#6389
Open
brtnfld wants to merge 10 commits into
Open
pre-size chunk buffer to avoid realloc fragmentation; fix nbit/scaleoffset no-op return values#6389brtnfld wants to merge 10 commits into
brtnfld wants to merge 10 commits into
Conversation
Collaborator
Author
|
I will need to do a PR to netcdf for a fix. |
fortnern
reviewed
May 1, 2026
| * This avoids repeated realloc() inside the filter pipeline (e.g., deflate's | ||
| * doubling strategy when the output buffer fills), which on Windows fragments | ||
| * the heap and causes read time to increase steadily across iterations. */ | ||
| if ((old_pline && old_pline->nused) && buf_alloc < chunk_size) |
Member
There was a problem hiding this comment.
I'm not sure you need the first condition - you can probably assert that chunk_nbytes == chunk_size if there are no filters (I think it was checked earlier):
if (buf_alloc < chunk_size) {
assert(old_pline && old_pline->nused);
buf_alloc = chunk_size;
}
fb096a0 to
6ab8428
Compare
fortnern
previously approved these changes
May 1, 2026
f1f1b72 to
bc4511b
Compare
…Windows When reading a filtered (compressed) chunk, filters like deflate initialize their output buffer using *buf_size (the pipeline's buffer capacity hint). With *buf_size set to chunk_disk_size (the compressed size), deflate grows its output buffer via repeated realloc() doubling. On Windows this fragments the heap and causes read times to increase steadily across iterations. Set buf_alloc to MAX(chunk_disk_size, chunk_size) immediately before calling H5Z_pipeline so filters pre-allocate their output at the full uncompressed size and avoid realloc. The inbuf is still allocated at chunk_disk_size — only the hint passed to the pipeline is enlarged — so there is no double-large-allocation overhead and peak memory stays at chunk_disk_size + chunk_size rather than 2 * chunk_size. For incompressible chunks where chunk_disk_size > chunk_size the hint remains at chunk_disk_size, satisfying H5Z_pipeline's post-filter size validation. Fixes HDFGroup#4481 and HDFGroup#4513.
… paths Per the HDF5 filter API, nbytes is the count of valid data bytes in the buffer, while *buf_size is the allocated buffer capacity. These two values were historically always equal on the read path, so either worked in practice. The pre-sizing change (which sets buf_alloc = chunk_size before the filter pipeline) makes *buf_size > nbytes for compressed reads, exposing the latent bug. In H5Znbit.c: the no-compress pass-through (cd_values[1] == 1) was returning *buf_size, causing the next filter (e.g. fletcher32) to treat the full pre-sized buffer as valid data and compute a checksum over uninitialised bytes. In H5Zscaleoffset.c: the no-process path had the same pattern. No current test exercises this path through a multi-filter pipeline where the size mismatch would be observable, but the fix is correct by the same API reasoning.
Instead of allocating a fresh output buffer (H5MM_malloc), copy the compressed input to a small temp buffer and resize *buf in-place via H5resize_memory. On Windows, HeapReAlloc can often extend an existing heap block without moving it, avoiding the cost of finding and committing a fresh large allocation for every chunk read. This is combined with the pre-sizing hint in H5Dchunk.c that sets buf_alloc = chunk_size before the pipeline call, so the resize goes directly to chunk_size in one step with no realloc loop.
…pacity Replace the hint-only buf_alloc enlargement with an actual H5D__chunk_mem_realloc() before calling H5Z_pipeline. Every filter now sees *buf_size equal to the real buffer capacity, not just an advisory hint. This fixes a bounds-check regression in H5Zscaleoffset where the read path uses *buf_size as the end-of-buffer sentinel in H5_IS_BUFFER_OVERFLOW and as the input-size argument to H5Z__scaleoffset_decompress; an inflated hint caused false-negative overflow checks and potential over-reads when scaleoffset is the on-disk filter. H5Zdeflate is simplified accordingly: the up-front H5resize_memory(*buf, nalloc) is removed since the buffer is already at nalloc on entry. The inbuf copy is retained (still needed to read compressed input while writing uncompressed output into the same buffer), as is the realloc-doubling loop for the uncommon case where output exceeds chunk_size. Also add test/chunk_deflate_perf.c, a standalone benchmark that times per-chunk deflate reads over multiple passes to detect the steady read-time increase caused by heap fragmentation on Windows (issues HDFGroup#4481 / HDFGroup#4513).
Add optional 4th argument to specify the output HDF5 file path (default: chunk_deflate_perf.h5 in CWD). Allows running develop and fix builds against separate files so pass 1 is cold-cache for both and the two runs don't share page-cache state.
Not suitable for the test suite; intended for manual Windows validation only. Keep locally if needed.
…pacity Allocate the chunk read buffer at MAX(chunk_disk_size, chunk_size) from the start rather than allocating at chunk_disk_size and immediately reallocating. The read only fills chunk_disk_size bytes regardless of buffer size, so there is no cost to the larger initial allocation and the separate realloc step is eliminated. Every filter now receives *buf_size equal to the actual buffer capacity with no additional allocation needed. On Windows a single HeapAlloc at the correct size avoids the repeated realloc-doubling in the deflate filter that fragments the heap and causes read times to increase over successive iterations (issues HDFGroup#4481 / HDFGroup#4513). Also fixes the scaleoffset bounds-check regression where *buf_size was used as an end-of-buffer sentinel.
Reverts the deflate in-place decompression experiment and the H5Dchunk.c pre-sizing changes back to the state at 9ab5e19, keeping the nbit and scaleoffset no-op path fixes.
…ipeline The previous approach allocated the buffer at chunk_disk_size and then bumped buf_alloc to chunk_size as a hint to the pipeline, causing *buf_size to misrepresent the actual allocation. Any filter that writes up to *buf_size bytes into *buf would overflow. Allocate at MAX(chunk_disk_size, chunk_size) upfront so the buffer and the hint given to filters are always consistent. For incompressible chunks where chunk_disk_size >= chunk_size the allocation is unchanged.
fortnern
approved these changes
May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When reading a filtered chunk, the decompression buffer was initialized at the compressed on-disk size. Filters like deflate grow this buffer via repeated realloc() calls, doubling its size when the output exceeds capacity. On Windows, this fragments the heap and causes read times to increase steadily across successive iterations.
The fix pre-sizes the buffer to the full uncompressed chunk_size when a filter pipeline is active, so the output buffer is large enough from the start and no realloc is needed.
Fixes #4481 and Fixes #4513.
Per the HDF5 filter API, nbytes is the count of valid data bytes in the buffer; *buf_size is the allocated buffer capacity. These two values were historically always equal on the read path, so either worked in practice. The pre-sizing change above makes *buf_size > nbytes on compressed reads for the first time, exposing two latent bugs:
H5Znbit.c: the no-compress pass-through path (cd_values[1] == 1) returned *buf_size instead of nbytes. The downstream filter (e.g., Fletcher32) then treated the full pre-sized buffer as valid input and computed a checksum over uninitialized bytes, causing filter pipeline failures.
H5Zscaleoffset.c: the no-process path (when the data already spans the full bit range and no scaling is needed) had the same *buf_size vs nbytes mistake. No current test exercises this path through a multi-filter pipeline where the size mismatch would be observable, but the fix is correct by the same API reasoning.