Skip to content

pre-size chunk buffer to avoid realloc fragmentation; fix nbit/scaleoffset no-op return values#6389

Open
brtnfld wants to merge 10 commits into
HDFGroup:developfrom
brtnfld:4481
Open

pre-size chunk buffer to avoid realloc fragmentation; fix nbit/scaleoffset no-op return values#6389
brtnfld wants to merge 10 commits into
HDFGroup:developfrom
brtnfld:4481

Conversation

@brtnfld
Copy link
Copy Markdown
Collaborator

@brtnfld brtnfld commented May 1, 2026

  1. Pre-size chunk buffer before the filter pipeline (H5Dchunk.c)

When reading a filtered chunk, the decompression buffer was initialized at the compressed on-disk size. Filters like deflate grow this buffer via repeated realloc() calls, doubling its size when the output exceeds capacity. On Windows, this fragments the heap and causes read times to increase steadily across successive iterations.

The fix pre-sizes the buffer to the full uncompressed chunk_size when a filter pipeline is active, so the output buffer is large enough from the start and no realloc is needed.

Fixes #4481 and Fixes #4513.

  1. Return nbytes instead of *buf_size in nbit and scaleoffset no-op paths (H5Znbit.c, H5Zscaleoffset.c)

Per the HDF5 filter API, nbytes is the count of valid data bytes in the buffer; *buf_size is the allocated buffer capacity. These two values were historically always equal on the read path, so either worked in practice. The pre-sizing change above makes *buf_size > nbytes on compressed reads for the first time, exposing two latent bugs:

H5Znbit.c: the no-compress pass-through path (cd_values[1] == 1) returned *buf_size instead of nbytes. The downstream filter (e.g., Fletcher32) then treated the full pre-sized buffer as valid input and computed a checksum over uninitialized bytes, causing filter pipeline failures.
H5Zscaleoffset.c: the no-process path (when the data already spans the full bit range and no scaling is needed) had the same *buf_size vs nbytes mistake. No current test exercises this path through a multi-filter pipeline where the size mismatch would be observable, but the fix is correct by the same API reasoning.

@github-project-automation github-project-automation Bot moved this to To be triaged in HDF5 - TRIAGE & TRACK May 1, 2026
@brtnfld brtnfld marked this pull request as draft May 1, 2026 17:24
@brtnfld brtnfld added the Component - C Library Core C library issues (usually in the src directory) label May 1, 2026
@brtnfld
Copy link
Copy Markdown
Collaborator Author

brtnfld commented May 1, 2026

I will need to do a PR to netcdf for a fix.

Comment thread src/H5Dchunk.c Outdated
* This avoids repeated realloc() inside the filter pipeline (e.g., deflate's
* doubling strategy when the output buffer fills), which on Windows fragments
* the heap and causes read time to increase steadily across iterations. */
if ((old_pline && old_pline->nused) && buf_alloc < chunk_size)
Copy link
Copy Markdown
Member

@fortnern fortnern May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you need the first condition - you can probably assert that chunk_nbytes == chunk_size if there are no filters (I think it was checked earlier):

if (buf_alloc < chunk_size) {
    assert(old_pline && old_pline->nused);
    buf_alloc = chunk_size;
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised it.

@brtnfld brtnfld force-pushed the 4481 branch 2 times, most recently from fb096a0 to 6ab8428 Compare May 1, 2026 19:26
@brtnfld brtnfld marked this pull request as ready for review May 1, 2026 21:37
fortnern
fortnern previously approved these changes May 1, 2026
brtnfld added 2 commits May 8, 2026 08:56
…Windows

When reading a filtered (compressed) chunk, filters like deflate initialize
their output buffer using *buf_size (the pipeline's buffer capacity hint).
With *buf_size set to chunk_disk_size (the compressed size), deflate grows
its output buffer via repeated realloc() doubling. On Windows this fragments
the heap and causes read times to increase steadily across iterations.

Set buf_alloc to MAX(chunk_disk_size, chunk_size) immediately before calling
H5Z_pipeline so filters pre-allocate their output at the full uncompressed
size and avoid realloc. The inbuf is still allocated at chunk_disk_size —
only the hint passed to the pipeline is enlarged — so there is no
double-large-allocation overhead and peak memory stays at
chunk_disk_size + chunk_size rather than 2 * chunk_size.
For incompressible chunks where chunk_disk_size > chunk_size the hint
remains at chunk_disk_size, satisfying H5Z_pipeline's post-filter size
validation.

Fixes HDFGroup#4481 and HDFGroup#4513.
… paths

Per the HDF5 filter API, nbytes is the count of valid data bytes in the
buffer, while *buf_size is the allocated buffer capacity. These two values
were historically always equal on the read path, so either worked in
practice. The pre-sizing change (which sets buf_alloc = chunk_size before
the filter pipeline) makes *buf_size > nbytes for compressed reads,
exposing the latent bug.

In H5Znbit.c: the no-compress pass-through (cd_values[1] == 1) was
returning *buf_size, causing the next filter (e.g. fletcher32) to treat
the full pre-sized buffer as valid data and compute a checksum over
uninitialised bytes.

In H5Zscaleoffset.c: the no-process path had the same pattern. No current
test exercises this path through a multi-filter pipeline where the size
mismatch would be observable, but the fix is correct by the same API
reasoning.
brtnfld and others added 3 commits May 8, 2026 10:33
Instead of allocating a fresh output buffer (H5MM_malloc), copy the
compressed input to a small temp buffer and resize *buf in-place via
H5resize_memory. On Windows, HeapReAlloc can often extend an existing
heap block without moving it, avoiding the cost of finding and committing
a fresh large allocation for every chunk read.

This is combined with the pre-sizing hint in H5Dchunk.c that sets
buf_alloc = chunk_size before the pipeline call, so the resize goes
directly to chunk_size in one step with no realloc loop.
…pacity

Replace the hint-only buf_alloc enlargement with an actual
H5D__chunk_mem_realloc() before calling H5Z_pipeline.  Every filter now
sees *buf_size equal to the real buffer capacity, not just an advisory
hint.  This fixes a bounds-check regression in H5Zscaleoffset where the
read path uses *buf_size as the end-of-buffer sentinel in
H5_IS_BUFFER_OVERFLOW and as the input-size argument to
H5Z__scaleoffset_decompress; an inflated hint caused false-negative
overflow checks and potential over-reads when scaleoffset is the on-disk
filter.

H5Zdeflate is simplified accordingly: the up-front H5resize_memory(*buf,
nalloc) is removed since the buffer is already at nalloc on entry.  The
inbuf copy is retained (still needed to read compressed input while
writing uncompressed output into the same buffer), as is the
realloc-doubling loop for the uncommon case where output exceeds
chunk_size.

Also add test/chunk_deflate_perf.c, a standalone benchmark that times
per-chunk deflate reads over multiple passes to detect the steady read-time
increase caused by heap fragmentation on Windows (issues HDFGroup#4481 / HDFGroup#4513).
brtnfld added 5 commits May 8, 2026 11:08
Add optional 4th argument to specify the output HDF5 file path
(default: chunk_deflate_perf.h5 in CWD).  Allows running develop
and fix builds against separate files so pass 1 is cold-cache for
both and the two runs don't share page-cache state.
Not suitable for the test suite; intended for manual Windows validation
only.  Keep locally if needed.
…pacity

Allocate the chunk read buffer at MAX(chunk_disk_size, chunk_size) from
the start rather than allocating at chunk_disk_size and immediately
reallocating.  The read only fills chunk_disk_size bytes regardless of
buffer size, so there is no cost to the larger initial allocation and the
separate realloc step is eliminated.

Every filter now receives *buf_size equal to the actual buffer capacity
with no additional allocation needed.  On Windows a single HeapAlloc at
the correct size avoids the repeated realloc-doubling in the deflate
filter that fragments the heap and causes read times to increase over
successive iterations (issues HDFGroup#4481 / HDFGroup#4513).  Also fixes the
scaleoffset bounds-check regression where *buf_size was used as an
end-of-buffer sentinel.
Reverts the deflate in-place decompression experiment and the H5Dchunk.c
pre-sizing changes back to the state at 9ab5e19, keeping the nbit
and scaleoffset no-op path fixes.
…ipeline

The previous approach allocated the buffer at chunk_disk_size and then
bumped buf_alloc to chunk_size as a hint to the pipeline, causing *buf_size
to misrepresent the actual allocation.  Any filter that writes up to
*buf_size bytes into *buf would overflow.

Allocate at MAX(chunk_disk_size, chunk_size) upfront so the buffer and
the hint given to filters are always consistent.  For incompressible chunks
where chunk_disk_size >= chunk_size the allocation is unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component - C Library Core C library issues (usually in the src directory)

Projects

Status: To be triaged

3 participants