Shut tile-compress pool down on streaming-write error path (#2276)#2279
Conversation
The streaming tiled-write path in ``_write_streaming`` built a ``ThreadPoolExecutor`` for parallel per-tile compression and called ``shutdown(wait=True)`` only after the tile-row loop completed. Any mid-stream raise (compression error, dask compute error, file write error) bypassed the shutdown and left worker threads alive. - Wrap the tile-row loop in ``try/finally`` so the pool is always shut down before the exception propagates. - Cancel still-pending futures before the final ``shutdown(wait=True)`` so the error path does not block on work we no longer need. - Tag the pool's worker threads with a distinctive ``thread_name_prefix`` so leak detection can tell them apart from dask's own offload / scheduler pools. Tests cover three mid-stream failure shapes plus a happy-path regression guard: compression failure, file-write failure, and the clean success path. Each test asserts both the pool's ``_shutdown`` flag and the absence of leaked worker threads named with the writer's prefix.
brendancol
left a comment
There was a problem hiding this comment.
PR Review: Shut tile-compress pool down on streaming-write error path (#2276)
Blockers (must fix before merge)
- None.
Suggestions (should fix, not blocking)
-
xrspatial/geotiff/_writer.py:1133-1143: the innertry/finallyclears_inflight_futures = []as soon as the list comprehension exits, including the exception path. Iffut.result()raises for fut[3] of 8 submitted, the outerfinallyat line 1161 sees an empty list and cannot cancel the still-pending fut[4..7].tile_pool.shutdown(wait=True)then blocks until those pending tasks drain. Either drop the inner clear, or calltile_pool.shutdown(wait=True, cancel_futures=True)(Python 3.9+) so pending work is dropped on the error path. The second option is simpler and removes the need for the_inflight_futuresbookkeeping entirely. -
xrspatial/geotiff/tests/test_streaming_write_pool_leak_2276.py:21-23: the module docstring still describes the defaultThreadPoolExecutor-thread name, but the implementation filters on_WRITER_POOL_PREFIX = 'xrspatial-geotiff-tile-compress'. Update the docstring to match. -
CHANGELOG.md: no CHANGELOG entry. Recent bug-fix PRs add a bullet under "Unreleased / Bug fixes and improvements" referencing the issue number. Add one.
Nits (optional improvements)
-
xrspatial/geotiff/_writer.py:1026-1033: the'xrspatial-geotiff-tile-compress'literal is duplicated between the writer and the test. Consider a module-level constant in_writer.py(e.g._TILE_POOL_THREAD_PREFIX) so the test imports it. -
xrspatial/geotiff/tests/test_streaming_write_pool_leak_2276.py:189-207:monkeypatch.setattr(os, 'fdopen', ...)patches the global. It is reverted at teardown, but anything in the same test session that callsos.fdopenduring the patched window gets intercepted. Scope is narrow enough here that it is fine to leave.
What looks good
try/finallyplacement is correct: pool construction happens before the try block, so the finally always has a validtile_poolto act on.- The
thread_name_prefixis an effective way to tell writer pools apart from dask's offload and scheduler pools when checking for leaks. - Three failure shapes plus a happy-path regression guard is solid coverage. The tests fail without the fix.
- Temp filenames include the issue number per project convention.
- Existing streaming-write suite (43 tests) still passes.
Checklist
- Algorithm matches reference/paper (N/A)
- All implemented backends produce consistent results (writer-only)
- NaN handling is correct (not touched)
- Edge cases covered by tests
- Dask chunk boundaries handled correctly (not touched)
- No premature materialization or unnecessary copies
- Benchmark exists or is not needed
- README feature matrix updated (N/A)
- Docstrings present and accurate (one minor drift noted)
) Review follow-up: - Replace the inner ``try/finally`` + ``_inflight_futures`` bookkeeping with ``tile_pool.shutdown(wait=True, cancel_futures=True)`` (Python 3.9+). On the error path, queued-but-not-started compress jobs are dropped instead of blocking the unwind on work we no longer need. - Hoist the ``xrspatial-geotiff-tile-compress`` thread-name prefix into a module-level ``_TILE_POOL_THREAD_PREFIX`` constant. The test now imports it from ``_writer`` so the two cannot drift. - Update the test module docstring to describe the actual filtering on the writer's prefix instead of the default ``ThreadPoolExecutor-`` name. - Add a CHANGELOG entry under "Unreleased / Bug fixes and improvements" referencing #2276. The ``monkeypatch.setattr(os, 'fdopen', ...)`` nit is dismissed -- pytest's monkeypatch reverts at teardown and the patched window is narrow.
brendancol
left a comment
There was a problem hiding this comment.
Follow-up review (after commit 1dcdb56)
All three Suggestions addressed. The first Nit applied; the second Nit dismissed with reason. No new findings.
Suggestion dispositions
xrspatial/geotiff/_writer.py:1133-1143(cancel_futures): fixed. Replaced the innertry/finally+_inflight_futuresbookkeeping withtile_pool.shutdown(wait=True, cancel_futures=True). The error path now drops queued work instead of blocking the unwind. Comment block updated to reflect the new approach.xrspatial/geotiff/tests/test_streaming_write_pool_leak_2276.py:21-23(docstring drift): fixed. Module docstring rewritten to describe the writer-prefix filtering and the dask-pool false-positive rationale instead of the old default-prefix wording.CHANGELOG.md: fixed. Added bullet under "Unreleased / Bug fixes and improvements" referencing #2276.
Nit dispositions
- Shared
_TILE_POOL_THREAD_PREFIXconstant: fixed. Hoisted into_writer.pyas a module-level constant; the test now imports it fromwriter_modso a future rename of the prefix on the writer side updates the test in lockstep. os.fdopenmonkeypatch global scope: dismissed. Pytest'smonkeypatchfixture reverts the attribute at test teardown, the patched window covers only the body ofto_geotifffor one call, and the failure injection is deterministic on call count. No realistic risk of cross-test contamination.
Verification
pytest xrspatial/geotiff/tests/test_streaming_write_pool_leak_2276.py xrspatial/geotiff/tests/test_streaming_write_parallel.py xrspatial/geotiff/tests/test_streaming_write.py-- 37 pass, 1 skipped (perf gate).
No remaining findings. Pool shutdown is now guaranteed on every exit path, queued work is dropped on the error path, and the test/code constant stays in sync.
Closes #2276
What
try/finallyso the per-tile compressionThreadPoolExecutoris shut down on every exit path, including mid-stream exceptions.shutdown(wait=True)so the error path does not block on work we no longer need.thread_name_prefix='xrspatial-geotiff-tile-compress'so leak detection can tell them apart from dask's offload / scheduler pools.Why
Before this change,
tile_pool.shutdown(wait=True)lived past the tile-row loop. A raise from_compress_block, from a dask compute, or from the sequential file write bypassed the shutdown call. The outertry/except BaseExceptionone frame up cleans up the temp file but never touches the pool, so worker threads outlive the failed write.Backend coverage
Writer-only path (numpy / dask). No GPU code touched.
Test plan
xrspatial/geotiff/tests/test_streaming_write_pool_leak_2276.pycovers three mid-stream failure shapes and the happy path:test_pool_shutdown_on_compress_failureraises from inside_compress_blockafter a few successful callstest_pool_shutdown_on_file_write_failureraises from the sequential file-write step after the parallel compress already rantest_pool_shutdown_on_happy_pathis a regression guard that the rewrite did not break clean-exit shutdown_shutdownflag and the absence of leaked worker threads named with the writer's prefix.pytest xrspatial/geotiff/tests/test_streaming_write_parallel.py xrspatial/geotiff/tests/test_streaming_write.py xrspatial/geotiff/tests/test_streaming_codecs_2026_05_11.py-- 43 pass, 1 skipped (perf gate).