Skip to content

Shut tile-compress pool down on streaming-write error path (#2276)#2279

Merged
brendancol merged 2 commits into
mainfrom
issue-2276
May 22, 2026
Merged

Shut tile-compress pool down on streaming-write error path (#2276)#2279
brendancol merged 2 commits into
mainfrom
issue-2276

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #2276

What

  • Wrap the streaming-write tile loop in try/finally so the per-tile compression ThreadPoolExecutor is shut down on every exit path, including mid-stream exceptions.
  • Cancel still-pending futures before shutdown(wait=True) so the error path does not block on work we no longer need.
  • Tag the pool's worker threads with thread_name_prefix='xrspatial-geotiff-tile-compress' so leak detection can tell them apart from dask's offload / scheduler pools.

Why

Before this change, tile_pool.shutdown(wait=True) lived past the tile-row loop. A raise from _compress_block, from a dask compute, or from the sequential file write bypassed the shutdown call. The outer try/except BaseException one frame up cleans up the temp file but never touches the pool, so worker threads outlive the failed write.

Backend coverage

Writer-only path (numpy / dask). No GPU code touched.

Test plan

  • xrspatial/geotiff/tests/test_streaming_write_pool_leak_2276.py covers three mid-stream failure shapes and the happy path:
    • test_pool_shutdown_on_compress_failure raises from inside _compress_block after a few successful calls
    • test_pool_shutdown_on_file_write_failure raises from the sequential file-write step after the parallel compress already ran
    • test_pool_shutdown_on_happy_path is a regression guard that the rewrite did not break clean-exit shutdown
  • Each test asserts both the pool's _shutdown flag and the absence of leaked worker threads named with the writer's prefix.
  • Confirmed tests fail without the fix.
  • pytest xrspatial/geotiff/tests/test_streaming_write_parallel.py xrspatial/geotiff/tests/test_streaming_write.py xrspatial/geotiff/tests/test_streaming_codecs_2026_05_11.py -- 43 pass, 1 skipped (perf gate).

The streaming tiled-write path in ``_write_streaming`` built a
``ThreadPoolExecutor`` for parallel per-tile compression and called
``shutdown(wait=True)`` only after the tile-row loop completed. Any
mid-stream raise (compression error, dask compute error, file write
error) bypassed the shutdown and left worker threads alive.

- Wrap the tile-row loop in ``try/finally`` so the pool is always
  shut down before the exception propagates.
- Cancel still-pending futures before the final
  ``shutdown(wait=True)`` so the error path does not block on work
  we no longer need.
- Tag the pool's worker threads with a distinctive
  ``thread_name_prefix`` so leak detection can tell them apart from
  dask's own offload / scheduler pools.

Tests cover three mid-stream failure shapes plus a happy-path
regression guard: compression failure, file-write failure, and the
clean success path. Each test asserts both the pool's
``_shutdown`` flag and the absence of leaked worker threads named
with the writer's prefix.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 22, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Shut tile-compress pool down on streaming-write error path (#2276)

Blockers (must fix before merge)

  • None.

Suggestions (should fix, not blocking)

  • xrspatial/geotiff/_writer.py:1133-1143: the inner try/finally clears _inflight_futures = [] as soon as the list comprehension exits, including the exception path. If fut.result() raises for fut[3] of 8 submitted, the outer finally at line 1161 sees an empty list and cannot cancel the still-pending fut[4..7]. tile_pool.shutdown(wait=True) then blocks until those pending tasks drain. Either drop the inner clear, or call tile_pool.shutdown(wait=True, cancel_futures=True) (Python 3.9+) so pending work is dropped on the error path. The second option is simpler and removes the need for the _inflight_futures bookkeeping entirely.

  • xrspatial/geotiff/tests/test_streaming_write_pool_leak_2276.py:21-23: the module docstring still describes the default ThreadPoolExecutor- thread name, but the implementation filters on _WRITER_POOL_PREFIX = 'xrspatial-geotiff-tile-compress'. Update the docstring to match.

  • CHANGELOG.md: no CHANGELOG entry. Recent bug-fix PRs add a bullet under "Unreleased / Bug fixes and improvements" referencing the issue number. Add one.

Nits (optional improvements)

  • xrspatial/geotiff/_writer.py:1026-1033: the 'xrspatial-geotiff-tile-compress' literal is duplicated between the writer and the test. Consider a module-level constant in _writer.py (e.g. _TILE_POOL_THREAD_PREFIX) so the test imports it.

  • xrspatial/geotiff/tests/test_streaming_write_pool_leak_2276.py:189-207: monkeypatch.setattr(os, 'fdopen', ...) patches the global. It is reverted at teardown, but anything in the same test session that calls os.fdopen during the patched window gets intercepted. Scope is narrow enough here that it is fine to leave.

What looks good

  • try/finally placement is correct: pool construction happens before the try block, so the finally always has a valid tile_pool to act on.
  • The thread_name_prefix is an effective way to tell writer pools apart from dask's offload and scheduler pools when checking for leaks.
  • Three failure shapes plus a happy-path regression guard is solid coverage. The tests fail without the fix.
  • Temp filenames include the issue number per project convention.
  • Existing streaming-write suite (43 tests) still passes.

Checklist

  • Algorithm matches reference/paper (N/A)
  • All implemented backends produce consistent results (writer-only)
  • NaN handling is correct (not touched)
  • Edge cases covered by tests
  • Dask chunk boundaries handled correctly (not touched)
  • No premature materialization or unnecessary copies
  • Benchmark exists or is not needed
  • README feature matrix updated (N/A)
  • Docstrings present and accurate (one minor drift noted)

)

Review follow-up:

- Replace the inner ``try/finally`` + ``_inflight_futures`` bookkeeping
  with ``tile_pool.shutdown(wait=True, cancel_futures=True)`` (Python
  3.9+). On the error path, queued-but-not-started compress jobs are
  dropped instead of blocking the unwind on work we no longer need.
- Hoist the ``xrspatial-geotiff-tile-compress`` thread-name prefix
  into a module-level ``_TILE_POOL_THREAD_PREFIX`` constant. The test
  now imports it from ``_writer`` so the two cannot drift.
- Update the test module docstring to describe the actual filtering
  on the writer's prefix instead of the default ``ThreadPoolExecutor-``
  name.
- Add a CHANGELOG entry under "Unreleased / Bug fixes and improvements"
  referencing #2276.

The ``monkeypatch.setattr(os, 'fdopen', ...)`` nit is dismissed --
pytest's monkeypatch reverts at teardown and the patched window is
narrow.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review (after commit 1dcdb56)

All three Suggestions addressed. The first Nit applied; the second Nit dismissed with reason. No new findings.

Suggestion dispositions

  • xrspatial/geotiff/_writer.py:1133-1143 (cancel_futures): fixed. Replaced the inner try/finally + _inflight_futures bookkeeping with tile_pool.shutdown(wait=True, cancel_futures=True). The error path now drops queued work instead of blocking the unwind. Comment block updated to reflect the new approach.
  • xrspatial/geotiff/tests/test_streaming_write_pool_leak_2276.py:21-23 (docstring drift): fixed. Module docstring rewritten to describe the writer-prefix filtering and the dask-pool false-positive rationale instead of the old default-prefix wording.
  • CHANGELOG.md: fixed. Added bullet under "Unreleased / Bug fixes and improvements" referencing #2276.

Nit dispositions

  • Shared _TILE_POOL_THREAD_PREFIX constant: fixed. Hoisted into _writer.py as a module-level constant; the test now imports it from writer_mod so a future rename of the prefix on the writer side updates the test in lockstep.
  • os.fdopen monkeypatch global scope: dismissed. Pytest's monkeypatch fixture reverts the attribute at test teardown, the patched window covers only the body of to_geotiff for one call, and the failure injection is deterministic on call count. No realistic risk of cross-test contamination.

Verification

  • pytest xrspatial/geotiff/tests/test_streaming_write_pool_leak_2276.py xrspatial/geotiff/tests/test_streaming_write_parallel.py xrspatial/geotiff/tests/test_streaming_write.py -- 37 pass, 1 skipped (perf gate).

No remaining findings. Pool shutdown is now guaranteed on every exit path, queued work is dropped on the error path, and the test/code constant stays in sync.

@brendancol brendancol merged commit 6261ca4 into main May 22, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ThreadPoolExecutor leak on mid-stream failure in tiled writes

1 participant