geotiff: share parsed VRTDataset across chunk tasks via dask.delayed (#1923)#1927
Merged
brendancol merged 3 commits intoMay 15, 2026
Conversation
…array-contrib#1923) _read_vrt_chunked in xrspatial/geotiff/_backends/vrt.py passed parsed_vrt=vrt as a plain kwarg to each per-chunk dask.delayed call, so dask embedded the full VRTDataset (filenames, src/dst rects, per-source nodata) into every task's kwargs. For an N-source VRT split into M chunks the graph held N*M copies of the source list; under distributed/process schedulers each task pickle shipped the full payload independently. Structurally verified pre-fix: 64 of 64 _vrt_chunk_read tasks held an inline VRTDataset in kwargs['parsed_vrt']. A 1000-source VRT split into 1000 chunks built a ~57 MB driver graph (60 KB pickled VRTDataset x 1000 task copies). Fix wraps the dataset in dask.delayed(vrt, pure=True) once before the per-chunk loop and threads that single shared reference through parsed_vrt=, matching the http_meta_key pattern in _backends/dask.py:147 and the meta_key pattern in _backends/gpu.py:1035. 3 new tests in test_vrt_chunked_shared_dataset_1923.py: a structural check that no task's kwargs['parsed_vrt'] is an inline VRTDataset, a decode-equivalence check vs the eager read, and a band-kwarg validation check to confirm the wrap does not change error semantics. State CSV: Pass 9 entry added; SAFE/IO-bound verdict holds.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces dask graph size and serialization overhead for chunked VRT reads by ensuring the parsed VRTDataset is shared once across all per-chunk tasks rather than being embedded into every task’s kwargs.
Changes:
- Wrap the parsed
VRTDatasetin a singledask.delayed(..., pure=True)and pass that shared reference into each_vrt_chunk_readtask. - Add regression tests to verify the dask graph structure no longer embeds inline
VRTDatasetobjects and that decode/band-validation behavior is unchanged. - Update the internal performance sweep state log entry for the geotiff audit.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
xrspatial/geotiff/_backends/vrt.py |
Wraps parsed VRT metadata in a shared delayed node to avoid per-task kwargs duplication in the chunked VRT dask graph. |
xrspatial/geotiff/tests/test_vrt_chunked_shared_dataset_1923.py |
Adds structural + behavioral regression tests for the shared-VRTDataset chunked path. |
.claude/sweep-performance-state.csv |
Records the performance audit finding/fix for issue #1923. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+23
to
+24
| import pickle | ||
| import tempfile |
Comment on lines
+104
to
+109
| Under the in-process scheduler an embedded copy still works | ||
| correctly because Python's pickle memo deduplicates references to | ||
| the same in-memory object. The bug surfaces under the distributed | ||
| / multi-process scheduler where each task pickle is serialised | ||
| independently and the full dataset is shipped once per task -- so | ||
| the structural shape, not the in-process pickle size, is what |
Remove unused pickle/tempfile imports flagged by flake8 F401 and reword the in-process scheduler note in the docstring so it no longer implies pickling happens locally.
Contributor
Author
|
@copilot resolve the merge conflicts in this pull request |
…e-geotiff-2026-05-15-1778854348
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_read_vrt_chunkedinxrspatial/geotiff/_backends/vrt.pypassed the parsedVRTDatasetas a plain kwarg to each per-chunkdask.delayedcall, embedding the full source list (filenames, src/dst rects, per-source nodata) into every task's kwargs.dask.delayed(vrt, pure=True)once and thread that single shared reference throughparsed_vrt=, matching thehttp_meta_keypattern in_backends/dask.py:147and themeta_keypattern in_backends/gpu.py:1035.Numbers
Structural pre-fix check on a 4x4 source VRT split into 64 chunks: 64 of 64
_vrt_chunk_readtasks held an inlineVRTDatasetinkwargs['parsed_vrt'].At the documented
_MAX_VRT_DASK_CHUNKScap a 1000-source VRT split into 1000 chunks builds a graph that embeds ~57 MB of redundant source metadata (60 KB pickledVRTDatasetx 1000 task copies). Post-fix the graph holds one shared copy.Closes #1923.
Test plan
test_vrt_chunked_shared_dataset_1923.pyadds three tests:test_vrt_chunked_dataset_is_shared_graph_input: walks every_vrt_chunk_readtask in the dask graph and asserts none of them embed an inlineVRTDatasetinkwargs['parsed_vrt'].test_vrt_chunked_decode_unchanged_after_shared_wrap: confirms decoded pixels match the eagerread_vrtbaseline.test_vrt_chunked_band_kwarg_still_validates: confirms the wrap does not change band validation error semantics.xrspatial/geotiff/tests/pass; the 1 unrelatedtile_size=4validator failure (test_size_param_validation_gpu_vrt_1776.py) predates this change.