Skip to content

geotiff: share parsed VRTDataset across chunk tasks via dask.delayed (#1923)#1927

Merged
brendancol merged 3 commits into
xarray-contrib:mainfrom
brendancol:deep-sweep-performance-geotiff-2026-05-15-1778854348
May 15, 2026
Merged

geotiff: share parsed VRTDataset across chunk tasks via dask.delayed (#1923)#1927
brendancol merged 3 commits into
xarray-contrib:mainfrom
brendancol:deep-sweep-performance-geotiff-2026-05-15-1778854348

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • _read_vrt_chunked in xrspatial/geotiff/_backends/vrt.py passed the parsed VRTDataset as a plain kwarg to each per-chunk dask.delayed call, embedding the full source list (filenames, src/dst rects, per-source nodata) into every task's kwargs.
  • For an N-source VRT split into M chunks the graph carried N*M copies of the source metadata; under distributed/process schedulers each task pickle shipped that payload independently.
  • Wrap the dataset in dask.delayed(vrt, pure=True) once and thread that single shared reference through parsed_vrt=, matching the http_meta_key pattern in _backends/dask.py:147 and the meta_key pattern in _backends/gpu.py:1035.

Numbers

Structural pre-fix check on a 4x4 source VRT split into 64 chunks: 64 of 64 _vrt_chunk_read tasks held an inline VRTDataset in kwargs['parsed_vrt'].

At the documented _MAX_VRT_DASK_CHUNKS cap a 1000-source VRT split into 1000 chunks builds a graph that embeds ~57 MB of redundant source metadata (60 KB pickled VRTDataset x 1000 task copies). Post-fix the graph holds one shared copy.

Closes #1923.

Test plan

  • New test_vrt_chunked_shared_dataset_1923.py adds three tests:
    • test_vrt_chunked_dataset_is_shared_graph_input: walks every _vrt_chunk_read task in the dask graph and asserts none of them embed an inline VRTDataset in kwargs['parsed_vrt'].
    • test_vrt_chunked_decode_unchanged_after_shared_wrap: confirms decoded pixels match the eager read_vrt baseline.
    • test_vrt_chunked_band_kwarg_still_validates: confirms the wrap does not change band validation error semantics.
  • 438 existing VRT tests under xrspatial/geotiff/tests/ pass; the 1 unrelated tile_size=4 validator failure (test_size_param_validation_gpu_vrt_1776.py) predates this change.

…array-contrib#1923)

_read_vrt_chunked in xrspatial/geotiff/_backends/vrt.py passed
parsed_vrt=vrt as a plain kwarg to each per-chunk dask.delayed call,
so dask embedded the full VRTDataset (filenames, src/dst rects,
per-source nodata) into every task's kwargs. For an N-source VRT
split into M chunks the graph held N*M copies of the source list;
under distributed/process schedulers each task pickle shipped the
full payload independently.

Structurally verified pre-fix: 64 of 64 _vrt_chunk_read tasks held
an inline VRTDataset in kwargs['parsed_vrt']. A 1000-source VRT
split into 1000 chunks built a ~57 MB driver graph (60 KB pickled
VRTDataset x 1000 task copies).

Fix wraps the dataset in dask.delayed(vrt, pure=True) once before
the per-chunk loop and threads that single shared reference through
parsed_vrt=, matching the http_meta_key pattern in
_backends/dask.py:147 and the meta_key pattern in
_backends/gpu.py:1035.

3 new tests in test_vrt_chunked_shared_dataset_1923.py: a structural
check that no task's kwargs['parsed_vrt'] is an inline VRTDataset,
a decode-equivalence check vs the eager read, and a band-kwarg
validation check to confirm the wrap does not change error
semantics.

State CSV: Pass 9 entry added; SAFE/IO-bound verdict holds.
Copilot AI review requested due to automatic review settings May 15, 2026 14:25
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces dask graph size and serialization overhead for chunked VRT reads by ensuring the parsed VRTDataset is shared once across all per-chunk tasks rather than being embedded into every task’s kwargs.

Changes:

  • Wrap the parsed VRTDataset in a single dask.delayed(..., pure=True) and pass that shared reference into each _vrt_chunk_read task.
  • Add regression tests to verify the dask graph structure no longer embeds inline VRTDataset objects and that decode/band-validation behavior is unchanged.
  • Update the internal performance sweep state log entry for the geotiff audit.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
xrspatial/geotiff/_backends/vrt.py Wraps parsed VRT metadata in a shared delayed node to avoid per-task kwargs duplication in the chunked VRT dask graph.
xrspatial/geotiff/tests/test_vrt_chunked_shared_dataset_1923.py Adds structural + behavioral regression tests for the shared-VRTDataset chunked path.
.claude/sweep-performance-state.csv Records the performance audit finding/fix for issue #1923.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +23 to +24
import pickle
import tempfile
Comment on lines +104 to +109
Under the in-process scheduler an embedded copy still works
correctly because Python's pickle memo deduplicates references to
the same in-memory object. The bug surfaces under the distributed
/ multi-process scheduler where each task pickle is serialised
independently and the full dataset is shipped once per task -- so
the structural shape, not the in-process pickle size, is what
Remove unused pickle/tempfile imports flagged by flake8 F401 and
reword the in-process scheduler note in the docstring so it no
longer implies pickling happens locally.
@brendancol
Copy link
Copy Markdown
Contributor Author

@copilot resolve the merge conflicts in this pull request

@brendancol brendancol merged commit b502930 into xarray-contrib:main May 15, 2026
1 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: VRT chunked path embeds full VRTDataset in every chunk task

2 participants