Problem
The GeoTIFF module has accumulated a large number of local fixes across very large files:
xrspatial/geotiff/_reader.py is roughly 3.4k lines
xrspatial/geotiff/_writer.py is roughly 2.8k lines
xrspatial/geotiff/_gpu_decode.py is roughly 3.3k lines
xrspatial/geotiff/_compression.py is roughly 1.9k lines
The test suite is extensive, but the implementation shape is now high-risk: behavior that should be one contract is often repeated or partially mirrored across eager NumPy, Dask, GPU, Dask+GPU, VRT, HTTP/fsspec, and writer paths. That makes regressions likely whenever a bug fix lands in one backend but misses another.
The highest-risk duplicated contracts are:
- transform inference and georef/no-georef/rotated-state handling
- nodata masking,
masked_nodata, nodata_pixels_present, and dtype promotion
- metadata finalization and attrs contract population
- dispatcher/backend kwarg validation parity
- read/write round-trip invariants across eager, lazy, GPU, VRT, and COG paths
This is not about rewriting the module wholesale. The goal is to reduce regression surface by extracting shared contract boundaries and making each backend call those contracts instead of reimplementing them locally.
Implementation plan
Phase 1: Inventory and contract map
- Build a short internal map of current backend entry points and the contract steps each performs:
- source/kwarg validation
- metadata parse
- transform/georef classification
- pixel decode
- orientation/photometric handling
- nodata mask and dtype cast
- attrs finalization
- DataArray construction
- Identify which steps already have shared helpers and which are still duplicated.
- Add this map as developer documentation under
xrspatial/geotiff/ or docs/source/reference/ so future patches have an intended flow to follow.
Phase 2: Centralize transform/georef contract
- Extract one shared transform/georef resolver that owns:
- coord-to-transform inference
- no-georef marker behavior
- degenerate axis policy
- rotated-affine read-only/drop policy
georef_status derivation
- Make eager writer, VRT tiled writer, GPU writer, and any low-level write path call the same resolver.
- Make the resolver return a typed result rather than loosely coupled attrs/tuples.
- Add parity tests for
y/x, lat/lon, row/col, no-georef, transform-only, CRS-only, and rotated-dropped cases.
Phase 3: Centralize nodata lifecycle
- Extract one nodata lifecycle object/helper that owns:
- raw declared sentinel
- effective mask sentinel after photometric transforms
- whether pixels were present
- whether masking occurred
- whether dtype promotion/cast occurred
- writer restore-sentinel decision
- Replace per-backend nodata branches with calls into this helper where feasible.
- Keep backend-specific execution kernels only for the actual array operation; the decision logic should be shared.
- Add backend parity tests for integer sentinels, float sentinels, NaN sentinel, out-of-range sentinels, MinIsWhite,
mask_nodata=False, and explicit dtype=.
Phase 4: Centralize metadata finalization
- Continue moving attrs population into
GeoTIFFMetadata / finalization helpers until every read backend produces attrs through the same path.
- Ensure VRT, eager NumPy, Dask, GPU eager, and Dask+GPU all stamp the same canonical keys for equivalent inputs.
- Add a small table-driven parity test that compares attrs across backends, excluding explicitly backend-specific attrs.
Phase 5: Reduce file size by extracting cohesive modules
Split only after contracts are centralized, so extraction is mechanical rather than another behavioral change. Candidate modules:
_sources.py: local mmap, HTTP, fsspec, BytesIO source classes and SSRF/range helpers
_decode.py: strip/tile decode orchestration independent of transport
_layout.py: strip/tile layout validation and byte-count/offset planning
_write_layout.py: IFD assembly and BigTIFF/COG layout decisions
_nodata.py: nodata lifecycle contract and masking decisions
_georef.py: transform/georef resolver and status handling
Each extraction should preserve public API and be backed by parity tests before and after.
Phase 6: Backend parity gates
Add or strengthen tests that verify the same fixture through available backends:
- eager NumPy
- Dask NumPy
- GPU when available, otherwise xfail/skip with strict fallback checks
- Dask+GPU when available
- VRT
- HTTP/fsspec where practical
For each fixture, assert:
- pixel values
- dims and coords
- transform/georef attrs
- CRS attrs
- nodata attrs and dtype
- selected rich metadata attrs
Acceptance criteria
- No public API changes unless explicitly documented and tested.
- All existing GeoTIFF tests pass.
- At least one new contract/parity test file covers transform/georef behavior across backends.
- At least one new contract/parity test file covers nodata lifecycle behavior across backends.
- New helpers are used by at least eager NumPy, Dask, and one writer path before closing this issue.
- File extraction is done in small PRs where each PR is either behavior-neutral or has narrowly scoped behavior changes with regression tests.
Non-goals
- Replacing the native GeoTIFF reader/writer with GDAL/rasterio.
- Rewriting GPU decode/compression internals in one pass.
- Changing documented public behavior without a migration note.
- Chasing performance improvements before contract parity is locked down.
Why this matters
GeoTIFF I/O is upstream of many other xrspatial functions. A subtle mismatch in transform, nodata, dtype, or attrs can silently corrupt downstream slope, aspect, zonal, proximity, hydrology, and raster/vector workflows. The module already has many targeted fixes; this issue is about making those fixes durable by moving the shared rules into shared code.
Problem
The GeoTIFF module has accumulated a large number of local fixes across very large files:
xrspatial/geotiff/_reader.pyis roughly 3.4k linesxrspatial/geotiff/_writer.pyis roughly 2.8k linesxrspatial/geotiff/_gpu_decode.pyis roughly 3.3k linesxrspatial/geotiff/_compression.pyis roughly 1.9k linesThe test suite is extensive, but the implementation shape is now high-risk: behavior that should be one contract is often repeated or partially mirrored across eager NumPy, Dask, GPU, Dask+GPU, VRT, HTTP/fsspec, and writer paths. That makes regressions likely whenever a bug fix lands in one backend but misses another.
The highest-risk duplicated contracts are:
masked_nodata,nodata_pixels_present, and dtype promotionThis is not about rewriting the module wholesale. The goal is to reduce regression surface by extracting shared contract boundaries and making each backend call those contracts instead of reimplementing them locally.
Implementation plan
Phase 1: Inventory and contract map
xrspatial/geotiff/ordocs/source/reference/so future patches have an intended flow to follow.Phase 2: Centralize transform/georef contract
georef_statusderivationy/x,lat/lon,row/col, no-georef, transform-only, CRS-only, and rotated-dropped cases.Phase 3: Centralize nodata lifecycle
mask_nodata=False, and explicitdtype=.Phase 4: Centralize metadata finalization
GeoTIFFMetadata/ finalization helpers until every read backend produces attrs through the same path.Phase 5: Reduce file size by extracting cohesive modules
Split only after contracts are centralized, so extraction is mechanical rather than another behavioral change. Candidate modules:
_sources.py: local mmap, HTTP, fsspec, BytesIO source classes and SSRF/range helpers_decode.py: strip/tile decode orchestration independent of transport_layout.py: strip/tile layout validation and byte-count/offset planning_write_layout.py: IFD assembly and BigTIFF/COG layout decisions_nodata.py: nodata lifecycle contract and masking decisions_georef.py: transform/georef resolver and status handlingEach extraction should preserve public API and be backed by parity tests before and after.
Phase 6: Backend parity gates
Add or strengthen tests that verify the same fixture through available backends:
For each fixture, assert:
Acceptance criteria
Non-goals
Why this matters
GeoTIFF I/O is upstream of many other xrspatial functions. A subtle mismatch in transform, nodata, dtype, or attrs can silently corrupt downstream slope, aspect, zonal, proximity, hydrology, and raster/vector workflows. The module already has many targeted fixes; this issue is about making those fixes durable by moving the shared rules into shared code.