Skip to content

Commit e73bc12

Browse files
authored
Require explicit opt-in on experimental and internal-only GeoTIFF paths (PR 4 of epic #2340) (#2356)
* Require explicit opt-in on experimental and internal-only GeoTIFF paths (#2352) PR 4 of 6 under epic #2340 (GeoTIFF release contract). Adds the matching read-side gates for the Tier 3 / Tier 4 codecs the writer already gated in #2137 and #1845, and extends the ``allow_experimental_codecs`` opt-in onto the writer rich-tag attrs (``gdal_metadata_xml`` / ``extra_tags``). Read side: - ``open_geotiff`` / ``read_geotiff_dask`` / ``read_geotiff_gpu`` / ``read_vrt`` now reject sources compressed with LERC, JPEG2000 / J2K, or LZ4 unless ``allow_experimental_codecs=True``. - The same readers reject JPEG-in-TIFF unless ``allow_internal_only_jpeg=True``. The flags do not collapse. - Gates fire at the public entry point (eager) or at graph build (dask), before any decode work, so the caller learns the opt-in name from the rejection rather than a deeper decode failure. Write side: - ``to_geotiff`` and ``write_geotiff_gpu`` reject writes that include ``attrs['gdal_metadata_xml']`` or ``attrs['extra_tags']`` unless ``allow_experimental_codecs=True``. Round-tripped attrs (carrying ``_xrspatial_geotiff_contract``) are exempt so the canonical attrs contract from #1984 still round-trips without a flag. Tests: - New ``test_experimental_internal_optin_2352.py`` pins the rejections, the exemption shape, and the public signatures. - Existing codec / rich-tag tests pass the new opt-ins through; the ``test_reader_kwarg_order_1935.py`` canonical kwarg order is updated to slot the new flags next to the other ``allow_*`` policy gates. Out of scope (sibling PRs under epic #2340): - Docstring rewrites for public read/write entry points (#2346). - Docs release-contract page (#2347). - ``SUPPORTED_FEATURES`` tier audit (#2348). - Unsupported-combination errors (#2349). Closes #2352. * Address self-review: explain Adobe Deflate collapse, contract-marker soft gate, and dask graph-build skip (#2352) Documents three design choices flagged during the self-review pass on #2356: * _COMPRESSION_TAG_TO_NAME collapses TIFF tag 32946 (Adobe Deflate) onto the stable codec.deflate entry on purpose. The comment makes the deliberate collapse vs accidental passthrough obvious to a future reviewer. * The rich-tag gate's _xrspatial_geotiff_contract exemption is a soft gate by design: forging the marker by hand would bypass it, but the alternative (gating every read-then-write call) would break the canonical attrs round-trip from #1984. * The dask read-side gate uses getattr(..., None) so a synthesised geo_info (non-TIFF source) skips the check rather than rejects. The comment documents the lockstep invariant (_ifd_compression stashed alongside _ifd_photometric / _ifd_samples_per_pixel on every TIFF source path) so the skip never silently bypasses a real TIFF read.
1 parent 9c40df4 commit e73bc12

47 files changed

Lines changed: 1127 additions & 129 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

xrspatial/geotiff/__init__.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,7 @@ def _read_geo_info(source, *, overview_level: int | None = None,
241241
# binding it into the chunk closure (#1809).
242242
geo_info._ifd_photometric = _ifd.photometric
243243
geo_info._ifd_samples_per_pixel = _ifd.samples_per_pixel
244+
geo_info._ifd_compression = _ifd.compression
244245
return geo_info, _ifd.height, _ifd.width, file_dtype, n_bands
245246
if _is_file_like(source):
246247
# File-like: read its full bytes; we don't try to mmap arbitrary
@@ -316,6 +317,11 @@ def _read_geo_info(source, *, overview_level: int | None = None,
316317
# binding it into the chunk closure (#1809).
317318
geo_info._ifd_photometric = ifd.photometric
318319
geo_info._ifd_samples_per_pixel = ifd.samples_per_pixel
320+
# Stash compression so the dask graph builder can fire the
321+
# experimental / internal-only codec opt-in gate at graph build
322+
# rather than waiting for the per-chunk task to fail (PR 4 of
323+
# epic #2340).
324+
geo_info._ifd_compression = ifd.compression
319325
return geo_info, ifd.height, ifd.width, file_dtype, n_bands
320326
finally:
321327
if close_data:
@@ -339,6 +345,8 @@ def open_geotiff(source: str | BinaryIO, *,
339345
missing_sources: str = _MISSING_SOURCES_SENTINEL,
340346
allow_rotated: bool = False,
341347
allow_unparseable_crs: bool = False,
348+
allow_experimental_codecs: bool = False,
349+
allow_internal_only_jpeg: bool = False,
342350
band_nodata: str | None = None,
343351
mask_nodata: bool = True,
344352
) -> xr.DataArray:
@@ -535,6 +543,23 @@ def open_geotiff(source: str | BinaryIO, *,
535543
behaviour where the citation field passes through unchanged.
536544
Matches the same kwarg on ``to_geotiff`` / ``write_geotiff_gpu``
537545
so a value the reader accepted can survive a round-trip.
546+
allow_experimental_codecs : bool, default False
547+
Read-side opt-in for sources compressed with the Tier 3
548+
experimental codecs (``lerc``, ``jpeg2000`` / ``j2k``, ``lz4``).
549+
Default ``False`` rejects the read with ``ValueError`` naming
550+
the flag; cross-backend numerical parity is not claimed and
551+
reader support across GDAL versions is uneven. Matches the
552+
same kwarg on the writers so a round-trip through a Tier 3
553+
codec stays opt-in on both sides. See SUPPORTED_FEATURES tier
554+
``'experimental'`` (epic #2340 PR 4).
555+
allow_internal_only_jpeg : bool, default False
556+
Read-side opt-in for JPEG-in-TIFF sources. The encoder writes
557+
self-contained JFIF tiles without the TIFF JPEGTables tag
558+
(347), so the read path is not interoperable with libtiff /
559+
GDAL / rasterio. ``allow_experimental_codecs=True`` does NOT
560+
cover this codec; the dedicated flag is its only gate. See
561+
SUPPORTED_FEATURES tier ``'internal_only'`` for ``codec.jpeg``
562+
(epic #2340 PR 4, original writer gate #1845).
538563
539564
Returns
540565
-------
@@ -665,6 +690,8 @@ def open_geotiff(source: str | BinaryIO, *,
665690
max_pixels=max_pixels,
666691
allow_rotated=allow_rotated,
667692
allow_unparseable_crs=allow_unparseable_crs,
693+
allow_experimental_codecs=allow_experimental_codecs,
694+
allow_internal_only_jpeg=allow_internal_only_jpeg,
668695
band_nodata=band_nodata,
669696
mask_nodata=mask_nodata,
670697
**vrt_kwargs)
@@ -685,6 +712,10 @@ def open_geotiff(source: str | BinaryIO, *,
685712
max_pixels=max_pixels,
686713
allow_rotated=allow_rotated,
687714
allow_unparseable_crs=allow_unparseable_crs,
715+
allow_experimental_codecs=(
716+
allow_experimental_codecs),
717+
allow_internal_only_jpeg=(
718+
allow_internal_only_jpeg),
688719
mask_nodata=mask_nodata,
689720
**gpu_kwargs)
690721

@@ -696,6 +727,10 @@ def open_geotiff(source: str | BinaryIO, *,
696727
max_pixels=max_pixels, name=name,
697728
allow_rotated=allow_rotated,
698729
allow_unparseable_crs=allow_unparseable_crs,
730+
allow_experimental_codecs=(
731+
allow_experimental_codecs),
732+
allow_internal_only_jpeg=(
733+
allow_internal_only_jpeg),
699734
mask_nodata=mask_nodata)
700735

701736
kwargs = {}
@@ -714,6 +749,8 @@ def open_geotiff(source: str | BinaryIO, *,
714749
source, window=window,
715750
overview_level=overview_level, band=band,
716751
allow_rotated=allow_rotated,
752+
allow_experimental_codecs=allow_experimental_codecs,
753+
allow_internal_only_jpeg=allow_internal_only_jpeg,
717754
**kwargs,
718755
)
719756

xrspatial/geotiff/_attrs.py

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -354,6 +354,166 @@
354354
)
355355

356356

357+
# Map TIFF compression tag values to codec names so the read-side opt-in
358+
# gate (PR 4 of epic #2340) can name the codec in the rejection message
359+
# without each call site repeating the integer-to-name table. The keys
360+
# are the TIFF 6 Compression tag values (tag 259) used inside
361+
# ``_compression.py``; the values match the codec names that appear on
362+
# the ``SUPPORTED_FEATURES`` keys (``codec.<name>``).
363+
_COMPRESSION_TAG_TO_NAME = {
364+
1: 'none',
365+
5: 'lzw',
366+
7: 'jpeg',
367+
8: 'deflate',
368+
32773: 'packbits',
369+
# Adobe Deflate (32946) decodes through the same zlib path as
370+
# plain Deflate (8) and is collapsed onto the same codec name on
371+
# purpose: both tags share the stable-tier classification in
372+
# ``SUPPORTED_FEATURES`` (``codec.deflate``). A future Adobe-
373+
# Deflate-specific tier would need its own ``codec.<name>`` entry
374+
# AND its own mapping line here; the collapse is deliberate, not
375+
# a passthrough.
376+
32946: 'deflate',
377+
34712: 'jpeg2000',
378+
34887: 'lerc',
379+
50000: 'zstd',
380+
50004: 'lz4',
381+
}
382+
383+
384+
def _validate_read_codec_optin(
385+
compression: int,
386+
*,
387+
allow_experimental_codecs: bool,
388+
allow_internal_only_jpeg: bool,
389+
entry_point: str = "open_geotiff",
390+
) -> None:
391+
"""Reject experimental / internal-only codecs on the read side.
392+
393+
Mirrors the writer-side gate in ``_writers/eager.py`` /
394+
``_writers/gpu.py`` so a caller cannot decode a file produced with
395+
an experimental or internal-only codec without naming the matching
396+
opt-in flag at the call site. The flag and feature both appear in
397+
the rejection message so the caller learns the name from the error
398+
rather than the docs.
399+
400+
Part of PR 4 of epic #2340 (the GeoTIFF release contract). The
401+
writer side has carried these gates since #2137 / #1845; this
402+
helper extends the same shape to the read entry points.
403+
404+
Parameters
405+
----------
406+
compression : int
407+
TIFF Compression tag value (tag 259) from the parsed IFD.
408+
allow_experimental_codecs : bool
409+
Opt-in for Tier 3 read paths (LERC, JPEG2000 / J2K, LZ4).
410+
allow_internal_only_jpeg : bool
411+
Opt-in for Tier 4 read path (JPEG-in-TIFF). Does not collapse
412+
into ``allow_experimental_codecs`` for the same reason as on
413+
the writer: internal-only is a stricter tier and keeps its own
414+
dedicated flag.
415+
entry_point : str
416+
Name of the public read function for the rejection message.
417+
"""
418+
codec_name = _COMPRESSION_TAG_TO_NAME.get(int(compression))
419+
if codec_name is None:
420+
# Unknown compression tags are validated separately by the
421+
# decoder; the opt-in gate only fires for codecs the reader
422+
# otherwise accepts.
423+
return
424+
if codec_name == 'jpeg' and not allow_internal_only_jpeg:
425+
raise ValueError(
426+
f"{entry_point}: source uses compression='jpeg' (TIFF "
427+
"tag 259 = 7), which is internal-only: the encoder writes "
428+
"self-contained JFIF tiles without the TIFF JPEGTables tag "
429+
"(347), so the read path is not interoperable with libtiff "
430+
"/ GDAL / rasterio. Pass allow_internal_only_jpeg=True to "
431+
"opt in to the internal-reader-only decode path. See "
432+
"SUPPORTED_FEATURES tier 'internal_only' for codec.jpeg "
433+
"(epic #2340, original gate #1845).")
434+
if codec_name in _EXPERIMENTAL_CODECS and not allow_experimental_codecs:
435+
raise ValueError(
436+
f"{entry_point}: source uses compression={codec_name!r} "
437+
"which is experimental on the read side: cross-backend "
438+
"numerical parity is not claimed and reader support across "
439+
"GDAL versions is uneven. Pass allow_experimental_codecs="
440+
"True to opt in, or re-encode the source with a stable "
441+
"lossless codec ('deflate', 'zstd', or 'lzw'). See "
442+
f"SUPPORTED_FEATURES tier 'experimental' for codec.{codec_name} "
443+
"(epic #2340, original writer gate #2137).")
444+
445+
446+
# Writer rich-tag attrs that ride the Experimental tier (PR 4 of epic
447+
# #2340). ``writer.gdal_metadata_xml`` and ``writer.extra_tags`` carry
448+
# free-form payloads through to the on-disk TIFF; their interop with
449+
# other readers (rasterio, libtiff, GDAL) depends on the payload and is
450+
# not part of the release promise. The opt-in keeps the surface narrow
451+
# without removing the capability.
452+
#
453+
# Round-trip exemption: when the attrs carry the
454+
# ``_xrspatial_geotiff_contract`` marker, they came from a previous
455+
# xrspatial read. The reader populated ``gdal_metadata_xml`` /
456+
# ``extra_tags`` from the source file; gating the write would force
457+
# every read-then-write caller to opt in. Skip the gate on
458+
# round-tripped attrs so the canonical contract from #1984 stays a
459+
# no-flag operation. The gate still fires when a caller adds those
460+
# attrs to a fresh DataArray that did not come from a read.
461+
def _validate_write_rich_tag_optin(
462+
attrs: dict,
463+
*,
464+
gdal_metadata_xml_kwarg: object = None,
465+
extra_tags_kwarg: object = None,
466+
allow_experimental_codecs: bool,
467+
entry_point: str = "to_geotiff",
468+
) -> None:
469+
"""Reject writes that include ``gdal_metadata_xml`` or ``extra_tags``
470+
unless the caller opted in via ``allow_experimental_codecs=True``.
471+
472+
Part of PR 4 of epic #2340. Mirrors the existing codec-flag shape
473+
so the rejection names the same opt-in the caller already learned
474+
from the LERC / J2K / LZ4 paths. Round-tripped attrs (carrying
475+
the ``_xrspatial_geotiff_contract`` marker) are exempt so the
476+
canonical attrs round-trip (#1984) stays a no-flag operation; the
477+
gate fires only when a caller constructs a fresh DataArray with
478+
one of the rich-tag attrs set.
479+
"""
480+
if allow_experimental_codecs:
481+
return
482+
# Round-trip exemption: a DataArray that came from
483+
# ``open_geotiff`` / ``read_geotiff_dask`` / ``read_geotiff_gpu``
484+
# carries the contract marker. Writing it back is the canonical
485+
# round-trip and should not require a new flag (issue #1984).
486+
#
487+
# This is a soft gate by design: a caller who hand-builds an
488+
# attrs dict with the contract marker could bypass it. Forging
489+
# the marker is a deliberate act, and the alternative (gating
490+
# every read-then-write call) would break the canonical attrs
491+
# round-trip that downstream code already depends on. The hard
492+
# guarantee is "fresh DataArrays carrying these attrs need the
493+
# opt-in"; the soft exemption keeps round-trips frictionless.
494+
if '_xrspatial_geotiff_contract' in attrs:
495+
return
496+
triggered: list[str] = []
497+
if attrs.get('gdal_metadata_xml') is not None:
498+
triggered.append("attrs['gdal_metadata_xml']")
499+
if attrs.get('extra_tags') is not None:
500+
triggered.append("attrs['extra_tags']")
501+
if gdal_metadata_xml_kwarg is not None:
502+
triggered.append('gdal_metadata_xml kwarg')
503+
if extra_tags_kwarg is not None:
504+
triggered.append('extra_tags kwarg')
505+
if not triggered:
506+
return
507+
raise ValueError(
508+
f"{entry_point}: {', '.join(triggered)} pass-through is "
509+
"experimental: the on-disk bytes are written verbatim and "
510+
"interop with other readers (rasterio, libtiff, GDAL) depends "
511+
"on the payload. Pass allow_experimental_codecs=True to opt "
512+
"in to the rich-tag write path, or drop the attr before the "
513+
"write. See SUPPORTED_FEATURES tier 'experimental' for "
514+
"writer.gdal_metadata_xml / writer.extra_tags (epic #2340).")
515+
516+
357517
# TIFF type ids needed when synthesizing extra_tags entries from attrs.
358518
_TIFF_BYTE = 1
359519
_TIFF_ASCII = 2

xrspatial/geotiff/_backends/dask.py

Lines changed: 51 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ def read_geotiff_dask(source: str, *,
4242
missing_sources: str = _MISSING_SOURCES_SENTINEL,
4343
allow_rotated: bool = False,
4444
allow_unparseable_crs: bool = False,
45+
allow_experimental_codecs: bool = False,
46+
allow_internal_only_jpeg: bool = False,
4547
band_nodata: str | None = None,
4648
mask_nodata: bool = True) -> xr.DataArray:
4749
"""Read a GeoTIFF as a dask-backed DataArray for out-of-core processing.
@@ -124,6 +126,16 @@ def read_geotiff_dask(source: str, *,
124126
instead of carrying the unrecognised payload through
125127
``attrs['crs_wkt']``. See ``open_geotiff`` for the full
126128
description.
129+
allow_experimental_codecs : bool, default False
130+
[advanced] Read-side opt-in for Tier 3 experimental codecs
131+
(``lerc``, ``jpeg2000`` / ``j2k``, ``lz4``). Fires at graph
132+
build, before any chunk task is scheduled. See ``open_geotiff``
133+
for the full description (epic #2340 PR 4).
134+
allow_internal_only_jpeg : bool, default False
135+
[advanced] Read-side opt-in for JPEG-in-TIFF sources. Not
136+
covered by ``allow_experimental_codecs``. See ``open_geotiff``
137+
for the full description (epic #2340 PR 4, original writer gate
138+
#1845).
127139
on_gpu_failure : str, optional
128140
[internal-only] Accepted for cross-backend signature symmetry
129141
only. The dask path runs CPU decoders, so passing this kwarg
@@ -315,6 +327,7 @@ def read_geotiff_dask(source: str, *,
315327
# Stash IFD photometric for the MinIsWhite nodata-inversion check below.
316328
geo_info._ifd_photometric = http_ifd.photometric
317329
geo_info._ifd_samples_per_pixel = http_ifd.samples_per_pixel
330+
geo_info._ifd_compression = http_ifd.compression
318331
else:
319332
# Metadata-only read: O(1) memory via mmap, no pixel decompression.
320333
# Lazy import for the same circular-import reason as ``read_vrt``
@@ -323,6 +336,28 @@ def read_geotiff_dask(source: str, *,
323336
geo_info, full_h, full_w, file_dtype, n_bands = _read_geo_info(
324337
source, overview_level=overview_level,
325338
allow_rotated=allow_rotated)
339+
340+
# Reject experimental / internal-only codecs at graph build, before
341+
# any chunk task is scheduled. The compression tag is stashed on
342+
# ``geo_info`` by ``_read_geo_info`` (local / fsspec) and by the
343+
# HTTP / fsspec branch above. PR 4 of epic #2340.
344+
#
345+
# ``getattr(..., None)`` is intentional: a synthesised geo_info
346+
# (non-TIFF source) carries no compression tag, so the gate must
347+
# skip rather than reject. Every TIFF source path stashes
348+
# ``_ifd_compression`` in lockstep with ``_ifd_photometric`` and
349+
# ``_ifd_samples_per_pixel`` so the skip never silently bypasses
350+
# a real TIFF read.
351+
_compression_tag = getattr(geo_info, '_ifd_compression', None)
352+
if _compression_tag is not None:
353+
from .._attrs import _validate_read_codec_optin
354+
_validate_read_codec_optin(
355+
_compression_tag,
356+
allow_experimental_codecs=allow_experimental_codecs,
357+
allow_internal_only_jpeg=allow_internal_only_jpeg,
358+
entry_point="read_geotiff_dask",
359+
)
360+
326361
# PR-C #2226: centralize the nodata lifecycle in one value object.
327362
# ``raw_sentinel`` carries the pre-inversion sentinel that
328363
# ``attrs['nodata']`` must preserve; ``effective_sentinel`` is what
@@ -553,7 +588,11 @@ def read_geotiff_dask(source: str, *,
553588
target_dtype=target_dtype,
554589
http_meta_key=http_meta_key,
555590
max_pixels=max_pixels,
556-
allow_rotated=allow_rotated),
591+
allow_rotated=allow_rotated,
592+
allow_experimental_codecs=(
593+
allow_experimental_codecs),
594+
allow_internal_only_jpeg=(
595+
allow_internal_only_jpeg)),
557596
shape=block_shape,
558597
dtype=target_dtype,
559598
)
@@ -575,7 +614,9 @@ def read_geotiff_dask(source: str, *,
575614

576615
def _delayed_read_window(source, r0, c0, r1, c1, overview_level, nodata,
577616
band, *, target_dtype=None, http_meta_key=None,
578-
max_pixels=None, allow_rotated=False):
617+
max_pixels=None, allow_rotated=False,
618+
allow_experimental_codecs=False,
619+
allow_internal_only_jpeg=False):
579620
"""Dask-delayed function to read a single window.
580621
581622
*http_meta_key* is an optional ``Delayed[(TIFFHeader, IFD)]`` parsed
@@ -631,11 +672,14 @@ def _read(http_meta):
631672
_r2a_kwargs = {}
632673
if max_pixels is not None:
633674
_r2a_kwargs['max_pixels'] = max_pixels
634-
arr, _ = _read_to_array(source, window=(r0, c0, r1, c1),
635-
overview_level=overview_level,
636-
band=band,
637-
allow_rotated=allow_rotated,
638-
**_r2a_kwargs)
675+
arr, _ = _read_to_array(
676+
source, window=(r0, c0, r1, c1),
677+
overview_level=overview_level,
678+
band=band,
679+
allow_rotated=allow_rotated,
680+
allow_experimental_codecs=allow_experimental_codecs,
681+
allow_internal_only_jpeg=allow_internal_only_jpeg,
682+
**_r2a_kwargs)
639683
if nodata is not None:
640684
# ``arr`` was just decoded by ``_fetch_decode_cog_http_tiles``
641685
# or ``read_to_array``; both return freshly-allocated buffers

0 commit comments

Comments
 (0)