Refactor GeoTIFF Phase 5d: extract _cog_http.py from _reader.py by brendancol · Pull Request #2262 · xarray-contrib/xarray-spatial

brendancol · 2026-05-21T18:00:25Z

Summary

Phase 5d of the GeoTIFF refactor (#2211). Moves the COG-over-HTTP transport layer into a new _cog_http.py.

Moves _parse_cog_http_meta, _read_cog_http, _fetch_decode_cog_http_strips, _fetch_decode_cog_http_tiles plus INITIAL_HTTP_HEADER_BYTES / MAX_HTTP_HEADER_BYTES out of _reader.py.
_reader.py shrinks from 1258 to 429 lines.
Public import paths stay intact: _reader.py re-exports the moved names. _backends/dask.py and geotiff/__init__.py now pull collaborators (_decode, _layout, _cog_http, _sources) directly where the monkeypatch contract allows.

Monkeypatch contract: option (a), keep the `_reader` namespace re-export

The existing tests patch attributes on _reader and then call _reader._read_cog_http. Rewriting every patch site (option b) would have touched ~30 call sites across 7 test files for a one-line saving inside _cog_http. I went with option (a): _read_cog_http and _parse_cog_http_meta in _cog_http.py resolve patchable collaborators through the _reader module namespace at call time (from . import _reader; _reader._HTTPSource(url), etc.).

The names looked up via _reader at call time:

_HTTPSource
_parse_cog_http_meta
_fetch_decode_cog_http_tiles
_decode_strip_or_tile
_apply_photometric_miniswhite
MAX_HTTP_HEADER_BYTES
INITIAL_HTTP_HEADER_BYTES

_backends/dask.py still goes through _reader for _HTTPSource / _CloudSource construction because the dask graph builder creates the source before handing it to the metadata parser, and the dask-HTTP coalesce tests patch _reader._HTTPSource.

Test plan

pytest xrspatial/geotiff/tests/test_http_cog_coalesce.py -v, 11/11, including test_read_cog_http_perf_with_mock_rtt (rtts_saved >= 5).
pytest xrspatial/geotiff/tests/, 5037 passed, 68 skipped. The lone failure (test_lowlevel_write_pushdown_2138::test_write_vs_to_geotiff_byte_parity_uint8[lz4]) reproduces on main and is unrelated (lz4 experimental-codec gate).
Monkeypatch-sensitive suite: test_security.py, test_cog_http_close_on_error_1816.py, test_http_meta_buffer_1718.py, test_http_stripped_window_max_pixels_issue_A_1842.py, test_cog_http_parallel_decode_2026_05_15.py, test_parallel_strip_decode_2100.py, all green.

Closes #2258
Part of #2211

Move the COG-over-HTTP transport layer (`_parse_cog_http_meta`, `_read_cog_http`, `_fetch_decode_cog_http_strips`, `_fetch_decode_cog_http_tiles`) plus the `INITIAL_HTTP_HEADER_BYTES` / `MAX_HTTP_HEADER_BYTES` constants into a new `_cog_http.py`. `_reader.py` drops from 1258 to 429 lines. The public import surface stays intact: `_reader.py` re-exports the moved names so `from xrspatial.geotiff._reader import _read_cog_http` and friends keep working unchanged. `_read_cog_http` resolves `_HTTPSource`, `_parse_cog_http_meta`, `_fetch_decode_cog_http_tiles`, `_decode_strip_or_tile`, and `_apply_photometric_miniswhite` through the `_reader` module namespace at call time so existing tests that `monkeypatch.setattr(_reader, '_HTTPSource', ...)` (and friends) continue to intercept the collaborators. `_parse_cog_http_meta` reads `MAX_HTTP_HEADER_BYTES` and `INITIAL_HTTP_HEADER_BYTES` the same way so `test_http_meta_buffer_1718`'s cap-shrinking patch keeps working. Part of #2211.

brendancol

PR Review: Refactor GeoTIFF Phase 5d, extract _cog_http.py from _reader.py

Mechanical extraction. I diffed every moved function body byte-for-byte against HEAD~1:xrspatial/geotiff/_reader.py and the only deltas are the deliberate _reader.X indirection inserts called out in the PR body. Logic is unchanged.

Blockers

None.

Suggestions

xrspatial/geotiff/__init__.py:216: _parse_cog_http_meta is now imported from _cog_http directly, while _backends/dask.py keeps _HTTPSource going through _reader for the monkeypatch contract. Both choices are correct (no test patches _reader._parse_cog_http_meta on the fsspec open_geotiff(..., chunks=...) path), but the asymmetry is worth a one-line comment so a future reader does not "fix" the inconsistency.
xrspatial/geotiff/_cog_http.py:64: the module-level from ._sources import _HTTPSource binding is only used as a type annotation now. Every runtime use goes through _reader._HTTPSource. Move it under if TYPE_CHECKING: or add a comment, otherwise a reader wonders why the imported name and the live binding diverge.

Nits

xrspatial/geotiff/_cog_http.py:268: _parse_cog_http_meta captures _reader.INITIAL_HTTP_HEADER_BYTES and _reader.MAX_HTTP_HEADER_BYTES once at the top of the call. The original referenced them as module globals every loop iteration. Behavior is identical in practice (no test patches the constants mid-call), but a sentence in the comment saying the one-shot capture is intentional and equivalent would close the loop.
_cog_http.py does from . import _reader lazily in four spots (_parse_cog_http_meta, _read_cog_http, _decode_http_strip, _decode_one). Same reason every time: avoid circular import at module load. A single sentence at the top of the file ("see _read_cog_http for why the lazy import is here, and which collaborators get the _reader.X treatment") would let future readers skip the per-call comments after the first.
No new tests added. The existing fleet (test_http_cog_coalesce, test_security, test_cog_http_close_on_error_1816, test_http_meta_buffer_1718, test_http_stripped_window_max_pixels_issue_A_1842, test_cog_http_parallel_decode_2026_05_15, test_parallel_strip_decode_2100) already exercises every patch site, so the indirection contract is covered end-to-end.

What looks good

_reader.py: 1258 -> 429 lines.
Full geotiff suite: 5037 passed, 68 skipped. The lone failure (test_lowlevel_write_pushdown_2138::test_write_vs_to_geotiff_byte_parity_uint8[lz4]) reproduces on main and is unrelated.
test_read_cog_http_perf_with_mock_rtt (rtts_saved >= 5) still passes, so the per-call _reader.X indirection does not measurably slow the coalesced HTTP path.
The lazy from . import _reader is the right call for a circular-import setup. Doing it at module top would break load order.
Docstrings preserved verbatim.
Public import surface unchanged.

Checklist

Algorithm matches reference: N/A, no logic changes
Backend parity: N/A, no backend changes
NaN handling: unchanged
Edge cases: covered by existing tests
Dask chunk boundaries: unchanged
No premature materialization or copies: diff is move-only
Benchmark: N/A
README feature matrix: N/A
Docstrings: preserved verbatim

* Move `_HTTPSource` to `TYPE_CHECKING` in `_cog_http.py` so the only module-level binding of the name is a type annotation; the runtime collaborator is always `_reader._HTTPSource`. * Drop direct `_apply_photometric_miniswhite` / `_decode_strip_or_tile` imports from `_cog_http.py` since both are looked up through `_reader` at call time. Comment in the import block explains which names from `_decode` are still used directly. * Add a module-level note about why `from . import _reader` is lazy (circular load order) and an index of every name looked up through the `_reader` namespace for monkeypatch reasons. * Add a comment in `_parse_cog_http_meta` explaining the one-shot capture of `INITIAL_HTTP_HEADER_BYTES` / `MAX_HTTP_HEADER_BYTES` is behaviourally equivalent to the original per-iteration global read (the constants are never mutated mid-call). * Add a comment in `__init__.py` explaining why this site imports `_parse_cog_http_meta` from `_cog_http` directly while `_backends/dask.py` keeps `_HTTPSource` going through `_reader`.

…2258) The previous nit-fix commit moved `_HTTPSource` under TYPE_CHECKING, which broke `typing.get_type_hints()` against `_parse_cog_http_meta`, `_fetch_decode_cog_http_strips`, and `_fetch_decode_cog_http_tiles`. The function objects are also re-exported through `_reader`, so the breakage propagated to `_reader._parse_cog_http_meta` etc. as well. Nothing in the codebase calls `get_type_hints` on these helpers today, but the original `_reader.py` carried a real `_HTTPSource` import, so removing it was a subtle behavioural regression rather than a cleanup. Restore the runtime import and replace the TYPE_CHECKING block with a comment that explains why the name lives in the imports but the live runtime collaborator is always `_reader._HTTPSource`.

brendancol

PR Review (pass 2): Refactor GeoTIFF Phase 5d, extract _cog_http.py from _reader.py

Follow-up after 3cfa47d0 (TYPE_CHECKING + comments) and 3ead2dcc (revert of the TYPE_CHECKING move).

Blockers

None.

Suggestions

None.

Nits

Honest correction to pass 1: the TYPE_CHECKING suggestion I made was wrong. With _HTTPSource under TYPE_CHECKING, typing.get_type_hints() against _parse_cog_http_meta / _fetch_decode_cog_http_strips / _fetch_decode_cog_http_tiles raises NameError: name '_HTTPSource' is not defined, and because the function objects are shared with the _reader re-exports the breakage propagates there too. 3ead2dcc restores the runtime import and uses an inline comment instead. Nothing in the codebase calls get_type_hints on these helpers today, but the original _reader.py carried a real import and a refactor should not silently degrade introspection. Noting for the audit trail.
Optional polish: now that the module docstring carries the patchable-names index, the four per-function "Resolve through the _reader module..." comment blocks could shrink to a one-liner pointing back at the docstring. The current per-call comments do name the specific test that drives each indirection, which is useful, so this is genuinely optional.

What looks good

Type-hint introspection now matches pre-refactor behavior.
Module docstring has a single-source-of-truth list of names that route through _reader. No grep required.
__init__.py:216 carries an explicit comment about why this site imports _parse_cog_http_meta from _cog_http directly while the dask backend keeps _HTTPSource going through _reader.
Full geotiff suite still passes (5037 passed, 68 skipped, lz4 failure pre-existing on main).
All monkeypatch-sensitive tests still pass.

Checklist

Logic preserved (diff-verified function-by-function in pass 1)
Monkeypatch contract preserved
Type-hint introspection preserved
Public import surface preserved
Tests cover every patch site
Docstrings preserved verbatim

github-actions Bot added the performance PR touches performance-sensitive code label May 21, 2026

brendancol commented May 21, 2026

View reviewed changes

brendancol added 2 commits May 21, 2026 11:06

brendancol commented May 21, 2026

View reviewed changes

brendancol merged commit 488f58a into main May 21, 2026
5 checks passed

brendancol mentioned this pull request May 21, 2026

Refactor GeoTIFF internals around shared read/write contracts #2211

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor GeoTIFF Phase 5d: extract _cog_http.py from _reader.py#2262

Refactor GeoTIFF Phase 5d: extract _cog_http.py from _reader.py#2262
brendancol merged 3 commits into
mainfrom
issue-2258-cog-http-extraction

brendancol commented May 21, 2026

Uh oh!

brendancol left a comment

Uh oh!

brendancol left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented May 21, 2026

Summary

Monkeypatch contract: option (a), keep the _reader namespace re-export

Test plan

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

PR Review: Refactor GeoTIFF Phase 5d, extract _cog_http.py from _reader.py

Blockers

Suggestions

Nits

What looks good

Checklist

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

PR Review (pass 2): Refactor GeoTIFF Phase 5d, extract _cog_http.py from _reader.py

Blockers

Suggestions

Nits

What looks good

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Monkeypatch contract: option (a), keep the `_reader` namespace re-export