Skip to content

Refactor GeoTIFF Phase 5d: extract _cog_http.py from _reader.py#2262

Merged
brendancol merged 3 commits into
mainfrom
issue-2258-cog-http-extraction
May 21, 2026
Merged

Refactor GeoTIFF Phase 5d: extract _cog_http.py from _reader.py#2262
brendancol merged 3 commits into
mainfrom
issue-2258-cog-http-extraction

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Phase 5d of the GeoTIFF refactor (#2211). Moves the COG-over-HTTP transport layer into a new _cog_http.py.

  • Moves _parse_cog_http_meta, _read_cog_http, _fetch_decode_cog_http_strips, _fetch_decode_cog_http_tiles plus INITIAL_HTTP_HEADER_BYTES / MAX_HTTP_HEADER_BYTES out of _reader.py.
  • _reader.py shrinks from 1258 to 429 lines.
  • Public import paths stay intact: _reader.py re-exports the moved names. _backends/dask.py and geotiff/__init__.py now pull collaborators (_decode, _layout, _cog_http, _sources) directly where the monkeypatch contract allows.

Monkeypatch contract: option (a), keep the _reader namespace re-export

The existing tests patch attributes on _reader and then call _reader._read_cog_http. Rewriting every patch site (option b) would have touched ~30 call sites across 7 test files for a one-line saving inside _cog_http. I went with option (a): _read_cog_http and _parse_cog_http_meta in _cog_http.py resolve patchable collaborators through the _reader module namespace at call time (from . import _reader; _reader._HTTPSource(url), etc.).

The names looked up via _reader at call time:

  • _HTTPSource
  • _parse_cog_http_meta
  • _fetch_decode_cog_http_tiles
  • _decode_strip_or_tile
  • _apply_photometric_miniswhite
  • MAX_HTTP_HEADER_BYTES
  • INITIAL_HTTP_HEADER_BYTES

_backends/dask.py still goes through _reader for _HTTPSource / _CloudSource construction because the dask graph builder creates the source before handing it to the metadata parser, and the dask-HTTP coalesce tests patch _reader._HTTPSource.

Test plan

  • pytest xrspatial/geotiff/tests/test_http_cog_coalesce.py -v, 11/11, including test_read_cog_http_perf_with_mock_rtt (rtts_saved >= 5).
  • pytest xrspatial/geotiff/tests/, 5037 passed, 68 skipped. The lone failure (test_lowlevel_write_pushdown_2138::test_write_vs_to_geotiff_byte_parity_uint8[lz4]) reproduces on main and is unrelated (lz4 experimental-codec gate).
  • Monkeypatch-sensitive suite: test_security.py, test_cog_http_close_on_error_1816.py, test_http_meta_buffer_1718.py, test_http_stripped_window_max_pixels_issue_A_1842.py, test_cog_http_parallel_decode_2026_05_15.py, test_parallel_strip_decode_2100.py, all green.

Closes #2258
Part of #2211

Move the COG-over-HTTP transport layer (`_parse_cog_http_meta`,
`_read_cog_http`, `_fetch_decode_cog_http_strips`,
`_fetch_decode_cog_http_tiles`) plus the `INITIAL_HTTP_HEADER_BYTES` /
`MAX_HTTP_HEADER_BYTES` constants into a new `_cog_http.py`. `_reader.py`
drops from 1258 to 429 lines.

The public import surface stays intact: `_reader.py` re-exports the
moved names so `from xrspatial.geotiff._reader import _read_cog_http`
and friends keep working unchanged.

`_read_cog_http` resolves `_HTTPSource`, `_parse_cog_http_meta`,
`_fetch_decode_cog_http_tiles`, `_decode_strip_or_tile`, and
`_apply_photometric_miniswhite` through the `_reader` module namespace
at call time so existing tests that
`monkeypatch.setattr(_reader, '_HTTPSource', ...)` (and friends)
continue to intercept the collaborators. `_parse_cog_http_meta` reads
`MAX_HTTP_HEADER_BYTES` and `INITIAL_HTTP_HEADER_BYTES` the same way so
`test_http_meta_buffer_1718`'s cap-shrinking patch keeps working.

Part of #2211.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 21, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Refactor GeoTIFF Phase 5d, extract _cog_http.py from _reader.py

Mechanical extraction. I diffed every moved function body byte-for-byte against HEAD~1:xrspatial/geotiff/_reader.py and the only deltas are the deliberate _reader.X indirection inserts called out in the PR body. Logic is unchanged.

Blockers

None.

Suggestions

  • xrspatial/geotiff/__init__.py:216: _parse_cog_http_meta is now imported from _cog_http directly, while _backends/dask.py keeps _HTTPSource going through _reader for the monkeypatch contract. Both choices are correct (no test patches _reader._parse_cog_http_meta on the fsspec open_geotiff(..., chunks=...) path), but the asymmetry is worth a one-line comment so a future reader does not "fix" the inconsistency.

  • xrspatial/geotiff/_cog_http.py:64: the module-level from ._sources import _HTTPSource binding is only used as a type annotation now. Every runtime use goes through _reader._HTTPSource. Move it under if TYPE_CHECKING: or add a comment, otherwise a reader wonders why the imported name and the live binding diverge.

Nits

  • xrspatial/geotiff/_cog_http.py:268: _parse_cog_http_meta captures _reader.INITIAL_HTTP_HEADER_BYTES and _reader.MAX_HTTP_HEADER_BYTES once at the top of the call. The original referenced them as module globals every loop iteration. Behavior is identical in practice (no test patches the constants mid-call), but a sentence in the comment saying the one-shot capture is intentional and equivalent would close the loop.

  • _cog_http.py does from . import _reader lazily in four spots (_parse_cog_http_meta, _read_cog_http, _decode_http_strip, _decode_one). Same reason every time: avoid circular import at module load. A single sentence at the top of the file ("see _read_cog_http for why the lazy import is here, and which collaborators get the _reader.X treatment") would let future readers skip the per-call comments after the first.

  • No new tests added. The existing fleet (test_http_cog_coalesce, test_security, test_cog_http_close_on_error_1816, test_http_meta_buffer_1718, test_http_stripped_window_max_pixels_issue_A_1842, test_cog_http_parallel_decode_2026_05_15, test_parallel_strip_decode_2100) already exercises every patch site, so the indirection contract is covered end-to-end.

What looks good

  • _reader.py: 1258 -> 429 lines.
  • Full geotiff suite: 5037 passed, 68 skipped. The lone failure (test_lowlevel_write_pushdown_2138::test_write_vs_to_geotiff_byte_parity_uint8[lz4]) reproduces on main and is unrelated.
  • test_read_cog_http_perf_with_mock_rtt (rtts_saved >= 5) still passes, so the per-call _reader.X indirection does not measurably slow the coalesced HTTP path.
  • The lazy from . import _reader is the right call for a circular-import setup. Doing it at module top would break load order.
  • Docstrings preserved verbatim.
  • Public import surface unchanged.

Checklist

  • Algorithm matches reference: N/A, no logic changes
  • Backend parity: N/A, no backend changes
  • NaN handling: unchanged
  • Edge cases: covered by existing tests
  • Dask chunk boundaries: unchanged
  • No premature materialization or copies: diff is move-only
  • Benchmark: N/A
  • README feature matrix: N/A
  • Docstrings: preserved verbatim

* Move `_HTTPSource` to `TYPE_CHECKING` in `_cog_http.py` so the only
  module-level binding of the name is a type annotation; the runtime
  collaborator is always `_reader._HTTPSource`.
* Drop direct `_apply_photometric_miniswhite` / `_decode_strip_or_tile`
  imports from `_cog_http.py` since both are looked up through `_reader`
  at call time. Comment in the import block explains which names from
  `_decode` are still used directly.
* Add a module-level note about why `from . import _reader` is lazy
  (circular load order) and an index of every name looked up through
  the `_reader` namespace for monkeypatch reasons.
* Add a comment in `_parse_cog_http_meta` explaining the one-shot
  capture of `INITIAL_HTTP_HEADER_BYTES` / `MAX_HTTP_HEADER_BYTES` is
  behaviourally equivalent to the original per-iteration global read
  (the constants are never mutated mid-call).
* Add a comment in `__init__.py` explaining why this site imports
  `_parse_cog_http_meta` from `_cog_http` directly while
  `_backends/dask.py` keeps `_HTTPSource` going through `_reader`.
…2258)

The previous nit-fix commit moved `_HTTPSource` under TYPE_CHECKING,
which broke `typing.get_type_hints()` against `_parse_cog_http_meta`,
`_fetch_decode_cog_http_strips`, and `_fetch_decode_cog_http_tiles`.
The function objects are also re-exported through `_reader`, so the
breakage propagated to `_reader._parse_cog_http_meta` etc. as well.

Nothing in the codebase calls `get_type_hints` on these helpers today,
but the original `_reader.py` carried a real `_HTTPSource` import, so
removing it was a subtle behavioural regression rather than a cleanup.

Restore the runtime import and replace the TYPE_CHECKING block with a
comment that explains why the name lives in the imports but the live
runtime collaborator is always `_reader._HTTPSource`.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review (pass 2): Refactor GeoTIFF Phase 5d, extract _cog_http.py from _reader.py

Follow-up after 3cfa47d0 (TYPE_CHECKING + comments) and 3ead2dcc (revert of the TYPE_CHECKING move).

Blockers

None.

Suggestions

None.

Nits

  • Honest correction to pass 1: the TYPE_CHECKING suggestion I made was wrong. With _HTTPSource under TYPE_CHECKING, typing.get_type_hints() against _parse_cog_http_meta / _fetch_decode_cog_http_strips / _fetch_decode_cog_http_tiles raises NameError: name '_HTTPSource' is not defined, and because the function objects are shared with the _reader re-exports the breakage propagates there too. 3ead2dcc restores the runtime import and uses an inline comment instead. Nothing in the codebase calls get_type_hints on these helpers today, but the original _reader.py carried a real import and a refactor should not silently degrade introspection. Noting for the audit trail.

  • Optional polish: now that the module docstring carries the patchable-names index, the four per-function "Resolve through the _reader module..." comment blocks could shrink to a one-liner pointing back at the docstring. The current per-call comments do name the specific test that drives each indirection, which is useful, so this is genuinely optional.

What looks good

  • Type-hint introspection now matches pre-refactor behavior.
  • Module docstring has a single-source-of-truth list of names that route through _reader. No grep required.
  • __init__.py:216 carries an explicit comment about why this site imports _parse_cog_http_meta from _cog_http directly while the dask backend keeps _HTTPSource going through _reader.
  • Full geotiff suite still passes (5037 passed, 68 skipped, lz4 failure pre-existing on main).
  • All monkeypatch-sensitive tests still pass.

Checklist

  • Logic preserved (diff-verified function-by-function in pass 1)
  • Monkeypatch contract preserved
  • Type-hint introspection preserved
  • Public import surface preserved
  • Tests cover every patch site
  • Docstrings preserved verbatim

@brendancol brendancol merged commit 488f58a into main May 21, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor GeoTIFF Phase 5d: extract _cog_http.py from _reader.py (PR-J of #2211)

1 participant