Refactor GeoTIFF Phase 5d: extract _cog_http.py from _reader.py#2262
Conversation
Move the COG-over-HTTP transport layer (`_parse_cog_http_meta`, `_read_cog_http`, `_fetch_decode_cog_http_strips`, `_fetch_decode_cog_http_tiles`) plus the `INITIAL_HTTP_HEADER_BYTES` / `MAX_HTTP_HEADER_BYTES` constants into a new `_cog_http.py`. `_reader.py` drops from 1258 to 429 lines. The public import surface stays intact: `_reader.py` re-exports the moved names so `from xrspatial.geotiff._reader import _read_cog_http` and friends keep working unchanged. `_read_cog_http` resolves `_HTTPSource`, `_parse_cog_http_meta`, `_fetch_decode_cog_http_tiles`, `_decode_strip_or_tile`, and `_apply_photometric_miniswhite` through the `_reader` module namespace at call time so existing tests that `monkeypatch.setattr(_reader, '_HTTPSource', ...)` (and friends) continue to intercept the collaborators. `_parse_cog_http_meta` reads `MAX_HTTP_HEADER_BYTES` and `INITIAL_HTTP_HEADER_BYTES` the same way so `test_http_meta_buffer_1718`'s cap-shrinking patch keeps working. Part of #2211.
brendancol
left a comment
There was a problem hiding this comment.
PR Review: Refactor GeoTIFF Phase 5d, extract _cog_http.py from _reader.py
Mechanical extraction. I diffed every moved function body byte-for-byte against HEAD~1:xrspatial/geotiff/_reader.py and the only deltas are the deliberate _reader.X indirection inserts called out in the PR body. Logic is unchanged.
Blockers
None.
Suggestions
-
xrspatial/geotiff/__init__.py:216:_parse_cog_http_metais now imported from_cog_httpdirectly, while_backends/dask.pykeeps_HTTPSourcegoing through_readerfor the monkeypatch contract. Both choices are correct (no test patches_reader._parse_cog_http_metaon the fsspecopen_geotiff(..., chunks=...)path), but the asymmetry is worth a one-line comment so a future reader does not "fix" the inconsistency. -
xrspatial/geotiff/_cog_http.py:64: the module-levelfrom ._sources import _HTTPSourcebinding is only used as a type annotation now. Every runtime use goes through_reader._HTTPSource. Move it underif TYPE_CHECKING:or add a comment, otherwise a reader wonders why the imported name and the live binding diverge.
Nits
-
xrspatial/geotiff/_cog_http.py:268:_parse_cog_http_metacaptures_reader.INITIAL_HTTP_HEADER_BYTESand_reader.MAX_HTTP_HEADER_BYTESonce at the top of the call. The original referenced them as module globals every loop iteration. Behavior is identical in practice (no test patches the constants mid-call), but a sentence in the comment saying the one-shot capture is intentional and equivalent would close the loop. -
_cog_http.pydoesfrom . import _readerlazily in four spots (_parse_cog_http_meta,_read_cog_http,_decode_http_strip,_decode_one). Same reason every time: avoid circular import at module load. A single sentence at the top of the file ("see_read_cog_httpfor why the lazy import is here, and which collaborators get the_reader.Xtreatment") would let future readers skip the per-call comments after the first. -
No new tests added. The existing fleet (
test_http_cog_coalesce,test_security,test_cog_http_close_on_error_1816,test_http_meta_buffer_1718,test_http_stripped_window_max_pixels_issue_A_1842,test_cog_http_parallel_decode_2026_05_15,test_parallel_strip_decode_2100) already exercises every patch site, so the indirection contract is covered end-to-end.
What looks good
_reader.py: 1258 -> 429 lines.- Full geotiff suite: 5037 passed, 68 skipped. The lone failure (
test_lowlevel_write_pushdown_2138::test_write_vs_to_geotiff_byte_parity_uint8[lz4]) reproduces onmainand is unrelated. test_read_cog_http_perf_with_mock_rtt(rtts_saved >= 5) still passes, so the per-call_reader.Xindirection does not measurably slow the coalesced HTTP path.- The lazy
from . import _readeris the right call for a circular-import setup. Doing it at module top would break load order. - Docstrings preserved verbatim.
- Public import surface unchanged.
Checklist
- Algorithm matches reference: N/A, no logic changes
- Backend parity: N/A, no backend changes
- NaN handling: unchanged
- Edge cases: covered by existing tests
- Dask chunk boundaries: unchanged
- No premature materialization or copies: diff is move-only
- Benchmark: N/A
- README feature matrix: N/A
- Docstrings: preserved verbatim
* Move `_HTTPSource` to `TYPE_CHECKING` in `_cog_http.py` so the only module-level binding of the name is a type annotation; the runtime collaborator is always `_reader._HTTPSource`. * Drop direct `_apply_photometric_miniswhite` / `_decode_strip_or_tile` imports from `_cog_http.py` since both are looked up through `_reader` at call time. Comment in the import block explains which names from `_decode` are still used directly. * Add a module-level note about why `from . import _reader` is lazy (circular load order) and an index of every name looked up through the `_reader` namespace for monkeypatch reasons. * Add a comment in `_parse_cog_http_meta` explaining the one-shot capture of `INITIAL_HTTP_HEADER_BYTES` / `MAX_HTTP_HEADER_BYTES` is behaviourally equivalent to the original per-iteration global read (the constants are never mutated mid-call). * Add a comment in `__init__.py` explaining why this site imports `_parse_cog_http_meta` from `_cog_http` directly while `_backends/dask.py` keeps `_HTTPSource` going through `_reader`.
…2258) The previous nit-fix commit moved `_HTTPSource` under TYPE_CHECKING, which broke `typing.get_type_hints()` against `_parse_cog_http_meta`, `_fetch_decode_cog_http_strips`, and `_fetch_decode_cog_http_tiles`. The function objects are also re-exported through `_reader`, so the breakage propagated to `_reader._parse_cog_http_meta` etc. as well. Nothing in the codebase calls `get_type_hints` on these helpers today, but the original `_reader.py` carried a real `_HTTPSource` import, so removing it was a subtle behavioural regression rather than a cleanup. Restore the runtime import and replace the TYPE_CHECKING block with a comment that explains why the name lives in the imports but the live runtime collaborator is always `_reader._HTTPSource`.
brendancol
left a comment
There was a problem hiding this comment.
PR Review (pass 2): Refactor GeoTIFF Phase 5d, extract _cog_http.py from _reader.py
Follow-up after 3cfa47d0 (TYPE_CHECKING + comments) and 3ead2dcc (revert of the TYPE_CHECKING move).
Blockers
None.
Suggestions
None.
Nits
-
Honest correction to pass 1: the TYPE_CHECKING suggestion I made was wrong. With
_HTTPSourceunderTYPE_CHECKING,typing.get_type_hints()against_parse_cog_http_meta/_fetch_decode_cog_http_strips/_fetch_decode_cog_http_tilesraisesNameError: name '_HTTPSource' is not defined, and because the function objects are shared with the_readerre-exports the breakage propagates there too.3ead2dccrestores the runtime import and uses an inline comment instead. Nothing in the codebase callsget_type_hintson these helpers today, but the original_reader.pycarried a real import and a refactor should not silently degrade introspection. Noting for the audit trail. -
Optional polish: now that the module docstring carries the patchable-names index, the four per-function "Resolve through the _reader module..." comment blocks could shrink to a one-liner pointing back at the docstring. The current per-call comments do name the specific test that drives each indirection, which is useful, so this is genuinely optional.
What looks good
- Type-hint introspection now matches pre-refactor behavior.
- Module docstring has a single-source-of-truth list of names that route through
_reader. No grep required. __init__.py:216carries an explicit comment about why this site imports_parse_cog_http_metafrom_cog_httpdirectly while the dask backend keeps_HTTPSourcegoing through_reader.- Full geotiff suite still passes (5037 passed, 68 skipped, lz4 failure pre-existing on main).
- All monkeypatch-sensitive tests still pass.
Checklist
- Logic preserved (diff-verified function-by-function in pass 1)
- Monkeypatch contract preserved
- Type-hint introspection preserved
- Public import surface preserved
- Tests cover every patch site
- Docstrings preserved verbatim
Summary
Phase 5d of the GeoTIFF refactor (#2211). Moves the COG-over-HTTP transport layer into a new
_cog_http.py._parse_cog_http_meta,_read_cog_http,_fetch_decode_cog_http_strips,_fetch_decode_cog_http_tilesplusINITIAL_HTTP_HEADER_BYTES/MAX_HTTP_HEADER_BYTESout of_reader.py._reader.pyshrinks from 1258 to 429 lines._reader.pyre-exports the moved names._backends/dask.pyandgeotiff/__init__.pynow pull collaborators (_decode,_layout,_cog_http,_sources) directly where the monkeypatch contract allows.Monkeypatch contract: option (a), keep the
_readernamespace re-exportThe existing tests patch attributes on
_readerand then call_reader._read_cog_http. Rewriting every patch site (option b) would have touched ~30 call sites across 7 test files for a one-line saving inside_cog_http. I went with option (a):_read_cog_httpand_parse_cog_http_metain_cog_http.pyresolve patchable collaborators through the_readermodule namespace at call time (from . import _reader; _reader._HTTPSource(url), etc.).The names looked up via
_readerat call time:_HTTPSource_parse_cog_http_meta_fetch_decode_cog_http_tiles_decode_strip_or_tile_apply_photometric_miniswhiteMAX_HTTP_HEADER_BYTESINITIAL_HTTP_HEADER_BYTES_backends/dask.pystill goes through_readerfor_HTTPSource/_CloudSourceconstruction because the dask graph builder creates the source before handing it to the metadata parser, and the dask-HTTP coalesce tests patch_reader._HTTPSource.Test plan
pytest xrspatial/geotiff/tests/test_http_cog_coalesce.py -v, 11/11, includingtest_read_cog_http_perf_with_mock_rtt(rtts_saved >= 5).pytest xrspatial/geotiff/tests/, 5037 passed, 68 skipped. The lone failure (test_lowlevel_write_pushdown_2138::test_write_vs_to_geotiff_byte_parity_uint8[lz4]) reproduces onmainand is unrelated (lz4 experimental-codec gate).test_security.py,test_cog_http_close_on_error_1816.py,test_http_meta_buffer_1718.py,test_http_stripped_window_max_pixels_issue_A_1842.py,test_cog_http_parallel_decode_2026_05_15.py,test_parallel_strip_decode_2100.py, all green.Closes #2258
Part of #2211