Stream HTTP read_range body so misbehaving servers cannot OOM (#2264) by brendancol · Pull Request #2273 · xarray-contrib/xarray-spatial

brendancol · 2026-05-21T20:49:30Z

Summary

Switch _HTTPSource.read_range() to streaming (preload_content=False) and cap the body on the wire with _read_capped, matching the read_all pattern from [Security] _HTTPSource.read_all() is unbounded — small declared raster can pull huge HTTP body #2051. Before this change urllib3 buffered the whole response into resp.data before the data[:length] slice in _validate_range_response ran, so a server that ignored Range and returned a 200 with a multi-GiB body still OOM'd the process. MAX_HTTP_HEADER_BYTES was advisory, not enforcing.
For the 200-without-Content-Range fallback (server ignored Range on a start=0 prefetch), the legitimate "small file served whole" case still works, but the slack is capped at a tight absolute (_RANGE_IGNORED_FULL_OBJECT_CAP = 16 MiB). The validator still slices that fallback body down to length for the caller.
Two enforcement paths: Content-Length past the cap is rejected on the preflight before any body bytes are read; chunked or missing Content-Length is aborted mid-stream by _read_capped. The 206-with-Content-Range path keeps its existing behaviour but now also goes through the streaming path so the cap is uniform.
HTTP-only path. No backend dispatch involved (numpy / cupy / dask+numpy / dask+cupy all share this code through the geotiff reader).

Test plan

pytest xrspatial/geotiff/tests/test_http_range_validation_1735.py: 10 passing. Covers honest-oversize-rejected, slack-fallback-sliced, short-body-returned, dishonest-oversize-streamed-and-capped, the preload_content=False wire contract, and the pre-existing 1735 regression cases.
pytest xrspatial/geotiff/tests/test_sidecar_ovr_2112.py: 28 passing. Includes the SimpleHTTPRequestHandler-ignores-Range scenario that drove the slack-cap design.
pytest xrspatial/geotiff/tests/test_ssrf_hardening_1664.py test_dns_rebinding_pin_issue_1846.py test_http_no_stdlib_fallback_2050.py test_http_meta_buffer_1718.py: 68 passing. Mock-fixture updates absorbed cleanly.
Full pytest xrspatial/geotiff/tests/: 5041 passing. One pre-existing lz4 failure on main unrelated to this change.

Closes #2264.

`_HTTPSource.read_range()` used `preload_content=True` (the urllib3 default), so the whole response body landed in `resp.data` before the `data[:length]` slice in `_validate_range_response` ever ran. A server that ignored `Range` and returned a 200 with a multi-GiB body therefore defeated the OOM guard the slice comment claimed: a 16 KiB metadata prefetch against a 2 GiB body still pulled 2 GiB into memory before the slice fired. `MAX_HTTP_HEADER_BYTES` was advisory, not enforcing. Change `read_range` to mirror the streaming pattern `read_all` adopted in #2051: request the body with `preload_content=False`, preflight the advertised `Content-Length`, then drain with `_read_capped` so the cap is enforced on the wire. The 206-with-Content-Range path keeps its existing validator. The 200-without-Content-Range fallback (server ignored `Range`) is allowed up to a tight absolute slack (`_RANGE_IGNORED_FULL_OBJECT_CAP = 16 MiB`); larger bodies are rejected on the preflight or aborted mid-stream. The validator still slices the fallback body down to `length` for callers. Updates the mock `_MockResp` / `_MockPoolResponse` test fixtures in test_http_no_stdlib_fallback_2050.py, test_ssrf_hardening_1664.py, and test_dns_rebinding_pin_issue_1846.py to expose `stream()`, `release_conn()`, and `Content-Length` so they remain compatible with the streaming code path. Updates the existing read_range_validation suite to cover both halves of the new contract -- honest oversize via Content-Length preflight, dishonest oversize via the streaming cap -- and pins the new wire-level requirement (`preload_content=False`). Closes #2264.

brendancol

PR Review: Stream HTTP read_range body so misbehaving servers cannot OOM (#2264)

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

xrspatial/geotiff/_sources.py:1005 - read_range has no docstring on _HTTPSource. The base Source.read_range is documented at the module level (line 9), but the HTTP override now carries non-trivial extra contract (Content-Length preflight, streaming cap, 16 MiB slack for start=0 fallback, OSError on overshoot). A short docstring summarising "raises OSError when the server's body is past the byte budget for this call" would save the next reader from chasing comments.
xrspatial/geotiff/_sources.py:1137-1142 - The slice comment in _validate_range_response still says "a 16 KiB prefetch against a 2 GiB object doesn't drag the whole thing into memory". The whole point of this PR is that the slice never actually achieved that. With the new cap the worst case is _RANGE_IGNORED_FULL_OBJECT_CAP (16 MiB) of transient buffer, not 2 GiB. Update the comment so it does not contradict the new behaviour.
xrspatial/geotiff/_sources.py:1310-1312 - _read_capped's OSError message refers only to "Issue #2051". When reached from read_range (the new caller) the trail leads readers to the wrong issue. Either reference both issues or drop the specific number and rely on the call site's message for context.
xrspatial/geotiff/_sources.py:1003 - _RANGE_IGNORED_FULL_OBJECT_CAP = 16 MiB is the maximum transient buffer per call. With read_ranges running up to 8 workers concurrently, the worst-case transient is ~128 MiB if every worker hits a Range-blind server. The current code is still vastly better than the pre-#2264 behaviour, but it's worth noting this in the constant's docstring or in read_ranges so future tuning is informed.

Nits (optional improvements)

xrspatial/geotiff/_sources.py:995-1003 - _RANGE_IGNORED_FULL_OBJECT_CAP is a class attribute, which is fine, but it sits next to other module-level HTTP knobs (_HTTP_MAX_REDIRECTS, INITIAL_HTTP_HEADER_BYTES in _cog_http). Promoting it to module-level would help the next person tweaking HTTP behaviour find it. Class-level does make the monkeypatch.setattr(_HTTPSource, ...) pattern in the tests cleaner, so it's a judgement call.
xrspatial/geotiff/_sources.py:1024-1045 - The try/finally around _check_range_content_length and _read_capped runs resp.release_conn() in finally, then _validate_range_response reads resp.status and resp.headers after release_conn. This works (urllib3 keeps status/headers populated after release), but the flow is slightly non-obvious. Either copy status and content_range into locals before the finally, or move _validate_range_response inside the try block. The read_all method has the same shape, so consistency is also fine.

What looks good

The streaming-with-preflight pattern matches the read_all design from #2051. Symmetry across the two HTTP read paths is easier to audit than two different fixes.
The test plan covers four interesting cases: honest oversize via Content-Length, dishonest oversize via chunked, legitimate small-file fallback, and the preload_content=False wire contract. The wire-contract test in particular is the right kind of regression bait - a future refactor that flipped the default would fail it immediately.
The mock-fixture updates in test_ssrf_hardening_1664.py and test_dns_rebinding_pin_issue_1846.py are scoped and minimal: just stream(), release_conn(), and Content-Length. No drive-by changes to the SSRF or DNS-rebinding test logic.
The existing test test_range_ignored_200_with_full_body_is_sliced_to_length was correctly renamed and updated to reflect the new behaviour (now test_range_ignored_200_full_object_sliced_within_cap). The old test name was the giveaway: it asserted slicing happened, not that buffering was bounded.

Checklist

Algorithm matches reference/paper - HTTP-only path, no algorithm involved.
All implemented backends produce consistent results - HTTP-only, no backend dispatch.
NaN handling is correct - N/A, byte-level code.
Edge cases are covered by tests - empty body, short body, oversize body, chunked oversize, zero-length, nonzero start.
Dask chunk boundaries handled correctly - N/A.
No premature materialization or unnecessary copies - _read_capped joins chunks once at end.
Benchmark exists or is not needed - not needed for a defensive HTTP fix.
README feature matrix updated (if applicable) - N/A.
Docstrings present and accurate - partially; see suggestion on read_range docstring.

…2264) - Add a docstring to ``_HTTPSource.read_range`` covering the post-#2264 contract (Content-Length preflight, streaming cap, full-object slack for ``start=0`` fallback, ``OSError`` on overshoot). The base ``Source.read_range`` is documented at module level but the HTTP override now carries non-trivial extra contract that deserves its own signpost. - Expand the ``_RANGE_IGNORED_FULL_OBJECT_CAP`` docstring to note the worst-case concurrent transient under ``read_ranges`` (~128 MiB with the default 8-worker pool), so future tuning of ``max_workers`` is informed. - Update the stale slice comment in ``_validate_range_response`` so it no longer claims to protect against a 2 GiB body; the new cap puts the worst case at ``_RANGE_IGNORED_FULL_OBJECT_CAP`` (16 MiB). - Update ``_read_capped``'s OSError message to reference both #2051 and #2264, since ``read_range`` is now a caller too. - Move ``_validate_range_response`` inside the ``try`` block so the status/headers reads happen before ``release_conn`` runs. The previous ordering worked (urllib3 keeps those populated post-release) but was easy to misread.

brendancol

Follow-up review after `edc3313`

Reviewed the second commit. Verdict: ready to merge once CI is green.

Disposition of original findings

Suggestion (no docstring on _HTTPSource.read_range) -- fixed at _sources.py:1012-1034. Docstring lists status / Content-Length / Content-Range axes and names the OSError contract.
Suggestion (stale 2 GiB comment in _validate_range_response) -- fixed. The comment now cites _RANGE_IGNORED_FULL_OBJECT_CAP (16 MiB) and references #2264.
Suggestion (_read_capped references only #2051) -- fixed. The OSError message now lists both #2051 and #2264.
Suggestion (document the 128 MiB worst-case under read_ranges) -- fixed in the constant's docstring.
Nit (release_conn / validator ordering) -- fixed. _validate_range_response now runs inside the try block; release_conn runs in finally after validation completes or raises.
Nit (class-level vs module-level constant) -- left as-is. Class-level keeps the monkeypatch.setattr(_HTTPSource, ...) test pattern clean, which the test suite relies on.

What looks good

Docstring is explicit about which budget applies in which branch, which is the part that took the longest to reason about from the original code.
The validator-inside-try refactor is a small readability win without changing behaviour. Easy to verify the response is alive when status/headers are read.

Outstanding

None blocking. CI is the last gate.

github-actions Bot added the performance PR touches performance-sensitive code label May 21, 2026

brendancol commented May 21, 2026

View reviewed changes

brendancol merged commit 6426536 into main May 21, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream HTTP read_range body so misbehaving servers cannot OOM (#2264)#2273

Stream HTTP read_range body so misbehaving servers cannot OOM (#2264)#2273
brendancol merged 2 commits into
mainfrom
issue-2264

brendancol commented May 21, 2026

Uh oh!

brendancol left a comment

Uh oh!

brendancol left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented May 21, 2026

Summary

Test plan

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

PR Review: Stream HTTP read_range body so misbehaving servers cannot OOM (#2264)

Blockers (must fix before merge)

Suggestions (should fix, not blocking)

Nits (optional improvements)

What looks good

Checklist

Uh oh!

brendancol left a comment

Choose a reason for hiding this comment

Follow-up review after edc3313

Disposition of original findings

What looks good

Outstanding

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Follow-up review after `edc3313`