Skip to content

Stream HTTP read_range body so misbehaving servers cannot OOM (#2264)#2273

Merged
brendancol merged 2 commits into
mainfrom
issue-2264
May 21, 2026
Merged

Stream HTTP read_range body so misbehaving servers cannot OOM (#2264)#2273
brendancol merged 2 commits into
mainfrom
issue-2264

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

  • Switch _HTTPSource.read_range() to streaming (preload_content=False) and cap the body on the wire with _read_capped, matching the read_all pattern from [Security] _HTTPSource.read_all() is unbounded — small declared raster can pull huge HTTP body #2051. Before this change urllib3 buffered the whole response into resp.data before the data[:length] slice in _validate_range_response ran, so a server that ignored Range and returned a 200 with a multi-GiB body still OOM'd the process. MAX_HTTP_HEADER_BYTES was advisory, not enforcing.
  • For the 200-without-Content-Range fallback (server ignored Range on a start=0 prefetch), the legitimate "small file served whole" case still works, but the slack is capped at a tight absolute (_RANGE_IGNORED_FULL_OBJECT_CAP = 16 MiB). The validator still slices that fallback body down to length for the caller.
  • Two enforcement paths: Content-Length past the cap is rejected on the preflight before any body bytes are read; chunked or missing Content-Length is aborted mid-stream by _read_capped. The 206-with-Content-Range path keeps its existing behaviour but now also goes through the streaming path so the cap is uniform.
  • HTTP-only path. No backend dispatch involved (numpy / cupy / dask+numpy / dask+cupy all share this code through the geotiff reader).

Test plan

  • pytest xrspatial/geotiff/tests/test_http_range_validation_1735.py: 10 passing. Covers honest-oversize-rejected, slack-fallback-sliced, short-body-returned, dishonest-oversize-streamed-and-capped, the preload_content=False wire contract, and the pre-existing 1735 regression cases.
  • pytest xrspatial/geotiff/tests/test_sidecar_ovr_2112.py: 28 passing. Includes the SimpleHTTPRequestHandler-ignores-Range scenario that drove the slack-cap design.
  • pytest xrspatial/geotiff/tests/test_ssrf_hardening_1664.py test_dns_rebinding_pin_issue_1846.py test_http_no_stdlib_fallback_2050.py test_http_meta_buffer_1718.py: 68 passing. Mock-fixture updates absorbed cleanly.
  • Full pytest xrspatial/geotiff/tests/: 5041 passing. One pre-existing lz4 failure on main unrelated to this change.

Closes #2264.

`_HTTPSource.read_range()` used `preload_content=True` (the urllib3
default), so the whole response body landed in `resp.data` before the
`data[:length]` slice in `_validate_range_response` ever ran. A server
that ignored `Range` and returned a 200 with a multi-GiB body therefore
defeated the OOM guard the slice comment claimed: a 16 KiB metadata
prefetch against a 2 GiB body still pulled 2 GiB into memory before the
slice fired. `MAX_HTTP_HEADER_BYTES` was advisory, not enforcing.

Change `read_range` to mirror the streaming pattern `read_all` adopted
in #2051: request the body with `preload_content=False`, preflight the
advertised `Content-Length`, then drain with `_read_capped` so the cap
is enforced on the wire. The 206-with-Content-Range path keeps its
existing validator. The 200-without-Content-Range fallback (server
ignored `Range`) is allowed up to a tight absolute slack
(`_RANGE_IGNORED_FULL_OBJECT_CAP = 16 MiB`); larger bodies are rejected
on the preflight or aborted mid-stream. The validator still slices the
fallback body down to `length` for callers.

Updates the mock `_MockResp` / `_MockPoolResponse` test fixtures in
test_http_no_stdlib_fallback_2050.py, test_ssrf_hardening_1664.py, and
test_dns_rebinding_pin_issue_1846.py to expose `stream()`,
`release_conn()`, and `Content-Length` so they remain compatible with
the streaming code path. Updates the existing read_range_validation
suite to cover both halves of the new contract -- honest oversize via
Content-Length preflight, dishonest oversize via the streaming cap --
and pins the new wire-level requirement (`preload_content=False`).

Closes #2264.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 21, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Stream HTTP read_range body so misbehaving servers cannot OOM (#2264)

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

  • xrspatial/geotiff/_sources.py:1005 - read_range has no docstring on _HTTPSource. The base Source.read_range is documented at the module level (line 9), but the HTTP override now carries non-trivial extra contract (Content-Length preflight, streaming cap, 16 MiB slack for start=0 fallback, OSError on overshoot). A short docstring summarising "raises OSError when the server's body is past the byte budget for this call" would save the next reader from chasing comments.

  • xrspatial/geotiff/_sources.py:1137-1142 - The slice comment in _validate_range_response still says "a 16 KiB prefetch against a 2 GiB object doesn't drag the whole thing into memory". The whole point of this PR is that the slice never actually achieved that. With the new cap the worst case is _RANGE_IGNORED_FULL_OBJECT_CAP (16 MiB) of transient buffer, not 2 GiB. Update the comment so it does not contradict the new behaviour.

  • xrspatial/geotiff/_sources.py:1310-1312 - _read_capped's OSError message refers only to "Issue #2051". When reached from read_range (the new caller) the trail leads readers to the wrong issue. Either reference both issues or drop the specific number and rely on the call site's message for context.

  • xrspatial/geotiff/_sources.py:1003 - _RANGE_IGNORED_FULL_OBJECT_CAP = 16 MiB is the maximum transient buffer per call. With read_ranges running up to 8 workers concurrently, the worst-case transient is ~128 MiB if every worker hits a Range-blind server. The current code is still vastly better than the pre-#2264 behaviour, but it's worth noting this in the constant's docstring or in read_ranges so future tuning is informed.

Nits (optional improvements)

  • xrspatial/geotiff/_sources.py:995-1003 - _RANGE_IGNORED_FULL_OBJECT_CAP is a class attribute, which is fine, but it sits next to other module-level HTTP knobs (_HTTP_MAX_REDIRECTS, INITIAL_HTTP_HEADER_BYTES in _cog_http). Promoting it to module-level would help the next person tweaking HTTP behaviour find it. Class-level does make the monkeypatch.setattr(_HTTPSource, ...) pattern in the tests cleaner, so it's a judgement call.

  • xrspatial/geotiff/_sources.py:1024-1045 - The try/finally around _check_range_content_length and _read_capped runs resp.release_conn() in finally, then _validate_range_response reads resp.status and resp.headers after release_conn. This works (urllib3 keeps status/headers populated after release), but the flow is slightly non-obvious. Either copy status and content_range into locals before the finally, or move _validate_range_response inside the try block. The read_all method has the same shape, so consistency is also fine.

What looks good

  • The streaming-with-preflight pattern matches the read_all design from #2051. Symmetry across the two HTTP read paths is easier to audit than two different fixes.
  • The test plan covers four interesting cases: honest oversize via Content-Length, dishonest oversize via chunked, legitimate small-file fallback, and the preload_content=False wire contract. The wire-contract test in particular is the right kind of regression bait - a future refactor that flipped the default would fail it immediately.
  • The mock-fixture updates in test_ssrf_hardening_1664.py and test_dns_rebinding_pin_issue_1846.py are scoped and minimal: just stream(), release_conn(), and Content-Length. No drive-by changes to the SSRF or DNS-rebinding test logic.
  • The existing test test_range_ignored_200_with_full_body_is_sliced_to_length was correctly renamed and updated to reflect the new behaviour (now test_range_ignored_200_full_object_sliced_within_cap). The old test name was the giveaway: it asserted slicing happened, not that buffering was bounded.

Checklist

  • Algorithm matches reference/paper - HTTP-only path, no algorithm involved.
  • All implemented backends produce consistent results - HTTP-only, no backend dispatch.
  • NaN handling is correct - N/A, byte-level code.
  • Edge cases are covered by tests - empty body, short body, oversize body, chunked oversize, zero-length, nonzero start.
  • Dask chunk boundaries handled correctly - N/A.
  • No premature materialization or unnecessary copies - _read_capped joins chunks once at end.
  • Benchmark exists or is not needed - not needed for a defensive HTTP fix.
  • README feature matrix updated (if applicable) - N/A.
  • Docstrings present and accurate - partially; see suggestion on read_range docstring.

…2264)

- Add a docstring to ``_HTTPSource.read_range`` covering the post-#2264
  contract (Content-Length preflight, streaming cap, full-object slack
  for ``start=0`` fallback, ``OSError`` on overshoot). The base
  ``Source.read_range`` is documented at module level but the HTTP
  override now carries non-trivial extra contract that deserves its own
  signpost.
- Expand the ``_RANGE_IGNORED_FULL_OBJECT_CAP`` docstring to note the
  worst-case concurrent transient under ``read_ranges`` (~128 MiB with
  the default 8-worker pool), so future tuning of ``max_workers`` is
  informed.
- Update the stale slice comment in ``_validate_range_response`` so it
  no longer claims to protect against a 2 GiB body; the new cap puts
  the worst case at ``_RANGE_IGNORED_FULL_OBJECT_CAP`` (16 MiB).
- Update ``_read_capped``'s OSError message to reference both #2051 and
  #2264, since ``read_range`` is now a caller too.
- Move ``_validate_range_response`` inside the ``try`` block so the
  status/headers reads happen before ``release_conn`` runs. The
  previous ordering worked (urllib3 keeps those populated post-release)
  but was easy to misread.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review after edc3313

Reviewed the second commit. Verdict: ready to merge once CI is green.

Disposition of original findings

  • Suggestion (no docstring on _HTTPSource.read_range) -- fixed at _sources.py:1012-1034. Docstring lists status / Content-Length / Content-Range axes and names the OSError contract.
  • Suggestion (stale 2 GiB comment in _validate_range_response) -- fixed. The comment now cites _RANGE_IGNORED_FULL_OBJECT_CAP (16 MiB) and references #2264.
  • Suggestion (_read_capped references only #2051) -- fixed. The OSError message now lists both #2051 and #2264.
  • Suggestion (document the 128 MiB worst-case under read_ranges) -- fixed in the constant's docstring.
  • Nit (release_conn / validator ordering) -- fixed. _validate_range_response now runs inside the try block; release_conn runs in finally after validation completes or raises.
  • Nit (class-level vs module-level constant) -- left as-is. Class-level keeps the monkeypatch.setattr(_HTTPSource, ...) test pattern clean, which the test suite relies on.

What looks good

  • Docstring is explicit about which budget applies in which branch, which is the part that took the longest to reason about from the original code.
  • The validator-inside-try refactor is a small readability win without changing behaviour. Easy to verify the response is alive when status/headers are read.

Outstanding

None blocking. CI is the last gate.

@brendancol brendancol merged commit 6426536 into main May 21, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTTP read_range buffers full response body before slicing -- OOM on misbehaving servers

1 participant