bugfix: fix zstd module infinite loop when upstream return content-length abnormal#23
bugfix: fix zstd module infinite loop when upstream return content-length abnormal#23drawing wants to merge 1 commit into
Conversation
|
Ran into this bug and can confirm this fix worked! Would be nice to get it merged to save time for others :) |
|
Somebody merge it? |
|
this fix does not work here in combination with varnish. |
|
You have to disable buffering, then everything works as expected here - with varnish |
|
I encountered the same issue when I tried to use zstd buffering on a homeassistant proxy config. The homeassistant instance became unreachable and my nginx 1.29.3 completely froze up. After I disabled zstd for that specific site, everything seemed going back to normal. |
|
Maybe this should get merged in and a new release should happen. |
|
Is this repo still alive? |
|
we all hope so! |
@Stensel8, hello! I have same problem with my server and can't to find out what causing freezing with enabled ZSTD. |
My servers are mostly built with my own configs baked in: https://github.com/Stensel8/Scripts/tree/main/nginx See below: sten@sten-srv03:~$ cat /etc/nginx/conf.d/homeassistant.conf sten@sten-srv03:~$ @lowkeypriority, these are some earlier outputs from my nginx reverse proxy server which completely froze up. The issue was only happening when trying to compress proxied requests with the zstd module for nginx. The fix for me was disabling zstd on proxied hosts: The version and modules I had at that time, were the following: sten@sten-srv03:~$ nginx -V sten@sten-srv03:~$ The main nginx config, looked like this: I have migrated to Angie now, which I am testing out to see if I like it :) |
|
@tokers, is this repository still active? Or will it be merged into the mainline nginx at some point? Since the module gets older and it seems to be getting more conflicted on every new nginx release. I reallly, reaaallly like the work though. ZSTD is like the best encoding we can get :) |
Fixes GitHub issue #41 - worker processes stuck at 100% CPU usage Applied two critical upstream fixes: PR #23 (drawing, Jun 2023): Fix infinite loop when upstream returns abnormal content-length by allocating fresh buffer during flush cycles. Prevents memory recycling without re-buffering that caused CPU spinning. PR #49 (t0mtaylor, May 2026): Introduce action state machine (COMPRESS, FLUSH, END) to prevent premature END transitions and 131072-byte truncation. Gates END transition on: - Input buffer fully drained - No more chain links queued - Last flag set Also gates last_buf=1 on action==END to prevent early frame finalization. Impact: Resolves infinite loop where empty output buffers get re-processed without progress, causing 100% CPU burn. Prevents silent data truncation at 131072-byte boundaries on large streams. Verified: Both fixes integrate cleanly with existing codebase. Next: Build and test with nginx/angie packages.
Two new regression scripts targeted at the zstd body filter's action state machine and upstream flush handling — designed to survive the V2 sequencing (compressStream2 migration → retire action state machine → CCtx pool). compress-state-machine.sh — structural coverage of compress() entry points: empty body, single byte, 131072 boundary over HTTP/1.1, 200000 multi-redo, 3-request keep-alive (one TCP connection, num_connects=1 asserted), 10x parallel burst, disk-vs-proxy equivalence (same body served via static file alias and via proxy_pass with Accept-Encoding stripped upstream — decompressed outputs must match byte-for-byte). proxy-flush.sh — chunked / SSE / Connection: Upgrade upstream patterns from tokers#23 thread (mklooss, Stensel8, lowkeypriority). Python TCP fixture emits 6-chunk Transfer-Encoding chunked, 5-event SSE, and Upgrade stand-in payloads. Four sub-tests cover proxy_buffering on/off + SSE + Upgrade. Coverage scope is documented in each script header. Both pass on master baseline as regression nets; tightening to TDD-fail on the master flush- promotion gap requires a TTFB / latency assertion deferred to V2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrate the two new bash regression scripts from the prior commit to
pytest, following the POC pattern in test_accept_encoding.py /
test_config_validation.py. The bash equivalents are removed in the same
commit — pytest is the strategic direction for complex network tests.
test_compress_state_machine.py — parametrized state machine coverage:
* basic[empty,single,h1-131072,h1-200000] — boundary inputs through
compress(), byte-equality round-trip
* disk_vs_proxy — same body via static alias (in_file buffer) and
proxy_pass (in-memory chain link) must decompress to identical bytes
* keep_alive_three — three requests on ONE http.client connection,
no state leak between them
* parallel_burst — 20 concurrent requests via ThreadPoolExecutor
test_proxy_flush.py — flush-promotion regression:
* Python TCP fixture server on :9000 (module-scoped pytest fixture),
threaded handler emits chunked / SSE / Upgrade upstream patterns
* Each handler records ground-truth bytes into a thread-safe dict so
the test byte-compares decompressed client response against the
upstream payload
* Sub-test chunked-on (proxy_buffering on) PASSES on master baseline.
Sub-tests chunked-off, sse, upgrade (proxy_buffering off) TIMEOUT
deterministically on master baseline — this reproduces Stensel8's
HomeAssistant freeze (PR tokers#23 thread) and is exactly the TDD signal
the 0.2.1 flush-promotion fix is meant to clear.
test_window_size.py — browser-compat regression for tokers#35:
* Verify response frame header Window Size <= 8 MiB (Chrome's hard
decompression-memory ceiling). Default module config (windowLog=19,
512 KiB) passes; regression net against any future change to a
browser-hostile default.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pytest test_websocket.py exercises the full WebSocket protocol path through nginx with zstd on + proxy_buffering off — the production shape from Stensel8's HomeAssistant freeze report (PR tokers#23 thread). Two sub-tests: * test_websocket_through_zstd_filter — full RFC 6455 handshake + bidirectional echo (4 client→server, 4 echo replies + 1 server- pushed initial frame). Hard 5-second async timeout catches the freeze symptom. * test_websocket_handshake_speed — cheaper canary: 101 Switching Protocols must arrive within 1 second; freeze at the Upgrade response (body_filter on zero-body 101) would push this over. Uses python3-websockets (apt-installed 10.4) for both client and the upstream echo server fixture — avoids ~100 LoC of hand-rolled RFC 6455 framing. Dockerfile.{dynamic,brotli,asan,valgrind} updated to install python3-websockets alongside the existing pytest / requests / zstandard packages. **Status on test1 baseline**: both sub-tests PASS. Basic protocol shape works under master's compress filter — useful regression net for V2 compressStream2 migration + Option γ action-machine rewrite. Does NOT reproduce Stensel8's specific freeze. His setup adds HTTP/2 + HTTP/3 to the client side plus permessage-deflate / HomeAssistant's specific frame patterns — V2 hardening will need HTTP/2 axis + extension negotiation. Bug #4b coverage section in master-bugs.md documents this gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New Dockerfile.angie builds angie + zstd dynamic module from source on Ubuntu (22.04/24.04/26.04 via UBUNTU_VERSION build-arg). Angie 1.11.5 is built on nginx core 1.29.3 — the same base Stensel8 reported the WebSocket freeze on in PR tokers#23. Why build from source: the official angie docker image ships without `--with-compat`, so external modules built with --with-compat fail to load. Building both binary and module ourselves with matching flags sidesteps this. No public angie apt repo for noble was found at probe time. Why excluded from ALL_VARIANTS: verified empirically (2026-05-16) that test_proxy_flush + test_http2_proxy_flush reproduce bug #4b with identical 3/4 TIMEOUT signature on nginx 1.31.0 and angie 1.11.5 — the flush-promotion gap lives in nginx core 1.29.3 code, inherited unchanged by angie. Adding angie to the default test matrix would double CI cost for zero unique signal. The dispatch entries remain so `bash t/build.sh angie-24.04` works for on-demand verification of fixes against the production-shape stack some users have migrated to. RFC 8441 (extended CONNECT for WS-over-HTTP/2) is NOT supported in angie 1.11.5 — verified by source inspection of src/http/v2/ngx_http_v2.c (returns 405 NOT ALLOWED for any HTTP/2 CONNECT, no `:protocol` pseudo-header handling). nginx mainline doesn't have it either. So this image doesn't unlock the Stensel8 WS-over-H2 axis either — that remains blocked on upstream RFC 8441 support. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lass Adds chunked-off-tiny sub-test reproducing the production freeze pattern from tokers#23 thread (mklooss 2025-03-20, Stensel8 2026-01-11, lowkeypriority 2026-02-02). Marked xfail (strict=True) until commit 3 retires the action state machine and adopts ngx_brotli's per-call-op pattern. Sub-test schedule (Tier-1 production-shape repro): - 50 chunks x 200 bytes random tail, 50 ms inter-chunk gap. - Total upstream window ~2.5 s, total body ~10 KiB. - proxy_buffering off; chunks reach the body filter with b->flush=1. - TTFB budget 400 ms; max_time_s 10. Discovery short-circuit: the existing chunked-off / sse / upgrade and test_http2_proxy_flush[chunked-off|sse|upgrade] sub-cases already FAIL with ReadTimeoutError on stable HEAD (full 10 s hang), so the bug class is already exercised by the harness. The chunked-off-tiny case is a tighter, focused production-shape repro for the action-state- machine flush-promotion bug. Baseline evidence (bash t/run.sh ubuntu-24.04 against stable c24d188): - test_proxy_flush[chunked-off-tiny] XFAIL (action-machine flush bug). - pytest summary: 10 failed, 76 passed, 3 skipped, 1 xfailed in 201 s. - Pre-existing failures (sse, upgrade, http2_proxy_flush[*], test_infinite_loop, test_h1/h2_roundtrip[131073|200000]) remain; none introduced by this commit. Production bug shape: TTFB ~ upstream window because ctx->flush is never promoted to ZSTD_flushStream when libzstd has not accumulated enough to spill (small chunks, rc==0 after ZSTD_compressStream).
Three bundled changes that together fix the production flush-promotion bug and finish the libzstd modern-API migration: 1. Per-call-op pattern (ngx_brotli model). Replaces ctx->action (COMPRESS/FLUSH/END enum + redo flag) with per-call ZSTD_EndDirective derivation from ctx->last / ctx->flush set on entry from ctx->in_buf's last_buf / flush. Reference: tmp/src/ngx_brotli/filter/ngx_http_brotli_filter_module.c:478-524. Zero-progress guard added as defense in depth (brotli line 520-524). Empty non-control bufs are now dropped in add_data to mirror brotli. 2. Modernize ZSTD_CCtx setup. Drops the legacy ZSTD_initCStream / ZSTD_initCStream_usingCDict path under #if ZSTD_VERSION_NUMBER < 10500. Now uniformly uses ZSTD_CCtx_reset(session_only) + ZSTD_CCtx_refCDict OR ZSTD_CCtx_setParameter(compressionLevel). Eliminates the -Wdeprecated-declarations warning for usingCDict on libzstd builds that mark it deprecated. 3. Compile-time guard: #error if libzstd < 1.4.0. Was de facto required by ZSTD_compressStream2 from the previous commit; now explicit. Static module unaffected (does not include zstd.h). Fixes the production flush-promotion bug from tokers#23 thread (mklooss 2025-03-20, Stensel8 2026-01-11, lowkeypriority 2026-02-02). Root cause: ctx->action only promoted COMPRESS -> FLUSH when libzstd reported pending output (rc>0). For small flush'd chunks where libzstd swallowed input with rc==0, no ZSTD_flushStream() was issued -- bytes stayed buffered, workers stalled until upstream closed. Pre-fix on stable HEAD: test_proxy_flush[chunked-off|sse|upgrade] + test_http2_proxy_flush[chunked-off|sse|upgrade] + test_infinite_loop all timed out at 10 s ReadTimeoutError. Post-fix on v2/compress-stream2: all eight previously failing cases plus the new chunked-off-tiny tier-1 repro now PASS within the 400- 600 ms TTFB budgets. ubuntu-24.04: 87 passed, 3 skipped. ubuntu-22.04 (libzstd 1.4.8): 86 passed, 4 skipped. ubuntu-24.04-shared-only: 87 passed, 3 skipped. ASan/UBSan: no diagnostics. Valgrind: no leaks beyond suppressions. ctx->last / ctx->flush retained as field names (no rename) so a partial revert of this commit leaves commit 2 compiling. Sticky-flag semantics documented at the field site. Closes the chunked-off-tiny xfail from the previous commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the backend business returns a content-length exception, which the content-length is greater than the actual body size(for example, restart when the business return data is not completed), nginx will send a flush empty buffer, which will cause an infinite loop of zstd filter module.
for example, upstream return content: