Skip to content

bugfix: fix zstd module infinite loop when upstream return content-length abnormal#23

Open
drawing wants to merge 1 commit into
tokers:masterfrom
drawing:branch_fix_endless_loop
Open

bugfix: fix zstd module infinite loop when upstream return content-length abnormal#23
drawing wants to merge 1 commit into
tokers:masterfrom
drawing:branch_fix_endless_loop

Conversation

@drawing
Copy link
Copy Markdown
Contributor

@drawing drawing commented Jun 12, 2023

When the backend business returns a content-length exception, which the content-length is greater than the actual body size(for example, restart when the business return data is not completed), nginx will send a flush empty buffer, which will cause an infinite loop of zstd filter module.

for example, upstream return content:

HTTP/1.1 200 OK
Content-Type: application/json;charset=UTF-8
Content-Length: 2

a

@ThinkChaos
Copy link
Copy Markdown

Ran into this bug and can confirm this fix worked!

Would be nice to get it merged to save time for others :)

@melroy89
Copy link
Copy Markdown

Somebody merge it?

@mklooss
Copy link
Copy Markdown

mklooss commented Mar 20, 2025

this fix does not work here in combination with varnish.
when varnish sends chunkes we get an error.
without this patch everything works as expected

@mklooss
Copy link
Copy Markdown

mklooss commented Apr 23, 2025

You have to disable buffering, then everything works as expected here - with varnish

@Stensel8
Copy link
Copy Markdown

I encountered the same issue when I tried to use zstd buffering on a homeassistant proxy config. The homeassistant instance became unreachable and my nginx 1.29.3 completely froze up.

After I disabled zstd for that specific site, everything seemed going back to normal.

@ncrosty58
Copy link
Copy Markdown

Maybe this should get merged in and a new release should happen.

@dolgovas
Copy link
Copy Markdown

dolgovas commented Feb 2, 2026

Is this repo still alive?

@zersh01
Copy link
Copy Markdown

zersh01 commented Feb 2, 2026

we all hope so!

@lowkeypriority
Copy link
Copy Markdown

I encountered the same issue when I tried to use zstd buffering on a homeassistant proxy config. The homeassistant instance became unreachable and my nginx 1.29.3 completely froze up.

After I disabled zstd for that specific site, everything seemed going back to normal.

@Stensel8, hello!
Can you provide the file/request which caused nginx to froze up?

I have same problem with my server and can't to find out what causing freezing with enabled ZSTD.

@Stensel8
Copy link
Copy Markdown

Stensel8 commented Feb 19, 2026

I encountered the same issue when I tried to use zstd buffering on a homeassistant proxy config. The homeassistant instance became unreachable and my nginx 1.29.3 completely froze up.
After I disabled zstd for that specific site, everything seemed going back to normal.

@Stensel8, hello! Can you provide the file/request which caused nginx to froze up?

I have same problem with my server and can't to find out what causing freezing with enabled ZSTD.

My servers are mostly built with my own configs baked in: https://github.com/Stensel8/Scripts/tree/main/nginx

See below:

sten@sten-srv03:~$ cat /etc/nginx/conf.d/homeassistant.conf

# /etc/nginx/conf.d/homeassistant.conf 

server {
    listen 80;
    listen [::]:80;
    server_name homeassistant-domain.nl;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl;
    listen [::]:443 ssl;
    listen 443 quic;
    listen [::]:443 quic;

    http2 on;
    http3 on;

    server_name homeassistant-domain.nl;

    # CRITICAL FIX: Disable zstd for Home Assistant
    zstd off;
    gzip off;  # Disable compression entirely voor HA

    ssl_certificate     /etc/letsencrypt/live/homeassistant-domain.nl/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/homeassistant-domain.nl.nl/privkey.pem;

    ssl_stapling off;
    ssl_stapling_verify off;

    quic_retry on;
    quic_gso on;

    add_header Alt-Svc 'h3=":443"; ma=86400' always;
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header Referrer-Policy "no-referrer" always;

    client_max_body_size 50M;

    location / {
        proxy_pass http://192.168.0.54:8123;
        proxy_http_version 1.1;

        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';

        proxy_read_timeout 90m;   # Lange timeout terug
        proxy_connect_timeout 10s;
        proxy_send_timeout 90m;

        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        proxy_buffering off;
        proxy_redirect off;
    }
}

sten@sten-srv03:~$

@lowkeypriority, these are some earlier outputs from my nginx reverse proxy server which completely froze up. The issue was only happening when trying to compress proxied requests with the zstd module for nginx.

The fix for me was disabling zstd on proxied hosts:

    # CRITICAL FIX: Disable zstd for Home Assistant
    zstd off;
    gzip off;  # Disable compression entirely voor HA

The version and modules I had at that time, were the following:

sten@sten-srv03:~$ nginx -V

nginx version: nginx/1.29.3
built by gcc 15.2.1 20251111 (Red Hat 15.2.1-4) (GCC) 
built with OpenSSL 3.6.0 1 Oct 2025
TLS SNI support enabled
configure arguments: --prefix=/usr/local/nginx --sbin-path=/usr/sbin/nginx --conf-path=/etc/nginx/nginx.conf --http-log-path=/var/log/nginx/access.log --error-log-path=/var/log/nginx/error.log --pid-path=/run/nginx.pid --lock-path=/var/lock/nginx.lock --with-openssl=/root/nginx-build-20251116-160202/openssl --with-pcre=/root/nginx-build-20251116-160202/pcre2 --with-zlib=/root/nginx-build-20251116-160202/zlib --with-pcre-jit --with-http_ssl_module --with-http_v2_module --with-http_v3_module --with-http_gzip_static_module --with-http_stub_status_module --with-http_realip_module --with-http_sub_module --with-http_secure_link_module --with-stream --with-stream_ssl_module --with-stream_ssl_preread_module --with-stream_realip_module --with-file-aio --with-threads --modules-path=/etc/nginx/modules --add-dynamic-module=/root/nginx-build-20251116-160202/headers-more --add-dynamic-module=/root/nginx-build-20251116-160202/zstd-module

sten@sten-srv03:~$

The main nginx config, looked like this:
sten@sten-srv03:~$ cat /etc/nginx/nginx.conf


load_module /etc/nginx/modules/ngx_http_zstd_filter_module.so;
load_module /etc/nginx/modules/ngx_http_zstd_static_module.so;
load_module /etc/nginx/modules/ngx_http_headers_more_filter_module.so;

user nginx;
worker_processes 2;
worker_cpu_affinity auto;

error_log /var/log/nginx/error.log warn;
pid /run/nginx.pid;

worker_rlimit_nofile 8192;

events {
    worker_connections 1024;
    use epoll;
    multi_accept on;
}

http {
    include /etc/nginx/mime.types;
    default_type application/octet-stream;

    reset_timedout_connection on;

    # WebSocket upgrade map (nodig voor Jellyfin/Music/Status)
    map $http_upgrade $connection_upgrade {
        default upgrade;
        ''      close;
    }

    server_tokens off;
    more_set_headers 'Server: nginx';

    log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                    '$status $body_bytes_sent "$http_referer" '
                    '"$http_user_agent" "$http_x_forwarded_for"';

    access_log /var/log/nginx/access.log main;

    # Performance
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;

    # Compression
    gzip on;
    gzip_vary on;
    gzip_proxied any;
    gzip_comp_level 6;
    gzip_types text/plain text/css text/xml text/javascript
               application/json application/javascript application/xml+rss
               application/rss+xml font/truetype font/opentype
               application/vnd.ms-fontobject image/svg+xml;


    # SSL - TLS 1.3 ONLY
    ssl_protocols TLSv1.3;
    ssl_prefer_server_ciphers off;
    ssl_session_cache shared:SSL:50m;
    ssl_session_timeout 1d;
    ssl_session_tickets off;

    # HTTP/3 global
    quic_retry on;
    ssl_early_data off;

    include /etc/nginx/conf.d/*.conf;
}

I have migrated to Angie now, which I am testing out to see if I like it :)

@Stensel8
Copy link
Copy Markdown

Stensel8 commented May 6, 2026

@tokers, is this repository still active? Or will it be merged into the mainline nginx at some point?

Since the module gets older and it seems to be getting more conflicted on every new nginx release.

I reallly, reaaallly like the work though. ZSTD is like the best encoding we can get :)

eilandert referenced this pull request in eilandert/zstd-nginx-module May 10, 2026
Fixes GitHub issue #41 - worker processes stuck at 100% CPU usage

Applied two critical upstream fixes:

PR #23 (drawing, Jun 2023): Fix infinite loop when upstream returns
abnormal content-length by allocating fresh buffer during flush cycles.
Prevents memory recycling without re-buffering that caused CPU spinning.

PR #49 (t0mtaylor, May 2026): Introduce action state machine (COMPRESS,
FLUSH, END) to prevent premature END transitions and 131072-byte
truncation. Gates END transition on:
  - Input buffer fully drained
  - No more chain links queued
  - Last flag set
Also gates last_buf=1 on action==END to prevent early frame finalization.

Impact: Resolves infinite loop where empty output buffers get re-processed
without progress, causing 100% CPU burn. Prevents silent data truncation
at 131072-byte boundaries on large streams.

Verified: Both fixes integrate cleanly with existing codebase.
Next: Build and test with nginx/angie packages.
hsw added a commit to hsw/zstd-nginx-module that referenced this pull request May 16, 2026
Two new regression scripts targeted at the zstd body filter's action state
machine and upstream flush handling — designed to survive the V2 sequencing
(compressStream2 migration → retire action state machine → CCtx pool).

compress-state-machine.sh — structural coverage of compress() entry points:
  empty body, single byte, 131072 boundary over HTTP/1.1, 200000 multi-redo,
  3-request keep-alive (one TCP connection, num_connects=1 asserted), 10x
  parallel burst, disk-vs-proxy equivalence (same body served via static
  file alias and via proxy_pass with Accept-Encoding stripped upstream —
  decompressed outputs must match byte-for-byte).

proxy-flush.sh — chunked / SSE / Connection: Upgrade upstream patterns
  from tokers#23 thread (mklooss, Stensel8,
  lowkeypriority). Python TCP fixture emits 6-chunk Transfer-Encoding
  chunked, 5-event SSE, and Upgrade stand-in payloads. Four sub-tests
  cover proxy_buffering on/off + SSE + Upgrade.

Coverage scope is documented in each script header. Both pass on master
baseline as regression nets; tightening to TDD-fail on the master flush-
promotion gap requires a TTFB / latency assertion deferred to V2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hsw added a commit to hsw/zstd-nginx-module that referenced this pull request May 16, 2026
Migrate the two new bash regression scripts from the prior commit to
pytest, following the POC pattern in test_accept_encoding.py /
test_config_validation.py. The bash equivalents are removed in the same
commit — pytest is the strategic direction for complex network tests.

test_compress_state_machine.py — parametrized state machine coverage:
  * basic[empty,single,h1-131072,h1-200000] — boundary inputs through
    compress(), byte-equality round-trip
  * disk_vs_proxy — same body via static alias (in_file buffer) and
    proxy_pass (in-memory chain link) must decompress to identical bytes
  * keep_alive_three — three requests on ONE http.client connection,
    no state leak between them
  * parallel_burst — 20 concurrent requests via ThreadPoolExecutor

test_proxy_flush.py — flush-promotion regression:
  * Python TCP fixture server on :9000 (module-scoped pytest fixture),
    threaded handler emits chunked / SSE / Upgrade upstream patterns
  * Each handler records ground-truth bytes into a thread-safe dict so
    the test byte-compares decompressed client response against the
    upstream payload
  * Sub-test chunked-on (proxy_buffering on) PASSES on master baseline.
    Sub-tests chunked-off, sse, upgrade (proxy_buffering off) TIMEOUT
    deterministically on master baseline — this reproduces Stensel8's
    HomeAssistant freeze (PR tokers#23 thread) and is exactly the TDD signal
    the 0.2.1 flush-promotion fix is meant to clear.

test_window_size.py — browser-compat regression for tokers#35:
  * Verify response frame header Window Size <= 8 MiB (Chrome's hard
    decompression-memory ceiling). Default module config (windowLog=19,
    512 KiB) passes; regression net against any future change to a
    browser-hostile default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hsw added a commit to hsw/zstd-nginx-module that referenced this pull request May 16, 2026
Pytest test_websocket.py exercises the full WebSocket protocol path
through nginx with zstd on + proxy_buffering off — the production
shape from Stensel8's HomeAssistant freeze report (PR tokers#23 thread).

Two sub-tests:
  * test_websocket_through_zstd_filter — full RFC 6455 handshake +
    bidirectional echo (4 client→server, 4 echo replies + 1 server-
    pushed initial frame). Hard 5-second async timeout catches the
    freeze symptom.
  * test_websocket_handshake_speed — cheaper canary: 101 Switching
    Protocols must arrive within 1 second; freeze at the Upgrade
    response (body_filter on zero-body 101) would push this over.

Uses python3-websockets (apt-installed 10.4) for both client and the
upstream echo server fixture — avoids ~100 LoC of hand-rolled RFC 6455
framing.

Dockerfile.{dynamic,brotli,asan,valgrind} updated to install
python3-websockets alongside the existing pytest / requests /
zstandard packages.

**Status on test1 baseline**: both sub-tests PASS. Basic protocol shape
works under master's compress filter — useful regression net for V2
compressStream2 migration + Option γ action-machine rewrite.

Does NOT reproduce Stensel8's specific freeze. His setup adds HTTP/2 +
HTTP/3 to the client side plus permessage-deflate / HomeAssistant's
specific frame patterns — V2 hardening will need HTTP/2 axis +
extension negotiation. Bug #4b coverage section in master-bugs.md
documents this gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hsw added a commit to hsw/zstd-nginx-module that referenced this pull request May 16, 2026
New Dockerfile.angie builds angie + zstd dynamic module from source on
Ubuntu (22.04/24.04/26.04 via UBUNTU_VERSION build-arg). Angie 1.11.5
is built on nginx core 1.29.3 — the same base Stensel8 reported the
WebSocket freeze on in PR tokers#23.

Why build from source: the official angie docker image ships without
`--with-compat`, so external modules built with --with-compat fail to
load. Building both binary and module ourselves with matching flags
sidesteps this. No public angie apt repo for noble was found at probe
time.

Why excluded from ALL_VARIANTS: verified empirically (2026-05-16) that
test_proxy_flush + test_http2_proxy_flush reproduce bug #4b with
identical 3/4 TIMEOUT signature on nginx 1.31.0 and angie 1.11.5 — the
flush-promotion gap lives in nginx core 1.29.3 code, inherited
unchanged by angie. Adding angie to the default test matrix would
double CI cost for zero unique signal. The dispatch entries remain so
`bash t/build.sh angie-24.04` works for on-demand verification of
fixes against the production-shape stack some users have migrated to.

RFC 8441 (extended CONNECT for WS-over-HTTP/2) is NOT supported in
angie 1.11.5 — verified by source inspection of src/http/v2/ngx_http_v2.c
(returns 405 NOT ALLOWED for any HTTP/2 CONNECT, no `:protocol`
pseudo-header handling). nginx mainline doesn't have it either. So
this image doesn't unlock the Stensel8 WS-over-H2 axis either — that
remains blocked on upstream RFC 8441 support.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hsw added a commit to hsw/zstd-nginx-module that referenced this pull request May 18, 2026
…lass

Adds chunked-off-tiny sub-test reproducing the production freeze pattern
from tokers#23 thread (mklooss 2025-03-20, Stensel8
2026-01-11, lowkeypriority 2026-02-02). Marked xfail (strict=True)
until commit 3 retires the action state machine and adopts
ngx_brotli's per-call-op pattern.

Sub-test schedule (Tier-1 production-shape repro):
- 50 chunks x 200 bytes random tail, 50 ms inter-chunk gap.
- Total upstream window ~2.5 s, total body ~10 KiB.
- proxy_buffering off; chunks reach the body filter with b->flush=1.
- TTFB budget 400 ms; max_time_s 10.

Discovery short-circuit: the existing chunked-off / sse / upgrade and
test_http2_proxy_flush[chunked-off|sse|upgrade] sub-cases already FAIL
with ReadTimeoutError on stable HEAD (full 10 s hang), so the bug
class is already exercised by the harness. The chunked-off-tiny case
is a tighter, focused production-shape repro for the action-state-
machine flush-promotion bug.

Baseline evidence (bash t/run.sh ubuntu-24.04 against stable c24d188):
- test_proxy_flush[chunked-off-tiny] XFAIL (action-machine flush bug).
- pytest summary: 10 failed, 76 passed, 3 skipped, 1 xfailed in 201 s.
- Pre-existing failures (sse, upgrade, http2_proxy_flush[*],
  test_infinite_loop, test_h1/h2_roundtrip[131073|200000]) remain;
  none introduced by this commit.

Production bug shape: TTFB ~ upstream window because ctx->flush is
never promoted to ZSTD_flushStream when libzstd has not accumulated
enough to spill (small chunks, rc==0 after ZSTD_compressStream).
hsw added a commit to hsw/zstd-nginx-module that referenced this pull request May 18, 2026
Three bundled changes that together fix the production flush-promotion
bug and finish the libzstd modern-API migration:

1. Per-call-op pattern (ngx_brotli model). Replaces ctx->action
   (COMPRESS/FLUSH/END enum + redo flag) with per-call
   ZSTD_EndDirective derivation from ctx->last / ctx->flush set on
   entry from ctx->in_buf's last_buf / flush. Reference:
   tmp/src/ngx_brotli/filter/ngx_http_brotli_filter_module.c:478-524.
   Zero-progress guard added as defense in depth (brotli line 520-524).
   Empty non-control bufs are now dropped in add_data to mirror brotli.

2. Modernize ZSTD_CCtx setup. Drops the legacy ZSTD_initCStream /
   ZSTD_initCStream_usingCDict path under #if ZSTD_VERSION_NUMBER <
   10500. Now uniformly uses ZSTD_CCtx_reset(session_only) +
   ZSTD_CCtx_refCDict OR ZSTD_CCtx_setParameter(compressionLevel).
   Eliminates the -Wdeprecated-declarations warning for usingCDict on
   libzstd builds that mark it deprecated.

3. Compile-time guard: #error if libzstd < 1.4.0. Was de facto
   required by ZSTD_compressStream2 from the previous commit; now
   explicit. Static module unaffected (does not include zstd.h).

Fixes the production flush-promotion bug from
tokers#23 thread (mklooss 2025-03-20, Stensel8
2026-01-11, lowkeypriority 2026-02-02). Root cause: ctx->action only
promoted COMPRESS -> FLUSH when libzstd reported pending output (rc>0).
For small flush'd chunks where libzstd swallowed input with rc==0, no
ZSTD_flushStream() was issued -- bytes stayed buffered, workers
stalled until upstream closed.

Pre-fix on stable HEAD: test_proxy_flush[chunked-off|sse|upgrade] +
test_http2_proxy_flush[chunked-off|sse|upgrade] + test_infinite_loop
all timed out at 10 s ReadTimeoutError.

Post-fix on v2/compress-stream2: all eight previously failing cases
plus the new chunked-off-tiny tier-1 repro now PASS within the 400-
600 ms TTFB budgets. ubuntu-24.04: 87 passed, 3 skipped. ubuntu-22.04
(libzstd 1.4.8): 86 passed, 4 skipped. ubuntu-24.04-shared-only: 87
passed, 3 skipped. ASan/UBSan: no diagnostics. Valgrind: no leaks
beyond suppressions.

ctx->last / ctx->flush retained as field names (no rename) so a
partial revert of this commit leaves commit 2 compiling. Sticky-flag
semantics documented at the field site.

Closes the chunked-off-tiny xfail from the previous commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants