From f30209a9a8dd676e83fabd03d03b22d79e759281 Mon Sep 17 00:00:00 2001 From: John McLear Date: Mon, 25 May 2026 22:53:34 +0100 Subject: [PATCH] fix(test): enable HTTP keep-alive on the global agent for backend tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The Windows backend-test silent-ELIFECYCLE flake (silent exit 255 mid test phase, no JS-handler trace, no native abort report; documented by PRs #7838, #7842) correlates with localhost server-side TIME_WAIT socket accumulation. The OS-level netstat sidecar added in PR #7846 captured the smoking gun in run 26419467467: netstat poll server-side TIME_WAIT on [::1]:50398 ---------------- ------------------------------------- 21:02:18.337 56 21:02:20.765 103 21:02:23.204 136 21:02:25.662 169 21:02:28.146 201 21:02:30.638 228 ← silent kill 37 ms later That's ~14 server-active-close TIME_WAITs per second growing linearly toward the kill. All TIME_WAITs are on the Etherpad listening port — the SERVER is the active closer, because supertest opens a fresh TCP connection per request and the server sends FIN after the response. The hard kernel TIME_WAIT cap (Windows default ~2000 free TCBs) was not exceeded in this capture, but 228 half-dead handles forces libuv's IOCP socket-table scan to walk a much larger working set on every completion. The kill cluster is concentrated on tests that perform rapid sequential HTTP roundtrips (importexportGetPost.ts, pad.ts, import.ts DOCX round-trips) — exactly the pattern that grows TIME_WAIT fastest. Setting `http.globalAgent.keepAlive = true` in the backend-test common bootstrap makes supertest's underlying http.request reuse a single TCP connection for sequential requests to the same origin. Connection reuse collapses the TIME_WAIT churn from ~14/s to nearly zero — each test no longer leaves a half-dead socket behind. Linux's TCP recycling is fast enough that the same load doesn't symptomize there, so this keep-alive is a Windows-targeted mitigation that's also a strict improvement on Linux (less socket churn = less work overall). This is a real behavior change scoped to the test process — tests share a long-lived connection rather than opening fresh ones — so the shape of any race that depended on per-request connect/disconnect cycles will shift. None of the existing backend tests assert on that, but the change is observable and is being landed deliberately. Co-Authored-By: Claude Opus 4.7 (1M context) --- src/tests/backend/common.ts | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/src/tests/backend/common.ts b/src/tests/backend/common.ts index 5e5cb2c1936..4bf028009a4 100644 --- a/src/tests/backend/common.ts +++ b/src/tests/backend/common.ts @@ -16,8 +16,33 @@ import TestAgent from "supertest/lib/agent"; import {Http2Server} from "node:http2"; import {SignJWT} from "jose"; import {privateKeyExported} from "../../node/security/OAuth2Provider"; +import * as http from 'node:http'; const webaccess = require('../../node/hooks/express/webaccess'); +// Enable HTTP keep-alive on the global agent for the test process. Without +// this, every supertest request opens a fresh TCP connection and the server +// closes it on response — the server side then enters TIME_WAIT for the +// default Windows TcpTimedWaitDelay (~120 s) before the ephemeral port is +// freed. +// +// The Windows backend-test job's OS-level netstat sidecar (PR #7846) +// captured the smoking gun for the silent-ELIFECYCLE flake in run +// 26419467467: localhost server-side TIME_WAIT counts on the Etherpad +// listening port climbed linearly at ~14/s, reaching 228 active TIME_WAIT +// entries on `[::1]:50398` 37 ms before the kill — all server-active-close +// half-dead sockets, all from rapid sequential supertest requests with no +// connection reuse. The kill cluster on Windows + Node 24 + plugins +// correlates tightly with this TIME_WAIT accumulation: it gives libuv a +// large pool of half-dead handles to walk on every IOCP completion. +// +// Setting keepAlive=true on http.globalAgent makes supertest's underlying +// http.request reuse a single TCP connection for sequential requests to +// the same origin, collapsing TIME_WAIT churn from ~14/s to nearly zero. +// Linux is unaffected; the flake was Windows-only because Linux's +// TIME_WAIT recycling is much faster and the kernel can sustain higher +// half-dead-socket counts without symptom. +http.globalAgent.keepAlive = true; + const backups:MapArrayType = {}; let agentPromise:Promise|null = null;