From f30209a9a8dd676e83fabd03d03b22d79e759281 Mon Sep 17 00:00:00 2001
From: John McLear <john@mclear.co.uk>
Date: Mon, 25 May 2026 22:53:34 +0100
Subject: [PATCH] fix(test): enable HTTP keep-alive on the global agent for
 backend tests
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The Windows backend-test silent-ELIFECYCLE flake (silent exit 255 mid
test phase, no JS-handler trace, no native abort report; documented
by PRs #7838, #7842) correlates with localhost server-side TIME_WAIT
socket accumulation. The OS-level netstat sidecar added in PR #7846
captured the smoking gun in run 26419467467:

  netstat poll      server-side TIME_WAIT on [::1]:50398
  ----------------  -------------------------------------
  21:02:18.337       56
  21:02:20.765      103
  21:02:23.204      136
  21:02:25.662      169
  21:02:28.146      201
  21:02:30.638      228   ← silent kill 37 ms later

That's ~14 server-active-close TIME_WAITs per second growing linearly
toward the kill. All TIME_WAITs are on the Etherpad listening port —
the SERVER is the active closer, because supertest opens a fresh TCP
connection per request and the server sends FIN after the response.

The hard kernel TIME_WAIT cap (Windows default ~2000 free TCBs) was
not exceeded in this capture, but 228 half-dead handles forces libuv's
IOCP socket-table scan to walk a much larger working set on every
completion. The kill cluster is concentrated on tests that perform
rapid sequential HTTP roundtrips (importexportGetPost.ts, pad.ts,
import.ts DOCX round-trips) — exactly the pattern that grows TIME_WAIT
fastest.

Setting `http.globalAgent.keepAlive = true` in the backend-test common
bootstrap makes supertest's underlying http.request reuse a single
TCP connection for sequential requests to the same origin. Connection
reuse collapses the TIME_WAIT churn from ~14/s to nearly zero — each
test no longer leaves a half-dead socket behind. Linux's TCP recycling
is fast enough that the same load doesn't symptomize there, so this
keep-alive is a Windows-targeted mitigation that's also a strict
improvement on Linux (less socket churn = less work overall).

This is a real behavior change scoped to the test process — tests
share a long-lived connection rather than opening fresh ones — so the
shape of any race that depended on per-request connect/disconnect
cycles will shift. None of the existing backend tests assert on that,
but the change is observable and is being landed deliberately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 src/tests/backend/common.ts | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/tests/backend/common.ts b/src/tests/backend/common.ts
index 5e5cb2c1936..4bf028009a4 100644
--- a/src/tests/backend/common.ts
+++ b/src/tests/backend/common.ts
@@ -16,8 +16,33 @@ import TestAgent from "supertest/lib/agent";
 import {Http2Server} from "node:http2";
 import {SignJWT} from "jose";
 import {privateKeyExported} from "../../node/security/OAuth2Provider";
+import * as http from 'node:http';
 const webaccess = require('../../node/hooks/express/webaccess');
 
+// Enable HTTP keep-alive on the global agent for the test process. Without
+// this, every supertest request opens a fresh TCP connection and the server
+// closes it on response — the server side then enters TIME_WAIT for the
+// default Windows TcpTimedWaitDelay (~120 s) before the ephemeral port is
+// freed.
+//
+// The Windows backend-test job's OS-level netstat sidecar (PR #7846)
+// captured the smoking gun for the silent-ELIFECYCLE flake in run
+// 26419467467: localhost server-side TIME_WAIT counts on the Etherpad
+// listening port climbed linearly at ~14/s, reaching 228 active TIME_WAIT
+// entries on `[::1]:50398` 37 ms before the kill — all server-active-close
+// half-dead sockets, all from rapid sequential supertest requests with no
+// connection reuse. The kill cluster on Windows + Node 24 + plugins
+// correlates tightly with this TIME_WAIT accumulation: it gives libuv a
+// large pool of half-dead handles to walk on every IOCP completion.
+//
+// Setting keepAlive=true on http.globalAgent makes supertest's underlying
+// http.request reuse a single TCP connection for sequential requests to
+// the same origin, collapsing TIME_WAIT churn from ~14/s to nearly zero.
+// Linux is unaffected; the flake was Windows-only because Linux's
+// TIME_WAIT recycling is much faster and the kernel can sustain higher
+// half-dead-socket counts without symptom.
+http.globalAgent.keepAlive = true;
+
 const backups:MapArrayType<any> = {};
 let agentPromise:Promise<any>|null = null;