SOLR-18174 AsyncTracker Semaphore leak reproduction by janhoy · Pull Request #4236 · apache/solr

janhoy · 2026-03-24T14:08:18Z

Relates to https://issues.apache.org/jira/browse/SOLR-18174

Background

HttpJettySolrClient uses an internal AsyncTracker semaphore (default 1000 permits) to limit outstanding async HTTP requests. A permit is acquired in onRequestQueued and released in onComplete. Two separate bugs can cause permits to be consumed without being returned, eventually exhausting the semaphore. Once it reaches zero, every new distributed query blocks permanently on semaphore.acquire() — there is no timeout and no recovery short of a node restart.

Two-phase failure model

This PR aims to prove both phases through dedicated reproduction tests. So far we have only succeeded in proving the second phase, by simulating low available permits. What is shown as phase 1 in the diagram below is what we assume must be happening in real life, based on externally observed cluster behavior and clues in thread-dumps.

Phase 1 — Gradual permit depletion (unproven hypothesis)

The thread dumps captured during the test-environment reproduction deadlock show only 13 threads blocked on semaphore.acquire() — far fewer than the 1000 needed to exhaust the semaphore through Phase 2 alone. This means the semaphore must already have been near zero before the final burst event, implying some background depletion process had been running for some time prior. We have no direct observation of this drain — no metrics were in place to track available permits over time — so this is an inference, not a measured fact.

We have no proof of what causes this slow depletion. Several hypotheses are on the table:

SOLR-17819 + SOLR-17792 interaction (9.9.0): Analysis of git history between 9.7.0 and 9.10.1 identifies a potentially relevant combination. SOLR-17819 commented out request.abort() in the cancellation path, meaning cancelled CompletableFutures no longer propagate to the underlying Jetty request — permits are only returned when the server responds or the idle timeout fires. SOLR-17792 changed HttpShardHandler.take() to actively call cancelAll() on shard errors (previously it just returned early). Together, a single shard error now ties up N−1 permits until idle timeouts fire. This is not a permanent leak per event — permits eventually return — but under frequent shard errors (which could happen in a k8s/linkerd environment) it could reduce effective headroom significantly. Whether our environments experience enough shard errors for this to matter is unknown. This analysis is detailed in the JIRA issue and this confluence page.
Jetty 10 idle-timeout race (HTTP/1.1): code tracing through branch_9_10 (Solr + Jetty 10) has identified a concrete sequence where one permit is permanently lost per occurrence. Two of the 13 stuck threads in the production dump show HttpConnectionOverHTTP.onIdleExpired() on the path into LBAsyncSolrClient.doAsyncRequest(), which is consistent with this mechanism. The PERMIT_ACQUIRED_ATTR idempotency guard in this PR is specifically designed to prevent this double-acquire by making queuedListener a no-op if a permit has already been acquired for the same request object. This race can fire silently under normal operation whenever a connection's idle timeout expires at the moment an exchange is being dispatched to it, independent of application traffic volume.
```
Step a: exchange submitted → notifyQueued() → semaphore.acquire() #1
Step b: idle-timeout race sets guard = -1 on the assigned connection
Step c: close(conn) → remove(conn) → process() → send() sees guard=-1 → SendFailure(retry=true)
Step d: Jetty re-queues same exchange → notifyQueued() → semaphore.acquire() #2
Step e: exchange completes → completeListener fires → semaphore.release() (once)
Net result: two acquires, one release — one permit permanently lost
```
Jetty HTTP/2 GOAWAY race: when a request is dispatched to a connection already marked closed, HttpConnectionOverHTTP2.send() may return SendFailure(ClosedChannelException, retry=true), triggering the same double onRequestQueued pattern described above. This is the scenario the test name references, but since we force HTTP/1.1 it is unlikely to be the trigger in our cluster. The PERMIT_ACQUIRED_ATTR guard would prevent this pattern too, as it is structurally identical.
Unknown cause: there may be other conditions in the Jetty client or Solr's use of it that produce unmatched acquires in the specific environment (Kubernetes, Linkerd, HTTP/1.1).

The test in this PR (testPermitLeakOnHttp2GoAwayDoubleQueuedListener) simulates the double onRequestQueued pattern directly via reflection — the same pattern identified in the idle-timeout race above — and asserts that the permit count is unchanged after a single onComplete. It fails without the PERMIT_ACQUIRED_ATTR idempotency guard. The guard and its TODO are included alongside the Phase 2 fix to invite review and confirm whether this path is reachable in practice.

Regardless of the true cause, the implication is the same: something reduces available permits over time until the semaphore is nearly exhausted, at which point Phase 2 delivers the final blow.

Phase 2 — Sudden full exhaustion (confirmed: re-entrant retry on the IO selector thread)

Test: testSemaphoreLeakOnLBRetry

When a connection-level failure occurs, Jetty fires onFailure on the IO selector thread. HttpJettySolrClient.requestAsync completes its CompletableFuture exceptionally from within that callback — still on the IO thread. LBAsyncSolrClient.doAsyncRequest registered a whenComplete on that future; because the future completes on the IO thread, whenComplete also fires synchronously on the same IO thread. The whenComplete action calls doAsyncRequest again as a retry to the next endpoint, which calls semaphore.acquire() — still on the IO thread — before the original request's completeListener has had a chance to call semaphore.release().

If the semaphore is already at zero (from Phase 1 depletion), acquire() blocks the IO thread. The blocked IO thread can no longer execute the completeListener that would release the original permit. The permit is permanently stuck, and the IO thread is permanently blocked. With N concurrent failures, N permits are lost in a single burst and N IO threads deadlock simultaneously.

The test reproduces this with MAX_PERMITS=40: a fake TCP server accepts 40 connections (exhausting the semaphore), then RST-closes them all simultaneously. After 2 seconds, the test asserts that all 40 permits have been restored and all futures have completed. It fails with the current code. A companion test (testNoPermitLeakOnLBRetryWithDefaultPermits) confirms that with permits well above zero the IO thread never blocks and all permits are correctly returned.

Fix: Complete the future off the IO thread by dispatching completeExceptionally to an executor, giving onComplete the opportunity to release the original permit before any whenComplete retry acquires a new one.

Also included

ASYNC_REQUESTS_MAX_SYSPROP (solr.http.client.async_requests.max) makes the permit ceiling configurable, used in test but does not hurt to have configurable in production?
An OTel gauge (solr.http.client.async_requests) exposes state=max and state=available permit counts via HttpShardHandlerFactory, enabling monitoring of available permits. If we had this metric already, it would tell a lot about when/why the failure happens.

Add a metrics gauge for available semaphores

…hore.

janhoy · 2026-03-24T23:15:30Z

@dsmiley, @mlbiscoc, @kotman12, @HoustonPutman I'm not calling you out here to get a thorough review of this PR code or every theory in the mostly LLM generated analysis.

But you are bright minds, and I fear that this issue is a series of bugs lurking, that will crop up once more heavy usage of newer solr releases reaches prime time. Perhaps some of you will connect some dots when seeing some of the code paths being discussed here.

@HoustonPutman I ping you since you were bit by the request cancellation/abort bug in 9.x and I wonder if you could shed light on why request.abort() was also removed from 10.s line.

The actual bug fixed in this PR is quite serious I believe. The LBSolrClient's retry request is executed synchronously on the IO selector thread instead of in the backgorund, thus enabling the deadlock when semaphore permits are depleted.

My $10000 question is how those permits leak in the first place, so I added a metric gauge for it. I suspect there is some code path acquiring a permit that is never released, but so far I have some LLM theories but no evidence.

I may build a custom Solr 9.10.2-SNAPSHOT with added logging and instrumentation around the AsyncTracker and deploy it to our test cluster hoping for a reproduction, although it took 14 days of run time before it manifested last time... Thankful for advice on strategies for catching the root cause. Tricky thing is it may be related to complex servicemesh proxying and mass-interruption of open connections..

…e path for any test

janhoy added 5 commits March 23, 2026 21:44

Reproduction test

a8c4a00

Fix by running retry query non-blocking in executor

deaff89

Rename sysprop for max permits

328ee36

Add a metrics gauge for available semaphores

Test showing that we do not deadlock unless available permits is 0

f9ae02a

Reproduction test with reflection. Fix not in place

78bfa0e

janhoy marked this pull request as draft March 24, 2026 14:08

github-actions bot added client:solrj tests cat:search labels Mar 24, 2026

Tidy

d8db3af

janhoy requested a review from kotman12 March 24, 2026 16:38

Add a request attribute to signal that a permit is acquired for semap…

251b99c

…hore.

janhoy requested review from dsmiley and mlbiscoc March 24, 2026 23:01

janhoy added 2 commits March 25, 2026 00:18

Use a 2 node 2 shard cluster to be sure to always trigger distrib cod…

dd63ce5

…e path for any test

Fix precommit

aa756db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-18174 AsyncTracker Semaphore leak reproduction#4236

SOLR-18174 AsyncTracker Semaphore leak reproduction#4236
janhoy wants to merge 9 commits intoapache:mainfrom
janhoy:SOLR-18174-hang-repro-fix

janhoy commented Mar 24, 2026 •

edited

Loading

Uh oh!

janhoy commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

janhoy commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Two-phase failure model

Phase 1 — Gradual permit depletion (unproven hypothesis)

Phase 2 — Sudden full exhaustion (confirmed: re-entrant retry on the IO selector thread)

Also included

Uh oh!

janhoy commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

janhoy commented Mar 24, 2026 •

edited

Loading