SOLR-18174 AsyncTracker Semaphore leak reproduction#4236
SOLR-18174 AsyncTracker Semaphore leak reproduction#4236janhoy wants to merge 9 commits intoapache:mainfrom
Conversation
Add a metrics gauge for available semaphores
|
@dsmiley, @mlbiscoc, @kotman12, @HoustonPutman I'm not calling you out here to get a thorough review of this PR code or every theory in the mostly LLM generated analysis. But you are bright minds, and I fear that this issue is a series of bugs lurking, that will crop up once more heavy usage of newer solr releases reaches prime time. Perhaps some of you will connect some dots when seeing some of the code paths being discussed here. @HoustonPutman I ping you since you were bit by the request cancellation/abort bug in 9.x and I wonder if you could shed light on why The actual bug fixed in this PR is quite serious I believe. The LBSolrClient's retry request is executed synchronously on the IO selector thread instead of in the backgorund, thus enabling the deadlock when semaphore permits are depleted. My $10000 question is how those permits leak in the first place, so I added a metric gauge for it. I suspect there is some code path acquiring a permit that is never released, but so far I have some LLM theories but no evidence. I may build a custom Solr 9.10.2-SNAPSHOT with added logging and instrumentation around the AsyncTracker and deploy it to our test cluster hoping for a reproduction, although it took 14 days of run time before it manifested last time... Thankful for advice on strategies for catching the root cause. Tricky thing is it may be related to complex servicemesh proxying and mass-interruption of open connections.. |
Relates to https://issues.apache.org/jira/browse/SOLR-18174
Background
HttpJettySolrClientuses an internalAsyncTrackersemaphore (default 1000 permits) to limit outstanding async HTTP requests. A permit is acquired inonRequestQueuedand released inonComplete. Two separate bugs can cause permits to be consumed without being returned, eventually exhausting the semaphore. Once it reaches zero, every new distributed query blocks permanently onsemaphore.acquire()— there is no timeout and no recovery short of a node restart.Two-phase failure model
This PR aims to prove both phases through dedicated reproduction tests. So far we have only succeeded in proving the second phase, by simulating low available permits. What is shown as phase 1 in the diagram below is what we assume must be happening in real life, based on externally observed cluster behavior and clues in thread-dumps.
Phase 1 — Gradual permit depletion (unproven hypothesis)
The thread dumps captured during the test-environment reproduction deadlock show only 13 threads blocked on
semaphore.acquire()— far fewer than the 1000 needed to exhaust the semaphore through Phase 2 alone. This means the semaphore must already have been near zero before the final burst event, implying some background depletion process had been running for some time prior. We have no direct observation of this drain — no metrics were in place to track available permits over time — so this is an inference, not a measured fact.We have no proof of what causes this slow depletion. Several hypotheses are on the table:
SOLR-17819 + SOLR-17792 interaction (9.9.0): Analysis of git history between 9.7.0 and 9.10.1 identifies a potentially relevant combination. SOLR-17819 commented out
request.abort()in the cancellation path, meaning cancelledCompletableFutures no longer propagate to the underlying Jetty request — permits are only returned when the server responds or the idle timeout fires. SOLR-17792 changedHttpShardHandler.take()to actively callcancelAll()on shard errors (previously it just returned early). Together, a single shard error now ties up N−1 permits until idle timeouts fire. This is not a permanent leak per event — permits eventually return — but under frequent shard errors (which could happen in a k8s/linkerd environment) it could reduce effective headroom significantly. Whether our environments experience enough shard errors for this to matter is unknown. This analysis is detailed in the JIRA issue and this confluence page.Jetty 10 idle-timeout race (HTTP/1.1): code tracing through
branch_9_10(Solr + Jetty 10) has identified a concrete sequence where one permit is permanently lost per occurrence. Two of the 13 stuck threads in the production dump showHttpConnectionOverHTTP.onIdleExpired()on the path intoLBAsyncSolrClient.doAsyncRequest(), which is consistent with this mechanism. ThePERMIT_ACQUIRED_ATTRidempotency guard in this PR is specifically designed to prevent this double-acquire by makingqueuedListenera no-op if a permit has already been acquired for the same request object. This race can fire silently under normal operation whenever a connection's idle timeout expires at the moment an exchange is being dispatched to it, independent of application traffic volume.Jetty HTTP/2 GOAWAY race: when a request is dispatched to a connection already marked closed,
HttpConnectionOverHTTP2.send()may returnSendFailure(ClosedChannelException, retry=true), triggering the same doubleonRequestQueuedpattern described above. This is the scenario the test name references, but since we force HTTP/1.1 it is unlikely to be the trigger in our cluster. ThePERMIT_ACQUIRED_ATTRguard would prevent this pattern too, as it is structurally identical.Unknown cause: there may be other conditions in the Jetty client or Solr's use of it that produce unmatched acquires in the specific environment (Kubernetes, Linkerd, HTTP/1.1).
The test in this PR (
testPermitLeakOnHttp2GoAwayDoubleQueuedListener) simulates the doubleonRequestQueuedpattern directly via reflection — the same pattern identified in the idle-timeout race above — and asserts that the permit count is unchanged after a singleonComplete. It fails without thePERMIT_ACQUIRED_ATTRidempotency guard. The guard and its TODO are included alongside the Phase 2 fix to invite review and confirm whether this path is reachable in practice.Regardless of the true cause, the implication is the same: something reduces available permits over time until the semaphore is nearly exhausted, at which point Phase 2 delivers the final blow.
Phase 2 — Sudden full exhaustion (confirmed: re-entrant retry on the IO selector thread)
Test:
testSemaphoreLeakOnLBRetryWhen a connection-level failure occurs, Jetty fires
onFailureon the IO selector thread.HttpJettySolrClient.requestAsynccompletes itsCompletableFutureexceptionally from within that callback — still on the IO thread.LBAsyncSolrClient.doAsyncRequestregistered awhenCompleteon that future; because the future completes on the IO thread,whenCompletealso fires synchronously on the same IO thread. ThewhenCompleteaction callsdoAsyncRequestagain as a retry to the next endpoint, which callssemaphore.acquire()— still on the IO thread — before the original request'scompleteListenerhas had a chance to callsemaphore.release().If the semaphore is already at zero (from Phase 1 depletion),
acquire()blocks the IO thread. The blocked IO thread can no longer execute thecompleteListenerthat would release the original permit. The permit is permanently stuck, and the IO thread is permanently blocked. With N concurrent failures, N permits are lost in a single burst and N IO threads deadlock simultaneously.The test reproduces this with
MAX_PERMITS=40: a fake TCP server accepts 40 connections (exhausting the semaphore), then RST-closes them all simultaneously. After 2 seconds, the test asserts that all 40 permits have been restored and all futures have completed. It fails with the current code. A companion test (testNoPermitLeakOnLBRetryWithDefaultPermits) confirms that with permits well above zero the IO thread never blocks and all permits are correctly returned.Fix: Complete the future off the IO thread by dispatching
completeExceptionallyto an executor, givingonCompletethe opportunity to release the original permit before anywhenCompleteretry acquires a new one.Also included
ASYNC_REQUESTS_MAX_SYSPROP(solr.http.client.async_requests.max) makes the permit ceiling configurable, used in test but does not hurt to have configurable in production?solr.http.client.async_requests) exposesstate=maxandstate=availablepermit counts viaHttpShardHandlerFactory, enabling monitoring of available permits. If we had this metric already, it would tell a lot about when/why the failure happens.