Skip to content

[Cosmos] Replace per-client schedulers with shared CosmosSchedulers to fix thread scaling#49062

Merged
xinlian12 merged 2 commits intoAzure:mainfrom
xinlian12:fix/shared-schedulers-thread-scaling
May 7, 2026
Merged

[Cosmos] Replace per-client schedulers with shared CosmosSchedulers to fix thread scaling#49062
xinlian12 merged 2 commits intoAzure:mainfrom
xinlian12:fix/shared-schedulers-thread-scaling

Conversation

@xinlian12
Copy link
Copy Markdown
Member

@xinlian12 xinlian12 commented May 5, 2026

Problem

PR testing revealed that global-ep-mgr and partition-availability-staleness-check thread counts increase linearly with tenant/client count because both GlobalEndpointManager and GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker create per-instance Schedulers.newSingle() schedulers.

With N clients -> N dedicated threads for each component -> 2N extra threads just for background location refresh and circuit breaker staleness checks.

Solution

Replace per-instance Schedulers.newSingle() with shared static BoundedElastic schedulers in CosmosSchedulers, following the existing pattern used for COSMOS_PARALLEL, TRANSPORT_RESPONSE_BOUNDED_ELASTIC, etc.

Changes

CosmosSchedulers.java

  • Added GLOBAL_ENDPOINT_MANAGER_BOUNDED_ELASTIC shared scheduler
  • Added PARTITION_AVAILABILITY_CHECK_BOUNDED_ELASTIC shared scheduler

GlobalEndpointManager.java

  • Replaced per-instance Schedulers.newSingle(CosmosDaemonThreadFactory) with CosmosSchedulers.GLOBAL_ENDPOINT_MANAGER_BOUNDED_ELASTIC
  • Track background refresh Disposable via AtomicReference with getAndSet() to atomically clean up old subscriptions on concurrent calls
  • close() cancels the tracked subscription instead of disposing the scheduler

GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker.java

  • Replaced per-instance Schedulers.newSingle("partition-availability-staleness-check") with CosmosSchedulers.PARTITION_AVAILABILITY_CHECK_BOUNDED_ELASTIC
  • Track recovery Disposable via AtomicReference for consistent cleanup on close()

Design Decisions

  • BoundedElastic over Single -- supports concurrent background tasks from multiple clients; threads auto-reclaim with 60s TTL
  • Disposable tracking -- shared schedulers cannot be disposed per-client, so background subscriptions are tracked and cancelled individually in close()
  • AtomicReference.getAndSet() -- prevents Disposable leaks when startRefreshLocationTimerAsync() is called concurrently
  • Existing isClosed guards in both classes provide additional protection against post-close work

Benchmark Results Thread Scaling Fix Validation

Config: Gateway mode, ReadThroughput, concurrency=64, 10min per run, accounts lx1-lx28 (cycling modulo 28), host-pinned
Branches: upstream/main vs xinlian12/fix/shared-schedulers-thread-scaling

1. Throughput: main vs fix (H1 ReadThroughput, steady-state)

Tenants main (ops/s) fix (ops/s) Delta
4 33,895 33,762 -0%
16 30,779 30,398 -1%
64 29,212 29,279 +0%
256 28,427 27,798 -2%
896 FAIL 27,009 fix succeeds

H2:

Tenants main (ops/s) fix (ops/s) Delta
4 31,199 30,959 -1%
16 28,323 28,305 -0%
64 27,175 26,757 -2%
256 25,261 25,791 +2%
896 18,855 20,766 +10%

2. Thread Scaling (PEAK, H1 ReadThroughput)

Tenants main fix Reduction
4 218 223 -2%
16 255 267 -5%
64 354 364 -3%
256 748 552 26%
896 1,748 544 69%

3. Thread Pool Breakdown (PEAK, H1 ReadThroughput)

Branch T part-avail global-ep glob-ep-bounded bench-disp transport bulk-exec reactor-ep TOTAL
main 4 4 4 0 68 48 38 16 232
fix 4 4 0 5 67 47 41 16 236
main 16 16 16 0 67 43 39 16 255
fix 16 16 0 30 67 52 39 16 267
main 64 64 64 0 67 45 40 16 354
fix 64 64 0 65 67 49 43 16 364
main 256 256 256 0 67 54 42 16 749
fix 256 160 0 160 67 48 42 16 553
main 896 814 815 0 0 15 42 16 1,760
fix 896 160 0 160 67 43 42 16 548

4. Key Findings

  • global-ep-mgr eliminated: 0 per-client threads across all tenant counts (was 1:1 on main)
  • partition-avail capped at ~160: shared BoundedElastic pool reuses threads with 60s TTL (was 1:1 on main)
  • global-endpoint-manager-bounded-elastic capped at ~160: replacement shared pool
  • Thread count flat at 256-896t: fix holds at ~548 threads regardless (main grows 748-1,760)
  • No throughput regression at 4-256t: -2% to +2% (within noise)
  • 896t-H1 now works: main failed with client creation timeout, fix succeeds with 27K ops/s
  • 896t-H2: +10% throughput improvement: 18,855 to 20,766 ops/s from reduced thread overhead

@github-actions github-actions Bot added the Cosmos label May 5, 2026
@xinlian12 xinlian12 force-pushed the fix/shared-schedulers-thread-scaling branch from db98eb9 to c42cfe8 Compare May 5, 2026 22:36
@xinlian12 xinlian12 changed the title [Cosmos] Replace per-client schedulers with shared CosmosSchedulers to fix thread scaling [Cosmos][NO Review] Replace per-client schedulers with shared CosmosSchedulers to fix thread scaling May 5, 2026
@xinlian12 xinlian12 marked this pull request as ready for review May 5, 2026 22:38
Copilot AI review requested due to automatic review settings May 5, 2026 22:38
@xinlian12 xinlian12 requested review from a team and kirankumarkolli as code owners May 5, 2026 22:38
@xinlian12
Copy link
Copy Markdown
Member Author

@sdkReviewAgent

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a thread-scaling issue in the Cosmos Java SDK where per-client Schedulers.newSingle() usage caused background thread counts to grow linearly with the number of client instances. It introduces shared schedulers in CosmosSchedulers and updates background work to run on those shared schedulers instead of allocating dedicated per-client threads.

Changes:

  • Added shared bounded-elastic schedulers to CosmosSchedulers for Global Endpoint Manager refresh and per-partition availability checks.
  • Updated GlobalEndpointManager background refresh to use the shared scheduler and removed per-instance scheduler disposal.
  • Updated GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker to use the shared scheduler and removed per-instance scheduler disposal.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/CosmosSchedulers.java Adds shared bounded-elastic schedulers for endpoint refresh and partition availability checks.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/GlobalEndpointManager.java Switches background location refresh work from per-instance single scheduler to shared bounded-elastic scheduler.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/perPartitionCircuitBreaker/GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker.java Switches staleness check work from per-instance single scheduler to shared bounded-elastic scheduler.

@xinlian12
Copy link
Copy Markdown
Member Author

Review complete (47:35)

No new comments — existing review coverage is sufficient.

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

…ead scaling

Thread count for 'global-ep-mgr' and 'partition-availability-staleness-check'
threads was scaling linearly with tenant/client count because both
GlobalEndpointManager and GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker
created per-instance Schedulers.newSingle() schedulers.

Changes:
- Add GLOBAL_ENDPOINT_MANAGER_BOUNDED_ELASTIC and
  PARTITION_AVAILABILITY_CHECK_BOUNDED_ELASTIC shared schedulers to CosmosSchedulers
- GlobalEndpointManager: Replace per-instance scheduler with shared scheduler,
  track background refresh Disposable via AtomicReference for immediate cleanup
  on close(). Use getAndSet() to atomically dispose old subscriptions on reschedule.
- GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker: Replace per-instance
  scheduler with shared scheduler, track recovery Disposable via AtomicReference
  for immediate cleanup on close(). Use compareAndSet on isPartitionRecoveryTaskRunning
  to prevent duplicate background tasks under concurrent init() calls.

Shared BoundedElastic schedulers reuse threads with 60s TTL, preventing thread
count from growing with client count while still supporting concurrent background
tasks from multiple clients.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12 xinlian12 force-pushed the fix/shared-schedulers-thread-scaling branch from c42cfe8 to 02b230e Compare May 6, 2026 04:19
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12 xinlian12 changed the title [Cosmos][NO Review] Replace per-client schedulers with shared CosmosSchedulers to fix thread scaling [Cosmos] Replace per-client schedulers with shared CosmosSchedulers to fix thread scaling May 6, 2026
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @xinlian12

@xinlian12 xinlian12 merged commit 9ca9e54 into Azure:main May 7, 2026
104 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants