Skip to content

fix: reuse containers per session#489

Open
areebahmeddd wants to merge 4 commits into
docker:mainfrom
areebahmeddd:fix/483-container-leak
Open

fix: reuse containers per session#489
areebahmeddd wants to merge 4 commits into
docker:mainfrom
areebahmeddd:fix/483-container-leak

Conversation

@areebahmeddd
Copy link
Copy Markdown

@areebahmeddd areebahmeddd commented May 16, 2026

What I did

  • Reuse a single Docker MCP container per client session instead of creating a new container for every tool call, preventing container leaks and zombie processes under concurrent load
  • Added safer concurrent client initialization and automatic cleanup of cached containers when client sessions disconnect
  • Used VS Code Copilot (Claude Sonnet 4.6) to understand the issue context and help generate a fix. Changes were self-reviewed

Related issue

Fixes #483

A picture of a cute animal

image

@areebahmeddd areebahmeddd requested a review from a team as a code owner May 16, 2026 18:50
@areebahmeddd areebahmeddd deleted the fix/483-container-leak branch May 18, 2026 21:33
@areebahmeddd areebahmeddd restored the fix/483-container-leak branch May 18, 2026 21:33
@areebahmeddd areebahmeddd reopened this May 18, 2026
@areebahmeddd
Copy link
Copy Markdown
Author

gentle ping! @cutecatfann 🙂

Copy link
Copy Markdown
Contributor

@cutecatfann cutecatfann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and for digging into this issue. Looks like the right start :)

A few things I'd like addressed before merging:

  1. The Spec.Image != "" override in longLived() is too broad
  if serverConfig.Spec.Image != "" {                                                                                         
      return true                                                                                                          
  }

This forces every container-based server to be long-lived regardless of its LongLived flag. Some servers are intentionally stateless and expect a fresh container per call. The double-checked locking alone already fixes the duplication race (multiple goroutines creating containers for the same key concurrently). Please remove this addition, the existing serverConfig.Spec.LongLived || cp.LongLived logic should continue to control caching behavior, and the race fix ensures that when caching IS enabled, only one container is created.

  1. The concurrency test doesn't test the actual code path
    TestAcquireClientNoDuplicatesUnderConcurrency manually reimplements the locking logic inside each goroutine instead of calling cp.AcquireClient(). This means the test would pass even if AcquireClient itself still had the race. Please rewrite it to call the real method (you may need to mock the Docker client / container creation to avoid starting real containers).

  2. ReleaseClientsForSession can trigger container creation
    If a clientGetter was stored in the map but GetClient was never called (container never started), calling kc.Getter.GetClient(context.TODO()) inside ReleaseClientsForSession will actually trigger the sync.Once and start a container just to close it

  3. Minor: context.TODO()
    Both ReleaseClientsForSession and the existing Close() use context.TODO(). For the new code, consider accepting or deriving a proper context so the cleanup calls can be cancelled during gateway shutdown.

@areebahmeddd
Copy link
Copy Markdown
Author

The Spec.Image != "" override in longLived() is too broad

cool, i've removed the image check entirely

The concurrency test doesn't test the actual code path

re wrote the test to use cp.AcquireClient() directly now

ReleaseClientsForSession can trigger container creation

let me add a check so we don’t start a container during cleanup

Minor: context.TODO()

sure 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway leaks containers and zombie processes under concurrent tool calls

2 participants