Skip to content

fix: non-blocking server initialization for faster gateway startup#472

Open
jchangx wants to merge 2 commits into
docker:mainfrom
jchangx:fix/non-blocking-server-init
Open

fix: non-blocking server initialization for faster gateway startup#472
jchangx wants to merge 2 commits into
docker:mainfrom
jchangx:fix/non-blocking-server-init

Conversation

@jchangx
Copy link
Copy Markdown
Contributor

@jchangx jchangx commented Apr 7, 2026

Summary

  • Move reloadConfiguration to a background goroutine so the transport server starts immediately instead of blocking on all MCP servers responding
  • Add a 15s per-server timeout in listCapabilities to prevent one slow/unreachable server from blocking the rest

Problem

When VPN-only servers (e.g. grafana-remote, sigma-remote at *.s.us-east-1.aws.dckr.io) are unreachable, the gateway takes ~60s to start because reloadConfiguration blocks on listCapabilities, which waits for each server's 30s transport timeout. This causes MCP clients (like Claude Code's /mcp reconnect) to time out before the gateway becomes ready.

Approach

Non-blocking init: The transport server (stdio/sse/streaming) now starts immediately. reloadConfiguration runs in a background goroutine. The go-sdk's Server.AddTool is thread-safe and automatically sends notifications/tools/list_changed to connected clients, so tools appear progressively as each server responds.

Per-server timeout: Each server in listCapabilities now gets a 15s context timeout. This prevents one unreachable server from consuming the full transport timeout (30s) and ensures healthy servers aren't delayed waiting for the errgroup to complete.

Risk analysis

Component Risk Notes
startStdioServer None No dependency on tools being loaded
Server.AddTool thread safety None go-sdk uses mutex; fully thread-safe
OAuth provider startup None Independent of reloadConfiguration
Profile loading middleware Low Runs after client initialize, giving background load time
Health state Low Stays unhealthy until load completes (correct behavior)
Initial tools/list response Low Clients may see incomplete list briefly; tools/list_changed notification triggers re-fetch
Background reload failure Low Logged; tools load on next config update

Test plan

  • go build ./pkg/gateway/... compiles cleanly
  • go test ./pkg/gateway/... -short passes
  • Manual test: start gateway with VPN off, verify stdio server starts in <2s
  • Manual test: verify tools appear after background load completes
  • Manual test: Claude Code /mcp reconnect works on first try without VPN
  • Verify tools/list_changed notification reaches Claude Code and triggers tool re-fetch

🤖 Generated with Claude Code

Move reloadConfiguration to a background goroutine so the transport
server (stdio/sse/streaming) starts immediately instead of waiting
for all MCP servers to respond. Unreachable servers (e.g. VPN-only
endpoints like grafana-remote and sigma-remote) previously caused
~60s startup delays due to 30s transport timeouts.

Also add a 15s per-server timeout in listCapabilities to prevent a
single slow/unreachable server from blocking all other servers in
the concurrent capability listing.

The go-sdk's Server.AddTool is thread-safe and automatically sends
tools/list_changed notifications to connected clients, so tools
appear progressively as each server responds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jchangx jchangx requested a review from a team as a code owner April 7, 2026 03:48
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bcd0e0b060

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pkg/gateway/run.go
Comment on lines +428 to +429
go func() {
if err := g.reloadConfiguration(ctx, configuration, nil, nil); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep initial reload synchronous in dry-run mode

Launching reloadConfiguration in a goroutine here means gateway run --dry-run can return at line 436 before any capability discovery executes, because dry-run exits immediately and the process can terminate before the background goroutine runs. This regresses dry-run from a real configuration/capability validation pass into a best-effort no-op, so broken/unreachable server configs may now appear successful and expected dry-run output (discovered tools / initialization summary) can be missing.

Useful? React with 👍 / 👎.

Move the dry-run early-return before the background goroutine so that
dry-run mode runs reloadConfiguration synchronously. This ensures
server configs are fully validated and discovered tools are reported
before the process exits, instead of racing with a background goroutine
that may never complete.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant