Skip to content

Fix #16344: graceful HTTP/2 GOAWAY migration to eliminate request failure window#16345

Open
devjue wants to merge 1 commit into
apache:3.3from
devjue:fix/16344-graceful-goaway-migration
Open

Fix #16344: graceful HTTP/2 GOAWAY migration to eliminate request failure window#16345
devjue wants to merge 1 commit into
apache:3.3from
devjue:fix/16344-graceful-goaway-migration

Conversation

@devjue

@devjue devjue commented Jun 16, 2026

Copy link
Copy Markdown

What is the purpose of the change

Fixes #16344.

Eliminate the single-request failure window that occurs every time a Dubbo Triple Consumer receives an HTTP/2 GOAWAY(errorCode=0, lastStreamId=MAX_INT) frame from gateways / reverse proxies / providers that gracefully rotate connections (e.g. max_requests_per_connection, idle_timeout, hot-restart drain).

Previously, AbstractNettyConnectionClient#onGoaway nulled the channel reference immediately and NettyConnectionHandler#onGoAway then scheduled a reconnect, leaving a brief window in which AbstractClusterInvoker#checkInvokers throws RpcException before FailoverCluster's retry loop — so retries=N did NOT mitigate the issue. This PR implements graceful migration: keep the old channel serving until a new channel is ready, then atomically swap.

Brief changelog

  • NettyConnectionHandler#onGoAway: keep old channel alive; schedule attemptGracefulMigration on connectivityExecutor after GRACEFUL_RECONNECT_DELAY_MS = 200ms; one retry on failure, then fall back to AbstractNettyConnectionClient#onGoaway.
  • NettyConnectionClient#initBootstrap: closeFuture listener now calls compareAndClearNettyChannel(ch) instead of clearNettyChannel() — prevents the old channel's close listener from wiping a freshly swapped-in new channel.
  • AbstractNettyConnectionClient: add compareAndClearNettyChannel, plus package-private getConnectivityExecutor() / getReconnectDuration() accessors.
  • NettyChannel#getChannelIfPresent: lookup-only API used by channelInactive to avoid allocating a transient NettyChannel purely for logging.
  • TripleGoAwayHandler: log errorCode and lastStreamId so the GOAWAY trigger is observable.

Verifying this change

Validated against a local Triple demo with sustained 1 QPS traffic:

  • 26 GOAWAY frames received → 100% graceful migration success → zero No provider available errors and zero request failures.
  • Channel rotation chain (first 18 migrations):
    :45956 → :45110 → :53222 → :58366 → :56372 → :55436 → :55188 → :52926 → :50682 → :48682 → :47626 → :44014 → :43116 → :45044 → :48602 → :51880 → :50992 → :51672
  • No regression in existing dubbo-remoting-netty4 and dubbo-rpc-triple unit tests.

Does this pull request potentially affect one of the following parts

  • Dependencies (does it add or upgrade a dependency)
  • The public API
  • The runtime per-connection behavior (HTTP/2 GOAWAY handling on the Consumer side)
  • The persistence format of the configurations
  • The default values of configurations
  • The threading model (uses existing connectivityExecutor; does NOT introduce new threads)
  • The serialization protocol
  • The compatibility with previous versions

Backward-compatible: defaults are unchanged; the only observable difference is that GOAWAY frames no longer cause a request failure window.

Documentation

  • Does this pull request introduce a new feature? No (bug fix)
  • If yes, how is the feature documented? Not applicable

Happy to backport to 3.2 once this lands on 3.3.

…st failure window

Keep the old channel alive after a GOAWAY frame and bring up a new
connection asynchronously on the connectivity executor; atomically
swap the channel reference via a closeFuture CAS so the old channel's
close listener cannot wipe a freshly swapped-in new channel.

- NettyConnectionHandler: schedule attemptGracefulMigration on
  connectivityExecutor with GRACEFUL_RECONNECT_DELAY_MS=200ms; one
  retry on failure, then fall back to AbstractNettyConnectionClient#onGoaway.
- NettyConnectionClient#initBootstrap: closeFuture listener now
  compareAndClearNettyChannel(ch) instead of clearNettyChannel().
- AbstractNettyConnectionClient: add compareAndClearNettyChannel,
  plus package-private getConnectivityExecutor / getReconnectDuration.
- NettyChannel#getChannelIfPresent: lookup-only API used by
  channelInactive.
- TripleGoAwayHandler: log errorCode and lastStreamId.

Signed-off-by: Jue Zhang <devjue@gmail.com>
@devjue devjue force-pushed the fix/16344-graceful-goaway-migration branch from b111c6d to 02e4fa9 Compare June 16, 2026 06:36
@uuuyuqi

uuuyuqi commented Jun 16, 2026

Copy link
Copy Markdown

@chickenlj PTAL

@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 61.90476% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.82%. Comparing base (866860d) to head (02e4fa9).

Files with missing lines Patch % Lines
...oting/transport/netty4/NettyConnectionHandler.java 51.85% 10 Missing and 3 partials ⚠️
.../dubbo/remoting/transport/netty4/NettyChannel.java 33.33% 1 Missing and 1 partial ⚠️
...ransport/netty4/AbstractNettyConnectionClient.java 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##                3.3   #16345      +/-   ##
============================================
+ Coverage     60.80%   60.82%   +0.01%     
- Complexity       15    11771   +11756     
============================================
  Files          1953     1953              
  Lines         89208    89239      +31     
  Branches      13458    13459       +1     
============================================
+ Hits          54245    54276      +31     
+ Misses        29397    29395       -2     
- Partials       5566     5568       +2     
Flag Coverage Δ
integration-tests-java21 32.14% <40.47%> (+0.03%) ⬆️
integration-tests-java8 32.30% <40.47%> (+<0.01%) ⬆️
samples-tests-java21 32.17% <14.28%> (-0.01%) ⬇️
samples-tests-java8 29.82% <11.90%> (+0.10%) ⬆️
unit-tests-java11 59.05% <57.14%> (+0.02%) ⬆️
unit-tests-java17 58.48% <11.90%> (-0.04%) ⬇️
unit-tests-java21 58.50% <11.90%> (+<0.01%) ⬆️
unit-tests-java25 58.45% <11.90%> (+<0.01%) ⬆️
unit-tests-java8 59.03% <11.90%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Single-request failure window when receiving HTTP/2 GOAWAY due to immediate channel nullification

3 participants