Skip to content

HDDS-14793. Fix race condition in XceiverClientGrpc#connectToDatanode causing intermittent NPEs.#9997

Draft
ptlrs wants to merge 6 commits intoapache:masterfrom
ptlrs:HDDS-14793-Intermittent-NPE-in-XceiverClientGrpc
Draft

HDDS-14793. Fix race condition in XceiverClientGrpc#connectToDatanode causing intermittent NPEs.#9997
ptlrs wants to merge 6 commits intoapache:masterfrom
ptlrs:HDDS-14793-Intermittent-NPE-in-XceiverClientGrpc

Conversation

@ptlrs
Copy link
Copy Markdown
Contributor

@ptlrs ptlrs commented Mar 29, 2026

What changes were proposed in this pull request?

The XceiverClientGrpc#connectToDatanode intermittently fails with an NPE.
The problem is that for a given datanode, there is a race condition between creating a channel and creating a stub.

When a new channel is created for a DN, it is put into the channels map. However, presence of a channel in the map does not imply that the corresponding stub for the same DN also exists in the asyncStubs map.

If the stub is accessed after creating a channel but before the creation of stub, we can get an NPE.

This PR fixes the problem by:

  • maintaining only one dnChannelInfoMap for both the channels and stubs instead of two independent maps
  • creating a ChannelInfo class to group the channel and stub

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14793

How was this patch tested?

CI: https://github.com/ptlrs/ozone/actions/runs/23703558972
Flaky test runner: https://github.com/ptlrs/ozone/actions/runs/23823575690

Copy link
Copy Markdown
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ptlrs for working on this.

@ptlrs
Copy link
Copy Markdown
Contributor Author

ptlrs commented Mar 29, 2026

Thanks for the review @adoroszlai, I have pushed the fixes.

@ptlrs ptlrs requested a review from adoroszlai March 30, 2026 16:28
Copy link
Copy Markdown
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ptlrs for updating the patch. compute logic looks good now.

Comment on lines +239 to +243
if (details == null || !dnChannelInfoMap.containsKey(details.getID())) {
return false;
}

private boolean isConnected(ManagedChannel channel) {
return channel != null && !channel.isTerminated() && !channel.isShutdown();
return !dnChannelInfoMap.get(details.getID()).isChannelInactive();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid separate containsKey and get.

if (details == null) {
  return false;
}
ChannelInfo info = dnChannelInfoMap.get(details.getID());
return info != null && !info.isChannelInactive();

Comment on lines +266 to +267
.map(ChannelInfo::getChannel)
.collect(Collectors.toList());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChannelInfo.channel may be null (or at least there are checks for that elsewhere).

Suggested change
.map(ChannelInfo::getChannel)
.collect(Collectors.toList());
.map(ChannelInfo::getChannel)
.filter(Objects::nonNull)
.collect(Collectors.toList());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 735 to 754
throw new IOException("This channel is not connected.");
}

ManagedChannel channel = channels.get(dn.getID());
// If the channel doesn't exist for this specific datanode or the channel
// is closed, just reconnect
if (!isConnected(channel)) {
// If the channel doesn't exist for this specific datanode or the channel is closed, just reconnect
if (!isConnected(dn)) {
reconnect(dn);
}

}

private void reconnect(DatanodeDetails dn)
throws IOException {
ManagedChannel channel;
try {
connectToDatanode(dn);
channel = channels.get(dn.getID());
} catch (Exception e) {
throw new IOException("Error while connecting", e);
}

if (!isConnected(channel)) {
if (!isConnected(dn)) {
throw new IOException("This channel is not connected.");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that connectToDatanode handles all cases ("closed", "already connected", "needs new connection"), checkOpen can be simplified:

    connectToDatanode(dn);

    if (!isConnected(dn)) {
      throw new IOException("This channel is not connected.");
    }

and reconnect can be removed.

Comment on lines -201 to -203
} catch (RuntimeException e) {
LOG.error("Failed to create channel to datanode {}", dn, e);
throw new IOException(e.getCause());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep the translation back from unchecked exception to IOException, because callers may not handle the former.

try {
  dnChannelInfoMap.compute(...);
} catch (UncheckedIOException e) {
  LOG.error("Failed to create channel to datanode {}", dn, e);
  throw e.getCause();
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@ptlrs
Copy link
Copy Markdown
Contributor Author

ptlrs commented Apr 3, 2026

@adoroszlai thanks for the instructions on how to run the Flaky test CI .

After running it on master and a test branch which has the latest changes of this PR we see the following results:

Branch Failures
master 14/100
FlakyTest-HDDS-14793-Intermittent-NPE-in-XceiverClientGrpc 0/100

@ptlrs ptlrs requested a review from adoroszlai April 3, 2026 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants