Skip to content

Race condition in NodeManager.SendData causes crash: "Node X does not have a provider" #13362

@YuliiaKovalova

Description

@YuliiaKovalova

Issue Description

This is caught by telemetry: "Node 2 does not have a provider".

A race condition between node shutdown and packet processing causes an InternalErrorException crash with the message "Node {X} does not have a provider" in NodeManager.SendData

Stack trace:

at Microsoft.Build.Shared.ErrorUtilities.ThrowInternalError(String message, Object[] args)
at Microsoft.Build.BackEnd.NodeManager.SendData(Int32 node, INodePacket packet)
at Microsoft.Build.Execution.BuildManager.HandleConfigurationRequest(Int32 node, BuildRequestConfiguration unresolvedConfiguration)
at Microsoft.Build.Execution.BuildManager.ProcessPacket(Int32 node, INodePacket packet)
at Microsoft.Build.Execution.BuildManager.ProcessWorkQueue(Action action)

Steps to Reproduce

NodeManager.DeserializeAndRoutePacket eagerly removes a node from _nodeIdToProvider on the communication thread when it receives a NodeShutdown packet, before the packet is routed to BuildManager. Meanwhile, BuildManager processes packets asynchronously via a work queue (ActionBlock<Action>). This creates a window where:

  1. An out-of-proc worker node (Node 2) sends a BuildRequestConfiguration packet, which is enqueued in BuildManager._workQueue.
  2. The same node shuts down (or dies). The communication thread receives the NodeShutdown packet and immediately calls NodeManager.RemoveNodeFromMapping(nodeId), removing the node from _nodeIdToProvider.
  3. The NodeShutdown packet is then routed to BuildManager.PacketReceived, which enqueues it in the work queue after the configuration request from step 1.
  4. The work queue processes the BuildRequestConfiguration first (FIFO). HandleConfigurationRequest resolves the configuration and calls _nodeManager.SendData(node, response).
  5. SendData calls _nodeIdToProvider.TryGetValue(node, ...) - the node has already been removed -> throws InternalErrorException.

Expected Behavior

no issue occurs

Actual Behavior

Additional Concerns

  • No synchronization in NodeManager: _nodeIdToProvider is a plain Dictionary<int, INodeProvider> with no locking. RemoveNodeFromMapping runs on the communication thread while SendData runs on the work queue thread, creating a potential concurrent dictionary access issue.
  • HandleResourceRequest is also vulnerable: It calls SendData from a ContinueWith callback outside _syncLock, so the node can be removed between the resource grant and the response send.
  • Hang potential: The crash is caught by ProcessWorkQueue's generic exception handler and routed to OnThreadException, which aborts the build. In the VS host, EndBuild then waits for _noNodesActiveEvent - which depends on the NodeShutdown packet (still in the queue) being processed. This usually resolves, but if the node died without sending a proper NodeShutdown, VS can hang permanently.

Analysis

No response

Versions & Configurations

Environment: VS Enterprise 18.6, .NET Framework, in-process (devenv) build host (but it seems to be a long term issue)

Metadata

Metadata

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions