Issue Description
This is caught by telemetry: "Node 2 does not have a provider".
A race condition between node shutdown and packet processing causes an InternalErrorException crash with the message "Node {X} does not have a provider" in NodeManager.SendData
Stack trace:
at Microsoft.Build.Shared.ErrorUtilities.ThrowInternalError(String message, Object[] args)
at Microsoft.Build.BackEnd.NodeManager.SendData(Int32 node, INodePacket packet)
at Microsoft.Build.Execution.BuildManager.HandleConfigurationRequest(Int32 node, BuildRequestConfiguration unresolvedConfiguration)
at Microsoft.Build.Execution.BuildManager.ProcessPacket(Int32 node, INodePacket packet)
at Microsoft.Build.Execution.BuildManager.ProcessWorkQueue(Action action)
Steps to Reproduce
NodeManager.DeserializeAndRoutePacket eagerly removes a node from _nodeIdToProvider on the communication thread when it receives a NodeShutdown packet, before the packet is routed to BuildManager. Meanwhile, BuildManager processes packets asynchronously via a work queue (ActionBlock<Action>). This creates a window where:
- An out-of-proc worker node (Node 2) sends a
BuildRequestConfiguration packet, which is enqueued in BuildManager._workQueue.
- The same node shuts down (or dies). The communication thread receives the
NodeShutdown packet and immediately calls NodeManager.RemoveNodeFromMapping(nodeId), removing the node from _nodeIdToProvider.
- The
NodeShutdown packet is then routed to BuildManager.PacketReceived, which enqueues it in the work queue after the configuration request from step 1.
- The work queue processes the
BuildRequestConfiguration first (FIFO). HandleConfigurationRequest resolves the configuration and calls _nodeManager.SendData(node, response).
SendData calls _nodeIdToProvider.TryGetValue(node, ...) - the node has already been removed -> throws InternalErrorException.
Expected Behavior
no issue occurs
Actual Behavior
Additional Concerns
- No synchronization in NodeManager:
_nodeIdToProvider is a plain Dictionary<int, INodeProvider> with no locking. RemoveNodeFromMapping runs on the communication thread while SendData runs on the work queue thread, creating a potential concurrent dictionary access issue.
- HandleResourceRequest is also vulnerable: It calls
SendData from a ContinueWith callback outside _syncLock, so the node can be removed between the resource grant and the response send.
- Hang potential: The crash is caught by
ProcessWorkQueue's generic exception handler and routed to OnThreadException, which aborts the build. In the VS host, EndBuild then waits for _noNodesActiveEvent - which depends on the NodeShutdown packet (still in the queue) being processed. This usually resolves, but if the node died without sending a proper NodeShutdown, VS can hang permanently.
Analysis
No response
Versions & Configurations
Environment: VS Enterprise 18.6, .NET Framework, in-process (devenv) build host (but it seems to be a long term issue)
Issue Description
This is caught by telemetry: "Node 2 does not have a provider".
A race condition between node shutdown and packet processing causes an InternalErrorException crash with the message "Node {X} does not have a provider" in NodeManager.SendData
Stack trace:
Steps to Reproduce
NodeManager.DeserializeAndRoutePacketeagerly removes a node from_nodeIdToProvideron the communication thread when it receives aNodeShutdownpacket, before the packet is routed toBuildManager. Meanwhile,BuildManagerprocesses packets asynchronously via a work queue (ActionBlock<Action>). This creates a window where:BuildRequestConfigurationpacket, which is enqueued inBuildManager._workQueue.NodeShutdownpacket and immediately callsNodeManager.RemoveNodeFromMapping(nodeId), removing the node from_nodeIdToProvider.NodeShutdownpacket is then routed toBuildManager.PacketReceived, which enqueues it in the work queue after the configuration request from step 1.BuildRequestConfigurationfirst (FIFO).HandleConfigurationRequestresolves the configuration and calls_nodeManager.SendData(node, response).SendDatacalls_nodeIdToProvider.TryGetValue(node, ...)- the node has already been removed -> throwsInternalErrorException.Expected Behavior
no issue occurs
Actual Behavior
Additional Concerns
_nodeIdToProvideris a plainDictionary<int, INodeProvider>with no locking.RemoveNodeFromMappingruns on the communication thread whileSendDataruns on the work queue thread, creating a potential concurrent dictionary access issue.SendDatafrom aContinueWithcallback outside_syncLock, so the node can be removed between the resource grant and the response send.ProcessWorkQueue's generic exception handler and routed toOnThreadException, which aborts the build. In the VS host,EndBuildthen waits for_noNodesActiveEvent- which depends on theNodeShutdownpacket (still in the queue) being processed. This usually resolves, but if the node died without sending a properNodeShutdown, VS can hang permanently.Analysis
No response
Versions & Configurations
Environment: VS Enterprise 18.6, .NET Framework, in-process (devenv) build host (but it seems to be a long term issue)