Fix process crash when testing Stderr#1449
Fix process crash when testing Stderr#1449ericstj wants to merge 3 commits intomodelcontextprotocol:mainfrom
Conversation
| // WaitForExit(int) (as used by KillTree) does not. We must call | ||
| // this whether the process exited on its own or was killed above, | ||
| // otherwise ErrorDataReceived handlers may fire after Dispose(). | ||
| if (HasExited(process)) |
There was a problem hiding this comment.
I don't understand this change or the comment on it. If HasExited return false, how does this prevent ErrorDataReceived from firing after Dispose?
There was a problem hiding this comment.
Here's what Copilot says:
The guard is there because WaitForExit() (no-arg) blocks indefinitely if the process hasn't exited.
Consider the scenarios at that point in the code:
| Scenario | HasExited | What happens |
|---|---|---|
| Process exited on its own | true | Call WaitForExit() to flush events ✓ |
| KillTree killed it | true | Call WaitForExit() to flush events ✓ |
| KillTree called but process didn't die (edge case) | false | Skip — avoids hanging forever |
| Process never started (processStarted=false) | throws → true | Caught by the outer try |
Without the guard, an unkillable process (or a race between kill and exit check) would cause WaitForExit() to hang the calling thread indefinitely. The tradeoff is that in that rare edge case, some ErrorDataReceived events may not be fully dispatched before Dispose() — but that's better than a deadlock.
HasExited returns true when the process has exited, but the no-arg WaitForExit() does additional work beyond waiting for exit — it drains the async stdout/stderr output readers.
Here are the docs that confirm that indeed you must call WaitForExit() to flush redirected events.
https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.process.waitforexit?view=net-10.0#system-diagnostics-process-waitforexit(system-int32)
Seems we could have a better name such an API - like Flush() or something.
There was a problem hiding this comment.
I do not understand that response. My question wasn't about whether there needs to be a guard on the HasExited call... there already was a guard prior to this PR. This PR is just changing that guard from:
if (!processRunning && HasExited(process))to
if (HasExited(process))and the value of processRunning is not impacted at all based on whether KillTree was successful or not.
Moreover, the comment that was added suggests that "we must call [WaitForExit] [...] otherwise ErrorDataReceived handlers may fire after Dispose()" but there's nothing here preventing that race condition when we don't call WaitForExit because the process hasn't yet exited by the time we get here. So the logic is lost on me.
What am I missing?
There was a problem hiding this comment.
Did the comment update help? Your !processRunning check was causing this to not call WaitForExit() in the KillTree case, since we only call KillTree when processRunning is true. So in that KillTree case we wouldn't flush the handlers at all. KillTree, when successful should cause the process to synchronously exit and HasExited will be true. The process exiting of its own accord also causes the HasExited to be true. I don't think we need a guard on HasExited but I can't be sure why it was placed there before. It seems to contradict the comment.
We can't remove HasExited check since that would let us call WaitForExit even in the case that the process was hung, so we need to keep that check. There is of course the case that the process might exit on it's own after our HasExited check (if it just exceeds the shutown timeout) - that will still result in a race, but we accept uncommon race condition in favor of a hang. Maybe we should any event handlers instead?
This code is really difficult to understand because those names - HasExited means the native process actually exited, WaitForExit is called not to wait for that to happen but to wait to flush any pending redirected output events.
|
This may not be the root cause of the problem. We've now got dumps on both windows and linux. I'll have a look and see if the dumps reveal a better root-cause. |
Fix #1448
@stephentoub