fix(dashboard): fix Windows named pipe race on daemon shutdown#40642
fix(dashboard): fix Windows named pipe race on daemon shutdown#40642Skn0tt merged 2 commits intomicrosoft:mainfrom
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
8b12ec2 to
cec169c
Compare
The previous test left the Windows named pipe connectable while the daemon was still shutting down (gracefullyCloseAll). The next test could connect to the dying daemon, get rejected, and leave server.endpoint undefined - cascading failures across many tests. Two changes fix this: 1. Daemon side: send PID to the client and only call gracefullyProcessExitDoNotHang in the socket.end() flush callback, ensuring the client receives the PID before teardown begins. Also move the handler before statePromise.then() so it fires immediately. 2. Client side: after receiving the daemon PID, poll process.kill(pid, 0) until ESRCH (up to 35s). This guarantees the named pipe is released before the next test tries to acquire the singleton. Also reverts f536e37 (EBUSY check + server.close in singleton handler) which was a previous incomplete attempt, and gives the connectToDashboard fixture a 60s timeout to accommodate graceful browser teardown.
cec169c to
5fe3cb5
Compare
| else | ||
| resolve(pid); | ||
| }); | ||
| client.once('error', () => reject(new Error('no dashboard running'))); |
There was a problem hiding this comment.
I'd instead resolve with undefined - checking error message always feels brittle.
| } | ||
| // Poll until the daemon process exits — at that point the OS has released all | ||
| // its handles, including the named pipe, so the next acquisition won't see a stale pipe. | ||
| const deadline = Date.now() + 35000; |
There was a problem hiding this comment.
- Let's use
monotonicTime(). - Why do we
process.kill()in a loop? Shouldn't once be enough?
There was a problem hiding this comment.
process.kill(, 0) is for checking liveness, not killing. So we have to loop it.
There was a problem hiding this comment.
I see, that was not clear to me 😄 I guess there is still a small chance of a race due to pid reuse? Perhaps we can first gracefullyCloseAll() and then reply with the pid? This way exiting the process should be instant, so we reduce the probability of a race.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
The failures above are unrelated, they also occur on Gonna look at them next. This diff has the risk of PID reuse, see above - i'll come back later, for now this should already help our CI. |
Test results for "MCP"9 failed 6934 passed, 1052 skipped Merge workflow run. |
Summary
Fixes a Windows CI flake where dashboard tests sporadically failed with
Cannot read properties of undefined (reading 'endpoint').The root cause: after
--kill, Windows keeps the named pipe connectable duringgracefullyCloseAll. The next test could connect to the dying daemon, get rejected mid-handshake, and seeserver.endpointas undefined — cascading failures across many tests.Daemon side (
dashboardApp.ts): move thekillhandler beforestatePromise.then()so it fires immediately. Send the daemon PID insocket.end(data, callback)and only callgracefullyProcessExitDoNotHanginside the flush callback, ensuring the client receives the PID before teardown begins.Client side (
runKillClient): after receiving the daemon PID, pollprocess.kill(pid, 0)untilESRCH(up to 35 s). This guarantees the named pipe is fully released before the next test tries to acquire the singleton.Test fixture (
cli-fixtures.ts): giveconnectToDashboarda 60 s fixture timeout.Also reverts f536e37 (added
EBUSYcheck +server.closein singleton handler) which was a previous incomplete attempt at fixing this.Fixes #40626