Fix flaky SendAsync_Success_ConnectionSetupActivityGraphRecorded test (Http1.1 + Http2)#129822
Fix flaky SendAsync_Success_ConnectionSetupActivityGraphRecorded test (Http1.1 + Http2)#129822Copilot wants to merge 3 commits into
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
… with TaskCompletionSource sync Co-authored-by: lewing <24063+lewing@users.noreply.github.com>
lewing
left a comment
There was a problem hiding this comment.
This PR is misdiagnosed and shouldn't be merged. The "race between client assertions and server reading req2" framing isn't the cause of the flake — there is no such race. The loopback request/response pump already serializes the two sides.
Actual cause: the JIT regression in #128384 (merged 2026-06-22 23:47 UTC, commit b426fdf). The client-side assertion at DiagnosticsTests.cs:657 — req1.Links.Single(l => l.Context == conn.Context) — throws "Sequence contains no matching element" because the auto-promoted runtime-async variant of a sync Task/ValueTask-returning forwarder in the System.Net.Http send chain is miscompiled (same class of bug as the deterministic NRE Rolf reported in #129813). The server-side "Unexpected EOF" is purely downstream: once the client Single() throws, WhenAllOrAnyFailed cancels, the loopback socket closes mid-ReadRequestDataAsync, and the EOF gets bundled into the AggregateException.
Evidence this is a runtime regression, not a test race:
This test was added in #103922 (July 2024) and has been clean on rolling-main runtime CI continuously through 2026-06-22. First failure on 2026-06-23 — the first rolling-main build containing #128384:
| Build (def 129 main) | Finished (UTC) | sourceVersion | Has #128384? | ConnectionSetupActivityGraphRecorded failures |
|---|---|---|---|---|
| 1466004 | 06-16 14:12 | 98d5787 | no | 0 |
| 1469451 | 06-18 07:59 | 26243ad | no | 0 |
| 1472138 | 06-19 10:18 | 63ac729 | no | 0 |
| 1475054 | 06-22 11:28 | 59979e6 | no | 0 |
| 1476126 | 06-23 00:09 | 97702a1 | no | 0 |
| 1476799 | 06-23 10:50 | 461fd2a | yes | 12 |
| 1478126 | 06-24 00:49 | b5e2bef | yes | 12 |
| 1478862 | 06-24 10:16 | 46b72c7 | yes | 12 |
A test race that lay dormant for ~2 years and started intermittently failing on exactly the commit that introduced auto-promotion of synchronous task-returning methods to runtime-async is not a test race.
The "fix" also makes the failure mode worse when the regression still fires:
- Today:
AggregateExceptionwith the realSingle()assertion failure + server-side EOF. Ugly, but the real failure is in the message. - With this PR: client's
Single()throws →req1AssertionsDone.SetResult()is never called → server blocks forever inawait req1AssertionsDone.Task→WhenAllOrAnyFailedwaits for the server task → test hangs until xunit's per-test timeout instead of failing fast. The PR description claims this as a feature ("the server waits indefinitely, andWhenAllOrAnyFailedsurfaces only the real assertion failure"), butWhenAllOrAnyFailedpropagates the first failure immediately; a hung server task just delays surfacing the real one.
Recommendation: Close this PR. The fix belongs in src/coreclr/jit/importer.cpp (impWrapTopOfStackInAwait / impSetupAsyncCall) — Jakob is already assigned to #129813 which has a deterministic minimal repro (Stream.ReadAsync(Memory<byte>) NRE) and is the same bug. The diagnostics test should stay as-is so it keeps surfacing the regression until the JIT fix lands; suppressing the secondary EOF noise here would mask the JIT bug across other code paths #128384 has broken.
Note
This comment was authored with assistance from GitHub Copilot.
Co-authored-by: lewing <24063+lewing@users.noreply.github.com>
You're right. The diagnosis was incorrect and the fix makes things worse by turning a fast failure into a hang. I've reverted |
I'm doubtful. The other bug is related to stripping of IL bodies and the interpreter trying to execute the stripped bodies. This one looks like x86, I don't think we do anything like that there. |
SendAsync_Success_ConnectionSetupActivityGraphRecordedwas intermittently failing with anAggregateExceptioncontaining both the real assertion failure and a spurious server-side EOF — because the loopback server raced to read req2 before the client finished asserting req1 and sent it. On connection close, Http/1.1 threw"Unexpected EOF trying to read request header"and Http/2 threw"Failed to read Headers frame.".Fix: Add a
TaskCompletionSourceto synchronize client and server within the test:req1AssertionsDone.SetResult()after all req1 assertions pass, immediately before sending req2awaitsreq1AssertionsDone.Taskbefore callingReadRequestDataAsync()for req2This eliminates the race for both Http/1.1 and Http/2 variants. If a client assertion fails, the TCS is never signaled, the server waits indefinitely, and
WhenAllOrAnyFailedsurfaces only the real assertion failure.