windows-amd-dxc-d3d12 has been failing roughly half of post-commit runs. 16 of the last 30 scheduled runs failed. When we manually re-ran the workflow 8 times against a single SHA to characterize the flakiness, each run produced a different set of failing tests — so this is a race, not a regression.
In the post-commit pipeline, all failures look the same:
gpu-exec: error: Failed to create PSO
Exit 1, no crash signature in the captured logs. The pre-commit pipeline has separately been seeing actual crashes originating from the same CreatePSO call. We don't yet have an explanation for why the two pipelines surface differently.
Runner: RX 9070 / driver 32.0.31007.1017 (from the dxdiag artifact).
Tests run in parallel via lit, each in its own offloader.exe process. The root cause is still unclear — it's possible there's a race condition in the AMD UMD around CreateComputePipelineState.
Next steps I'd like to take:
- Enable WER LocalDumps on the runner so the next pre-commit crash captures a user-mode dump. This needs a remote/admin session — it can't be done from a workflow.
- Get an RX 9070 + matching driver into a dev box for local investigation — can't reproduce on RX 6800 / 32.0.21043.10005, and having admin + free re-run cycles would unblock most things.
- Longer-term: kernel-mode debugging if the user-mode dump isn't enough.
Per-test frequency and per-run timeline in a comment below.
windows-amd-dxc-d3d12has been failing roughly half of post-commit runs. 16 of the last 30 scheduled runs failed. When we manually re-ran the workflow 8 times against a single SHA to characterize the flakiness, each run produced a different set of failing tests — so this is a race, not a regression.In the post-commit pipeline, all failures look the same:
Exit 1, no crash signature in the captured logs. The pre-commit pipeline has separately been seeing actual crashes originating from the same
CreatePSOcall. We don't yet have an explanation for why the two pipelines surface differently.Runner: RX 9070 / driver 32.0.31007.1017 (from the dxdiag artifact).
Tests run in parallel via lit, each in its own
offloader.exeprocess. The root cause is still unclear — it's possible there's a race condition in the AMD UMD aroundCreateComputePipelineState.Next steps I'd like to take:
Per-test frequency and per-run timeline in a comment below.