Skip to content

windows-amd-dxc-d3d12 fails ~50% of the time in post-commit run with 'Failed to create PSO' across many tests #1210

@alsepkow

Description

@alsepkow

windows-amd-dxc-d3d12 has been failing roughly half of post-commit runs. 16 of the last 30 scheduled runs failed. When we manually re-ran the workflow 8 times against a single SHA to characterize the flakiness, each run produced a different set of failing tests — so this is a race, not a regression.

In the post-commit pipeline, all failures look the same:

gpu-exec: error: Failed to create PSO

Exit 1, no crash signature in the captured logs. The pre-commit pipeline has separately been seeing actual crashes originating from the same CreatePSO call. We don't yet have an explanation for why the two pipelines surface differently.

Runner: RX 9070 / driver 32.0.31007.1017 (from the dxdiag artifact).

Tests run in parallel via lit, each in its own offloader.exe process. The root cause is still unclear — it's possible there's a race condition in the AMD UMD around CreateComputePipelineState.

Next steps I'd like to take:

  1. Enable WER LocalDumps on the runner so the next pre-commit crash captures a user-mode dump. This needs a remote/admin session — it can't be done from a workflow.
  2. Get an RX 9070 + matching driver into a dev box for local investigation — can't reproduce on RX 6800 / 32.0.21043.10005, and having admin + free re-run cycles would unblock most things.
  3. Longer-term: kernel-mode debugging if the user-mode dump isn't enough.

Per-test frequency and per-run timeline in a comment below.

Metadata

Metadata

Assignees

No one assigned

    Labels

    driver-bugBugs that are likely or confirmed GPU driver bugsneeds-triageIssues needing further triage

    Type

    No fields configured for Bug.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions