[AMDGPU] asyncmark/wait.asyncmark intrinsics produce incorrect code for 3-stage software-pipelined loops

The `llvm.amdgcn.asyncmark` and `llvm.amdgcn.wait.asyncmark` intrinsics (introduced in #180467) produce numerically incorrect results when used in 3-stage software-pipelined loops with `buffer_load_dwordx4 ... lds` (async DMA to LDS) on gfx950. 2-stage pipelines are unaffected.

### Reproducer

Given the `.optimized.ll` from a 4096×4096×4096 f32 GEMM with 3-stage pipelining, replacing the asyncmark intrinsics with explicit `s_waitcnt` calls produces correct results while the original IR produces ~40% wrong output elements.

The replacement:
```llvm
; REMOVED:
call void @llvm.amdgcn.asyncmark()

; REPLACED:
call void @llvm.amdgcn.wait.asyncmark(i16 2)
→ call void @llvm.amdgcn.s.waitcnt(i32 8184)  ; vmcnt(8) = 2 groups × 4 loads/group

call void @llvm.amdgcn.wait.asyncmark(i16 0)
→ call void @llvm.amdgcn.s.waitcnt(i32 8176)  ; vmcnt(0)
```

Both versions compiled with `llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx950 -O3`. The LLVM IR and assembly for both the broken (original) and working (fixed) cases are attached.

A self-contained reproducer is available at: https://github.com/jerryyin/scripts/tree/master/iree/reproducers/asyncmark_bug

**Minimal reproduction from assembly** (`reproduce_from_asm.sh`): assembles both checked-in `.s` files to HSACOs, substitutes each into an otherwise-identical IREE vmfb (same compilation, same dispatch, same inputs), and compares against a baseline. The only variable is which `.hsaco` is loaded. Use [this ](https://github.com/llvm/llvm-project/issues/186878#issuecomment-4100692072) for reproducing.

### Results

| Case | max abs diff vs baseline | Wrong elements | Verdict |
|------|--------------------------|----------------|---------|
| Original (with asyncmark) | ~50 | ~6.8M / 16.8M (40%) | **FAIL** |
| Modified (asyncmark → s_waitcnt) | 0.000000 | 0 / 16.8M | **PASS** |

Consistently reproducible across runs. The exact wrong values vary between runs (characteristic of a race condition), but the failure rate is consistently ~40%.

### What We Know
- **922 instructions in both**, identical instruction count
- **Identical `s_waitcnt` values**: both produce `vmcnt(8)` and `vmcnt(0)` at the same positions — `mergeAsyncMarks()` computes the correct values
- **Identical register allocation and loop structure**
- **Identical kernel descriptors** (LDS size, VGPRs, SGPRs)
- **The only difference**: 4 ALU instructions reordered in the prolog (`v_bfe_u32`, `v_and_b32`, `v_lshlrev_b32`, `s_mov_b32` — all independent address computations)

The `asyncmark` intrinsics carry `IntrHasSideEffects`, which makes them scheduling barriers in the machine scheduler DAG (`isGlobalMemoryObject()` returns true). Removing them changes the scheduler's freedom in the prolog, producing a slightly different instruction ordering. Despite the reordering involving only data-independent ALU instructions, one ordering consistently produces correct results and the other consistently produces ~40% wrong elements on gfx950.

The failure pattern (non-deterministic wrong values, consistent ~40% failure rate) is characteristic of a hardware timing-dependent race condition in the async DMA-to-LDS path, triggered by the specific instruction ordering in the prolog (wrong values, consistent rate).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] asyncmark/wait.asyncmark intrinsics produce incorrect code for 3-stage software-pipelined loops #186878

Reproducer

Results

What We Know

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Case	max abs diff vs baseline	Wrong elements	Verdict
Original (with asyncmark)	~50	~6.8M / 16.8M (40%)	FAIL
Modified (asyncmark → s_waitcnt)	0.000000	0 / 16.8M	PASS

[AMDGPU] asyncmark/wait.asyncmark intrinsics produce incorrect code for 3-stage software-pipelined loops #186878

Description

Reproducer

Results

What We Know

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions