Skip to content

[AMDGPU] asyncmark/wait.asyncmark intrinsics produce incorrect code for 3-stage software-pipelined loops #186878

@jerryyin

Description

@jerryyin

The llvm.amdgcn.asyncmark and llvm.amdgcn.wait.asyncmark intrinsics (introduced in #180467) produce numerically incorrect results when used in 3-stage software-pipelined loops with buffer_load_dwordx4 ... lds (async DMA to LDS) on gfx950. 2-stage pipelines are unaffected.

Reproducer

Given the .optimized.ll from a 4096×4096×4096 f32 GEMM with 3-stage pipelining, replacing the asyncmark intrinsics with explicit s_waitcnt calls produces correct results while the original IR produces ~40% wrong output elements.

The replacement:

; REMOVED:
call void @llvm.amdgcn.asyncmark()

; REPLACED:
call void @llvm.amdgcn.wait.asyncmark(i16 2)
→ call void @llvm.amdgcn.s.waitcnt(i32 8184)  ; vmcnt(8) = 2 groups × 4 loads/group

call void @llvm.amdgcn.wait.asyncmark(i16 0)
→ call void @llvm.amdgcn.s.waitcnt(i32 8176)  ; vmcnt(0)

Both versions compiled with llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx950 -O3. The LLVM IR and assembly for both the broken (original) and working (fixed) cases are attached.

A self-contained reproducer is available at: https://github.com/jerryyin/scripts/tree/master/iree/reproducers/asyncmark_bug

Minimal reproduction from assembly (reproduce_from_asm.sh): assembles both checked-in .s files to HSACOs, substitutes each into an otherwise-identical IREE vmfb (same compilation, same dispatch, same inputs), and compares against a baseline. The only variable is which .hsaco is loaded. Use this for reproducing.

Results

Case max abs diff vs baseline Wrong elements Verdict
Original (with asyncmark) ~50 ~6.8M / 16.8M (40%) FAIL
Modified (asyncmark → s_waitcnt) 0.000000 0 / 16.8M PASS

Consistently reproducible across runs. The exact wrong values vary between runs (characteristic of a race condition), but the failure rate is consistently ~40%.

What We Know

  • 922 instructions in both, identical instruction count
  • Identical s_waitcnt values: both produce vmcnt(8) and vmcnt(0) at the same positions — mergeAsyncMarks() computes the correct values
  • Identical register allocation and loop structure
  • Identical kernel descriptors (LDS size, VGPRs, SGPRs)
  • The only difference: 4 ALU instructions reordered in the prolog (v_bfe_u32, v_and_b32, v_lshlrev_b32, s_mov_b32 — all independent address computations)

The asyncmark intrinsics carry IntrHasSideEffects, which makes them scheduling barriers in the machine scheduler DAG (isGlobalMemoryObject() returns true). Removing them changes the scheduler's freedom in the prolog, producing a slightly different instruction ordering. Despite the reordering involving only data-independent ALU instructions, one ordering consistently produces correct results and the other consistently produces ~40% wrong elements on gfx950.

The failure pattern (non-deterministic wrong values, consistent ~40% failure rate) is characteristic of a hardware timing-dependent race condition in the async DMA-to-LDS path, triggered by the specific instruction ordering in the prolog (wrong values, consistent rate).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions