Skip to content

test_comm_gemm_overlap.py::test_multi_layer_with_overlap_bf16 fails on A100 #3097

@francesco-bertolotti

Description

@francesco-bertolotti

Summary

On 4x A100, test_multi_layer_with_overlap_bf16[ TransformerLayer - BULK DGRAD/WGRAD - 2 layers - BF16 -False] fails its numerical check deterministically:

[rank0] NUMERICAL CHECK FAILED: layers.1.self_attention.layernorm_qkv.bias.grad not close
enough at index 771 with 0.1171875 vs 0.0703125 | rel. error = 0.6666666666666666
(tol = 0.025) | abs. error = 0.046875 (tol = 0.00125)

This does appear to be an precision problem, the same configuration passes with --seq-length=256 or --num-layers=1. I tried also to force NVTE_FUSED_ATTN=0 in addition to NVTE_FLASH_ATTN=0 but I got the same error

Environment

  • 4x A100 64GB (sm80), single node, NVLink (UB_SKIPMC=1 path, no CUDA Multicast)
  • TE @ 720ec27e (current main at time of writing), built with NVTE_CUDA_ARCHS=80
  • torch 2.12.0+cu126, cuDNN 9.10.2.21, flash-attn 2.8.3, CUDA runtime 12.6
  • driver: 535.274.02

Reproduction

UB_SKIPMC=1 NVTE_FLASH_ATTN=0 PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 \
torchrun --nproc_per_node=4 tests/pytorch/distributed/run_layer_with_overlap.py \
  --seed=42 --seq-length=1024 --batch-size=2 --num-heads=32 --head-dim=48 \
  --layer-type=TransformerLayer --num-layers=2

Suggested resolution

I would simply reduce seq-len. Let me know if youd welcom a PR for this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions