`test_comm_gemm_overlap.py::test_multi_layer_with_overlap_bf16` fails on A100



## Summary

On 4x A100, `test_multi_layer_with_overlap_bf16[ TransformerLayer - BULK DGRAD/WGRAD - 2 layers - BF16  -False]` fails its numerical check deterministically:

```
[rank0] NUMERICAL CHECK FAILED: layers.1.self_attention.layernorm_qkv.bias.grad not close
enough at index 771 with 0.1171875 vs 0.0703125 | rel. error = 0.6666666666666666
(tol = 0.025) | abs. error = 0.046875 (tol = 0.00125)
```

This does appear to be an precision problem, the same configuration passes with `--seq-length=256` or `--num-layers=1`. I tried also to force `NVTE_FUSED_ATTN=0` in addition to `NVTE_FLASH_ATTN=0` but I got the same error

## Environment

- 4x A100 64GB (sm80), single node, NVLink (`UB_SKIPMC=1` path, no CUDA Multicast)
- TE @ `720ec27e` (current `main` at time of writing), built with `NVTE_CUDA_ARCHS=80`
- torch 2.12.0+cu126, cuDNN 9.10.2.21, flash-attn 2.8.3, CUDA runtime 12.6
- driver: 535.274.02

## Reproduction

```bash
UB_SKIPMC=1 NVTE_FLASH_ATTN=0 PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 \
torchrun --nproc_per_node=4 tests/pytorch/distributed/run_layer_with_overlap.py \
  --seed=42 --seq-length=1024 --batch-size=2 --num-heads=32 --head-dim=48 \
  --layer-type=TransformerLayer --num-layers=2
```

## Suggested resolution

I would simply reduce seq-len. Let me know if youd welcom a PR for this.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_comm_gemm_overlap.py::test_multi_layer_with_overlap_bf16` fails on A100 #3097

Summary

Environment

Reproduction

Suggested resolution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

test_comm_gemm_overlap.py::test_multi_layer_with_overlap_bf16 fails on A100 #3097

Description

Summary

Environment

Reproduction

Suggested resolution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`test_comm_gemm_overlap.py::test_multi_layer_with_overlap_bf16` fails on A100 #3097