increased a bit tolerance for pytorch/distributed/run_numerics.py#3095
Conversation
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Greptile SummaryThis PR loosens the absolute tolerance for
Confidence Score: 5/5Safe to merge — it is a one-line tolerance bump in a test helper with no impact on library code. The change is narrowly scoped to a single constant in a test utility function. The new value of 5e-5 is calibrated to the observed worst-case error (2.27e-5) with ~2.2x headroom, and the comment clearly attributes the looser bound to TF32 reduction-order noise in TP-sharded GEMMs. No library code, model logic, or other test configurations are touched. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["_get_tolerances(dtype)"] --> B{QUANTIZATION?}
B -->|fp8_cs| C["rtol=0.4, atol=0.25"]
B -->|nvfp4| D["rtol=0.125, atol=0.12"]
B -->|other quantized| E["rtol=0.125, atol=0.0625"]
B -->|None| F{dtype}
F -->|float16| G["rtol=1e-3, atol=1e-5"]
F -->|bfloat16| H["rtol=1.6e-2, atol=1e-5"]
F -->|float32 / TF32| I["rtol=1e-3, atol=5e-5 (changed from 1e-5)"]
F -->|other| J["raise ValueError"]
Reviews (1): Last reviewed commit: "increased a bit tolerance for pytorch/di..." | Re-trigger Greptile |
| # TF32 has same mantissa bits as FP16. The atol is looser than for FP16 | ||
| # because near-zero gradient elements can differ by a few 1e-5 between | ||
| # the TP-sharded and single-device GEMM reduction orders (observed on A100). |
There was a problem hiding this comment.
This is a bit disturbing. Even if TF32 has errors, shouldn't it be strictly better than FP16?
This makes me think there are other differences going on, like maybe the FP32 GEMM kernel is different between TP and non-TP, while it is consistent for FP16?
|
Hi @timmoon10, I did go down this rabbit hole. This is what I think is going on. The failing test runs a sharded transformer layer (with TP=4) against an unsharded one. If they match the test pass otherwise it fails. The test runs with several configurations. The one that it fails is only the one with ReLU activation. Next, I ran both the sharded transformer and unsharded one separately with CUBLASLT_LOG_LEVEL=5. And I found a small difference in the algorithm selection: # the sharded one
[2026-06-06 15:31:42][cublasLt][321674][Trace][cublasLtMatmul]
A=0X153DBE21A800 Adesc=[type=R_32F rows=8 cols=64 ld=8]
B=0X153DBE3B6000 Bdesc=[type=R_32F rows=8 cols=1024 ld=8]
C=0X153D47880000 Cdesc=[type=R_32F rows=64 cols=1024 ld=64]
D=0X153D47880000 Ddesc=[type=R_32F rows=64 cols=1024 ld=64]
computeDesc=[computeType=COMPUTE_32F_FAST_TF32 scaleType=R_32F transa=OP_T]
algo=[algoId=0 tile=MATMUL_TILE_128x32 ctaSwizzling=1]
workSpace=0X153D4E000000 workSpaceSizeInBytes=4194304 beta=0 outOfPlace=0 stream=0X0# the unsharded one
[2026-06-06 15:30:06][cublasLt][321377][Trace][cublasLtMatmul]
A=0X1530C6213000 Adesc=[type=R_32F rows=32 cols=64 ld=32]
B=0X1530C6358A00 Bdesc=[type=R_32F rows=32 cols=1024 ld=32]
C=0X1530C6378A00 Cdesc=[type=R_32F rows=64 cols=1024 ld=64]
D=0X1530C6378A00 Ddesc=[type=R_32F rows=64 cols=1024 ld=64]
computeDesc=[computeType=COMPUTE_32F_FAST_TF32 scaleType=R_32F transa=OP_T]
algo=[algoId=21 tile=MATMUL_TILE_64x64 stages=MATMUL_STAGES_16x6]
workSpace=0X153056000000workSpaceSizeInBytes=4194304 beta=0 outOfPlace=0 stream=0X0In particular, the sharded transformer selects Looking at #define CUBLASLT_NUMERICAL_IMPL_FLAGS_FMA (0x01ull << 0) // = 0x00001
#define CUBLASLT_NUMERICAL_IMPL_FLAGS_HMMA (0x02ull << 0) // = 0x00002
#define CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_32F (0x02ull << 8) // = 0x00200
#define CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_TF32 (0x04ull << 16) // = 0x40000
#define CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_32F (0x08ull << 16) // = 0x80000which means:
This is not exactly easy to read. However, I believe this means that algo 0 runs in full precision while algo 21 runs in half precision. For further confirmation, I have set the feedforward size in the transformer layer to 256. This leads to selecting algo 21 also for the sharded case: [2026-06-06 14:52:43][cublasLt][317853][Trace][cublasLtMatmul]
A=0X151072213000 Adesc=[type=R_32F rows=32 cols=64 ld=32]
B=0X15107239AC00 Bdesc=[type=R_32F rows=32 cols=1024 ld=32]
C=0X151007600000 Cdesc=[type=R_32F rows=64 cols=1024 ld=64]
D=0X151007600000 Ddesc=[type=R_32F rows=64 cols=1024 ld=64] computeDesc=[computeType=COMPUTE_32F_FAST_TF32 scaleType=R_32F transa=OP_T]
algo=[algoId=21 tile=MATMUL_TILE_64x64 stages=MATMUL_STAGES_16x6]
workSpace=0X151006000000 workSpaceSizeInBytes=4194304 beta=0 outOfPlace=0 stream=0X0which ultimately makes the test pass. Given all of this, I think that it is fair to increase the tolerance a bit to make this test pass. On the other hand, we can make it pass also by increasing the feedforward size to force the same algoId selection by cublas. Let me know which you prefer |
Description
tests/pytorch/distributed/test_numerics.py::test_distributed[None](the unquantized fp32/TF32 configuration) fails on A100 in theTransformerLayergradient check:Parameter index 12 is
layernorm_mlp.fc1_weight; the failure is a single near-zero gradient element in_test_transformer_layer_parallel(sequence_parallel=False).Reproduction (observed on 4x A100 / sm80, TF32 enabled):
Fix
Raise the fp32
atolfrom1e-5to5e-5(2x headroom over the observed miss), keepingrtol = 1e-3unchanged.Fixes # (issue)
Type of change
Changes
run_numerics.py::_get_tolerancesfrom1e-5to5e-5, with a comment explaining the TP-sharded vs single-GPU reduction-order noise on near-zero gradient elements.rtoland the fp16/bf16/quantized tolerances are unchanged.Checklist: