fix: ensure absmax_offset is of type float32 before passing to gemm_4bit#1971
fix: ensure absmax_offset is of type float32 before passing to gemm_4bit#1971kathsucurry wants to merge 2 commits into
Conversation
|
Thanks for catching this! Out of curiosity, can you please provide some more detail on your environment? Specifically versions of torch, accelerate, and transformers. This seems reasonable as a defensive measure, but I don't think the root issue is in the serialized models themselves. There's a quirk in the 4bit serialization where this offset value is actually stored in a utf-8 encoded uint8 tensor as a JSON string. When deserialized, it uses the Python bitsandbytes/bitsandbytes/functional.py Lines 521 to 523 in 9dad665 I can see how in some circumstance (e.g. The other possibility is One last, but less likely, possibility here is that you have an AVX512BF16 enabled CPU, and moved between CPU and GPU at some point after running a forward pass on CPU. This may have inadvertently cast the offset. Since you only observe the issue with pre-quantized models, it leans heavily toward being related to the deserialization. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Ah I should have included the versions in the PR description. Below are the versions (and I'll add them in the description as well):
Thank you so much for the comments! The detailed breakdown gives really helpful context on how the offset gets stored and deserialized. I still lack context on the serialization/deserialization path or |
Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>
TLDR
gemm_4bitkernel functions expect a float32absmax_offset, but the offsets from some pre-quantized models are bfloat16. The discrepancy leads to gibberish offset values in the kernel, which further affects the outputs.Relevant package versions:
Background
I was playing around with Unsloth_Puzzles.ipynb part B on RTX5070Ti (sm120) with the same model,
unsloth/meta-Llama-3.1-8B-Instruct-bnb-4bit, when I noticed that the training loss values changed drastically after commit 5453368 "[CUDA] New 4bit GEMM kernels for inference (#1949)". Specifically, the loss starts off much larger and quickly becomes 0 (NaN values):Note that the training loss before the mentioned commit is similar to that in the notebook.
By trying out different models, I found that the loss values show the same issue when 1) a pre-quantized base model is used and 2)
gemm_4bitis called (whenuse_customisTrue). Some of the models I used are as follows:use_customisTrueFindings
The bug appears to be caused by the type of
absmax_offset. Theabsmax_offsetvalues coming from the pre-quantized models I've used so far are bfloat16, while the kernel function expects a float32 tensor. Consequently, the offset value becomes gibberish and extremely large, which causes overflow.I've made a small change along with the corresponding test. The training loss for all the models mentioned above is now more stable. The model
unsloth/meta-Llama-3.1-8B-Instruct-bnb-4bit, for instance, now gives the following outputs:I've only tried Unsloth's pre-quantized models, so do let me know if there are other pre-quantized models you'd like me to try (or if there are other tests you'd like me to do).