[Bugfix] Consolidate Gemma2/3 GGUF fixes for correctness on Blackwell#37220
[Bugfix] Consolidate Gemma2/3 GGUF fixes for correctness on Blackwell#37220kitaekatt wants to merge 5 commits intovllm-project:mainfrom
Conversation
This PR consolidates four related GGUF bug fixes for Gemma2 and Gemma3 models, plus a style improvement from reviewer feedback. **1. Add quant_config to embedding layer (PR vllm-project#30424)** Pass quant_config to VocabParallelEmbedding in Gemma2Model so that GGUFEmbeddingMethod is selected instead of UnquantizedEmbeddingMethod. Without this, quantized bytes are read as raw floats producing gibberish. **2. Fix EOS token extraction for HF blob paths (PR vllm-project#30434)** GGUF files served from HuggingFace Hub use blob paths that don't match the expected filename pattern. Extract EOS token ID directly from GGUF metadata as a reliable fallback. **3. Skip missing parameters during GGUF weight loading (PR vllm-project#30699)** Gemma2 GGUF files may lack certain weight keys (e.g. embed_tokens.qweight_type). Skip missing parameters gracefully instead of raising KeyError. **4. Use RMSNorm instead of GemmaRMSNorm for GGUF (PR vllm-project#31464)** GGUF files store RMSNorm weights with +1 baked in (llama.cpp convention). GemmaRMSNorm adds 1 in its forward pass, causing double addition. Select plain RMSNorm at construction time for GGUF models. Applied to all norm layers in Gemma2 and Gemma3 (including q_norm/k_norm). **Style: compact rms_norm_kwargs pattern (reviewer feedback)** Use rms_norm_kwargs dict to avoid repeating constructor arguments, per hmellor's review on PR vllm-project#31464. Tested on RTX 5090 (Blackwell, SM 120) with gem2-2b-gguf and gemma3-1b. Supersedes PRs vllm-project#30424, vllm-project#30434, vllm-project#30699, vllm-project#31464. Signed-off-by: Christina <truffle@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request consolidates several bug fixes for Gemma2 and Gemma3 GGUF models, enhancing correctness and robustness. The changes correctly address issues with embedding quantization, EOS token handling, missing weight parameters, and normalization layers. The implementation is solid. I have one suggestion to improve maintainability by refactoring duplicated code.
| quant_name = quant_config.get_name() if quant_config else None | ||
| rms_norm_cls = RMSNorm if quant_name == "gguf" else GemmaRMSNorm |
There was a problem hiding this comment.
The logic to determine the normalization class based on the quantization config is duplicated in Gemma2DecoderLayer and Gemma2Model within this file, and also multiple times in vllm/model_executor/models/gemma3.py. This code duplication increases maintenance overhead and the risk of introducing inconsistencies in the future if the logic needs to be updated.
To improve this, you could extract the logic into a module-level helper function. For example:
from typing import Type
from torch import nn
from vllm.model_executor.layers.layernorm import GemmaRMSNorm, RMSNorm
from vllm.model_executor.layers.quantization import QuantizationConfig
def _get_norm_cls(quant_config: QuantizationConfig | None) -> Type[nn.Module]:
quant_name = quant_config.get_name() if quant_config else None
return RMSNorm if quant_name == "gguf" else GemmaRMSNormThis helper can then be called from all the locations where this logic is needed, making the code more DRY and easier to maintain.
| quant_name = quant_config.get_name() if quant_config else None | |
| rms_norm_cls = RMSNorm if quant_name == "gguf" else GemmaRMSNorm | |
| rms_norm_cls = _get_norm_cls(quant_config) |
There was a problem hiding this comment.
Done in c7a1a30 — extracted _get_norm_cls() helper in both gemma2.py and gemma3.py. All 5 occurrences now use it.
Extract the repeated norm class selection logic into a module-level _get_norm_cls() helper in both gemma2.py and gemma3.py, as requested in review. Signed-off-by: kitaekatt <kitaekatt@users.noreply.github.com> Signed-off-by: Christina <truffle@gmail.com>
| if gguf_file: | ||
| gguf_path = Path(path_or_repo_id) / gguf_file | ||
| gguf_eos_id = extract_eos_token_id_from_gguf(str(gguf_path)) | ||
| if gguf_eos_id is not None: | ||
| hf_eos_id = tokenizer.eos_token_id | ||
| if hf_eos_id != gguf_eos_id: | ||
| logger.info( | ||
| "Patching tokenizer eos_token_id from %d to %d " | ||
| "(using GGUF metadata)", | ||
| hf_eos_id, | ||
| gguf_eos_id, | ||
| ) | ||
| tokenizer.eos_token_id = gguf_eos_id |
There was a problem hiding this comment.
What if we provided a repo id instead of local path here?
BTW, I prefer to add an extra maybe_patch_gguf_tokenizer function at gguf_utils.py instead, because it's not very elegant to explicitly handle gguf file here:
def maybe_patch_gguf_tokenizer(
tokenizer,
path_or_repo_id: str,
**kwargs,
):
...
maybe_patch_gguf_tokenizer(...)Addresses review feedback on vllm-project#37220 from @Isotr0py: - Move GGUF EOS-token patching out of vllm/tokenizers/hf.py into a dedicated maybe_patch_gguf_tokenizer() helper in gguf_utils.py, mirroring the existing maybe_patch_hf_config_from_gguf naming. - Resolve HuggingFace repo ids (not just local directory paths) via hf_hub_download when the file is not found at a local path. Behavior is unchanged for local-directory callers. Signed-off-by: Christina <truffle@gmail.com>
|
@Isotr0py thanks for the review — addressed both points in 86fa040:
Also merged latest Verification:
Ready for another look. |
Summary
Consolidates four related GGUF bug fixes for Gemma2 and Gemma3 models into a single PR, as requested by @Isotr0py in #30434. Also applies reviewer feedback from @hmellor on #31464 (compact
rms_norm_kwargspattern).Fixes included:
Add quant_config to embedding layer (fix(gemma2): Add quant_config to embedding layer for GGUF support #30424)
Pass
quant_configtoVocabParallelEmbeddinginGemma2Modelso thatGGUFEmbeddingMethodis selected instead ofUnquantizedEmbeddingMethod. Without this, quantized bytes are read as raw floats → gibberish output.Fix EOS token extraction for HF blob paths (fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer #30434)
GGUF files served from HuggingFace Hub use blob paths that don't match the expected filename pattern. Extract EOS token ID directly from GGUF metadata as a reliable fallback.
Skip missing parameters during GGUF weight loading ([Bugfix] Skip missing parameters during GGUF Gemma2 weight loading #30699)
Gemma2 GGUF files may lack certain weight keys (e.g.
embed_tokens.qweight_type). Skip missing parameters gracefully instead of raisingKeyError.Use RMSNorm instead of GemmaRMSNorm for GGUF ([Bugfix] Apply RMSNorm weight correction for Gemma2 GGUF models #31464)
GGUF files store RMSNorm weights with +1 baked in (llama.cpp convention).
GemmaRMSNormadds 1 in its forward pass, causing double addition. Select plainRMSNormat construction time for GGUF models. Applied to all norm layers in both Gemma2 and Gemma3 (includingq_norm/k_norm), using the compactrms_norm_kwargspattern per @hmellor's review.Files changed:
vllm/model_executor/models/gemma2.py— fixes 1, 3, 4vllm/model_executor/models/gemma3.py— fix 4 (Gemma3 norms)vllm/tokenizers/hf.py— fix 2vllm/transformers_utils/gguf_utils.py— fix 2Supersedes: #30424, #30434, #30699, #31464
Test Results
All tests performed on RTX 5090 (Blackwell, SM 120).
Consolidated PR validation (fresh benchmarks, 2026-03-16):
*gemma3-1b vLLM server loaded and served requests correctly using the new
rms_norm_clscode path. Benchmark framework hit a response parser error on specific items (unrelated to model code changes).Prior individual PR validation (2026-03-11, all PRs cherry-picked):
Code path coverage:
Pre-commit:
Signed-off-by: Christina truffle@gmail.com