[Bugfix] Consolidate Gemma2/3 GGUF fixes for correctness on Blackwell by kitaekatt · Pull Request #37220 · vllm-project/vllm

kitaekatt · 2026-03-16T18:04:57Z

Summary

Consolidates four related GGUF bug fixes for Gemma2 and Gemma3 models into a single PR, as requested by @Isotr0py in #30434. Also applies reviewer feedback from @hmellor on #31464 (compact rms_norm_kwargs pattern).

Fixes included:

Add quant_config to embedding layer (fix(gemma2): Add quant_config to embedding layer for GGUF support #30424)
Pass quant_config to VocabParallelEmbedding in Gemma2Model so that GGUFEmbeddingMethod is selected instead of UnquantizedEmbeddingMethod. Without this, quantized bytes are read as raw floats → gibberish output.
Fix EOS token extraction for HF blob paths (fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer #30434)
GGUF files served from HuggingFace Hub use blob paths that don't match the expected filename pattern. Extract EOS token ID directly from GGUF metadata as a reliable fallback.
Skip missing parameters during GGUF weight loading ([Bugfix] Skip missing parameters during GGUF Gemma2 weight loading #30699)
Gemma2 GGUF files may lack certain weight keys (e.g. embed_tokens.qweight_type). Skip missing parameters gracefully instead of raising KeyError.
Use RMSNorm instead of GemmaRMSNorm for GGUF ([Bugfix] Apply RMSNorm weight correction for Gemma2 GGUF models #31464)
GGUF files store RMSNorm weights with +1 baked in (llama.cpp convention). GemmaRMSNorm adds 1 in its forward pass, causing double addition. Select plain RMSNorm at construction time for GGUF models. Applied to all norm layers in both Gemma2 and Gemma3 (including q_norm/k_norm), using the compact rms_norm_kwargs pattern per @hmellor's review.

Files changed:

vllm/model_executor/models/gemma2.py — fixes 1, 3, 4
vllm/model_executor/models/gemma3.py — fix 4 (Gemma3 norms)
vllm/tokenizers/hf.py — fix 2
vllm/transformers_utils/gguf_utils.py — fix 2

Supersedes: #30424, #30434, #30699, #31464

Test Results

All tests performed on RTX 5090 (Blackwell, SM 120).

Consolidated PR validation (fresh benchmarks, 2026-03-16):

Model	Benchmark	Score	Throughput	Status
gem2-2b-gguf (Gemma2 GGUF)	HumanEval	43.9% (72/164)	373 tok/s	✅ SUCCESS
gem2-2b-gguf (Gemma2 GGUF)	IFEval	66.5% (360/541)	77 tok/s	✅ SUCCESS
gemma3-1b (Gemma3 GGUF)	Server + inference	110 tok/s, 65% IFEval acc	—	✅ Server OK*

*gemma3-1b vLLM server loaded and served requests correctly using the new rms_norm_cls code path. Benchmark framework hit a response parser error on specific items (unrelated to model code changes).

Prior individual PR validation (2026-03-11, all PRs cherry-picked):

Model	Benchmark	Score	Status
gem2-2b-gguf (Gemma2 GGUF)	HumanEval	42.1% (69/164)	✅ SUCCESS
gem2-2b-gguf (Gemma2 GGUF)	IFEval	65.6% (355/541)	✅ SUCCESS
gemma3-1b (Gemma3)	HumanEval	26.8% (44/164)	✅ SUCCESS

Code path coverage:

Code path	Tested by
gemma2.py: RMSNorm selection (GGUF)	gem2-2b-gguf HumanEval + IFEval
gemma2.py: embedding quant_config	gem2-2b-gguf (all benchmarks)
gemma2.py: skip missing params	gem2-2b-gguf (all benchmarks)
gemma3.py: RMSNorm selection (GGUF)	gemma3-1b server + inference
gemma3.py: q_norm/k_norm in attention	gemma3-1b server + inference
EOS token extraction (blob paths)	gem2-2b-gguf (HF blob path model)
compact rms_norm_kwargs pattern	Both models (all 5 locations)

Pre-commit:

✅ ruff check: Passed
✅ ruff format: Passed (applied)
✅ mypy: Passed
✅ typos: Passed

Signed-off-by: Christina truffle@gmail.com

This PR consolidates four related GGUF bug fixes for Gemma2 and Gemma3 models, plus a style improvement from reviewer feedback. **1. Add quant_config to embedding layer (PR vllm-project#30424)** Pass quant_config to VocabParallelEmbedding in Gemma2Model so that GGUFEmbeddingMethod is selected instead of UnquantizedEmbeddingMethod. Without this, quantized bytes are read as raw floats producing gibberish. **2. Fix EOS token extraction for HF blob paths (PR vllm-project#30434)** GGUF files served from HuggingFace Hub use blob paths that don't match the expected filename pattern. Extract EOS token ID directly from GGUF metadata as a reliable fallback. **3. Skip missing parameters during GGUF weight loading (PR vllm-project#30699)** Gemma2 GGUF files may lack certain weight keys (e.g. embed_tokens.qweight_type). Skip missing parameters gracefully instead of raising KeyError. **4. Use RMSNorm instead of GemmaRMSNorm for GGUF (PR vllm-project#31464)** GGUF files store RMSNorm weights with +1 baked in (llama.cpp convention). GemmaRMSNorm adds 1 in its forward pass, causing double addition. Select plain RMSNorm at construction time for GGUF models. Applied to all norm layers in Gemma2 and Gemma3 (including q_norm/k_norm). **Style: compact rms_norm_kwargs pattern (reviewer feedback)** Use rms_norm_kwargs dict to avoid repeating constructor arguments, per hmellor's review on PR vllm-project#31464. Tested on RTX 5090 (Blackwell, SM 120) with gem2-2b-gguf and gemma3-1b. Supersedes PRs vllm-project#30424, vllm-project#30434, vllm-project#30699, vllm-project#31464. Signed-off-by: Christina <truffle@gmail.com>

gemini-code-assist

Code Review

This pull request consolidates several bug fixes for Gemma2 and Gemma3 GGUF models, enhancing correctness and robustness. The changes correctly address issues with embedding quantization, EOS token handling, missing weight parameters, and normalization layers. The implementation is solid. I have one suggestion to improve maintainability by refactoring duplicated code.

gemini-code-assist · 2026-03-16T18:10:23Z

+        quant_name = quant_config.get_name() if quant_config else None
+        rms_norm_cls = RMSNorm if quant_name == "gguf" else GemmaRMSNorm


The logic to determine the normalization class based on the quantization config is duplicated in Gemma2DecoderLayer and Gemma2Model within this file, and also multiple times in vllm/model_executor/models/gemma3.py. This code duplication increases maintenance overhead and the risk of introducing inconsistencies in the future if the logic needs to be updated.

To improve this, you could extract the logic into a module-level helper function. For example:

from typing import Type from torch import nn from vllm.model_executor.layers.layernorm import GemmaRMSNorm, RMSNorm from vllm.model_executor.layers.quantization import QuantizationConfig def _get_norm_cls(quant_config: QuantizationConfig | None) -> Type[nn.Module]: quant_name = quant_config.get_name() if quant_config else None return RMSNorm if quant_name == "gguf" else GemmaRMSNorm

This helper can then be called from all the locations where this logic is needed, making the code more DRY and easier to maintain.

Suggested change

quant_name = quant_config.get_name() if quant_config else None

rms_norm_cls = RMSNorm if quant_name == "gguf" else GemmaRMSNorm

rms_norm_cls = _get_norm_cls(quant_config)

Can you address this?

Done in c7a1a30 — extracted _get_norm_cls() helper in both gemma2.py and gemma3.py. All 5 occurrences now use it.

kitaekatt · 2026-03-16T20:35:36Z

@hmellor Would you be willing to review this consolidated PR? It includes your compact rms_norm_kwargs pattern from #31464 applied to all 5 locations across gemma2.py and gemma3.py, plus three other Gemma2-specific GGUF fixes consolidated per @Isotr0py's request.

kitaekatt · 2026-03-16T20:36:41Z

@Isotr0py This consolidates the Gemma2-specific PRs as you requested in #30434. The four PRs (#30424, #30434, #30699, #31464) are now closed in favor of this one.

Extract the repeated norm class selection logic into a module-level _get_norm_cls() helper in both gemma2.py and gemma3.py, as requested in review. Signed-off-by: kitaekatt <kitaekatt@users.noreply.github.com> Signed-off-by: Christina <truffle@gmail.com>

Isotr0py · 2026-03-27T16:33:26Z

+        if gguf_file:
+            gguf_path = Path(path_or_repo_id) / gguf_file
+            gguf_eos_id = extract_eos_token_id_from_gguf(str(gguf_path))
+            if gguf_eos_id is not None:
+                hf_eos_id = tokenizer.eos_token_id
+                if hf_eos_id != gguf_eos_id:
+                    logger.info(
+                        "Patching tokenizer eos_token_id from %d to %d "
+                        "(using GGUF metadata)",
+                        hf_eos_id,
+                        gguf_eos_id,
+                    )
+                    tokenizer.eos_token_id = gguf_eos_id


What if we provided a repo id instead of local path here?

BTW, I prefer to add an extra maybe_patch_gguf_tokenizer function at gguf_utils.py instead, because it's not very elegant to explicitly handle gguf file here:

def maybe_patch_gguf_tokenizer( tokenizer, path_or_repo_id: str, **kwargs, ): ... maybe_patch_gguf_tokenizer(...)

…olidated

@Isotr0py

Addresses review feedback on vllm-project#37220 from @Isotr0py: - Move GGUF EOS-token patching out of vllm/tokenizers/hf.py into a dedicated maybe_patch_gguf_tokenizer() helper in gguf_utils.py, mirroring the existing maybe_patch_hf_config_from_gguf naming. - Resolve HuggingFace repo ids (not just local directory paths) via hf_hub_download when the file is not found at a local path. Behavior is unchanged for local-directory callers. Signed-off-by: Christina <truffle@gmail.com>

kitaekatt · 2026-04-18T17:08:49Z

@Isotr0py thanks for the review — addressed both points in 86fa040:

Extracted maybe_patch_gguf_tokenizer into gguf_utils.py, mirroring the existing maybe_patch_hf_config_from_gguf helper. vllm/tokenizers/hf.py now calls it in one line; the inline GGUF handling is gone (+62/−19 across the two files).
Handles HuggingFace repo ids in addition to local paths. The helper tries Path(path_or_repo_id) / gguf_file first; if that is not a file, it falls back to hf_hub_download(repo_id=path_or_repo_id, filename=gguf_file) and degrades to a debug log + no-op on any resolution failure.

Also merged latest upstream/main so the PR is current.

Verification:

Direct unit check against a cached bartowski/gemma-2-2b-it-GGUF blob (this GGUF reports EOS=1):
- local-dir path + mismatched HF eos (set to 0) → patched 0 → 1 ✅
- local-dir path + matching HF eos → unchanged ✅
- gguf_file=None → no-op ✅
- repo-id path + mismatched HF eos → hf_hub_download resolves from cache, patched 0 → 1 ✅
End-to-end: TheBloke/Mistral-7B-Instruct-v0.2-GGUF (non-gated base tokenizer) HumanEval through the refactored path on RTX 5090 — 164/164 questions generated, 265 tok/s, no errors.

Ready for another look.

mergify Bot added the bug Something isn't working label Mar 16, 2026

gemini-code-assist Bot reviewed Mar 16, 2026

View reviewed changes

kitaekatt marked this pull request as ready for review March 16, 2026 19:39

kitaekatt requested review from DarkLight1337 and njhill as code owners March 16, 2026 19:39

kitaekatt added 2 commits March 25, 2026 11:41

Merge branch 'main' into fix/gemma-gguf-consolidated

c3ac087

Isotr0py reviewed Mar 27, 2026

View reviewed changes

kitaekatt added 2 commits April 18, 2026 11:56

Merge remote-tracking branch 'upstream/main' into fix/gemma-gguf-cons…

5befed8

…olidated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Consolidate Gemma2/3 GGUF fixes for correctness on Blackwell#37220

[Bugfix] Consolidate Gemma2/3 GGUF fixes for correctness on Blackwell#37220
kitaekatt wants to merge 5 commits intovllm-project:mainfrom
kitaekatt:fix/gemma-gguf-consolidated

kitaekatt commented Mar 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 16, 2026

Uh oh!

Isotr0py Mar 17, 2026

Uh oh!

kitaekatt Mar 25, 2026

Uh oh!

kitaekatt commented Mar 16, 2026

Uh oh!

kitaekatt commented Mar 16, 2026

Uh oh!

Isotr0py Mar 27, 2026

Uh oh!

kitaekatt commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		quant_name = quant_config.get_name() if quant_config else None
		rms_norm_cls = RMSNorm if quant_name == "gguf" else GemmaRMSNorm

	quant_name = quant_config.get_name() if quant_config else None
	rms_norm_cls = RMSNorm if quant_name == "gguf" else GemmaRMSNorm
	rms_norm_cls = _get_norm_cls(quant_config)

Uh oh!

Conversation

kitaekatt commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fixes included:

Files changed:

Test Results

Consolidated PR validation (fresh benchmarks, 2026-03-16):

Prior individual PR validation (2026-03-11, all PRs cherry-picked):

Code path coverage:

Pre-commit:

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Isotr0py Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

kitaekatt Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

kitaekatt commented Mar 16, 2026

Uh oh!

kitaekatt commented Mar 16, 2026

Uh oh!

Isotr0py Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

kitaekatt commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kitaekatt commented Mar 16, 2026 •

edited

Loading