Skip to content

[Bugfix] Consolidate Gemma2/3 GGUF fixes for correctness on Blackwell#37220

Open
kitaekatt wants to merge 5 commits intovllm-project:mainfrom
kitaekatt:fix/gemma-gguf-consolidated
Open

[Bugfix] Consolidate Gemma2/3 GGUF fixes for correctness on Blackwell#37220
kitaekatt wants to merge 5 commits intovllm-project:mainfrom
kitaekatt:fix/gemma-gguf-consolidated

Conversation

@kitaekatt
Copy link
Copy Markdown
Contributor

@kitaekatt kitaekatt commented Mar 16, 2026

Summary

Consolidates four related GGUF bug fixes for Gemma2 and Gemma3 models into a single PR, as requested by @Isotr0py in #30434. Also applies reviewer feedback from @hmellor on #31464 (compact rms_norm_kwargs pattern).

Fixes included:

  1. Add quant_config to embedding layer (fix(gemma2): Add quant_config to embedding layer for GGUF support #30424)
    Pass quant_config to VocabParallelEmbedding in Gemma2Model so that GGUFEmbeddingMethod is selected instead of UnquantizedEmbeddingMethod. Without this, quantized bytes are read as raw floats → gibberish output.

  2. Fix EOS token extraction for HF blob paths (fix(gguf): Use EOS token ID from GGUF metadata instead of HF tokenizer #30434)
    GGUF files served from HuggingFace Hub use blob paths that don't match the expected filename pattern. Extract EOS token ID directly from GGUF metadata as a reliable fallback.

  3. Skip missing parameters during GGUF weight loading ([Bugfix] Skip missing parameters during GGUF Gemma2 weight loading #30699)
    Gemma2 GGUF files may lack certain weight keys (e.g. embed_tokens.qweight_type). Skip missing parameters gracefully instead of raising KeyError.

  4. Use RMSNorm instead of GemmaRMSNorm for GGUF ([Bugfix] Apply RMSNorm weight correction for Gemma2 GGUF models #31464)
    GGUF files store RMSNorm weights with +1 baked in (llama.cpp convention). GemmaRMSNorm adds 1 in its forward pass, causing double addition. Select plain RMSNorm at construction time for GGUF models. Applied to all norm layers in both Gemma2 and Gemma3 (including q_norm/k_norm), using the compact rms_norm_kwargs pattern per @hmellor's review.

Files changed:

  • vllm/model_executor/models/gemma2.py — fixes 1, 3, 4
  • vllm/model_executor/models/gemma3.py — fix 4 (Gemma3 norms)
  • vllm/tokenizers/hf.py — fix 2
  • vllm/transformers_utils/gguf_utils.py — fix 2

Supersedes: #30424, #30434, #30699, #31464

Test Results

All tests performed on RTX 5090 (Blackwell, SM 120).

Consolidated PR validation (fresh benchmarks, 2026-03-16):

Model Benchmark Score Throughput Status
gem2-2b-gguf (Gemma2 GGUF) HumanEval 43.9% (72/164) 373 tok/s ✅ SUCCESS
gem2-2b-gguf (Gemma2 GGUF) IFEval 66.5% (360/541) 77 tok/s ✅ SUCCESS
gemma3-1b (Gemma3 GGUF) Server + inference 110 tok/s, 65% IFEval acc ✅ Server OK*

*gemma3-1b vLLM server loaded and served requests correctly using the new rms_norm_cls code path. Benchmark framework hit a response parser error on specific items (unrelated to model code changes).

Prior individual PR validation (2026-03-11, all PRs cherry-picked):

Model Benchmark Score Status
gem2-2b-gguf (Gemma2 GGUF) HumanEval 42.1% (69/164) ✅ SUCCESS
gem2-2b-gguf (Gemma2 GGUF) IFEval 65.6% (355/541) ✅ SUCCESS
gemma3-1b (Gemma3) HumanEval 26.8% (44/164) ✅ SUCCESS

Code path coverage:

Code path Tested by
gemma2.py: RMSNorm selection (GGUF) gem2-2b-gguf HumanEval + IFEval
gemma2.py: embedding quant_config gem2-2b-gguf (all benchmarks)
gemma2.py: skip missing params gem2-2b-gguf (all benchmarks)
gemma3.py: RMSNorm selection (GGUF) gemma3-1b server + inference
gemma3.py: q_norm/k_norm in attention gemma3-1b server + inference
EOS token extraction (blob paths) gem2-2b-gguf (HF blob path model)
compact rms_norm_kwargs pattern Both models (all 5 locations)

Pre-commit:

  • ✅ ruff check: Passed
  • ✅ ruff format: Passed (applied)
  • ✅ mypy: Passed
  • ✅ typos: Passed

Signed-off-by: Christina truffle@gmail.com

This PR consolidates four related GGUF bug fixes for Gemma2 and Gemma3
models, plus a style improvement from reviewer feedback.

**1. Add quant_config to embedding layer (PR vllm-project#30424)**
Pass quant_config to VocabParallelEmbedding in Gemma2Model so that
GGUFEmbeddingMethod is selected instead of UnquantizedEmbeddingMethod.
Without this, quantized bytes are read as raw floats producing gibberish.

**2. Fix EOS token extraction for HF blob paths (PR vllm-project#30434)**
GGUF files served from HuggingFace Hub use blob paths that don't match
the expected filename pattern. Extract EOS token ID directly from GGUF
metadata as a reliable fallback.

**3. Skip missing parameters during GGUF weight loading (PR vllm-project#30699)**
Gemma2 GGUF files may lack certain weight keys (e.g. embed_tokens.qweight_type).
Skip missing parameters gracefully instead of raising KeyError.

**4. Use RMSNorm instead of GemmaRMSNorm for GGUF (PR vllm-project#31464)**
GGUF files store RMSNorm weights with +1 baked in (llama.cpp convention).
GemmaRMSNorm adds 1 in its forward pass, causing double addition.
Select plain RMSNorm at construction time for GGUF models. Applied to
all norm layers in Gemma2 and Gemma3 (including q_norm/k_norm).

**Style: compact rms_norm_kwargs pattern (reviewer feedback)**
Use rms_norm_kwargs dict to avoid repeating constructor arguments,
per hmellor's review on PR vllm-project#31464.

Tested on RTX 5090 (Blackwell, SM 120) with gem2-2b-gguf and gemma3-1b.
Supersedes PRs vllm-project#30424, vllm-project#30434, vllm-project#30699, vllm-project#31464.

Signed-off-by: Christina <truffle@gmail.com>
@mergify mergify Bot added the bug Something isn't working label Mar 16, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request consolidates several bug fixes for Gemma2 and Gemma3 GGUF models, enhancing correctness and robustness. The changes correctly address issues with embedding quantization, EOS token handling, missing weight parameters, and normalization layers. The implementation is solid. I have one suggestion to improve maintainability by refactoring duplicated code.

Comment thread vllm/model_executor/models/gemma2.py Outdated
Comment on lines +219 to +220
quant_name = quant_config.get_name() if quant_config else None
rms_norm_cls = RMSNorm if quant_name == "gguf" else GemmaRMSNorm
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to determine the normalization class based on the quantization config is duplicated in Gemma2DecoderLayer and Gemma2Model within this file, and also multiple times in vllm/model_executor/models/gemma3.py. This code duplication increases maintenance overhead and the risk of introducing inconsistencies in the future if the logic needs to be updated.

To improve this, you could extract the logic into a module-level helper function. For example:

from typing import Type
from torch import nn
from vllm.model_executor.layers.layernorm import GemmaRMSNorm, RMSNorm
from vllm.model_executor.layers.quantization import QuantizationConfig

def _get_norm_cls(quant_config: QuantizationConfig | None) -> Type[nn.Module]:
    quant_name = quant_config.get_name() if quant_config else None
    return RMSNorm if quant_name == "gguf" else GemmaRMSNorm

This helper can then be called from all the locations where this logic is needed, making the code more DRY and easier to maintain.

Suggested change
quant_name = quant_config.get_name() if quant_config else None
rms_norm_cls = RMSNorm if quant_name == "gguf" else GemmaRMSNorm
rms_norm_cls = _get_norm_cls(quant_config)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you address this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c7a1a30 — extracted _get_norm_cls() helper in both gemma2.py and gemma3.py. All 5 occurrences now use it.

@kitaekatt
Copy link
Copy Markdown
Contributor Author

@hmellor Would you be willing to review this consolidated PR? It includes your compact rms_norm_kwargs pattern from #31464 applied to all 5 locations across gemma2.py and gemma3.py, plus three other Gemma2-specific GGUF fixes consolidated per @Isotr0py's request.

@kitaekatt
Copy link
Copy Markdown
Contributor Author

@Isotr0py This consolidates the Gemma2-specific PRs as you requested in #30434. The four PRs (#30424, #30434, #30699, #31464) are now closed in favor of this one.

Extract the repeated norm class selection logic into a module-level
_get_norm_cls() helper in both gemma2.py and gemma3.py, as requested
in review.

Signed-off-by: kitaekatt <kitaekatt@users.noreply.github.com>
Signed-off-by: Christina <truffle@gmail.com>
Comment thread vllm/tokenizers/hf.py Outdated
Comment on lines +137 to +149
if gguf_file:
gguf_path = Path(path_or_repo_id) / gguf_file
gguf_eos_id = extract_eos_token_id_from_gguf(str(gguf_path))
if gguf_eos_id is not None:
hf_eos_id = tokenizer.eos_token_id
if hf_eos_id != gguf_eos_id:
logger.info(
"Patching tokenizer eos_token_id from %d to %d "
"(using GGUF metadata)",
hf_eos_id,
gguf_eos_id,
)
tokenizer.eos_token_id = gguf_eos_id
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we provided a repo id instead of local path here?

BTW, I prefer to add an extra maybe_patch_gguf_tokenizer function at gguf_utils.py instead, because it's not very elegant to explicitly handle gguf file here:

def maybe_patch_gguf_tokenizer(
    tokenizer,
    path_or_repo_id: str,
    **kwargs,
):
    ...

maybe_patch_gguf_tokenizer(...)

Addresses review feedback on vllm-project#37220 from @Isotr0py:

- Move GGUF EOS-token patching out of vllm/tokenizers/hf.py into a
  dedicated maybe_patch_gguf_tokenizer() helper in gguf_utils.py,
  mirroring the existing maybe_patch_hf_config_from_gguf naming.
- Resolve HuggingFace repo ids (not just local directory paths) via
  hf_hub_download when the file is not found at a local path.

Behavior is unchanged for local-directory callers.

Signed-off-by: Christina <truffle@gmail.com>
@kitaekatt
Copy link
Copy Markdown
Contributor Author

@Isotr0py thanks for the review — addressed both points in 86fa040:

  • Extracted maybe_patch_gguf_tokenizer into gguf_utils.py, mirroring the existing maybe_patch_hf_config_from_gguf helper. vllm/tokenizers/hf.py now calls it in one line; the inline GGUF handling is gone (+62/−19 across the two files).
  • Handles HuggingFace repo ids in addition to local paths. The helper tries Path(path_or_repo_id) / gguf_file first; if that is not a file, it falls back to hf_hub_download(repo_id=path_or_repo_id, filename=gguf_file) and degrades to a debug log + no-op on any resolution failure.

Also merged latest upstream/main so the PR is current.

Verification:

  • Direct unit check against a cached bartowski/gemma-2-2b-it-GGUF blob (this GGUF reports EOS=1):
    • local-dir path + mismatched HF eos (set to 0) → patched 0 → 1 ✅
    • local-dir path + matching HF eos → unchanged ✅
    • gguf_file=None → no-op ✅
    • repo-id path + mismatched HF eos → hf_hub_download resolves from cache, patched 0 → 1 ✅
  • End-to-end: TheBloke/Mistral-7B-Instruct-v0.2-GGUF (non-gated base tokenizer) HumanEval through the refactored path on RTX 5090 — 164/164 questions generated, 265 tok/s, no errors.

Ready for another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants