| Property | Value |
|---|---|
| Provider | |
| GGUF architecture key | gemma3 |
| Source class | Gemma3Model |
| Vision encoder | Gemma3VisionEncoder |
| Image processor | Gemma3ImageProcessor |
| Example models | gemma-3-4b, gemma-3-12b, gemma-3-27b |
| Modalities | Text, image |
| Thinking mode | No |
| Tool calling | No |
| Output parser | PassthroughOutputParser |
Gemma 3 is Google's third-generation open-weights LLM family, distilled from larger Gemini-class models. The defining architectural choices are:
- Hybrid attention pattern — every sixth layer is a full causal attention layer; the other five are sliding-window attention (SWA) layers. SWA bounds the per-layer attention cost on long contexts.
- Per-head QK normalization — both Q and K are RMSNorm-ed per head before the attention dot-product, which stabilizes attention scores in the wide head dimension Gemma 3 uses (256 by default).
- GeGLU FFN with a heavy intermediate-to-hidden ratio — uses a fused
ffn_gate_upweight and the GELU non-linearity, giving slightly stronger signal preservation than SwiGLU at the cost of a small additional FLOP. - Embedding scaling — token embeddings are multiplied by
sqrt(hidden_size)before the first layer, which keeps the residual stream norm independent of hidden size. - Optional logit softcap —
tanh(logits / cap) * capclips extreme logits.
Gemma 3 is a vision-text model: a separate Gemma3VisionEncoder (loaded from a
multimodal projector GGUF, typically mmproj-gemma3-4b-f16.gguf) produces
fixed-length image token embeddings that are spliced into the residual stream
at <start_of_image> token positions.
┌──────────────────────────┐
│ token_embd.weight │
tokens (int[]) ──────►│ (× sqrt(hidden_size)) │
└────────────┬─────────────┘
│
[optional] InjectVisionEmbeddings
│
▼
┌────────────── × NumLayers ──────────────┐
│ RMSNorm (attn_norm) │
│ Q, K, V projections │
│ per-head RMSNorm (attn_q_norm/k_norm) │
│ RoPE (NeoX, local OR global base) │
│ scale Q │
│ Attention (SWA window OR full causal) │
│ Output projection │
│ RMSNorm (post_attention_norm) + residual│
│ RMSNorm (ffn_norm) │
│ GeGLU: GELU(gate) * up → down │
│ RMSNorm (post_ffw_norm) + residual │
└─────────────────────────────────────────┘
│
RMSNorm (output_norm)
│
LM head (output.weight or tied)
│
[optional] tanh-softcap
│
▼
logits
IsGlobalLayer(layer) is true when (layer + 1) % 6 == 0. Global layers use
full causal attention; the other five-of-six layers attend only to the trailing
sliding_window tokens.
Token-by-token (decode) and per-chunk (prefill) flow through the same
TransformerBlock. The single token position is the only difference: prefill
broadcasts [seqLen, hidden] through the GEMMs and applies a triangular causal
mask, while decode runs a [1, hidden] row through the cached attention
window.
For each layer L:
hidden ─► RMSNorm(attn_norm.weight, eps)
─► attn_q.weight ─► Q (per-head reshape)
─► attn_k.weight ─► K (per-head reshape)
─► attn_v.weight ─► V (per-head reshape)
─► RMSNorm(attn_q_norm.weight) on Q (per head)
─► RMSNorm(attn_k_norm.weight) on K (per head)
─► RoPE_NeoX(Q, K, freqs[layer]) // local OR global base
─► Q ← Q * (1/sqrt(headDim))
─► append K, V to KV cache (positions startPos..startPos+seqLen-1)
─► attention(Q, KCache, VCache, window=W if SWA else totalSeq)
─► attn_output.weight matmul → o
─► RMSNorm(post_attention_norm.weight, eps) on o
─► hidden = hidden + o
─► h2 = RMSNorm(ffn_norm.weight, eps) on hidden
─► ffn_gate_up.weight matmul → [gate ‖ up] (fused projection)
─► g = GELU(gate)
─► h3 = ffn_down.weight × (g * up)
─► RMSNorm(post_ffw_norm.weight, eps) on h3
─► hidden = hidden + h3
After all layers:
hidden ─► narrow(seq_len-1) if prefill // only the last row matters for the LM head
─► RMSNorm(output_norm.weight, eps)
─► matmul against output.weight (or token_embd.weight when tied)
─► [optional] tanh(logits/cap) * cap // _finalLogitSoftcap > 0
─► copy to float[VocabSize] for the sampler
- GQA with separate
key_lengthandvalue_length(default 256/256). - Pattern:
IsGlobalLayer(layer)returns true when(layer + 1) % 6 == 0. Global layers are full causal; the other five layers in every six attend only to positions in[totalSeqLen − slidingWindow, totalSeqLen). - QK normalization: per-head RMSNorm with weights
attn_q_norm.weight/attn_k_norm.weight. - RoPE: NeoX-style. Local layers use
_ropeFreqsLocal[]derived fromgemma3.rope.local.freq_base(default 10000). Global layers use_ropeFreqsGlobal[]fromgemma3.rope.freq_basedivided by_ropeScale. The 27B variant (NumLayers == 34) hardcodes_ropeScale = 8.0. - Decode SWA bounding:
AttentionDecodeWithWindow()reads only positions[max(0, totalSeqLen − slidingWindow), totalSeqLen)from the cache.
ffn_gate_up.weight packs [gate ‖ up] along the row dimension so a single
matmul produces both halves. Ops.GELUMul then computes GELU(gate) * up
(the activation is the Gaussian Error Linear Unit, not the cheaper SiLU used by
Qwen / Mistral). The ffn_down.weight matmul follows.
Four RMSNorms per block — attn_norm, post_attention_norm, ffn_norm,
post_ffw_norm — plus the per-head attn_q_norm / attn_k_norm. The final
output_norm is shared across layers. Epsilon comes from general GGUF
metadata (Config.Eps, typically 1e-6).
The token embedding tensor token_embd.weight is also reused as the LM head
when output.weight is absent (_hasTiedOutput). Embeddings are scaled by
sqrt(Config.HiddenSize) in ScaleEmbedding().
If gemma3.final_logit_softcapping > 0, the LM head output is clipped via
logits = tanh(logits / cap) * cap to keep extreme logits from dominating the
softmax. Skipped otherwise.
Gemma3VisionEncoder is a CLIP-style ViT that produces a fixed
TokensPerImage (256 by default) embeddings of size hidden. The image
processor:
- Composites RGBA over white.
- Resizes the image to
image_size × image_size(typically 896×896). - Normalizes with the encoder's mean/std.
- Converts to NCHW float tensor.
Embeddings are injected into the residual stream by
InjectVisionEmbeddings(), which copies the [256, hidden] block over the
positions occupied by <start_of_image> placeholders. The input prompt is
expanded ahead of time by ChatTemplate.ExpandGemma3ImageTokens() so that
each image becomes one start sentinel + 256 padding tokens + one end sentinel
inside the tokenized prompt.
| Key | Type | Default | Meaning |
|---|---|---|---|
gemma3.attention.sliding_window |
uint32 | 1024 | SWA window length for non-global layers |
gemma3.attention.key_length |
uint32 | 256 | Per-head key dim |
gemma3.attention.value_length |
uint32 | 256 | Per-head value dim |
gemma3.rope.local.freq_base |
float32 | 10000 | RoPE base for SWA layers |
gemma3.rope.freq_base |
float32 | from general |
RoPE base for global layers |
gemma3.final_logit_softcapping |
float32 | 0 (disabled) | tanh-softcap level for the LM head |
Plus the standard general.* keys (hidden_length, block_count,
attention.head_count, attention.head_count_kv, embedding_length, etc.)
that ParseBaseConfig reads.
token_embd.weight # [vocab, hidden]
blk.{L}.attn_norm.weight # pre-attention RMSNorm
blk.{L}.attn_q.weight # Q projection
blk.{L}.attn_k.weight # K projection
blk.{L}.attn_v.weight # V projection
blk.{L}.attn_q_norm.weight # per-head Q RMSNorm
blk.{L}.attn_k_norm.weight # per-head K RMSNorm
blk.{L}.attn_output.weight # output projection
blk.{L}.post_attention_norm.weight # post-attention RMSNorm
blk.{L}.ffn_norm.weight # pre-FFN RMSNorm
blk.{L}.ffn_gate.weight } # before fusion: separate gate/up
blk.{L}.ffn_up.weight }
blk.{L}.ffn_gate_up.weight # after fusion: [2*intermediate, hidden]
blk.{L}.ffn_down.weight # FFN down projection
blk.{L}.post_ffw_norm.weight # post-FFN RMSNorm
output_norm.weight # final RMSNorm
output.weight # LM head (optional if tied)
Gemma3Model calls FuseGateUpWeights() at load time to concatenate
ffn_gate.weight and ffn_up.weight into a single ffn_gate_up.weight. When
the GGUF already ships fused weights, the call is a no-op.
The constructor (Gemma3Model(string ggufPath, BackendType backend)) does
exactly the steps a generic ModelBase workflow expects:
ParseBaseConfig()— readsgeneral.*keys and fillsConfig.- Reads the Gemma 3-specific metadata, decides the per-head dims, the SWA
window, both RoPE bases, the optional softcap, and detects the 27B variant
to override
_ropeScale. ParseTokenizer()— builds the SentencePiece tokenizer from GGUF metadata.LoadWeights()— streams F32 norms / embeddings into managed tensors and leaves quantized matmul weights in the_quantWeightsdictionary._hasTiedOutput = !_weights.ContainsKey("output.weight") && !_quantWeights.ContainsKey("output.weight").FuseGateUpWeights()— concatenates gate / up if the GGUF stored them separately.PrepareCudaQuantizedWeightsForInference()— uploads and reorders quant blobs for the direct CUDA backend (no-op on other backends).PrecomputeRoPE()— fills_ropeFreqsLocal/_ropeFreqsGlobal.InitKVCache(maxSeqLen)— allocates per-layer K and V tensors of shape[NumKVHeads, maxSeqLen, headDim]in the configured KV-cache dtype (f32,f16, orq8_0).
Forward(int[] tokens) then runs the transformer block-by-block. There is no
fused single-graph kernel for Gemma 3 in this branch — each op dispatches
independently — which is one of the optimization opportunities listed below.
InjectVisionEmbeddings() is called before the loop when
_pendingVisionEmbeddingsList is non-empty. The list is populated by the
multimodal injector (TensorSharp.Models/ModelMultimodalInjector.cs) which
runs the image processor + vision encoder once per unique image path and caches
the resulting [TokensPerImage, hidden] embedding tensor.
Prefill walks seqLen > 1 tokens through the same TransformerBlock and falls
back to the standard managed code path. The two notable details are:
- SWA mask cache —
_cachedSWAMaskWidthsstores per-row sliding-window mask widths for the current(queryLen, startPos). All five SWA layers in any block-of-six share the same mask, so the mask is built once and reused. When chunked KV-cache reuse changesstartPos, the cache invalidates. - Single-pass last-row narrow — the LM head only ever runs on the last
row, so prefill keeps the full
[seqLen, hidden]tensor only until the last layer, then narrows to[1, hidden]beforeoutput_normand the LM head.
Forward([token]) dispatches each op individually because Gemma 3 does not
yet have a fused single-graph kernel (unlike Gemma 4). The decode hot path
relies on:
- Pre-resolved per-layer weight name strings (no string interpolation in the hot loop after init).
- The KV-cache append being a
Narrow + Copyrather than a full slab rewrite. AttentionDecodeWithWindow()which restricts theK^T Qand softmax window for SWA layers to the trailingslidingWindowpositions instead of the full cache.
For the GGML CUDA / Metal backends, every matmul, RMSNorm, RoPE, softmax, and
attention op runs through the registered native kernel; for the direct CUDA
backend it goes through cuBLAS GEMM and PTX kernels; for cpu it goes through
the SIMD-optimized managed kernels in TensorSharp.Core.
- Per-layer K and V tensors are sized for the full
maxSeqLenand every layer (both SWA and global) gets the same capacity. SWA bounding is a per-decode windowing operation, not a smaller allocation. (This is one of the listed optimization opportunities below.) ResetKVCache()zeroes every cache tensor and callsInvalidateTensorDeviceCache()so the GGML / CUDA copies see the reset.TruncateKVCache(int tokenCount)keeps the firsttokenCountpositions (used by the multi-turn KV reuse path in the server).
Quantized weights load through LoadWeights() and stay quantized in the
_quantWeights dictionary; matmul calls run through the backend's native
quantized matmul without dequantizing to FP32.
Gemma3OutputParseris the same asPassthroughOutputParser— Gemma 3 has no thinking / tool-call wire format.- Chat template falls back to the hardcoded Gemma chat template when the GGUF does not ship a Jinja2 template.
- Fused QKV — Q, K, V are still three separate projections. Concatenating them into a single matmul (the way Gemma 4 / Qwen 3 do) would halve the attention dispatch count.
- Fused single-graph decode — adopting a
Gemma3ModelDecodekernel analogous toGemma4ModelDecodewould eliminate per-op CPU/GPU round-trips on Metal / CUDA. - Circular SWA cache — SWA layers could allocate
slidingWindowslots instead ofmaxSeqLen, mirroring Gemma 4'sCopyToCacheCircular()path. - Chunked prefill — long prompts currently materialize the full
[seqLen × seqLen]mask for global layers. Chunking (as in Gemma 4) would bound the per-step memory usage.