Problem
When using Gemma 4 models via create_chat_completion() with tools, the Python API returns tool calls as raw native tokens inside message.content instead of the expected message.tool_calls list. This completely breaks OpenAI-compatible tool calling for any application using Gemma 4.
Environment
|
|
| llama-cpp-python |
0.3.23 |
| Model |
gemma-4-E4B-it-Q4_K_M.gguf (Unsloth) |
| Platform |
macOS (Apple Silicon) |
| Python |
3.12 |
Reproduction
from llama_cpp import Llama
llm = Llama(model_path="gemma-4-E4B-it-Q4_K_M.gguf", n_ctx=4096, n_gpu_layers=-1, verbose=False)
tools = [
{
"type": "function",
"function": {
"name": "write_file",
"description": "Write content to a file",
"parameters": {
"type": "object",
"properties": {
"file_path": {"type": "string"},
"content": {"type": "string"},
},
"required": ["file_path", "content"],
},
},
},
{
"type": "function",
"function": {
"name": "run_command",
"description": "Run a shell command",
"parameters": {
"type": "object",
"properties": {
"command": {"type": "string"},
},
"required": ["command"],
},
},
},
]
r = llm.create_chat_completion(
messages=[{"role": "user", "content": "Write 'print(\"hello\")' to hello.py"}],
tools=tools,
tool_choice="auto",
)
msg = r["choices"][0]["message"]
print("tool_calls:", msg.get("tool_calls")) # None ← BUG
print("content: ", msg.get("content")) # raw native tokens ← BUG
Actual Output
tool_calls: None
content: '<|tool_call>call:write_file{content:<|"|>print("hello")<|"|>,file_path:<|"|>hello.py<|"|>}<tool_call|>'
tool_calls: None
content: '<|tool_call>call:run_command{command:<|"|>ls -la<|"|>}<tool_call|>'
Expected Output
tool_calls: [
{
"id": "...",
"type": "function",
"function": {
"name": "write_file",
"arguments": "{\"file_path\": \"hello.py\", \"content\": \"print(\\\"hello\\\")\"}"
}
}
]
content: None
Root Cause
llama_chat_format.py registers a "gemma" format handler that covers Gemma 1/2/3, but there is no "gemma4" handler:
# llama_chat_format.py — only this exists, nothing for gemma4
@register_chat_format("gemma")
def format_gemma(messages, **kwargs):
... # Basic turn formatting, no tool call parsing
Verified with:
import inspect, re
from llama_cpp import llama_chat_format
src = inspect.getsource(llama_chat_format)
formats = re.findall(r'register_chat_format\(["\'](.+?)["\']\)', src)
print([f for f in formats if "gemma" in f.lower()])
# Output: ['gemma'] ← no gemma4
Without a dedicated handler, the library falls back to the Jinja2 template embedded in the GGUF file for chat formatting. The Jinja2 template correctly prompts the model to use its native tool call tokens — but no code in the Python API then parses those native tokens back into tool_calls.
The C++ server (llama-server) solves this with a PEG grammar parser added in ggml-org/llama.cpp PR #21326. That fix has never been ported to the Python API.
Gemma 4 Native Token Format
Gemma 4 encodes tool calls using these native tokens:
<|tool_call>call:FUNCTION_NAME{ARG_PAIRS}<tool_call|>
Where ARG_PAIRS encodes values by type:
| Type |
Encoding |
Example |
| string |
key:<|"|>value<|"|> |
file_path:<|"|>hello.py<|"|> |
| integer |
key:30 |
timeout:30 |
| float |
key:3.5 |
temperature:3.5 |
| boolean |
key:true / key:false |
background:false |
| list[str] |
key:[<|"|>a<|"|>,<|"|>b<|"|>] |
files:[<|"|>main.py<|"|>] |
Gemma 4 also outputs an internal thinking channel before tool calls (when thinking is enabled):
<|channel>thought
[internal reasoning here]
<channel|><|tool_call>call:FUNCTION_NAME{...}<tool_call|>
A complete handler needs to strip the thought block before parsing the tool call tokens.
Suggested Fix
Add a @register_chat_completion_handler("gemma4") in llama_chat_format.py that:
- Uses the Jinja2 template from the GGUF metadata for message formatting (already works)
- After generation, checks if
message.content contains <|tool_call> tokens
- Strips any
<|channel>thought...<channel|> block
- Parses the native token format into the standard OpenAI
tool_calls structure
Impact
- All applications using
create_chat_completion() with Gemma 4 + tools receive tool_calls=None
- The tool call is silently dropped — no error, no warning
- Gemma 4 was released April 2, 2026. This affects every version of llama-cpp-python up to and including 0.3.23 (latest)
- The C++ server already has this fix. The Python API does not.
References
Problem
When using Gemma 4 models via
create_chat_completion()with tools, the Python API returns tool calls as raw native tokens insidemessage.contentinstead of the expectedmessage.tool_callslist. This completely breaks OpenAI-compatible tool calling for any application using Gemma 4.Environment
gemma-4-E4B-it-Q4_K_M.gguf(Unsloth)Reproduction
Actual Output
Expected Output
Root Cause
llama_chat_format.pyregisters a"gemma"format handler that covers Gemma 1/2/3, but there is no"gemma4"handler:Verified with:
Without a dedicated handler, the library falls back to the Jinja2 template embedded in the GGUF file for chat formatting. The Jinja2 template correctly prompts the model to use its native tool call tokens — but no code in the Python API then parses those native tokens back into
tool_calls.The C++ server (
llama-server) solves this with a PEG grammar parser added in ggml-org/llama.cpp PR #21326. That fix has never been ported to the Python API.Gemma 4 Native Token Format
Gemma 4 encodes tool calls using these native tokens:
Where
ARG_PAIRSencodes values by type:key:<|"|>value<|"|>file_path:<|"|>hello.py<|"|>key:30timeout:30key:3.5temperature:3.5key:true/key:falsebackground:falsekey:[<|"|>a<|"|>,<|"|>b<|"|>]files:[<|"|>main.py<|"|>]Gemma 4 also outputs an internal thinking channel before tool calls (when thinking is enabled):
A complete handler needs to strip the thought block before parsing the tool call tokens.
Suggested Fix
Add a
@register_chat_completion_handler("gemma4")inllama_chat_format.pythat:message.contentcontains<|tool_call>tokens<|channel>thought...<channel|>blocktool_callsstructureImpact
create_chat_completion()with Gemma 4 + tools receivetool_calls=NoneReferences