Skip to content

Gemma 4 tool calls returned as raw native tokens in content instead of tool_calls #2227

@ThiruvarankanM

Description

@ThiruvarankanM

Problem

When using Gemma 4 models via create_chat_completion() with tools, the Python API returns tool calls as raw native tokens inside message.content instead of the expected message.tool_calls list. This completely breaks OpenAI-compatible tool calling for any application using Gemma 4.

Environment

llama-cpp-python 0.3.23
Model gemma-4-E4B-it-Q4_K_M.gguf (Unsloth)
Platform macOS (Apple Silicon)
Python 3.12

Reproduction

from llama_cpp import Llama

llm = Llama(model_path="gemma-4-E4B-it-Q4_K_M.gguf", n_ctx=4096, n_gpu_layers=-1, verbose=False)

tools = [
    {
        "type": "function",
        "function": {
            "name": "write_file",
            "description": "Write content to a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "file_path": {"type": "string"},
                    "content":   {"type": "string"},
                },
                "required": ["file_path", "content"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "run_command",
            "description": "Run a shell command",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {"type": "string"},
                },
                "required": ["command"],
            },
        },
    },
]

r = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Write 'print(\"hello\")' to hello.py"}],
    tools=tools,
    tool_choice="auto",
)

msg = r["choices"][0]["message"]
print("tool_calls:", msg.get("tool_calls"))   # None ← BUG
print("content:   ", msg.get("content"))      # raw native tokens ← BUG

Actual Output

tool_calls: None
content:    '<|tool_call>call:write_file{content:<|"|>print("hello")<|"|>,file_path:<|"|>hello.py<|"|>}<tool_call|>'
tool_calls: None
content:    '<|tool_call>call:run_command{command:<|"|>ls -la<|"|>}<tool_call|>'

Expected Output

tool_calls: [
  {
    "id": "...",
    "type": "function",
    "function": {
      "name": "write_file",
      "arguments": "{\"file_path\": \"hello.py\", \"content\": \"print(\\\"hello\\\")\"}"
    }
  }
]
content: None

Root Cause

llama_chat_format.py registers a "gemma" format handler that covers Gemma 1/2/3, but there is no "gemma4" handler:

# llama_chat_format.py — only this exists, nothing for gemma4
@register_chat_format("gemma")
def format_gemma(messages, **kwargs):
    ...  # Basic turn formatting, no tool call parsing

Verified with:

import inspect, re
from llama_cpp import llama_chat_format
src = inspect.getsource(llama_chat_format)
formats = re.findall(r'register_chat_format\(["\'](.+?)["\']\)', src)
print([f for f in formats if "gemma" in f.lower()])
# Output: ['gemma']   ← no gemma4

Without a dedicated handler, the library falls back to the Jinja2 template embedded in the GGUF file for chat formatting. The Jinja2 template correctly prompts the model to use its native tool call tokens — but no code in the Python API then parses those native tokens back into tool_calls.

The C++ server (llama-server) solves this with a PEG grammar parser added in ggml-org/llama.cpp PR #21326. That fix has never been ported to the Python API.

Gemma 4 Native Token Format

Gemma 4 encodes tool calls using these native tokens:

<|tool_call>call:FUNCTION_NAME{ARG_PAIRS}<tool_call|>

Where ARG_PAIRS encodes values by type:

Type Encoding Example
string key:<|"|>value<|"|> file_path:<|"|>hello.py<|"|>
integer key:30 timeout:30
float key:3.5 temperature:3.5
boolean key:true / key:false background:false
list[str] key:[<|"|>a<|"|>,<|"|>b<|"|>] files:[<|"|>main.py<|"|>]

Gemma 4 also outputs an internal thinking channel before tool calls (when thinking is enabled):

<|channel>thought
[internal reasoning here]
<channel|><|tool_call>call:FUNCTION_NAME{...}<tool_call|>

A complete handler needs to strip the thought block before parsing the tool call tokens.

Suggested Fix

Add a @register_chat_completion_handler("gemma4") in llama_chat_format.py that:

  1. Uses the Jinja2 template from the GGUF metadata for message formatting (already works)
  2. After generation, checks if message.content contains <|tool_call> tokens
  3. Strips any <|channel>thought...<channel|> block
  4. Parses the native token format into the standard OpenAI tool_calls structure

Impact

  • All applications using create_chat_completion() with Gemma 4 + tools receive tool_calls=None
  • The tool call is silently dropped — no error, no warning
  • Gemma 4 was released April 2, 2026. This affects every version of llama-cpp-python up to and including 0.3.23 (latest)
  • The C++ server already has this fix. The Python API does not.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions