Gemma 4 tool calls returned as raw native tokens in `content` instead of `tool_calls`

## Problem

When using Gemma 4 models via `create_chat_completion()` with tools, the Python API returns tool calls as raw native tokens inside `message.content` instead of the expected `message.tool_calls` list. This completely breaks OpenAI-compatible tool calling for any application using Gemma 4.

## Environment

| | |
|---|---|
| **llama-cpp-python** | 0.3.23 |
| **Model** | `gemma-4-E4B-it-Q4_K_M.gguf` (Unsloth) |
| **Platform** | macOS (Apple Silicon) |
| **Python** | 3.12 |

## Reproduction

```python
from llama_cpp import Llama

llm = Llama(model_path="gemma-4-E4B-it-Q4_K_M.gguf", n_ctx=4096, n_gpu_layers=-1, verbose=False)

tools = [
    {
        "type": "function",
        "function": {
            "name": "write_file",
            "description": "Write content to a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "file_path": {"type": "string"},
                    "content":   {"type": "string"},
                },
                "required": ["file_path", "content"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "run_command",
            "description": "Run a shell command",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {"type": "string"},
                },
                "required": ["command"],
            },
        },
    },
]

r = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Write 'print(\"hello\")' to hello.py"}],
    tools=tools,
    tool_choice="auto",
)

msg = r["choices"][0]["message"]
print("tool_calls:", msg.get("tool_calls"))   # None ← BUG
print("content:   ", msg.get("content"))      # raw native tokens ← BUG
```

## Actual Output

```
tool_calls: None
content:    '<|tool_call>call:write_file{content:<|"|>print("hello")<|"|>,file_path:<|"|>hello.py<|"|>}<tool_call|>'
```

```
tool_calls: None
content:    '<|tool_call>call:run_command{command:<|"|>ls -la<|"|>}<tool_call|>'
```

## Expected Output

```
tool_calls: [
  {
    "id": "...",
    "type": "function",
    "function": {
      "name": "write_file",
      "arguments": "{\"file_path\": \"hello.py\", \"content\": \"print(\\\"hello\\\")\"}"
    }
  }
]
content: None
```

## Root Cause

`llama_chat_format.py` registers a `"gemma"` format handler that covers Gemma 1/2/3, but **there is no `"gemma4"` handler**:

```python
# llama_chat_format.py — only this exists, nothing for gemma4
@register_chat_format("gemma")
def format_gemma(messages, **kwargs):
    ...  # Basic turn formatting, no tool call parsing
```

Verified with:
```python
import inspect, re
from llama_cpp import llama_chat_format
src = inspect.getsource(llama_chat_format)
formats = re.findall(r'register_chat_format\(["\'](.+?)["\']\)', src)
print([f for f in formats if "gemma" in f.lower()])
# Output: ['gemma']   ← no gemma4
```

Without a dedicated handler, the library falls back to the Jinja2 template embedded in the GGUF file for chat formatting. The Jinja2 template correctly prompts the model to use its native tool call tokens — but **no code in the Python API then parses those native tokens back into `tool_calls`**.

The C++ server (`llama-server`) solves this with a PEG grammar parser added in [ggml-org/llama.cpp PR #21326](https://github.com/ggml-org/llama.cpp/pull/21326). That fix has never been ported to the Python API.

## Gemma 4 Native Token Format

Gemma 4 encodes tool calls using these native tokens:

```
<|tool_call>call:FUNCTION_NAME{ARG_PAIRS}<tool_call|>
```

Where `ARG_PAIRS` encodes values by type:

| Type | Encoding | Example |
|------|----------|---------|
| string | `key:<\|"\|>value<\|"\|>` | `file_path:<\|"\|>hello.py<\|"\|>` |
| integer | `key:30` | `timeout:30` |
| float | `key:3.5` | `temperature:3.5` |
| boolean | `key:true` / `key:false` | `background:false` |
| list[str] | `key:[<\|"\|>a<\|"\|>,<\|"\|>b<\|"\|>]` | `files:[<\|"\|>main.py<\|"\|>]` |

Gemma 4 also outputs an internal thinking channel before tool calls (when thinking is enabled):

```
<|channel>thought
[internal reasoning here]
<channel|><|tool_call>call:FUNCTION_NAME{...}<tool_call|>
```

A complete handler needs to strip the thought block before parsing the tool call tokens.

## Suggested Fix

Add a `@register_chat_completion_handler("gemma4")` in `llama_chat_format.py` that:

1. Uses the Jinja2 template from the GGUF metadata for message formatting (already works)
2. After generation, checks if `message.content` contains `<|tool_call>` tokens
3. Strips any `<|channel>thought...<channel|>` block
4. Parses the native token format into the standard OpenAI `tool_calls` structure

## Impact

- All applications using `create_chat_completion()` with Gemma 4 + tools receive `tool_calls=None`
- The tool call is silently dropped — no error, no warning
- Gemma 4 was released April 2, 2026. This affects every version of llama-cpp-python up to and including **0.3.23** (latest)
- The C++ server already has this fix. The Python API does not.

## References

- [ggml-org/llama.cpp PR #21326 — Gemma 4 template parser fixes](https://github.com/ggml-org/llama.cpp/pull/21326)
- [ggml-org/llama.cpp Issue #21316 — Gemma4 tool calling leaves unexpected tokens](https://github.com/ggml-org/llama.cpp/issues/21316)
- [ggml-org/llama.cpp Issue #22786 — Gemma 4 tool call returned as content](https://github.com/ggml-org/llama.cpp/issues/22786)
- [Gemma 4 prompt formatting — Google AI for Developers](https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4)
- [Gemma 4 thinking mode — Google AI for Developers](https://ai.google.dev/gemma/docs/capabilities/thinking)



llama-cpp-python	0.3.23
Model	`gemma-4-E4B-it-Q4_K_M.gguf` (Unsloth)
Platform	macOS (Apple Silicon)
Python	3.12

Type	Encoding	Example
string	`key:<\|"\|>value<\|"\|>`	`file_path:<\|"\|>hello.py<\|"\|>`
integer	`key:30`	`timeout:30`
float	`key:3.5`	`temperature:3.5`
boolean	`key:true` / `key:false`	`background:false`
list[str]	`key:[<\|"\|>a<\|"\|>,<\|"\|>b<\|"\|>]`	`files:[<\|"\|>main.py<\|"\|>]`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 4 tool calls returned as raw native tokens in `content` instead of `tool_calls` #2227

Problem

Environment

Reproduction

Actual Output

Expected Output

Root Cause

Gemma 4 Native Token Format

Suggested Fix

Impact

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Gemma 4 tool calls returned as raw native tokens in content instead of tool_calls #2227

Description

Problem

Environment

Reproduction

Actual Output

Expected Output

Root Cause

Gemma 4 Native Token Format

Suggested Fix

Impact

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Gemma 4 tool calls returned as raw native tokens in `content` instead of `tool_calls` #2227