Skip to content

[llama] Add chat format support for Llama 3 Instruct models#16987

Open
seyeong-han wants to merge 1 commit into
pytorch:mainfrom
seyeong-han:chat-format-support-llama-runner
Open

[llama] Add chat format support for Llama 3 Instruct models#16987
seyeong-han wants to merge 1 commit into
pytorch:mainfrom
seyeong-han:chat-format-support-llama-runner

Conversation

@seyeong-han
Copy link
Copy Markdown
Contributor

Summary

Without chat formatting, Instruct models behave like base models and never generate end-of-turn tokens.

So, I added chat template support to the llama_main runner for Llama 3 Instruct models.

Problem

When running Llama 3 Instruct models without proper chat formatting:

cmake-out/examples/models/llama/llama_main \    
    --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --prompt="What's the capital of France?"

Paris is incorrect answer so was stated so some major tech companies may choose the other city, and for instance the CEO of Apple is Tim Cook and he is from California, and the CEO of Google is Sundar Pichai and he is from India.

==> Only max_tokens limit

Solution

New CLI flags enable proper Instruct model behavior:

Flag Description Default
--chat_format Chat template (llama3, none) none
--system_prompt System prompt for assistant behavior (empty)
--echo Echo input prompt in output true

Examples

Basic Usage

cmake-out/examples/models/llama/llama_main \
    --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --chat_format="llama3" \
    --prompt="What's the capital of France?"

Output:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

What's the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The capital of France is Paris.

Clean Output (Recommended for Apps)

cmake-out/examples/models/llama/llama_main \
    --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --chat_format="llama3" \
    --echo=false \
    --prompt="What's the capital of France?"

Output:

The capital of France is Paris.

With System Prompt

cmake-out/examples/models/llama/llama_main \
    --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --chat_format="llama3" \
    --system_prompt="You are a helpful assistant. Be concise and respond in one sentence." \
    --echo=false \
    --prompt="Explain quantum computing"

Output:

Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously to solve complex problems faster than classical computers.

Backward Compatible (No Chat Format)

# For base models or pre-formatted prompts
cmake-out/examples/models/llama/llama_main \
    --model_path="llama/Llama-3.2-1B/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --chat_format="none" \
    --prompt="Once upon a time"

Files Changed

File Description
examples/models/llama/runner/chat_formatter.h NEW - ChatFormatter for Llama 3
examples/models/llama/main.cpp Added CLI flags
examples/models/llama/README.md Documentation with examples
extension/llm/runner/text_llm_runner.cpp Special token filtering
extension/llm/runner/llm_runner_helper.cpp EOS token merge logic

Before/After Comparison

Metric Before (no chat format) After (--chat_format=llama3 --echo=false)
Tokens generated 120 (max limit) 7 (stops at EOS)
Output Rambling continuation Clean answer
Special tokens Visible in output Filtered out

Test Plan

  • Build with make llama-cpu
  • Test --chat_format=llama3 with Llama-3.2-1B-Instruct
  • Verify generation stops at <|eot_id|> token
  • Test --echo=false produces clean output without special tokens
  • Test --system_prompt affects model behavior
  • Backward compatible with --chat_format=none (default)

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jan 29, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16987

Note: Links to docs will display an error until the docs builds have been completed.

❌ 100 New Failures, 2 Cancelled Jobs

As of commit 2d09315 with merge base 4930e7c (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 29, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@seyeong-han
Copy link
Copy Markdown
Contributor Author

seyeong-han commented Feb 2, 2026

Summary

  • Add Jinja2Cpp-based chat template support for the Llama runner, including embedded templates, template types, and Jinja formatter bindings.
  • Import vLLM Llama3.2 and Gemma3 Jinja templates and normalize them for Jinja2Cpp compatibility.
  • Wire Jinja formatting into C++/Python runners (CLI flags, pybinds), add tests, and update docs.
  • Fix build/link integration for Jinja2Cpp and address missing nonstd headers during install.

Result

CMAKE_POLICY_VERSION_MINIMUM=3.5 make llama-cpu
cmake-out/examples/models/llama/llama_main \
--model_path lama/Llama-3.2-1B-Instruct/model.pte \
--tokenizer_path llama/tokenizer.json \
--chat_template_file extension/llm/runner/templates/tool_chat_template_llama3.2_pythonic.jinja \
--prompt "What is the capital of France?"
I tokenizers:regex.cpp:27] Registering override fallback regex
I tokenizers:hf_tokenizer.cpp:152] Setting up normalizer...
I tokenizers:hf_tokenizer.cpp:158] Normalizer field is null, skipping
I tokenizers:hf_tokenizer.cpp:170] Setting up pretokenizer...
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1770067519.722113  161182 re2.cc:237] Error parsing '((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n...': invalid perl operator: (?!                                                                                                           
I tokenizers:re2_regex.cpp:27] Re2 failed to compile regex: ((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+), error: invalid perl operator: (?!                                                                                            
This may be ok if a fallback regex is used.
I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
I tokenizers:hf_tokenizer.cpp:174] Pretokenizer set up
I tokenizers:hf_tokenizer.cpp:190] Loading BPE merges...
I tokenizers:hf_tokenizer.cpp:250] Loaded 280147 BPE merge rules
I tokenizers:hf_tokenizer.cpp:262] Built merge ranks map with 127744 entries
I tokenizers:hf_tokenizer.cpp:417] Detected stop token: '<|end_of_text|>' (id=128001)
I tokenizers:hf_tokenizer.cpp:417] Detected stop token: '<|eom_id|>' (id=128008)
I tokenizers:hf_tokenizer.cpp:417] Detected stop token: '<|eot_id|>' (id=128009)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant with tool calling capabilities. Only reply with a tool call if the function exists in the library provided by the user. If it doesn't exist, just reply directly in natural language. When you receive a tool call response, use the output to format an answer to the original user question.<|eot_id|><|start_header_id|>user<|end_header_id|>                                                                                            

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


"Bonjour, I'm happy to help. The capital of France is Paris."

PyTorchObserver {"prompt_tokens":104,"generated_tokens":16,"model_load_start_ms":1770067520100,"model_load_end_ms":1770067522061,"inference_start_ms":1770067522061,"inference_end_ms":1770067524284,"prompt_eval_end_ms":1770067523133,"first_token_ms":1770067523133,"aggregate_sampling_time_ms":5,"SCALING_FACTOR_UNITS_PER_SECOND":1000}                                                                                                                           ```

@larryliu0820

Comment on lines +25 to +26
Llama3, // Llama 3.x Instruct models
Gemma3, // Gemma 3 Instruct models
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these 2 special?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to cover the llama3 and gemma3 for the initial support and will cover more models such as Qwen and other ExecuTorch LLM supported models.

Comment thread examples/models/llama/runner/eager.py Outdated
)

parser.add_argument(
"--chat_format",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is chat_format an existing flag?

Should we standardize on chat_template_file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated: eager.py now uses --chat_template_file only. The --chat_format flag was removed for consistency with the C++ runner.

```
If you build with `make llama-cpu` and hit a RapidJSON CMake error, run it as:
```
CMAKE_POLICY_VERSION_MINIMUM=3.5 make llama-cpu
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error always hit?

What cmake version do we require in default build?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CMAKE_POLICY_VERSION_MINIMUM=3.5 workaround is required because RapidJSON’s CMakeLists uses a minimum version that CMake 4+ no longer accepts, and Jinja2Cpp pulls RapidJSON in its internal deps. This is a dependency‑compatibility workaround for specific build environments, not a permanent ExecuTorch requirement.

@@ -0,0 +1,123 @@
{#- Begin-of-sequence token to start the model prompt -#}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these templates available from download from HF?

Do we need to check them in?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won’t check in the vLLM templates. Instead we’ll reference/download them from Hugging Face and document this in the LLM runner README with a --chat_template_file example.

auto value = eos_id.toScalar().to<int64_t>();
eos_ids.emplace(value);
ET_LOG(Info, "eos_id = %" PRId64, value);
if (eos_ids.find(value) == eos_ids.end()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do EOS still come from model metadata? Is there a standard way to get them from tokenizer in HF?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes—EOS still comes from model metadata if present, but we first read it from the tokenizer. In llm_runner_helper.cpp we use tokenizer->eos_tok() and tokenizer->stop_tokens() (populated from HF tokenizer.json), then merge any additional EOS IDs from model metadata. So the standard HF path is supported, and metadata is an additive fallback when the model exports extra stop IDs.

@@ -0,0 +1,38 @@
/*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is just testing the 2 special cases of llama3 and gemma3?

Is there a test for the more general jinja file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we only test the supported model templates (Llama3/Gemma3). A generic Jinja file smoke test can be added later if needed, but it’s not required to validate supported templates in this PR.

" warming=" + (config.warming ? "True" : "False") + ">";
});

// Bind chat template helpers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have bindings for text llm runner right? Is this being used in multimodal runner?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this will only be used for multimodal runner, but will be extended to text_llm_runner in the next PR.

Comment thread CMakeLists.txt Outdated
string_view.hpp
"${_jinja2cpp_nonstd_root}/string-view-lite/include"
)
endif()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole section should live inside extension/llm/chat_template/ directory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the Jinja2Cpp FetchContent and nonstd/RapidJSON workaround into extension/llm/chat_template/CMakeLists.txt; the root now only adds that subdirectory.

@@ -0,0 +1,51 @@
#pragma once
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start a new directory extension/llm/chat_template/ and put these files inside.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. chat_templates.h now lives under extension/llm/chat_template/ and all includes were updated.

Comment thread CMakeLists.txt
Comment on lines +314 to +320
if(TARGET jinja2cpp)
install(
TARGETS jinja2cpp
EXPORT ExecuTorchTargets
DESTINATION ${CMAKE_INSTALL_LIBDIR}
)
endif()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to chat_template/CMakeLists.txt as well?

Copy link
Copy Markdown
Contributor

@larryliu0820 larryliu0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good start! Thank you for adding this

@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Feb 4, 2026

@seyeong-han has imported this pull request. If you are a Meta employee, you can view this in D92221320.

Summary:
Without chat formatting, Instruct models behave like base models and never generate end-of-turn tokens.

So, I added chat template support to the `llama_main` runner for Llama 3 Instruct models.

## Problem

When running Llama 3 Instruct models without proper chat formatting:

```bash
cmake-out/examples/models/llama/llama_main \
    --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --prompt="What's the capital of France?"

Paris is incorrect answer so was stated so some major tech companies may choose the other city, and for instance the CEO of Apple is Tim Cook and he is from California, and the CEO of Google is Sundar Pichai and he is from India.
```
==>  Only max_tokens limit

### Solution

New CLI flags enable proper Instruct model behavior:

| Flag | Description | Default |
|------|-------------|---------|
| `--chat_format` | Chat template (llama3, none) | `none` |
| `--system_prompt` | System prompt for assistant behavior | (empty) |
| `--echo` | Echo input prompt in output | `true` |

### Examples

#### Basic Usage
```bash
cmake-out/examples/models/llama/llama_main \
    --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --chat_format="llama3" \
    --prompt="What's the capital of France?"
```

**Output:**
```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

What's the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The capital of France is Paris.
```

#### Clean Output (Recommended for Apps)
```bash
cmake-out/examples/models/llama/llama_main \
    --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --chat_format="llama3" \
    --echo=false \
    --prompt="What's the capital of France?"
```

**Output:**
```
The capital of France is Paris.
```

#### With System Prompt
```bash
cmake-out/examples/models/llama/llama_main \
    --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --chat_format="llama3" \
    --system_prompt="You are a helpful assistant. Be concise and respond in one sentence." \
    --echo=false \
    --prompt="Explain quantum computing"
```

**Output:**
```
Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously to solve complex problems faster than classical computers.
```

#### Backward Compatible (No Chat Format)
```bash
# For base models or pre-formatted prompts
cmake-out/examples/models/llama/llama_main \
    --model_path="llama/Llama-3.2-1B/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --chat_format="none" \
    --prompt="Once upon a time"
```

### Files Changed

| File | Description |
|------|-------------|
| `examples/models/llama/runner/chat_formatter.h` | **NEW** - ChatFormatter for Llama 3 |
| `examples/models/llama/main.cpp` | Added CLI flags |
| `examples/models/llama/README.md` | Documentation with examples |
| `extension/llm/runner/text_llm_runner.cpp` | Special token filtering |
| `extension/llm/runner/llm_runner_helper.cpp` | EOS token merge logic |

### Before/After Comparison

| Metric | Before (no chat format) | After (--chat_format=llama3 --echo=false) |
|--------|------------------------|------------------------------------------|
| Tokens generated | 120 (max limit) | 7 (stops at EOS) |
| Output | Rambling continuation | Clean answer |
| Special tokens | Visible in output | Filtered out |

Pull Request resolved: pytorch#16987

Test Plan:
- [x] Build with `make llama-cpu`
- [x] Test `--chat_format=llama3` with Llama-3.2-1B-Instruct
- [x] Verify generation stops at `<|eot_id|>` token
- [x] Test `--echo=false` produces clean output without special tokens
- [x] Test `--system_prompt` affects model behavior
- [x] Backward compatible with `--chat_format=none` (default)

Differential Revision: D92221320

Pulled By: seyeong-han
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Feb 10, 2026

@seyeong-han has exported this pull request. If you are a Meta employee, you can view the originating Diff in D92221320.

@kirklandsign kirklandsign requested a review from Copilot February 24, 2026 23:38
@kirklandsign
Copy link
Copy Markdown
Contributor

Thanks! Do we plan to integrate into llm runner?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Jinja2-based chat template formatting support to the ExecuTorch LLM runner stack so Instruct models (notably Llama 3.x / 3.2) generate proper end-of-turn behavior, and exposes these capabilities through the C++ runner, Python bindings, and llama example apps.

Changes:

  • Introduces JinjaChatFormatter + chat types + embedded templates (Llama3/Llama3.2/Gemma3), with C++ and Python bindings plus tests.
  • Adds llama example CLI + Python runner support for chat formatting (system prompt, template file, echo control) and ships sample templates.
  • Wires in build system changes (CMake/Buck/Bazel-ish) and updates EOS handling / output token filtering.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
shim_et/xplat/executorch/build/build_variables.bzl Adds new runner source to xplat build variables
extension/llm/runner/text_llm_runner.cpp Filters special tokens from printed output
extension/llm/runner/test/test_runner_pybindings.py Adds Python tests for new chat template bindings
extension/llm/runner/test/test_jinja_chat_formatter.cpp New C++ unit tests for chat formatter + parsing
extension/llm/runner/test/targets.bzl Registers new C++ test target
extension/llm/runner/test/CMakeLists.txt Adds new test source to CMake test suite
extension/llm/runner/test/BUCK Updates fbcode test target definitions/cleanup
extension/llm/runner/templates/tool_chat_template_llama3.2_pythonic.jinja New sample Llama3.2 tool chat template
extension/llm/runner/templates/tool_chat_template_gemma3_pythonic.jinja New sample Gemma3 tool chat template
extension/llm/runner/targets.bzl Exports new headers/sources + adds template deps
extension/llm/runner/pybindings.cpp Exposes chat types/template formatter to Python
extension/llm/runner/llm_runner_helper.cpp Tweaks EOS IDs collection/logging
extension/llm/runner/jinja_chat_formatter.h New Jinja chat formatter API
extension/llm/runner/jinja_chat_formatter.cpp Implements template loading/normalization/rendering
extension/llm/runner/chat_types.h New C++ structs for messages/conversations
extension/llm/runner/_llm_runner.pyi Adds Python typing for new bindings
extension/llm/runner/init.py Exports new bindings from package
extension/llm/runner/README.md Documents chat templates usage for runner/llama_main
extension/llm/runner/CMakeLists.txt Links runner lib against jinja2cpp + defines macro
extension/llm/chat_template/targets.bzl New Buck target exporting embedded templates header
extension/llm/chat_template/chat_templates.h New embedded templates + model token metadata
extension/llm/chat_template/CMakeLists.txt Fetches/builds jinja2cpp for CMake builds
extension/llm/chat_template/BUCK Registers chat_template targets
examples/models/llama/runner/targets.bzl Exports new chat_formatter header
examples/models/llama/runner/generation.py Adds chat formatting support to Python runner
examples/models/llama/runner/eager.py Adds CLI args for system prompt/template file
examples/models/llama/runner/chat_formatter.h New C++ chat formatter wrapper for llama_main
examples/models/llama/runner/CMakeLists.txt Links llama_runner against jinja2cpp
examples/models/llama/main.cpp Adds chat-format CLI flags + integrates echo into config
examples/models/llama/README.md Documents new CLI flags + examples
examples/models/llama/CMakeLists.txt Fetches jinja2cpp for standalone llama example build
CMakeLists.txt Adds FetchContent include, installs jinja2cpp if present, adds chat_template subdir

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +253 to +254
if normalized in ("llama3", "llama3.2", "llama32", "llama3_2"):
return chat_template_type.Llama3
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_resolve_template_type maps Llama 3.2 variants ("llama3.2", "llama32", "llama3_2") to ChatTemplateType.Llama3, even though the bindings expose a distinct ChatTemplateType.Llama32 and the C++ parseChatTemplateType maps these variants to Llama32. This inconsistency can be surprising for callers; map these variants to ChatTemplateType.Llama32 to keep behavior consistent across languages.

Suggested change
if normalized in ("llama3", "llama3.2", "llama32", "llama3_2"):
return chat_template_type.Llama3
if normalized == "llama3":
return chat_template_type.Llama3
if normalized in ("llama3.2", "llama32", "llama3_2"):
return chat_template_type.Llama32

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +34
include(FetchContent)
cmake_policy(SET CMP0077 NEW)

FetchContent_Declare(
jinja2cpp
GIT_REPOSITORY https://github.com/jinja2cpp/Jinja2Cpp.git
GIT_TAG 1.3.2
GIT_SUBMODULES_RECURSE TRUE
)

set(JINJA2CPP_BUILD_TESTS
OFF
CACHE BOOL ""
FORCE
)
set(JINJA2CPP_BUILD_SHARED
OFF
CACHE BOOL ""
FORCE
)
set(JINJA2CPP_INSTALL
OFF
CACHE BOOL ""
FORCE
)

FetchContent_MakeAvailable(jinja2cpp)
if(NOT TARGET jinja2cpp)
message(FATAL_ERROR "Jinja2Cpp target not found after FetchContent.")
endif()
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUPPORT_REGEX_LOOKAHEAD is set after FetchContent_MakeAvailable(jinja2cpp), so it won't affect any dependencies that are configured/built as part of jinja2cpp's CMake (e.g., if the option is meant to influence its regex backend). If this option is required to avoid the lookahead compilation error, set it in the cache before calling FetchContent_MakeAvailable.

Copilot uses AI. Check for mistakes.
Comment on lines 56 to 62
max_batch_size: int,
use_kv_cache: bool,
vocab_size: int,
chat_format: str = "llama3",
system_prompt: str = "",
chat_template_file: Optional[str] = None,
device: str = "cpu",
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default chat_format is now "llama3", which changes behavior for all existing callers that instantiate LlamaRunner without specifying chat formatting (e.g., NativeLlamaRunner, EagerLlamaRunner). This is likely a backward-incompatible change (and it also diverges from llama_main where --chat_format defaults to none). Consider defaulting to "none" here as well, or inferring the format from model metadata/name so base models don't get wrapped in an instruct template by default.

Copilot uses AI. Check for mistakes.
Comment on lines +100 to +115
inline ChatFormat parse_chat_format(const std::string& format_str) {
static const std::unordered_map<std::string, ChatFormat> format_map = {
{"none", ChatFormat::None},
{"llama3", ChatFormat::Llama3},
{"llama3.2", ChatFormat::Llama3},
{"llama32", ChatFormat::Llama3},
{"llama3_2", ChatFormat::Llama3},
{"gemma3", ChatFormat::Gemma3},
{"jinja", ChatFormat::Jinja},
};
auto it = format_map.find(format_str);
if (it != format_map.end()) {
return it->second;
}
return ChatFormat::None;
}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_chat_format does an exact lookup without normalizing case/whitespace, so values like "Llama3" or " llama3 " will silently fall back to None. Since this is wired to a CLI flag, it should be more forgiving (e.g., lowercasing and trimming before the lookup).

Copilot uses AI. Check for mistakes.
Comment on lines +130 to +153
inline std::unique_ptr<ChatFormatter> create_chat_formatter(
ChatFormat format,
const std::string& template_file = "") {
using executorch::extension::llm::ChatTemplateType;
using executorch::extension::llm::JinjaChatFormatter;

if (!template_file.empty()) {
return std::make_unique<JinjaChatFormatterAdapter>(
JinjaChatFormatter::fromFile(template_file));
}

switch (format) {
case ChatFormat::Llama3:
return std::make_unique<JinjaChatFormatterAdapter>(
JinjaChatFormatter::fromTemplate(ChatTemplateType::Llama3));
case ChatFormat::Gemma3:
return std::make_unique<JinjaChatFormatterAdapter>(
JinjaChatFormatter::fromTemplate(ChatTemplateType::Gemma3));
case ChatFormat::Jinja:
return std::make_unique<NoChatFormatter>();
case ChatFormat::None:
default:
return std::make_unique<NoChatFormatter>();
}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChatFormat::Jinja currently returns NoChatFormatter when template_file is empty, but the CLI supports --chat_format=jinja and the README describes jinja as “custom template from file”. This leads to a confusing no-op when the user forgets --chat_template_file. Consider emitting an error (or at least a warning) when format == ChatFormat::Jinja and no template file is provided.

Copilot uses AI. Check for mistakes.
Comment on lines 133 to 142
// Wrap the token_callback with print function
std::function<void(const std::string&)> wrapped_callback =
[token_callback, config](const std::string& piece) {
if (!config.warming) {
llm::safe_printf(piece.c_str());
fflush(stdout);
// Filter out special tokens when not echoing or for cleaner output
if (!is_special_token(piece)) {
llm::safe_printf(piece.c_str());
fflush(stdout);
}
}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapped_callback filters out any piece that looks like a special token regardless of GenerationConfig.echo. This means tokens like <|eot_id|> will be suppressed even when users explicitly want to see raw model output (and it can also hide literal text that happens to match <|...|>). Consider gating this filtering on !config.echo (or adding an explicit filter_special_tokens config), and/or filtering by token IDs (e.g., stop token set) rather than a string-pattern heuristic.

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +46
{"not tools is none", "not tools"},
{"not tools is None", "not tools"},
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In normalizeTemplate, the replacement "not tools is none" -> "not tools" changes semantics when tools is a non-empty list. In Jinja, not tools is none is effectively tools is not none (true even when tools is non-empty), but not tools becomes false for non-empty lists, which will skip tool blocks in templates like tool_chat_template_llama3.2_pythonic.jinja. Adjust this normalization to preserve the intended meaning (e.g., map not tools is none to tools or to an explicit non-empty check, depending on the desired behavior).

Suggested change
{"not tools is none", "not tools"},
{"not tools is None", "not tools"},
{"not tools is none", "tools"},
{"not tools is None", "tools"},

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +16
if(NOT EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER)
return()
endif()

include(FetchContent)
cmake_policy(SET CMP0077 NEW)

FetchContent_Declare(
jinja2cpp
GIT_REPOSITORY https://github.com/jinja2cpp/Jinja2Cpp.git
GIT_TAG 1.3.2
GIT_SUBMODULES_RECURSE TRUE
)

set(JINJA2CPP_BUILD_TESTS
OFF
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chat_templates.h is included by public runner headers (e.g., jinja_chat_formatter.h), but this CMake subdir doesn't install the extension/llm/chat_template/*.h headers. As a result, make install / consumers of the installed SDK may fail to compile due to missing headers. Add an install(DIRECTORY ...) for this header (and/or define/install an interface target that exports the include dir).

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +32
FetchContent_Declare(
jinja2cpp
GIT_REPOSITORY https://github.com/jinja2cpp/Jinja2Cpp.git
GIT_TAG 1.3.2
GIT_SUBMODULES_RECURSE TRUE
)

set(JINJA2CPP_BUILD_TESTS
OFF
CACHE BOOL ""
FORCE
)
set(JINJA2CPP_BUILD_SHARED
OFF
CACHE BOOL ""
FORCE
)
set(JINJA2CPP_INSTALL
OFF
CACHE BOOL ""
FORCE
)

FetchContent_MakeAvailable(jinja2cpp)
if(NOT TARGET jinja2cpp)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CMake file downloads and builds the jinja2cpp dependency at configure/build time via FetchContent_Declare from a remote Git repository pinned only to a Git tag, without any checksum or signature verification. If the upstream repository or the 1.3.2 tag is compromised or retagged, a malicious actor could inject arbitrary code into your build artifacts and any environment where this target is built. To harden the supply chain, vendor this dependency or fetch it from a trusted mirror, pin GIT_TAG to an immutable commit SHA, and add integrity verification (or otherwise avoid network-based fetching) for production builds.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +75
include(FetchContent)
cmake_policy(SET CMP0077 NEW)
FetchContent_Declare(
jinja2cpp
GIT_REPOSITORY https://github.com/jinja2cpp/Jinja2Cpp.git
GIT_TAG 1.3.2
GIT_SUBMODULES_RECURSE TRUE
)
set(JINJA2CPP_BUILD_TESTS
OFF
CACHE BOOL ""
FORCE
)
set(JINJA2CPP_BUILD_SHARED
OFF
CACHE BOOL ""
FORCE
)
set(JINJA2CPP_INSTALL
OFF
CACHE BOOL ""
FORCE
)
FetchContent_MakeAvailable(jinja2cpp)

Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example CMake configuration pulls jinja2cpp directly from GitHub at configure/build time using FetchContent_Declare with only a Git tag and no integrity verification. A compromise of the upstream repository or the 1.3.2 tag could transparently inject malicious code into developers’ builds or any CI system that runs these example targets. Consider vendoring the dependency or using an internal mirror, pinning to an immutable commit SHA, and adding checksum/signature verification instead of relying on a network fetch to an unverified third-party Git repo.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions Bot added the Stale PRs inactive for over 60 days label Apr 26, 2026
@kirklandsign
Copy link
Copy Markdown
Contributor

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 30, 2026

Claude finished @kirklandsign's task in 2m 29s —— View job


Code Review: Chat Format Support for Llama 3 Instruct Models

  • Read core implementation files
  • Review build system changes
  • Review test coverage
  • Check for security/correctness issues
  • Provide comprehensive feedback

Overall this is a useful feature. The architecture is clean — a Jinja2Cpp-backed formatter behind an abstract ChatFormatter interface, with factory functions and proper CLI integration. Below are the issues I found, ordered by severity.


High Severity

1. normalizeTemplate changes semantics for tools handling
extension/llm/runner/jinja_chat_formatter.cpp:45-46

{"not tools is none", "not tools"},
{"not tools is None", "not tools"},

In Jinja2, not tools is none means tools is not none (i.e., true when tools is defined and non-null — even if non-empty). But not tools evaluates to false for a non-empty list, which will incorrectly skip tool-use blocks in templates like tool_chat_template_llama3.2_pythonic.jinja when tools is passed as []. The correct replacement should be tools is defined or simply tools (truthy check for non-None), depending on the template's intent. Given that tools is always passed as [] in params, this likely works by accident today but will break if tool-use support is added later.

Fix this →

2. Special token filtering is unconditional and uses string heuristics
extension/llm/runner/text_llm_runner.cpp:28-61

The is_special_token function unconditionally filters tokens from output regardless of config.echo. This means:

  • When echo=true, users still can't see special tokens even if they explicitly want raw output.
  • The <|...|> regex-style heuristic (text[1] == '|' && text[text.size()-2] == '|') can match legitimate user content that happens to look like a special token.

The filtering should be gated on !config.echo or a separate filter_special_tokens config option, and ideally match by token ID (using the EOS/stop token set) rather than string patterns.

Fix this →

3. Backward-incompatible default in Python runner
examples/models/llama/runner/generation.py:59

chat_format: str = "llama3",

The Python LlamaRunner defaults to "llama3", while the C++ llama_main defaults to "none". This breaks existing callers of EagerLlamaRunner/NativeLlamaRunner that don't pass chat_format — they'll now get Llama3 chat formatting applied to their prompts unexpectedly. This should default to "none" for consistency and backward compatibility.

Fix this →


Medium Severity

4. parse_chat_format doesn't normalize input
examples/models/llama/runner/chat_formatter.h:100-115

Unlike the extension-level parseChatTemplateType (which lowercases), the example-level parse_chat_format does an exact std::unordered_map lookup. CLI values like "Llama3" or "LLAMA3" silently fall through to ChatFormat::None. Since this is user-facing, it should lowercase before lookup:

inline ChatFormat parse_chat_format(const std::string& format_str) {
  std::string lower = format_str;
  std::transform(lower.begin(), lower.end(), lower.begin(), ::tolower);
  // ... lookup on lower ...
}

Fix this →

5. --chat_format=jinja without --chat_template_file silently does nothing
examples/models/llama/runner/chat_formatter.h:148-149

When a user passes --chat_format=jinja but forgets --chat_template_file, create_chat_formatter returns a NoChatFormatter. This is a confusing no-op. It should emit an error or at minimum a warning.

6. SUPPORT_REGEX_LOOKAHEAD set after FetchContent_MakeAvailable
extension/llm/chat_template/CMakeLists.txt:102-106

This cache variable is set after FetchContent_MakeAvailable(jinja2cpp) (line 31), so if it's meant to influence jinja2cpp's regex backend configuration, it won't take effect. Move it before line 31 if it's relevant to jinja2cpp.

7. Duplicate FetchContent_Declare(jinja2cpp) in two CMakeLists
Both extension/llm/chat_template/CMakeLists.txt and examples/models/llama/CMakeLists.txt declare and fetch jinja2cpp independently. The example one doesn't benefit from the nonstd header copying logic, and the duplication creates maintenance burden. Consider having the example CMake just find_package or rely on the parent build's target.

8. _resolve_template_type maps Llama 3.2 variants to Llama3 instead of Llama32
examples/models/llama/runner/generation.py:253-254

if normalized in ("llama3", "llama3.2", "llama32", "llama3_2"):
    return chat_template_type.Llama3

The C++ parseChatTemplateType maps "llama3.2" variants to ChatTemplateType::Llama32. The Python should do the same for consistency. While they currently share the same template, if they diverge later this will be a subtle bug.


Low Severity / Suggestions

9. Supply-chain pinning: Both FetchContent_Declare calls pin jinja2cpp by Git tag (1.3.2), not by commit SHA. If the tag is rewritten upstream, a different (possibly malicious) revision could be fetched. Consider pinning to an immutable commit SHA.

10. Missing tools in Jinja params: formatConversation doesn't inject a tools variable into the template params. The normalizations in normalizeTemplate paper over this for the embedded templates, but any custom template that references tools (like the shipped tool_chat_template_*.jinja files) will fail at render time or produce unexpected output. Consider passing tools = jinja2::ValuesList() in the params map.

11. kKnownSpecialTokens as a global unordered_set: This is initialized at static-init time. It's small enough that this is fine, but for a library intended for embedded use, consider a constexpr array with linear search or a switch-based approach to avoid heap allocation at startup.

12. Test coverage: The tests validate the formatter in isolation but don't test the end-to-end integration (e.g., that --chat_format=llama3 --echo=false actually suppresses the template prefix in generated output). An integration test would catch issues like the special-token filtering being unconditional.


CI Failures

The 100 CI failures appear to be build failures stemming from Boost regex headers pulled in by jinja2cpp (-Wshorten-64-to-32 errors on Apple builds) and general compilation issues. The Apple builds explicitly fail on:

boost/regex/v5/c_regex_traits.hpp:461:17: error: implicit conversion loses integer precision: 'long' to 'int' [-Werror,-Wshorten-64-to-32]

This needs to be addressed — either by suppressing the warning for jinja2cpp's transitive dependencies or by finding a version of jinja2cpp/boost that doesn't trigger it on Apple toolchains.


Summary

The feature is well-designed and the test suite covers the core formatter logic well. The main actionable items are: (1) fix the normalizeTemplate semantics for tools, (2) gate special-token filtering on !config.echo, (3) fix the Python default to "none", and (4) resolve the Apple CI build failures from Boost. The inconsistency between parse_chat_format (no case normalization) and parseChatTemplateType (has case normalization) should also be fixed since it's user-facing.

@kirklandsign
Copy link
Copy Markdown
Contributor

@claude any suggestions to split this into smaller PR stack?

@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @kirklandsign's task in 1m 50s —— View job


PR Split Suggestion

  • Review changed files and dependencies
  • Identify logical boundaries for splitting
  • Propose PR stack with ordering and dependencies

Here's how I'd split this into a 4-PR stack, ordered by dependency:


PR 1: Core Jinja2Cpp chat formatter library (foundation)

Files:

  • extension/llm/chat_template/CMakeLists.txt — FetchContent for jinja2cpp
  • extension/llm/chat_template/chat_templates.hChatTemplateType enum, embedded templates, ModelTokens
  • extension/llm/chat_template/targets.bzl / BUCK
  • extension/llm/runner/chat_types.hChatMessage, ChatConversation structs
  • extension/llm/runner/jinja_chat_formatter.h / .cppJinjaChatFormatter class + parseChatTemplateType
  • extension/llm/runner/CMakeLists.txt changes (link against jinja2cpp)
  • extension/llm/runner/targets.bzl changes (new srcs/deps)
  • extension/llm/runner/test/test_jinja_chat_formatter.cpp + test build files
  • CMakeLists.txt — minimal changes to add extension/llm/chat_template subdir

Why first: This is the self-contained library with no downstream impact. It can be reviewed/tested in isolation. Addresses the CI build issues (Boost -Wshorten-64-to-32 on Apple) early.


PR 2: Special token filtering + EOS handling in TextLLMRunner

Files:

  • extension/llm/runner/text_llm_runner.cppis_special_token(), kKnownSpecialTokens, wrapped_callback filtering logic
  • extension/llm/runner/llm_runner_helper.cpp — EOS token merge logic changes

Why separate: These are behavior changes to the shared runner that affect all users of TextLLMRunner. They deserve focused review since (a) the special-token filtering is unconditional regardless of config.echo (flagged in previous reviews), and (b) the EOS merge logic touches tokenizer/model metadata interaction. Keeping it isolated makes it easy to evaluate backward-compat impact.


PR 3: Python bindings + Python runner support

Files:

  • extension/llm/runner/pybindings.cpp — expose ChatMessage, ChatConversation, ChatTemplateType, JinjaChatFormatter
  • extension/llm/runner/__init__.py / _llm_runner.pyi
  • examples/models/llama/runner/generation.py_resolve_template_type, _format_prompt, chat formatting in generate()
  • examples/models/llama/runner/eager.py — CLI args
  • extension/llm/runner/test/test_runner_pybindings.py

Why separate: Python changes are independently testable, and reviewers may want different eyes on the Python vs C++ paths. Also isolates the backward-compat concern (default chat_format="llama3" in Python vs "none" in C++).


PR 4: Llama example CLI integration + templates + docs

Files:

  • examples/models/llama/runner/chat_formatter.hChatFormat enum, parse_chat_format, create_chat_formatter factory
  • examples/models/llama/main.cpp — new gflags, formatter wiring
  • examples/models/llama/CMakeLists.txt / runner/CMakeLists.txt / runner/targets.bzl
  • extension/llm/runner/templates/*.jinja — sample templates
  • examples/models/llama/README.md / extension/llm/runner/README.md
  • shim_et/xplat/executorch/build/build_variables.bzl

Why last: This is the user-facing integration that depends on all prior PRs. It's also the most reviewable in isolation since it's example code — lower blast radius.


Merge order

PR 1 (library) → PR 2 (runner behavior) → PR 3 (python) → PR 4 (CLI + docs)

PRs 2 and 3 are independent of each other (both depend only on PR 1), so they could land in parallel if desired:

         ┌─ PR 2 (runner behavior) ─┐
PR 1 ───┤                           ├──► PR 4 (CLI + docs)
         └─ PR 3 (python bindings) ─┘

Notes

  • PR 1 is the critical path — the Apple CI failures from Boost/jinja2cpp headers need to be fixed here (either suppress -Wshorten-64-to-32 for the dependency, pin to a commit SHA, or vendor a patched version).
  • PR 2 should address the review feedback about gating is_special_token on !config.echo before landing.
  • PR 3 should fix the default from chat_format="llama3" to "none" for backward compat.
  • Each PR is independently buildable and testable, which will make CI green on each step.

@seyeong-han
Copy link
Copy Markdown
Contributor Author

Per @kirklandsign's request to split this into a smaller PR stack, this work is now broken into 4 reviewable PRs:

PR Subject LOC
1/4 — #19533 Library + tests (extension/llm/chat_template/* + JinjaChatFormatter) +821 / -28
2/4 — #19534 TextLLMRunner echo-gated special-token filter + EOS merge +58 / -6 (incremental)
3/4 — #19535 Python bindings + Python LlamaRunner integration +256 / -12 (incremental)
4/4 — #19536 llama_main CLI flags + chat_formatter wrapper + universal Jinja docs +424 / -84 (incremental)

Each PR is reviewable in isolation. PR-A is the foundation (no behavior change). PRs B, C, D each layer on focused scope. The stack also addresses the reviewer feedback from this PR and the earlier @claude review:

  • normalizeTemplate not tools is nonetools (PR-A, addresses Copilot review)
  • ✅ vLLM-style params (tools=[], date_string, etc.) injected so any HF / vLLM template works (PR-A)
  • SUPPORT_REGEX_LOOKAHEAD ordered before FetchContent_MakeAvailable (PR-A)
  • chat_templates.h installed (PR-A)
  • ✅ Echo-gated special-token filter (PR-B, addresses Copilot review)
  • ✅ EOS metadata merge instead of clear (PR-B)
  • ✅ Python chat_format="none" default + Llama32 mapping (PR-C, addresses Copilot review)
  • parse_chat_format case-insensitive + trim (PR-D, addresses Copilot review)
  • chat_format=jinja w/o template_file throws (PR-D, addresses Copilot review)
  • ✅ FetchContent guard in llama example so no duplicate declaration (PR-D)
  • ✅ Sample vLLM .jinja templates removed; docs reference HF / vLLM template directories (per @metascroy)
  • ✅ Universal Jinja support: any HuggingFace / vLLM Jinja template works via --chat_template_file, with regression tests

Closing this PR in favor of the stack.

cc @kirklandsign @larryliu0820 @metascroy @lucylq @mergennachin

seyeong-han added a commit to seyeong-han/executorch that referenced this pull request May 13, 2026
Part 2 of the chat-template support stack split out of pytorch#16987.

What this PR adds
-----------------
* extension/llm/runner/text_llm_runner.cpp: Add 'is_special_token()'
  with a small kKnownSpecialTokens set covering Llama 3.x, Gemma, and
  generic <s>/</s>/<pad>/<unk> tokens, plus a regex-style match for
  Llama-format <|...|> tokens. wrapped_callback now suppresses these
  from the printed stream when GenerationConfig.echo == false. When
  echo == true, raw model output (including chat-template tokens) is
  emitted unchanged - this preserves backward compatibility for users
  who explicitly want to see raw tokens.

* extension/llm/runner/llm_runner_helper.cpp: get_eos_ids() now MERGES
  the tokenizer's primary eos_tok() with any additional EOS IDs the
  model metadata exports under kEosIds, instead of clearing the set
  when metadata is present. This is correct for HF-tokenizer models
  (e.g. Llama 3.x) where eos_tok() = <|end_of_text|> but the model
  also wants <|eot_id|> as a stop token. Also logs the primary tok
  and only logs metadata IDs that are newly inserted.

Why this is split out
---------------------
These are runner-behavior changes that affect ALL TextLLMRunner users,
not just the new chat-template path. They deserve focused review for
backward-compat impact (echo gating) and EOS-set semantics (merge vs
clear).

Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter
            library) — only for stack ordering; this PR has no
            include or symbol dependency on that library.

Original PR (full stack): pytorch#16987
seyeong-han added a commit to seyeong-han/executorch that referenced this pull request May 13, 2026
…ration

Part 3 of the chat-template support stack split out of pytorch#16987.

What this PR adds
-----------------
* extension/llm/runner/pybindings.cpp: New pybind11 classes:
  - ChatMessage(role, content)
  - ChatConversation(messages, bos_token, eos_token, add_generation_prompt)
  - ChatTemplateType enum (None_, Llama3, Llama32, Gemma3, Custom)
  - JinjaChatFormatter with from_template / from_string / from_file
    static factories, format(prompt, system_prompt) and
    format_conversation(ChatConversation) methods, includes_bos().
* extension/llm/runner/__init__.py: re-exports the new bindings via
  __all__.
* extension/llm/runner/_llm_runner.pyi: type stubs for the new
  classes so consumers get IDE / mypy support.
* extension/llm/runner/test/test_runner_pybindings.py: Python tests
  covering the new bindings end-to-end.
* examples/models/llama/runner/generation.py: LlamaRunner now accepts
  chat_format / system_prompt / chat_template_file kwargs and exposes
  _format_prompt + chat_completion using the JinjaChatFormatter.
  Default chat_format is 'none' (matches llama_main, preserves
  backward compatibility for existing EagerLlamaRunner / NativeLlamaRunner
  callers). _resolve_template_type maps 'llama3.2' / 'llama32' /
  'llama3_2' to ChatTemplateType.Llama32 (consistent with C++
  parseChatTemplateType).
* examples/models/llama/runner/eager.py: adds --chat_template_file CLI
  flag for chat mode.

Why this is split out
---------------------
Python changes are independently testable and reviewers may want
different eyes on the Python vs C++ paths. Also isolates the
backward-compat concern around the chat_format default.

Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter
            library headers/symbols).

Original PR (full stack): pytorch#16987
seyeong-han added a commit to seyeong-han/executorch that referenced this pull request May 13, 2026
…Jinja docs

Part 4 of the chat-template support stack split out of pytorch#16987. This is
the user-facing surface that wires everything together.

What this PR adds
-----------------
* examples/models/llama/runner/chat_formatter.h: Example-local
  ChatFormatter abstraction with NoChatFormatter and a
  JinjaChatFormatterAdapter wrapping
  executorch::extension::llm::JinjaChatFormatter. parse_chat_format is
  case-insensitive and trims whitespace, so 'Llama3', ' llama3 ',
  'LLAMA3' all map correctly. create_chat_formatter throws
  std::invalid_argument when chat_format=jinja is passed without
  --chat_template_file (no more silent no-op).
* examples/models/llama/main.cpp: Adds --chat_format,
  --chat_template_file, --system_prompt, --echo flags. Wraps the
  prompt with the chat formatter, catches invalid_argument /
  std::exception from formatter creation with clear error messages.
  Wires GenerationConfig.echo from the new --echo flag.
* examples/models/llama/runner/CMakeLists.txt + targets.bzl: link
  llama_runner against jinja2cpp (transitive include in chat_formatter.h).
* examples/models/llama/CMakeLists.txt: add a guarded
  FetchContent_Declare(jinja2cpp) so the example builds standalone
  (when the parent build hasn't already added jinja2cpp via
  extension/llm/chat_template), without redeclaring when it has.
* examples/models/llama/README.md: documents the new flags AND the
  recommended workflow of passing any HuggingFace / vLLM Jinja
  template via --chat_template_file (universal Jinja support).
* extension/llm/runner/README.md: documents universal Jinja support
  for the LLM runner library — points at vLLM examples and HF
  tokenizer_config.json files as supported template sources.

Why this is split out
---------------------
This is the user-facing CLI integration that depends on PRs A and C.
It's the most reviewable in isolation since it's example code with
lower blast radius — reviewers can focus on the CLI ergonomics and
docs without re-reading library internals.

Sample vLLM templates are NOT checked in (per reviewer feedback);
documentation here points users to vLLM's examples directory and
HuggingFace tokenizer_config.json files, which the universal Jinja
support handles directly.

Depends on:
  - PR-A: extension/llm/chat_template/* + JinjaChatFormatter library
  - PR-C: chat_formatter.h includes JinjaChatFormatter (header-only),
          but generation.py / eager.py changes are independent

Original PR (full stack): pytorch#16987
seyeong-han added a commit to seyeong-han/executorch that referenced this pull request May 13, 2026
Foundation PR for the chat-template support stack. Adds the Jinja2Cpp-based
JinjaChatFormatter, supporting chat-types, embedded Llama3/Llama3.2/Gemma3
templates, build glue (CMake/Buck), and a focused C++ unit-test suite.
This PR is reviewable in isolation — it has no behavior change for any
existing runner; downstream PRs (B/C/D) plug it in.

This is part 1 of a 4-PR stack split out of pytorch#16987 per reviewer request:

  1/4 (this PR)  Library + tests
  2/4            TextLLMRunner echo-gated special-token filter + EOS merge
  3/4            Python bindings + Python LlamaRunner integration
  4/4            llama_main CLI flags + chat_formatter wrapper + docs

What this PR adds
-----------------
* extension/llm/chat_template/{chat_templates.h, BUCK, CMakeLists.txt,
  targets.bzl} — embedded Llama3/Llama3.2/Gemma3 templates and the
  ChatTemplateType enum + ModelTokens. The CMake file FetchContent's
  Jinja2Cpp 1.3.2, with SUPPORT_REGEX_LOOKAHEAD set BEFORE
  FetchContent_MakeAvailable so it propagates correctly, plus header
  staging for nonstd headers that some Jinja2Cpp installations omit.
  Installs chat_templates.h so SDK consumers can include it.
* extension/llm/runner/{chat_types.h, jinja_chat_formatter.{h,cpp}} — the
  Universal Jinja chat formatter that supports any HuggingFace / vLLM
  chat template, not just the embedded ones. Loadable via fromTemplate
  (built-in), fromString (any string), or fromFile (any .jinja file).
  formatConversation injects vLLM/HuggingFace-standard params (tools=[],
  tool_choice=None, date_string, chat_template_kwargs) so any template
  that references those variables renders correctly.
* normalizeTemplate handles vLLM/HF template quirks for Jinja2Cpp:
  notably, 'not tools is none' maps to 'tools' (truthy check), preserving
  the intent of 'tools is not none' for empty-list defaults.
* extension/llm/runner/{CMakeLists.txt, targets.bzl} — link
  extension_llm_runner against jinja2cpp (PRIVATE) and define
  EXECUTORCH_USE_JINJA2CPP.
* extension/llm/runner/test/{test_jinja_chat_formatter.cpp, CMakeLists.txt,
  targets.bzl, BUCK} — unit tests covering Llama3 / Llama3.2 / Gemma3
  embedded templates, parseChatTemplateType (case-insensitive), and
  three universal-Jinja regression tests:
    - generic HuggingFace-style template (proves it's not Llama-specific)
    - tools-aware template (validates the tools=[] default)
    - 'not tools is none' normalization regression test
* CMakeLists.txt — adds add_subdirectory(extension/llm/chat_template)
  guarded by EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER.
* shim_et/xplat/executorch/build/build_variables.bzl — adds
  jinja_chat_formatter.cpp to the runner sources.

Notes
-----
* No behavior change for existing TextLLMRunner / MultimodalRunner users:
  the formatter is opt-in, only invoked when downstream code calls it.
* Sample vLLM templates are NOT checked in (per reviewer feedback);
  documentation in the follow-up CLI PR points users to vLLM's examples
  directory and HuggingFace tokenizer_config.json files.

Original PR (full stack): pytorch#16987
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported Stale PRs inactive for over 60 days

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants