[llama] Add chat format support for Llama 3 Instruct models#16987
[llama] Add chat format support for Llama 3 Instruct models#16987seyeong-han wants to merge 1 commit into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16987
Note: Links to docs will display an error until the docs builds have been completed. ❌ 100 New Failures, 2 Cancelled JobsAs of commit 2d09315 with merge base 4930e7c ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Summary
ResultCMAKE_POLICY_VERSION_MINIMUM=3.5 make llama-cpucmake-out/examples/models/llama/llama_main \
--model_path lama/Llama-3.2-1B-Instruct/model.pte \
--tokenizer_path llama/tokenizer.json \
--chat_template_file extension/llm/runner/templates/tool_chat_template_llama3.2_pythonic.jinja \
--prompt "What is the capital of France?"I tokenizers:regex.cpp:27] Registering override fallback regex
I tokenizers:hf_tokenizer.cpp:152] Setting up normalizer...
I tokenizers:hf_tokenizer.cpp:158] Normalizer field is null, skipping
I tokenizers:hf_tokenizer.cpp:170] Setting up pretokenizer...
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1770067519.722113 161182 re2.cc:237] Error parsing '((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n...': invalid perl operator: (?!
I tokenizers:re2_regex.cpp:27] Re2 failed to compile regex: ((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+), error: invalid perl operator: (?!
This may be ok if a fallback regex is used.
I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
I tokenizers:hf_tokenizer.cpp:174] Pretokenizer set up
I tokenizers:hf_tokenizer.cpp:190] Loading BPE merges...
I tokenizers:hf_tokenizer.cpp:250] Loaded 280147 BPE merge rules
I tokenizers:hf_tokenizer.cpp:262] Built merge ranks map with 127744 entries
I tokenizers:hf_tokenizer.cpp:417] Detected stop token: '<|end_of_text|>' (id=128001)
I tokenizers:hf_tokenizer.cpp:417] Detected stop token: '<|eom_id|>' (id=128008)
I tokenizers:hf_tokenizer.cpp:417] Detected stop token: '<|eot_id|>' (id=128009)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024
You are a helpful assistant with tool calling capabilities. Only reply with a tool call if the function exists in the library provided by the user. If it doesn't exist, just reply directly in natural language. When you receive a tool call response, use the output to format an answer to the original user question.<|eot_id|><|start_header_id|>user<|end_header_id|>
What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"Bonjour, I'm happy to help. The capital of France is Paris."
PyTorchObserver {"prompt_tokens":104,"generated_tokens":16,"model_load_start_ms":1770067520100,"model_load_end_ms":1770067522061,"inference_start_ms":1770067522061,"inference_end_ms":1770067524284,"prompt_eval_end_ms":1770067523133,"first_token_ms":1770067523133,"aggregate_sampling_time_ms":5,"SCALING_FACTOR_UNITS_PER_SECOND":1000} ``` |
| Llama3, // Llama 3.x Instruct models | ||
| Gemma3, // Gemma 3 Instruct models |
There was a problem hiding this comment.
Why are these 2 special?
There was a problem hiding this comment.
I just wanted to cover the llama3 and gemma3 for the initial support and will cover more models such as Qwen and other ExecuTorch LLM supported models.
| ) | ||
|
|
||
| parser.add_argument( | ||
| "--chat_format", |
There was a problem hiding this comment.
Is chat_format an existing flag?
Should we standardize on chat_template_file?
There was a problem hiding this comment.
Updated: eager.py now uses --chat_template_file only. The --chat_format flag was removed for consistency with the C++ runner.
| ``` | ||
| If you build with `make llama-cpu` and hit a RapidJSON CMake error, run it as: | ||
| ``` | ||
| CMAKE_POLICY_VERSION_MINIMUM=3.5 make llama-cpu |
There was a problem hiding this comment.
Is this error always hit?
What cmake version do we require in default build?
There was a problem hiding this comment.
The CMAKE_POLICY_VERSION_MINIMUM=3.5 workaround is required because RapidJSON’s CMakeLists uses a minimum version that CMake 4+ no longer accepts, and Jinja2Cpp pulls RapidJSON in its internal deps. This is a dependency‑compatibility workaround for specific build environments, not a permanent ExecuTorch requirement.
| @@ -0,0 +1,123 @@ | |||
| {#- Begin-of-sequence token to start the model prompt -#} | |||
There was a problem hiding this comment.
Are these templates available from download from HF?
Do we need to check them in?
There was a problem hiding this comment.
We won’t check in the vLLM templates. Instead we’ll reference/download them from Hugging Face and document this in the LLM runner README with a --chat_template_file example.
| auto value = eos_id.toScalar().to<int64_t>(); | ||
| eos_ids.emplace(value); | ||
| ET_LOG(Info, "eos_id = %" PRId64, value); | ||
| if (eos_ids.find(value) == eos_ids.end()) { |
There was a problem hiding this comment.
Do EOS still come from model metadata? Is there a standard way to get them from tokenizer in HF?
There was a problem hiding this comment.
Yes—EOS still comes from model metadata if present, but we first read it from the tokenizer. In llm_runner_helper.cpp we use tokenizer->eos_tok() and tokenizer->stop_tokens() (populated from HF tokenizer.json), then merge any additional EOS IDs from model metadata. So the standard HF path is supported, and metadata is an additive fallback when the model exports extra stop IDs.
| @@ -0,0 +1,38 @@ | |||
| /* | |||
There was a problem hiding this comment.
This file is just testing the 2 special cases of llama3 and gemma3?
Is there a test for the more general jinja file?
There was a problem hiding this comment.
For now we only test the supported model templates (Llama3/Gemma3). A generic Jinja file smoke test can be added later if needed, but it’s not required to validate supported templates in this PR.
| " warming=" + (config.warming ? "True" : "False") + ">"; | ||
| }); | ||
|
|
||
| // Bind chat template helpers |
There was a problem hiding this comment.
We don't have bindings for text llm runner right? Is this being used in multimodal runner?
There was a problem hiding this comment.
Currently, this will only be used for multimodal runner, but will be extended to text_llm_runner in the next PR.
| string_view.hpp | ||
| "${_jinja2cpp_nonstd_root}/string-view-lite/include" | ||
| ) | ||
| endif() |
There was a problem hiding this comment.
This whole section should live inside extension/llm/chat_template/ directory.
There was a problem hiding this comment.
Moved the Jinja2Cpp FetchContent and nonstd/RapidJSON workaround into extension/llm/chat_template/CMakeLists.txt; the root now only adds that subdirectory.
| @@ -0,0 +1,51 @@ | |||
| #pragma once | |||
There was a problem hiding this comment.
I think we should start a new directory extension/llm/chat_template/ and put these files inside.
There was a problem hiding this comment.
Done. chat_templates.h now lives under extension/llm/chat_template/ and all includes were updated.
| if(TARGET jinja2cpp) | ||
| install( | ||
| TARGETS jinja2cpp | ||
| EXPORT ExecuTorchTargets | ||
| DESTINATION ${CMAKE_INSTALL_LIBDIR} | ||
| ) | ||
| endif() |
There was a problem hiding this comment.
Can we move this to chat_template/CMakeLists.txt as well?
larryliu0820
left a comment
There was a problem hiding this comment.
I think this is a good start! Thank you for adding this
|
@seyeong-han has imported this pull request. If you are a Meta employee, you can view this in D92221320. |
Summary:
Without chat formatting, Instruct models behave like base models and never generate end-of-turn tokens.
So, I added chat template support to the `llama_main` runner for Llama 3 Instruct models.
## Problem
When running Llama 3 Instruct models without proper chat formatting:
```bash
cmake-out/examples/models/llama/llama_main \
--model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
--tokenizer_path="llama/original/tokenizer.model" \
--prompt="What's the capital of France?"
Paris is incorrect answer so was stated so some major tech companies may choose the other city, and for instance the CEO of Apple is Tim Cook and he is from California, and the CEO of Google is Sundar Pichai and he is from India.
```
==> Only max_tokens limit
### Solution
New CLI flags enable proper Instruct model behavior:
| Flag | Description | Default |
|------|-------------|---------|
| `--chat_format` | Chat template (llama3, none) | `none` |
| `--system_prompt` | System prompt for assistant behavior | (empty) |
| `--echo` | Echo input prompt in output | `true` |
### Examples
#### Basic Usage
```bash
cmake-out/examples/models/llama/llama_main \
--model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
--tokenizer_path="llama/original/tokenizer.model" \
--chat_format="llama3" \
--prompt="What's the capital of France?"
```
**Output:**
```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
What's the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
The capital of France is Paris.
```
#### Clean Output (Recommended for Apps)
```bash
cmake-out/examples/models/llama/llama_main \
--model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
--tokenizer_path="llama/original/tokenizer.model" \
--chat_format="llama3" \
--echo=false \
--prompt="What's the capital of France?"
```
**Output:**
```
The capital of France is Paris.
```
#### With System Prompt
```bash
cmake-out/examples/models/llama/llama_main \
--model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
--tokenizer_path="llama/original/tokenizer.model" \
--chat_format="llama3" \
--system_prompt="You are a helpful assistant. Be concise and respond in one sentence." \
--echo=false \
--prompt="Explain quantum computing"
```
**Output:**
```
Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously to solve complex problems faster than classical computers.
```
#### Backward Compatible (No Chat Format)
```bash
# For base models or pre-formatted prompts
cmake-out/examples/models/llama/llama_main \
--model_path="llama/Llama-3.2-1B/model.pte" \
--tokenizer_path="llama/original/tokenizer.model" \
--chat_format="none" \
--prompt="Once upon a time"
```
### Files Changed
| File | Description |
|------|-------------|
| `examples/models/llama/runner/chat_formatter.h` | **NEW** - ChatFormatter for Llama 3 |
| `examples/models/llama/main.cpp` | Added CLI flags |
| `examples/models/llama/README.md` | Documentation with examples |
| `extension/llm/runner/text_llm_runner.cpp` | Special token filtering |
| `extension/llm/runner/llm_runner_helper.cpp` | EOS token merge logic |
### Before/After Comparison
| Metric | Before (no chat format) | After (--chat_format=llama3 --echo=false) |
|--------|------------------------|------------------------------------------|
| Tokens generated | 120 (max limit) | 7 (stops at EOS) |
| Output | Rambling continuation | Clean answer |
| Special tokens | Visible in output | Filtered out |
Pull Request resolved: pytorch#16987
Test Plan:
- [x] Build with `make llama-cpu`
- [x] Test `--chat_format=llama3` with Llama-3.2-1B-Instruct
- [x] Verify generation stops at `<|eot_id|>` token
- [x] Test `--echo=false` produces clean output without special tokens
- [x] Test `--system_prompt` affects model behavior
- [x] Backward compatible with `--chat_format=none` (default)
Differential Revision: D92221320
Pulled By: seyeong-han
19cca81 to
2d09315
Compare
|
@seyeong-han has exported this pull request. If you are a Meta employee, you can view the originating Diff in D92221320. |
|
Thanks! Do we plan to integrate into llm runner? |
There was a problem hiding this comment.
Pull request overview
Adds Jinja2-based chat template formatting support to the ExecuTorch LLM runner stack so Instruct models (notably Llama 3.x / 3.2) generate proper end-of-turn behavior, and exposes these capabilities through the C++ runner, Python bindings, and llama example apps.
Changes:
- Introduces
JinjaChatFormatter+ chat types + embedded templates (Llama3/Llama3.2/Gemma3), with C++ and Python bindings plus tests. - Adds llama example CLI + Python runner support for chat formatting (system prompt, template file, echo control) and ships sample templates.
- Wires in build system changes (CMake/Buck/Bazel-ish) and updates EOS handling / output token filtering.
Reviewed changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| shim_et/xplat/executorch/build/build_variables.bzl | Adds new runner source to xplat build variables |
| extension/llm/runner/text_llm_runner.cpp | Filters special tokens from printed output |
| extension/llm/runner/test/test_runner_pybindings.py | Adds Python tests for new chat template bindings |
| extension/llm/runner/test/test_jinja_chat_formatter.cpp | New C++ unit tests for chat formatter + parsing |
| extension/llm/runner/test/targets.bzl | Registers new C++ test target |
| extension/llm/runner/test/CMakeLists.txt | Adds new test source to CMake test suite |
| extension/llm/runner/test/BUCK | Updates fbcode test target definitions/cleanup |
| extension/llm/runner/templates/tool_chat_template_llama3.2_pythonic.jinja | New sample Llama3.2 tool chat template |
| extension/llm/runner/templates/tool_chat_template_gemma3_pythonic.jinja | New sample Gemma3 tool chat template |
| extension/llm/runner/targets.bzl | Exports new headers/sources + adds template deps |
| extension/llm/runner/pybindings.cpp | Exposes chat types/template formatter to Python |
| extension/llm/runner/llm_runner_helper.cpp | Tweaks EOS IDs collection/logging |
| extension/llm/runner/jinja_chat_formatter.h | New Jinja chat formatter API |
| extension/llm/runner/jinja_chat_formatter.cpp | Implements template loading/normalization/rendering |
| extension/llm/runner/chat_types.h | New C++ structs for messages/conversations |
| extension/llm/runner/_llm_runner.pyi | Adds Python typing for new bindings |
| extension/llm/runner/init.py | Exports new bindings from package |
| extension/llm/runner/README.md | Documents chat templates usage for runner/llama_main |
| extension/llm/runner/CMakeLists.txt | Links runner lib against jinja2cpp + defines macro |
| extension/llm/chat_template/targets.bzl | New Buck target exporting embedded templates header |
| extension/llm/chat_template/chat_templates.h | New embedded templates + model token metadata |
| extension/llm/chat_template/CMakeLists.txt | Fetches/builds jinja2cpp for CMake builds |
| extension/llm/chat_template/BUCK | Registers chat_template targets |
| examples/models/llama/runner/targets.bzl | Exports new chat_formatter header |
| examples/models/llama/runner/generation.py | Adds chat formatting support to Python runner |
| examples/models/llama/runner/eager.py | Adds CLI args for system prompt/template file |
| examples/models/llama/runner/chat_formatter.h | New C++ chat formatter wrapper for llama_main |
| examples/models/llama/runner/CMakeLists.txt | Links llama_runner against jinja2cpp |
| examples/models/llama/main.cpp | Adds chat-format CLI flags + integrates echo into config |
| examples/models/llama/README.md | Documents new CLI flags + examples |
| examples/models/llama/CMakeLists.txt | Fetches jinja2cpp for standalone llama example build |
| CMakeLists.txt | Adds FetchContent include, installs jinja2cpp if present, adds chat_template subdir |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if normalized in ("llama3", "llama3.2", "llama32", "llama3_2"): | ||
| return chat_template_type.Llama3 |
There was a problem hiding this comment.
_resolve_template_type maps Llama 3.2 variants ("llama3.2", "llama32", "llama3_2") to ChatTemplateType.Llama3, even though the bindings expose a distinct ChatTemplateType.Llama32 and the C++ parseChatTemplateType maps these variants to Llama32. This inconsistency can be surprising for callers; map these variants to ChatTemplateType.Llama32 to keep behavior consistent across languages.
| if normalized in ("llama3", "llama3.2", "llama32", "llama3_2"): | |
| return chat_template_type.Llama3 | |
| if normalized == "llama3": | |
| return chat_template_type.Llama3 | |
| if normalized in ("llama3.2", "llama32", "llama3_2"): | |
| return chat_template_type.Llama32 |
| include(FetchContent) | ||
| cmake_policy(SET CMP0077 NEW) | ||
|
|
||
| FetchContent_Declare( | ||
| jinja2cpp | ||
| GIT_REPOSITORY https://github.com/jinja2cpp/Jinja2Cpp.git | ||
| GIT_TAG 1.3.2 | ||
| GIT_SUBMODULES_RECURSE TRUE | ||
| ) | ||
|
|
||
| set(JINJA2CPP_BUILD_TESTS | ||
| OFF | ||
| CACHE BOOL "" | ||
| FORCE | ||
| ) | ||
| set(JINJA2CPP_BUILD_SHARED | ||
| OFF | ||
| CACHE BOOL "" | ||
| FORCE | ||
| ) | ||
| set(JINJA2CPP_INSTALL | ||
| OFF | ||
| CACHE BOOL "" | ||
| FORCE | ||
| ) | ||
|
|
||
| FetchContent_MakeAvailable(jinja2cpp) | ||
| if(NOT TARGET jinja2cpp) | ||
| message(FATAL_ERROR "Jinja2Cpp target not found after FetchContent.") | ||
| endif() |
There was a problem hiding this comment.
SUPPORT_REGEX_LOOKAHEAD is set after FetchContent_MakeAvailable(jinja2cpp), so it won't affect any dependencies that are configured/built as part of jinja2cpp's CMake (e.g., if the option is meant to influence its regex backend). If this option is required to avoid the lookahead compilation error, set it in the cache before calling FetchContent_MakeAvailable.
| max_batch_size: int, | ||
| use_kv_cache: bool, | ||
| vocab_size: int, | ||
| chat_format: str = "llama3", | ||
| system_prompt: str = "", | ||
| chat_template_file: Optional[str] = None, | ||
| device: str = "cpu", |
There was a problem hiding this comment.
The default chat_format is now "llama3", which changes behavior for all existing callers that instantiate LlamaRunner without specifying chat formatting (e.g., NativeLlamaRunner, EagerLlamaRunner). This is likely a backward-incompatible change (and it also diverges from llama_main where --chat_format defaults to none). Consider defaulting to "none" here as well, or inferring the format from model metadata/name so base models don't get wrapped in an instruct template by default.
| inline ChatFormat parse_chat_format(const std::string& format_str) { | ||
| static const std::unordered_map<std::string, ChatFormat> format_map = { | ||
| {"none", ChatFormat::None}, | ||
| {"llama3", ChatFormat::Llama3}, | ||
| {"llama3.2", ChatFormat::Llama3}, | ||
| {"llama32", ChatFormat::Llama3}, | ||
| {"llama3_2", ChatFormat::Llama3}, | ||
| {"gemma3", ChatFormat::Gemma3}, | ||
| {"jinja", ChatFormat::Jinja}, | ||
| }; | ||
| auto it = format_map.find(format_str); | ||
| if (it != format_map.end()) { | ||
| return it->second; | ||
| } | ||
| return ChatFormat::None; | ||
| } |
There was a problem hiding this comment.
parse_chat_format does an exact lookup without normalizing case/whitespace, so values like "Llama3" or " llama3 " will silently fall back to None. Since this is wired to a CLI flag, it should be more forgiving (e.g., lowercasing and trimming before the lookup).
| inline std::unique_ptr<ChatFormatter> create_chat_formatter( | ||
| ChatFormat format, | ||
| const std::string& template_file = "") { | ||
| using executorch::extension::llm::ChatTemplateType; | ||
| using executorch::extension::llm::JinjaChatFormatter; | ||
|
|
||
| if (!template_file.empty()) { | ||
| return std::make_unique<JinjaChatFormatterAdapter>( | ||
| JinjaChatFormatter::fromFile(template_file)); | ||
| } | ||
|
|
||
| switch (format) { | ||
| case ChatFormat::Llama3: | ||
| return std::make_unique<JinjaChatFormatterAdapter>( | ||
| JinjaChatFormatter::fromTemplate(ChatTemplateType::Llama3)); | ||
| case ChatFormat::Gemma3: | ||
| return std::make_unique<JinjaChatFormatterAdapter>( | ||
| JinjaChatFormatter::fromTemplate(ChatTemplateType::Gemma3)); | ||
| case ChatFormat::Jinja: | ||
| return std::make_unique<NoChatFormatter>(); | ||
| case ChatFormat::None: | ||
| default: | ||
| return std::make_unique<NoChatFormatter>(); | ||
| } |
There was a problem hiding this comment.
ChatFormat::Jinja currently returns NoChatFormatter when template_file is empty, but the CLI supports --chat_format=jinja and the README describes jinja as “custom template from file”. This leads to a confusing no-op when the user forgets --chat_template_file. Consider emitting an error (or at least a warning) when format == ChatFormat::Jinja and no template file is provided.
| // Wrap the token_callback with print function | ||
| std::function<void(const std::string&)> wrapped_callback = | ||
| [token_callback, config](const std::string& piece) { | ||
| if (!config.warming) { | ||
| llm::safe_printf(piece.c_str()); | ||
| fflush(stdout); | ||
| // Filter out special tokens when not echoing or for cleaner output | ||
| if (!is_special_token(piece)) { | ||
| llm::safe_printf(piece.c_str()); | ||
| fflush(stdout); | ||
| } | ||
| } |
There was a problem hiding this comment.
wrapped_callback filters out any piece that looks like a special token regardless of GenerationConfig.echo. This means tokens like <|eot_id|> will be suppressed even when users explicitly want to see raw model output (and it can also hide literal text that happens to match <|...|>). Consider gating this filtering on !config.echo (or adding an explicit filter_special_tokens config), and/or filtering by token IDs (e.g., stop token set) rather than a string-pattern heuristic.
| {"not tools is none", "not tools"}, | ||
| {"not tools is None", "not tools"}, |
There was a problem hiding this comment.
In normalizeTemplate, the replacement "not tools is none" -> "not tools" changes semantics when tools is a non-empty list. In Jinja, not tools is none is effectively tools is not none (true even when tools is non-empty), but not tools becomes false for non-empty lists, which will skip tool blocks in templates like tool_chat_template_llama3.2_pythonic.jinja. Adjust this normalization to preserve the intended meaning (e.g., map not tools is none to tools or to an explicit non-empty check, depending on the desired behavior).
| {"not tools is none", "not tools"}, | |
| {"not tools is None", "not tools"}, | |
| {"not tools is none", "tools"}, | |
| {"not tools is None", "tools"}, |
| if(NOT EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER) | ||
| return() | ||
| endif() | ||
|
|
||
| include(FetchContent) | ||
| cmake_policy(SET CMP0077 NEW) | ||
|
|
||
| FetchContent_Declare( | ||
| jinja2cpp | ||
| GIT_REPOSITORY https://github.com/jinja2cpp/Jinja2Cpp.git | ||
| GIT_TAG 1.3.2 | ||
| GIT_SUBMODULES_RECURSE TRUE | ||
| ) | ||
|
|
||
| set(JINJA2CPP_BUILD_TESTS | ||
| OFF |
There was a problem hiding this comment.
chat_templates.h is included by public runner headers (e.g., jinja_chat_formatter.h), but this CMake subdir doesn't install the extension/llm/chat_template/*.h headers. As a result, make install / consumers of the installed SDK may fail to compile due to missing headers. Add an install(DIRECTORY ...) for this header (and/or define/install an interface target that exports the include dir).
| FetchContent_Declare( | ||
| jinja2cpp | ||
| GIT_REPOSITORY https://github.com/jinja2cpp/Jinja2Cpp.git | ||
| GIT_TAG 1.3.2 | ||
| GIT_SUBMODULES_RECURSE TRUE | ||
| ) | ||
|
|
||
| set(JINJA2CPP_BUILD_TESTS | ||
| OFF | ||
| CACHE BOOL "" | ||
| FORCE | ||
| ) | ||
| set(JINJA2CPP_BUILD_SHARED | ||
| OFF | ||
| CACHE BOOL "" | ||
| FORCE | ||
| ) | ||
| set(JINJA2CPP_INSTALL | ||
| OFF | ||
| CACHE BOOL "" | ||
| FORCE | ||
| ) | ||
|
|
||
| FetchContent_MakeAvailable(jinja2cpp) | ||
| if(NOT TARGET jinja2cpp) |
There was a problem hiding this comment.
This CMake file downloads and builds the jinja2cpp dependency at configure/build time via FetchContent_Declare from a remote Git repository pinned only to a Git tag, without any checksum or signature verification. If the upstream repository or the 1.3.2 tag is compromised or retagged, a malicious actor could inject arbitrary code into your build artifacts and any environment where this target is built. To harden the supply chain, vendor this dependency or fetch it from a trusted mirror, pin GIT_TAG to an immutable commit SHA, and add integrity verification (or otherwise avoid network-based fetching) for production builds.
| include(FetchContent) | ||
| cmake_policy(SET CMP0077 NEW) | ||
| FetchContent_Declare( | ||
| jinja2cpp | ||
| GIT_REPOSITORY https://github.com/jinja2cpp/Jinja2Cpp.git | ||
| GIT_TAG 1.3.2 | ||
| GIT_SUBMODULES_RECURSE TRUE | ||
| ) | ||
| set(JINJA2CPP_BUILD_TESTS | ||
| OFF | ||
| CACHE BOOL "" | ||
| FORCE | ||
| ) | ||
| set(JINJA2CPP_BUILD_SHARED | ||
| OFF | ||
| CACHE BOOL "" | ||
| FORCE | ||
| ) | ||
| set(JINJA2CPP_INSTALL | ||
| OFF | ||
| CACHE BOOL "" | ||
| FORCE | ||
| ) | ||
| FetchContent_MakeAvailable(jinja2cpp) | ||
|
|
There was a problem hiding this comment.
This example CMake configuration pulls jinja2cpp directly from GitHub at configure/build time using FetchContent_Declare with only a Git tag and no integrity verification. A compromise of the upstream repository or the 1.3.2 tag could transparently inject malicious code into developers’ builds or any CI system that runs these example targets. Consider vendoring the dependency or using an internal mirror, pinning to an immutable commit SHA, and adding checksum/signature verification instead of relying on a network fetch to an unverified third-party Git repo.
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
|
@claude review |
|
Claude finished @kirklandsign's task in 2m 29s —— View job Code Review: Chat Format Support for Llama 3 Instruct Models
Overall this is a useful feature. The architecture is clean — a Jinja2Cpp-backed formatter behind an abstract High Severity1. {"not tools is none", "not tools"},
{"not tools is None", "not tools"},In Jinja2, 2. Special token filtering is unconditional and uses string heuristics The
The filtering should be gated on 3. Backward-incompatible default in Python runner chat_format: str = "llama3",The Python Medium Severity4. Unlike the extension-level inline ChatFormat parse_chat_format(const std::string& format_str) {
std::string lower = format_str;
std::transform(lower.begin(), lower.end(), lower.begin(), ::tolower);
// ... lookup on lower ...
}5. When a user passes 6. This cache variable is set after 7. Duplicate 8. if normalized in ("llama3", "llama3.2", "llama32", "llama3_2"):
return chat_template_type.Llama3The C++ Low Severity / Suggestions9. Supply-chain pinning: Both 10. Missing 11. 12. Test coverage: The tests validate the formatter in isolation but don't test the end-to-end integration (e.g., that CI FailuresThe 100 CI failures appear to be build failures stemming from Boost regex headers pulled in by jinja2cpp ( This needs to be addressed — either by suppressing the warning for jinja2cpp's transitive dependencies or by finding a version of jinja2cpp/boost that doesn't trigger it on Apple toolchains. SummaryThe feature is well-designed and the test suite covers the core formatter logic well. The main actionable items are: (1) fix the |
|
@claude any suggestions to split this into smaller PR stack? |
|
Claude finished @kirklandsign's task in 1m 50s —— View job PR Split Suggestion
Here's how I'd split this into a 4-PR stack, ordered by dependency: PR 1: Core Jinja2Cpp chat formatter library (foundation)Files:
Why first: This is the self-contained library with no downstream impact. It can be reviewed/tested in isolation. Addresses the CI build issues (Boost PR 2: Special token filtering + EOS handling in
|
|
Per @kirklandsign's request to split this into a smaller PR stack, this work is now broken into 4 reviewable PRs:
Each PR is reviewable in isolation. PR-A is the foundation (no behavior change). PRs B, C, D each layer on focused scope. The stack also addresses the reviewer feedback from this PR and the earlier
Closing this PR in favor of the stack. cc @kirklandsign @larryliu0820 @metascroy @lucylq @mergennachin |
Part 2 of the chat-template support stack split out of pytorch#16987. What this PR adds ----------------- * extension/llm/runner/text_llm_runner.cpp: Add 'is_special_token()' with a small kKnownSpecialTokens set covering Llama 3.x, Gemma, and generic <s>/</s>/<pad>/<unk> tokens, plus a regex-style match for Llama-format <|...|> tokens. wrapped_callback now suppresses these from the printed stream when GenerationConfig.echo == false. When echo == true, raw model output (including chat-template tokens) is emitted unchanged - this preserves backward compatibility for users who explicitly want to see raw tokens. * extension/llm/runner/llm_runner_helper.cpp: get_eos_ids() now MERGES the tokenizer's primary eos_tok() with any additional EOS IDs the model metadata exports under kEosIds, instead of clearing the set when metadata is present. This is correct for HF-tokenizer models (e.g. Llama 3.x) where eos_tok() = <|end_of_text|> but the model also wants <|eot_id|> as a stop token. Also logs the primary tok and only logs metadata IDs that are newly inserted. Why this is split out --------------------- These are runner-behavior changes that affect ALL TextLLMRunner users, not just the new chat-template path. They deserve focused review for backward-compat impact (echo gating) and EOS-set semantics (merge vs clear). Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter library) — only for stack ordering; this PR has no include or symbol dependency on that library. Original PR (full stack): pytorch#16987
…ration Part 3 of the chat-template support stack split out of pytorch#16987. What this PR adds ----------------- * extension/llm/runner/pybindings.cpp: New pybind11 classes: - ChatMessage(role, content) - ChatConversation(messages, bos_token, eos_token, add_generation_prompt) - ChatTemplateType enum (None_, Llama3, Llama32, Gemma3, Custom) - JinjaChatFormatter with from_template / from_string / from_file static factories, format(prompt, system_prompt) and format_conversation(ChatConversation) methods, includes_bos(). * extension/llm/runner/__init__.py: re-exports the new bindings via __all__. * extension/llm/runner/_llm_runner.pyi: type stubs for the new classes so consumers get IDE / mypy support. * extension/llm/runner/test/test_runner_pybindings.py: Python tests covering the new bindings end-to-end. * examples/models/llama/runner/generation.py: LlamaRunner now accepts chat_format / system_prompt / chat_template_file kwargs and exposes _format_prompt + chat_completion using the JinjaChatFormatter. Default chat_format is 'none' (matches llama_main, preserves backward compatibility for existing EagerLlamaRunner / NativeLlamaRunner callers). _resolve_template_type maps 'llama3.2' / 'llama32' / 'llama3_2' to ChatTemplateType.Llama32 (consistent with C++ parseChatTemplateType). * examples/models/llama/runner/eager.py: adds --chat_template_file CLI flag for chat mode. Why this is split out --------------------- Python changes are independently testable and reviewers may want different eyes on the Python vs C++ paths. Also isolates the backward-compat concern around the chat_format default. Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter library headers/symbols). Original PR (full stack): pytorch#16987
…Jinja docs Part 4 of the chat-template support stack split out of pytorch#16987. This is the user-facing surface that wires everything together. What this PR adds ----------------- * examples/models/llama/runner/chat_formatter.h: Example-local ChatFormatter abstraction with NoChatFormatter and a JinjaChatFormatterAdapter wrapping executorch::extension::llm::JinjaChatFormatter. parse_chat_format is case-insensitive and trims whitespace, so 'Llama3', ' llama3 ', 'LLAMA3' all map correctly. create_chat_formatter throws std::invalid_argument when chat_format=jinja is passed without --chat_template_file (no more silent no-op). * examples/models/llama/main.cpp: Adds --chat_format, --chat_template_file, --system_prompt, --echo flags. Wraps the prompt with the chat formatter, catches invalid_argument / std::exception from formatter creation with clear error messages. Wires GenerationConfig.echo from the new --echo flag. * examples/models/llama/runner/CMakeLists.txt + targets.bzl: link llama_runner against jinja2cpp (transitive include in chat_formatter.h). * examples/models/llama/CMakeLists.txt: add a guarded FetchContent_Declare(jinja2cpp) so the example builds standalone (when the parent build hasn't already added jinja2cpp via extension/llm/chat_template), without redeclaring when it has. * examples/models/llama/README.md: documents the new flags AND the recommended workflow of passing any HuggingFace / vLLM Jinja template via --chat_template_file (universal Jinja support). * extension/llm/runner/README.md: documents universal Jinja support for the LLM runner library — points at vLLM examples and HF tokenizer_config.json files as supported template sources. Why this is split out --------------------- This is the user-facing CLI integration that depends on PRs A and C. It's the most reviewable in isolation since it's example code with lower blast radius — reviewers can focus on the CLI ergonomics and docs without re-reading library internals. Sample vLLM templates are NOT checked in (per reviewer feedback); documentation here points users to vLLM's examples directory and HuggingFace tokenizer_config.json files, which the universal Jinja support handles directly. Depends on: - PR-A: extension/llm/chat_template/* + JinjaChatFormatter library - PR-C: chat_formatter.h includes JinjaChatFormatter (header-only), but generation.py / eager.py changes are independent Original PR (full stack): pytorch#16987
Foundation PR for the chat-template support stack. Adds the Jinja2Cpp-based JinjaChatFormatter, supporting chat-types, embedded Llama3/Llama3.2/Gemma3 templates, build glue (CMake/Buck), and a focused C++ unit-test suite. This PR is reviewable in isolation — it has no behavior change for any existing runner; downstream PRs (B/C/D) plug it in. This is part 1 of a 4-PR stack split out of pytorch#16987 per reviewer request: 1/4 (this PR) Library + tests 2/4 TextLLMRunner echo-gated special-token filter + EOS merge 3/4 Python bindings + Python LlamaRunner integration 4/4 llama_main CLI flags + chat_formatter wrapper + docs What this PR adds ----------------- * extension/llm/chat_template/{chat_templates.h, BUCK, CMakeLists.txt, targets.bzl} — embedded Llama3/Llama3.2/Gemma3 templates and the ChatTemplateType enum + ModelTokens. The CMake file FetchContent's Jinja2Cpp 1.3.2, with SUPPORT_REGEX_LOOKAHEAD set BEFORE FetchContent_MakeAvailable so it propagates correctly, plus header staging for nonstd headers that some Jinja2Cpp installations omit. Installs chat_templates.h so SDK consumers can include it. * extension/llm/runner/{chat_types.h, jinja_chat_formatter.{h,cpp}} — the Universal Jinja chat formatter that supports any HuggingFace / vLLM chat template, not just the embedded ones. Loadable via fromTemplate (built-in), fromString (any string), or fromFile (any .jinja file). formatConversation injects vLLM/HuggingFace-standard params (tools=[], tool_choice=None, date_string, chat_template_kwargs) so any template that references those variables renders correctly. * normalizeTemplate handles vLLM/HF template quirks for Jinja2Cpp: notably, 'not tools is none' maps to 'tools' (truthy check), preserving the intent of 'tools is not none' for empty-list defaults. * extension/llm/runner/{CMakeLists.txt, targets.bzl} — link extension_llm_runner against jinja2cpp (PRIVATE) and define EXECUTORCH_USE_JINJA2CPP. * extension/llm/runner/test/{test_jinja_chat_formatter.cpp, CMakeLists.txt, targets.bzl, BUCK} — unit tests covering Llama3 / Llama3.2 / Gemma3 embedded templates, parseChatTemplateType (case-insensitive), and three universal-Jinja regression tests: - generic HuggingFace-style template (proves it's not Llama-specific) - tools-aware template (validates the tools=[] default) - 'not tools is none' normalization regression test * CMakeLists.txt — adds add_subdirectory(extension/llm/chat_template) guarded by EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER. * shim_et/xplat/executorch/build/build_variables.bzl — adds jinja_chat_formatter.cpp to the runner sources. Notes ----- * No behavior change for existing TextLLMRunner / MultimodalRunner users: the formatter is opt-in, only invoked when downstream code calls it. * Sample vLLM templates are NOT checked in (per reviewer feedback); documentation in the follow-up CLI PR points users to vLLM's examples directory and HuggingFace tokenizer_config.json files. Original PR (full stack): pytorch#16987
Summary
Without chat formatting, Instruct models behave like base models and never generate end-of-turn tokens.
So, I added chat template support to the
llama_mainrunner for Llama 3 Instruct models.Problem
When running Llama 3 Instruct models without proper chat formatting:
==> Only max_tokens limit
Solution
New CLI flags enable proper Instruct model behavior:
--chat_formatnone--system_prompt--echotrueExamples
Basic Usage
cmake-out/examples/models/llama/llama_main \ --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \ --tokenizer_path="llama/original/tokenizer.model" \ --chat_format="llama3" \ --prompt="What's the capital of France?"Output:
Clean Output (Recommended for Apps)
cmake-out/examples/models/llama/llama_main \ --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \ --tokenizer_path="llama/original/tokenizer.model" \ --chat_format="llama3" \ --echo=false \ --prompt="What's the capital of France?"Output:
With System Prompt
cmake-out/examples/models/llama/llama_main \ --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \ --tokenizer_path="llama/original/tokenizer.model" \ --chat_format="llama3" \ --system_prompt="You are a helpful assistant. Be concise and respond in one sentence." \ --echo=false \ --prompt="Explain quantum computing"Output:
Backward Compatible (No Chat Format)
Files Changed
examples/models/llama/runner/chat_formatter.hexamples/models/llama/main.cppexamples/models/llama/README.mdextension/llm/runner/text_llm_runner.cppextension/llm/runner/llm_runner_helper.cppBefore/After Comparison
Test Plan
make llama-cpu--chat_format=llama3with Llama-3.2-1B-Instruct<|eot_id|>token--echo=falseproduces clean output without special tokens--system_promptaffects model behavior--chat_format=none(default)