[llm][4/4] llama_main CLI flags + chat_formatter wrapper + universal Jinja docs#19536
Open
seyeong-han wants to merge 4 commits into
Open
[llm][4/4] llama_main CLI flags + chat_formatter wrapper + universal Jinja docs#19536seyeong-han wants to merge 4 commits into
seyeong-han wants to merge 4 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19536
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
6 tasks
This PR needs a
|
Foundation PR for the chat-template support stack. Adds the Jinja2Cpp-based JinjaChatFormatter, supporting chat-types, embedded Llama3/Llama3.2/Gemma3 templates, build glue (CMake/Buck), and a focused C++ unit-test suite. This PR is reviewable in isolation — it has no behavior change for any existing runner; downstream PRs (B/C/D) plug it in. This is part 1 of a 4-PR stack split out of pytorch#16987 per reviewer request: 1/4 (this PR) Library + tests 2/4 TextLLMRunner echo-gated special-token filter + EOS merge 3/4 Python bindings + Python LlamaRunner integration 4/4 llama_main CLI flags + chat_formatter wrapper + docs What this PR adds ----------------- * extension/llm/chat_template/{chat_templates.h, BUCK, CMakeLists.txt, targets.bzl} — embedded Llama3/Llama3.2/Gemma3 templates and the ChatTemplateType enum + ModelTokens. The CMake file FetchContent's Jinja2Cpp 1.3.2, with SUPPORT_REGEX_LOOKAHEAD set BEFORE FetchContent_MakeAvailable so it propagates correctly, plus header staging for nonstd headers that some Jinja2Cpp installations omit. Installs chat_templates.h so SDK consumers can include it. * extension/llm/runner/{chat_types.h, jinja_chat_formatter.{h,cpp}} — the Universal Jinja chat formatter that supports any HuggingFace / vLLM chat template, not just the embedded ones. Loadable via fromTemplate (built-in), fromString (any string), or fromFile (any .jinja file). formatConversation injects vLLM/HuggingFace-standard params (tools=[], tool_choice=None, date_string, chat_template_kwargs) so any template that references those variables renders correctly. * normalizeTemplate handles vLLM/HF template quirks for Jinja2Cpp: notably, 'not tools is none' maps to 'tools' (truthy check), preserving the intent of 'tools is not none' for empty-list defaults. * extension/llm/runner/{CMakeLists.txt, targets.bzl} — link extension_llm_runner against jinja2cpp (PRIVATE) and define EXECUTORCH_USE_JINJA2CPP. * extension/llm/runner/test/{test_jinja_chat_formatter.cpp, CMakeLists.txt, targets.bzl, BUCK} — unit tests covering Llama3 / Llama3.2 / Gemma3 embedded templates, parseChatTemplateType (case-insensitive), and three universal-Jinja regression tests: - generic HuggingFace-style template (proves it's not Llama-specific) - tools-aware template (validates the tools=[] default) - 'not tools is none' normalization regression test * CMakeLists.txt — adds add_subdirectory(extension/llm/chat_template) guarded by EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER. * shim_et/xplat/executorch/build/build_variables.bzl — adds jinja_chat_formatter.cpp to the runner sources. Notes ----- * No behavior change for existing TextLLMRunner / MultimodalRunner users: the formatter is opt-in, only invoked when downstream code calls it. * Sample vLLM templates are NOT checked in (per reviewer feedback); documentation in the follow-up CLI PR points users to vLLM's examples directory and HuggingFace tokenizer_config.json files. Original PR (full stack): pytorch#16987
Part 2 of the chat-template support stack split out of pytorch#16987. What this PR adds ----------------- * extension/llm/runner/text_llm_runner.cpp: Add 'is_special_token()' with a small kKnownSpecialTokens set covering Llama 3.x, Gemma, and generic <s>/</s>/<pad>/<unk> tokens, plus a regex-style match for Llama-format <|...|> tokens. wrapped_callback now suppresses these from the printed stream when GenerationConfig.echo == false. When echo == true, raw model output (including chat-template tokens) is emitted unchanged - this preserves backward compatibility for users who explicitly want to see raw tokens. * extension/llm/runner/llm_runner_helper.cpp: get_eos_ids() now MERGES the tokenizer's primary eos_tok() with any additional EOS IDs the model metadata exports under kEosIds, instead of clearing the set when metadata is present. This is correct for HF-tokenizer models (e.g. Llama 3.x) where eos_tok() = <|end_of_text|> but the model also wants <|eot_id|> as a stop token. Also logs the primary tok and only logs metadata IDs that are newly inserted. Why this is split out --------------------- These are runner-behavior changes that affect ALL TextLLMRunner users, not just the new chat-template path. They deserve focused review for backward-compat impact (echo gating) and EOS-set semantics (merge vs clear). Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter library) — only for stack ordering; this PR has no include or symbol dependency on that library. Original PR (full stack): pytorch#16987
…ration Part 3 of the chat-template support stack split out of pytorch#16987. What this PR adds ----------------- * extension/llm/runner/pybindings.cpp: New pybind11 classes: - ChatMessage(role, content) - ChatConversation(messages, bos_token, eos_token, add_generation_prompt) - ChatTemplateType enum (None_, Llama3, Llama32, Gemma3, Custom) - JinjaChatFormatter with from_template / from_string / from_file static factories, format(prompt, system_prompt) and format_conversation(ChatConversation) methods, includes_bos(). * extension/llm/runner/__init__.py: re-exports the new bindings via __all__. * extension/llm/runner/_llm_runner.pyi: type stubs for the new classes so consumers get IDE / mypy support. * extension/llm/runner/test/test_runner_pybindings.py: Python tests covering the new bindings end-to-end. * examples/models/llama/runner/generation.py: LlamaRunner now accepts chat_format / system_prompt / chat_template_file kwargs and exposes _format_prompt + chat_completion using the JinjaChatFormatter. Default chat_format is 'none' (matches llama_main, preserves backward compatibility for existing EagerLlamaRunner / NativeLlamaRunner callers). _resolve_template_type maps 'llama3.2' / 'llama32' / 'llama3_2' to ChatTemplateType.Llama32 (consistent with C++ parseChatTemplateType). * examples/models/llama/runner/eager.py: adds --chat_template_file CLI flag for chat mode. Why this is split out --------------------- Python changes are independently testable and reviewers may want different eyes on the Python vs C++ paths. Also isolates the backward-compat concern around the chat_format default. Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter library headers/symbols). Original PR (full stack): pytorch#16987
…Jinja docs Part 4 of the chat-template support stack split out of pytorch#16987. This is the user-facing surface that wires everything together. What this PR adds ----------------- * examples/models/llama/runner/chat_formatter.h: Example-local ChatFormatter abstraction with NoChatFormatter and a JinjaChatFormatterAdapter wrapping executorch::extension::llm::JinjaChatFormatter. parse_chat_format is case-insensitive and trims whitespace, so 'Llama3', ' llama3 ', 'LLAMA3' all map correctly. create_chat_formatter throws std::invalid_argument when chat_format=jinja is passed without --chat_template_file (no more silent no-op). * examples/models/llama/main.cpp: Adds --chat_format, --chat_template_file, --system_prompt, --echo flags. Wraps the prompt with the chat formatter, catches invalid_argument / std::exception from formatter creation with clear error messages. Wires GenerationConfig.echo from the new --echo flag. * examples/models/llama/runner/CMakeLists.txt + targets.bzl: link llama_runner against jinja2cpp (transitive include in chat_formatter.h). * examples/models/llama/CMakeLists.txt: add a guarded FetchContent_Declare(jinja2cpp) so the example builds standalone (when the parent build hasn't already added jinja2cpp via extension/llm/chat_template), without redeclaring when it has. * examples/models/llama/README.md: documents the new flags AND the recommended workflow of passing any HuggingFace / vLLM Jinja template via --chat_template_file (universal Jinja support). * extension/llm/runner/README.md: documents universal Jinja support for the LLM runner library — points at vLLM examples and HF tokenizer_config.json files as supported template sources. Why this is split out --------------------- This is the user-facing CLI integration that depends on PRs A and C. It's the most reviewable in isolation since it's example code with lower blast radius — reviewers can focus on the CLI ergonomics and docs without re-reading library internals. Sample vLLM templates are NOT checked in (per reviewer feedback); documentation here points users to vLLM's examples directory and HuggingFace tokenizer_config.json files, which the universal Jinja support handles directly. Depends on: - PR-A: extension/llm/chat_template/* + JinjaChatFormatter library - PR-C: chat_formatter.h includes JinjaChatFormatter (header-only), but generation.py / eager.py changes are independent Original PR (full stack): pytorch#16987
d90a120 to
8338d00
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Part 4 of the chat-template support stack split out of #16987 per @kirklandsign's request. This is the user-facing surface that wires everything together — the CLI flags for
llama_mainand the universal-Jinja documentation.Stack overview
What this PR adds
Example-local
ChatFormatter(examples/models/llama/runner/chat_formatter.h)ChatFormatterabstraction withNoChatFormatterand aJinjaChatFormatterAdapterwrappingexecutorch::extension::llm::JinjaChatFormatter.parse_chat_formatis case-insensitive and trims whitespace, so"Llama3"," llama3 ","LLAMA3"all map correctly.create_chat_formatterthrowsstd::invalid_argumentwhenchat_format=jinjais passed without--chat_template_file(no more silent no-op — addresses the Copilot review comment from [llama] Add chat format support for Llama 3 Instruct models #16987).CLI flags (
examples/models/llama/main.cpp)Adds:
--chat_format=llama3|gemma3|jinja|none(default:none)--chat_template_file=<path>(any HF / vLLM Jinja template)--system_prompt="<text>"--echo=true|false(wired toGenerationConfig.echo)Wraps the prompt with the chat formatter, catches
invalid_argument/std::exceptionfrom formatter creation with clear error messages.Build wiring
examples/models/llama/runner/CMakeLists.txt+targets.bzl: linkllama_runneragainstjinja2cpp(transitive include inchat_formatter.h)examples/models/llama/CMakeLists.txt: a guardedFetchContent_Declare(jinja2cpp)so the example builds standalone (when the parent build hasn't already addedjinja2cppviaextension/llm/chat_template), without redeclaring when it hasUniversal Jinja docs
examples/models/llama/README.md: documents the new flags AND the recommended workflow of passing any HuggingFace / vLLM Jinja template via--chat_template_fileextension/llm/runner/README.md: documents universal Jinja support — points at vLLM examples and HFtokenizer_config.jsonfiles as supported template sourcesUniversal Jinja example
After this PR, users can drop in any chat template — no recompile required:
Why this is split out
This is the user-facing CLI integration that depends on PRs A and C. It's the most reviewable in isolation since it's example code with lower blast radius — reviewers can focus on the CLI ergonomics and docs without re-reading library internals.
Sample vLLM templates are NOT checked in (per @metascroy's reviewer feedback); documentation here points users to vLLM's
examples/directory and HuggingFacetokenizer_config.jsonfiles, which the universal Jinja support handles directly.Test Plan
cmake --workflow llm-releasemake llama-cpu--chat_format=llama3withLlama-3.2-1B-Instruct<|eot_id|>token--echo=falseproduces clean output without special tokens--echo=trueproduces raw output WITH special tokens--system_promptaffects model behavior--chat_template_file=<vLLM template>works for arbitrary HF/vLLM Jinja files--chat_format=jinjawithout--chat_template_filenow produces a clear error (not silent no-op)--chat_format="Llama3"/" llama3 "(case-insensitive + trim) parses correctly--chat_format=none(default)examples/models/llamaworks (FetchContent guard does not redeclare)Depends on
JinjaChatFormatterlibraryLlamaRunner(independent runtime path; this PR'schat_formatter.his C++-only)Original PR
Splitting #16987 into 4 reviewable PRs. Once this stack lands, #16987 will be closed as superseded.
cc @kirklandsign @larryliu0820 @metascroy