[llm][4/4] llama_main CLI flags + chat_formatter wrapper + universal Jinja docs by seyeong-han · Pull Request #19536 · pytorch/executorch

seyeong-han · 2026-05-13T04:53:37Z

Summary

Part 4 of the chat-template support stack split out of #16987 per @kirklandsign's request. This is the user-facing surface that wires everything together — the CLI flags for llama_main and the universal-Jinja documentation.

Stack overview

PR	Subject
1/4	#19533 Library + tests
2/4	#19534 TextLLMRunner echo gating + EOS merge
3/4	#19535 Python bindings + LlamaRunner integration
4/4 (this PR)	llama_main CLI flags + chat_formatter wrapper + universal Jinja docs

What this PR adds

Example-local `ChatFormatter` (`examples/models/llama/runner/chat_formatter.h`)

ChatFormatter abstraction with NoChatFormatter and a JinjaChatFormatterAdapter wrapping executorch::extension::llm::JinjaChatFormatter.

parse_chat_format is case-insensitive and trims whitespace, so "Llama3", " llama3 ", "LLAMA3" all map correctly.
create_chat_formatter throws std::invalid_argument when chat_format=jinja is passed without --chat_template_file (no more silent no-op — addresses the Copilot review comment from [llama] Add chat format support for Llama 3 Instruct models #16987).

CLI flags (`examples/models/llama/main.cpp`)

Adds:

--chat_format=llama3|gemma3|jinja|none (default: none)
--chat_template_file=<path> (any HF / vLLM Jinja template)
--system_prompt="<text>"
--echo=true|false (wired to GenerationConfig.echo)

Wraps the prompt with the chat formatter, catches invalid_argument / std::exception from formatter creation with clear error messages.

Build wiring

examples/models/llama/runner/CMakeLists.txt + targets.bzl: link llama_runner against jinja2cpp (transitive include in chat_formatter.h)
examples/models/llama/CMakeLists.txt: a guarded FetchContent_Declare(jinja2cpp) so the example builds standalone (when the parent build hasn't already added jinja2cpp via extension/llm/chat_template), without redeclaring when it has

Universal Jinja docs

examples/models/llama/README.md: documents the new flags AND the recommended workflow of passing any HuggingFace / vLLM Jinja template via --chat_template_file
extension/llm/runner/README.md: documents universal Jinja support — points at vLLM examples and HF tokenizer_config.json files as supported template sources

Universal Jinja example

After this PR, users can drop in any chat template — no recompile required:

# Use any vLLM template:
curl -L -o llama3.2.jinja \
  https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_llama3.2_pythonic.jinja

cmake-out/examples/models/llama/llama_main \
  --model_path=<model.pte> \
  --tokenizer_path=<tokenizer.json> \
  --chat_template_file=llama3.2.jinja \
  --system_prompt="You are a helpful assistant." \
  --echo=false \
  --prompt="What is the capital of France?"
# Output: The capital of France is Paris.

Why this is split out

This is the user-facing CLI integration that depends on PRs A and C. It's the most reviewable in isolation since it's example code with lower blast radius — reviewers can focus on the CLI ergonomics and docs without re-reading library internals.

Sample vLLM templates are NOT checked in (per @metascroy's reviewer feedback); documentation here points users to vLLM's examples/ directory and HuggingFace tokenizer_config.json files, which the universal Jinja support handles directly.

Test Plan

Depends on

PR-A: #19533 — JinjaChatFormatter library
PR-C: #19535 — Python LlamaRunner (independent runtime path; this PR's chat_formatter.h is C++-only)

Original PR

Splitting #16987 into 4 reviewable PRs. Once this stack lands, #16987 will be closed as superseded.

cc @kirklandsign @larryliu0820 @metascroy

pytorch-bot · 2026-05-13T04:53:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19536

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-05-13T04:54:28Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Foundation PR for the chat-template support stack. Adds the Jinja2Cpp-based JinjaChatFormatter, supporting chat-types, embedded Llama3/Llama3.2/Gemma3 templates, build glue (CMake/Buck), and a focused C++ unit-test suite. This PR is reviewable in isolation — it has no behavior change for any existing runner; downstream PRs (B/C/D) plug it in. This is part 1 of a 4-PR stack split out of pytorch#16987 per reviewer request: 1/4 (this PR) Library + tests 2/4 TextLLMRunner echo-gated special-token filter + EOS merge 3/4 Python bindings + Python LlamaRunner integration 4/4 llama_main CLI flags + chat_formatter wrapper + docs What this PR adds ----------------- * extension/llm/chat_template/{chat_templates.h, BUCK, CMakeLists.txt, targets.bzl} — embedded Llama3/Llama3.2/Gemma3 templates and the ChatTemplateType enum + ModelTokens. The CMake file FetchContent's Jinja2Cpp 1.3.2, with SUPPORT_REGEX_LOOKAHEAD set BEFORE FetchContent_MakeAvailable so it propagates correctly, plus header staging for nonstd headers that some Jinja2Cpp installations omit. Installs chat_templates.h so SDK consumers can include it. * extension/llm/runner/{chat_types.h, jinja_chat_formatter.{h,cpp}} — the Universal Jinja chat formatter that supports any HuggingFace / vLLM chat template, not just the embedded ones. Loadable via fromTemplate (built-in), fromString (any string), or fromFile (any .jinja file). formatConversation injects vLLM/HuggingFace-standard params (tools=[], tool_choice=None, date_string, chat_template_kwargs) so any template that references those variables renders correctly. * normalizeTemplate handles vLLM/HF template quirks for Jinja2Cpp: notably, 'not tools is none' maps to 'tools' (truthy check), preserving the intent of 'tools is not none' for empty-list defaults. * extension/llm/runner/{CMakeLists.txt, targets.bzl} — link extension_llm_runner against jinja2cpp (PRIVATE) and define EXECUTORCH_USE_JINJA2CPP. * extension/llm/runner/test/{test_jinja_chat_formatter.cpp, CMakeLists.txt, targets.bzl, BUCK} — unit tests covering Llama3 / Llama3.2 / Gemma3 embedded templates, parseChatTemplateType (case-insensitive), and three universal-Jinja regression tests: - generic HuggingFace-style template (proves it's not Llama-specific) - tools-aware template (validates the tools=[] default) - 'not tools is none' normalization regression test * CMakeLists.txt — adds add_subdirectory(extension/llm/chat_template) guarded by EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER. * shim_et/xplat/executorch/build/build_variables.bzl — adds jinja_chat_formatter.cpp to the runner sources. Notes ----- * No behavior change for existing TextLLMRunner / MultimodalRunner users: the formatter is opt-in, only invoked when downstream code calls it. * Sample vLLM templates are NOT checked in (per reviewer feedback); documentation in the follow-up CLI PR points users to vLLM's examples directory and HuggingFace tokenizer_config.json files. Original PR (full stack): pytorch#16987

Part 2 of the chat-template support stack split out of pytorch#16987. What this PR adds ----------------- * extension/llm/runner/text_llm_runner.cpp: Add 'is_special_token()' with a small kKnownSpecialTokens set covering Llama 3.x, Gemma, and generic <s>/</s>/<pad>/<unk> tokens, plus a regex-style match for Llama-format <|...|> tokens. wrapped_callback now suppresses these from the printed stream when GenerationConfig.echo == false. When echo == true, raw model output (including chat-template tokens) is emitted unchanged - this preserves backward compatibility for users who explicitly want to see raw tokens. * extension/llm/runner/llm_runner_helper.cpp: get_eos_ids() now MERGES the tokenizer's primary eos_tok() with any additional EOS IDs the model metadata exports under kEosIds, instead of clearing the set when metadata is present. This is correct for HF-tokenizer models (e.g. Llama 3.x) where eos_tok() = <|end_of_text|> but the model also wants <|eot_id|> as a stop token. Also logs the primary tok and only logs metadata IDs that are newly inserted. Why this is split out --------------------- These are runner-behavior changes that affect ALL TextLLMRunner users, not just the new chat-template path. They deserve focused review for backward-compat impact (echo gating) and EOS-set semantics (merge vs clear). Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter library) — only for stack ordering; this PR has no include or symbol dependency on that library. Original PR (full stack): pytorch#16987

…ration Part 3 of the chat-template support stack split out of pytorch#16987. What this PR adds ----------------- * extension/llm/runner/pybindings.cpp: New pybind11 classes: - ChatMessage(role, content) - ChatConversation(messages, bos_token, eos_token, add_generation_prompt) - ChatTemplateType enum (None_, Llama3, Llama32, Gemma3, Custom) - JinjaChatFormatter with from_template / from_string / from_file static factories, format(prompt, system_prompt) and format_conversation(ChatConversation) methods, includes_bos(). * extension/llm/runner/__init__.py: re-exports the new bindings via __all__. * extension/llm/runner/_llm_runner.pyi: type stubs for the new classes so consumers get IDE / mypy support. * extension/llm/runner/test/test_runner_pybindings.py: Python tests covering the new bindings end-to-end. * examples/models/llama/runner/generation.py: LlamaRunner now accepts chat_format / system_prompt / chat_template_file kwargs and exposes _format_prompt + chat_completion using the JinjaChatFormatter. Default chat_format is 'none' (matches llama_main, preserves backward compatibility for existing EagerLlamaRunner / NativeLlamaRunner callers). _resolve_template_type maps 'llama3.2' / 'llama32' / 'llama3_2' to ChatTemplateType.Llama32 (consistent with C++ parseChatTemplateType). * examples/models/llama/runner/eager.py: adds --chat_template_file CLI flag for chat mode. Why this is split out --------------------- Python changes are independently testable and reviewers may want different eyes on the Python vs C++ paths. Also isolates the backward-compat concern around the chat_format default. Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter library headers/symbols). Original PR (full stack): pytorch#16987

…Jinja docs Part 4 of the chat-template support stack split out of pytorch#16987. This is the user-facing surface that wires everything together. What this PR adds ----------------- * examples/models/llama/runner/chat_formatter.h: Example-local ChatFormatter abstraction with NoChatFormatter and a JinjaChatFormatterAdapter wrapping executorch::extension::llm::JinjaChatFormatter. parse_chat_format is case-insensitive and trims whitespace, so 'Llama3', ' llama3 ', 'LLAMA3' all map correctly. create_chat_formatter throws std::invalid_argument when chat_format=jinja is passed without --chat_template_file (no more silent no-op). * examples/models/llama/main.cpp: Adds --chat_format, --chat_template_file, --system_prompt, --echo flags. Wraps the prompt with the chat formatter, catches invalid_argument / std::exception from formatter creation with clear error messages. Wires GenerationConfig.echo from the new --echo flag. * examples/models/llama/runner/CMakeLists.txt + targets.bzl: link llama_runner against jinja2cpp (transitive include in chat_formatter.h). * examples/models/llama/CMakeLists.txt: add a guarded FetchContent_Declare(jinja2cpp) so the example builds standalone (when the parent build hasn't already added jinja2cpp via extension/llm/chat_template), without redeclaring when it has. * examples/models/llama/README.md: documents the new flags AND the recommended workflow of passing any HuggingFace / vLLM Jinja template via --chat_template_file (universal Jinja support). * extension/llm/runner/README.md: documents universal Jinja support for the LLM runner library — points at vLLM examples and HF tokenizer_config.json files as supported template sources. Why this is split out --------------------- This is the user-facing CLI integration that depends on PRs A and C. It's the most reviewable in isolation since it's example code with lower blast radius — reviewers can focus on the CLI ergonomics and docs without re-reading library internals. Sample vLLM templates are NOT checked in (per reviewer feedback); documentation here points users to vLLM's examples directory and HuggingFace tokenizer_config.json files, which the universal Jinja support handles directly. Depends on: - PR-A: extension/llm/chat_template/* + JinjaChatFormatter library - PR-C: chat_formatter.h includes JinjaChatFormatter (header-only), but generation.py / eager.py changes are independent Original PR (full stack): pytorch#16987

seyeong-han requested review from kirklandsign, larryliu0820 and mergennachin as code owners May 13, 2026 04:53

seyeong-han requested a review from kirklandsign May 13, 2026 04:53

seyeong-han requested a review from lucylq as a code owner May 13, 2026 04:53

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2026

seyeong-han mentioned this pull request May 13, 2026

[llama] Add chat format support for Llama 3 Instruct models #16987

Open

6 tasks

seyeong-han added 4 commits May 12, 2026 22:03

seyeong-han force-pushed the chat-llama-cli branch from d90a120 to 8338d00 Compare May 13, 2026 05:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llm][4/4] llama_main CLI flags + chat_formatter wrapper + universal Jinja docs#19536

[llm][4/4] llama_main CLI flags + chat_formatter wrapper + universal Jinja docs#19536
seyeong-han wants to merge 4 commits into
pytorch:mainfrom
seyeong-han:chat-llama-cli

seyeong-han commented May 13, 2026

Uh oh!

pytorch-bot Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seyeong-han commented May 13, 2026

Summary

Stack overview

What this PR adds

Example-local ChatFormatter (examples/models/llama/runner/chat_formatter.h)

CLI flags (examples/models/llama/main.cpp)

Build wiring

Universal Jinja docs

Universal Jinja example

Why this is split out

Test Plan

Depends on

Original PR

Uh oh!

pytorch-bot Bot commented May 13, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19536

❗ 1 Active SEVs

Uh oh!

github-actions Bot commented May 13, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Example-local `ChatFormatter` (`examples/models/llama/runner/chat_formatter.h`)

CLI flags (`examples/models/llama/main.cpp`)

This PR needs a `release notes:` label