[llm][3/4] Python bindings for JinjaChatFormatter + LlamaRunner integration by seyeong-han · Pull Request #19535 · pytorch/executorch

seyeong-han · 2026-05-13T04:52:46Z

Summary

Part 3 of the chat-template support stack split out of #16987 per @kirklandsign's request.

This PR exposes the JinjaChatFormatter (added in PR-A #19533) to Python via pybind11, and integrates it into the example LlamaRunner Python class.

Stack overview

PR	Subject
1/4	#19533 Library + tests
2/4	#19534 TextLLMRunner echo gating + EOS merge
3/4 (this PR)	Python bindings + Python LlamaRunner integration
4/4	llama_main CLI flags + chat_formatter wrapper + universal Jinja docs

What this PR adds

C++ pybind11 bindings (`extension/llm/runner/pybindings.cpp`)

ChatMessage(role, content)
ChatConversation(messages, bos_token, eos_token, add_generation_prompt)
ChatTemplateType enum (None_, Llama3, Llama32, Gemma3, Custom)
JinjaChatFormatter with from_template / from_string / from_file static factories, format(prompt, system_prompt) and format_conversation(ChatConversation) methods, and includes_bos()

Python package surface

extension/llm/runner/__init__.py — re-exports the new bindings via __all__
extension/llm/runner/_llm_runner.pyi — type stubs for the new classes (IDE / mypy support)

Python `LlamaRunner` integration (`examples/models/llama/runner/generation.py`)

LlamaRunner now accepts chat_format / system_prompt / chat_template_file kwargs and exposes _format_prompt + chat_completion using the JinjaChatFormatter.

Backward-compat: default chat_format is "none" (matches llama_main, preserves backward compatibility for existing EagerLlamaRunner / NativeLlamaRunner callers that don't pass chat_format).

_resolve_template_type maps "llama3.2" / "llama32" / "llama3_2" to ChatTemplateType.Llama32 (consistent with C++ parseChatTemplateType) — addresses the cross-language consistency comment from Copilot review on the original PR.

CLI integration (`examples/models/llama/runner/eager.py`)

Adds --chat_template_file CLI flag for chat mode.

Tests (`extension/llm/runner/test/test_runner_pybindings.py`)

Python tests covering the new bindings end-to-end.

Why this is split out

Python changes are independently testable and reviewers may want different eyes on the Python vs. C++ paths. Also isolates the backward-compat concern around the chat_format default.

Test Plan

Build with EXECUTORCH_BUILD_PYBIND=ON
Run Python tests: pytest extension/llm/runner/test/test_runner_pybindings.py
Verify from executorch.extension.llm.runner import JinjaChatFormatter works
Verify LlamaRunner.chat_completion() formats prompts correctly with default Llama3 template
Verify LlamaRunner constructor with chat_format="none" (default) is backward-compatible
Verify _resolve_template_type maps Llama 3.2 variants to Llama32

Depends on

PR-A: #19533 (JinjaChatFormatter library headers/symbols)

Original PR

Splitting #16987 into 4 reviewable PRs.

cc @kirklandsign @larryliu0820 @metascroy

pytorch-bot · 2026-05-13T04:52:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19535

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-05-13T04:53:37Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Foundation PR for the chat-template support stack. Adds the Jinja2Cpp-based JinjaChatFormatter, supporting chat-types, embedded Llama3/Llama3.2/Gemma3 templates, build glue (CMake/Buck), and a focused C++ unit-test suite. This PR is reviewable in isolation — it has no behavior change for any existing runner; downstream PRs (B/C/D) plug it in. This is part 1 of a 4-PR stack split out of pytorch#16987 per reviewer request: 1/4 (this PR) Library + tests 2/4 TextLLMRunner echo-gated special-token filter + EOS merge 3/4 Python bindings + Python LlamaRunner integration 4/4 llama_main CLI flags + chat_formatter wrapper + docs What this PR adds ----------------- * extension/llm/chat_template/{chat_templates.h, BUCK, CMakeLists.txt, targets.bzl} — embedded Llama3/Llama3.2/Gemma3 templates and the ChatTemplateType enum + ModelTokens. The CMake file FetchContent's Jinja2Cpp 1.3.2, with SUPPORT_REGEX_LOOKAHEAD set BEFORE FetchContent_MakeAvailable so it propagates correctly, plus header staging for nonstd headers that some Jinja2Cpp installations omit. Installs chat_templates.h so SDK consumers can include it. * extension/llm/runner/{chat_types.h, jinja_chat_formatter.{h,cpp}} — the Universal Jinja chat formatter that supports any HuggingFace / vLLM chat template, not just the embedded ones. Loadable via fromTemplate (built-in), fromString (any string), or fromFile (any .jinja file). formatConversation injects vLLM/HuggingFace-standard params (tools=[], tool_choice=None, date_string, chat_template_kwargs) so any template that references those variables renders correctly. * normalizeTemplate handles vLLM/HF template quirks for Jinja2Cpp: notably, 'not tools is none' maps to 'tools' (truthy check), preserving the intent of 'tools is not none' for empty-list defaults. * extension/llm/runner/{CMakeLists.txt, targets.bzl} — link extension_llm_runner against jinja2cpp (PRIVATE) and define EXECUTORCH_USE_JINJA2CPP. * extension/llm/runner/test/{test_jinja_chat_formatter.cpp, CMakeLists.txt, targets.bzl, BUCK} — unit tests covering Llama3 / Llama3.2 / Gemma3 embedded templates, parseChatTemplateType (case-insensitive), and three universal-Jinja regression tests: - generic HuggingFace-style template (proves it's not Llama-specific) - tools-aware template (validates the tools=[] default) - 'not tools is none' normalization regression test * CMakeLists.txt — adds add_subdirectory(extension/llm/chat_template) guarded by EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER. * shim_et/xplat/executorch/build/build_variables.bzl — adds jinja_chat_formatter.cpp to the runner sources. Notes ----- * No behavior change for existing TextLLMRunner / MultimodalRunner users: the formatter is opt-in, only invoked when downstream code calls it. * Sample vLLM templates are NOT checked in (per reviewer feedback); documentation in the follow-up CLI PR points users to vLLM's examples directory and HuggingFace tokenizer_config.json files. Original PR (full stack): pytorch#16987

Part 2 of the chat-template support stack split out of pytorch#16987. What this PR adds ----------------- * extension/llm/runner/text_llm_runner.cpp: Add 'is_special_token()' with a small kKnownSpecialTokens set covering Llama 3.x, Gemma, and generic <s>/</s>/<pad>/<unk> tokens, plus a regex-style match for Llama-format <|...|> tokens. wrapped_callback now suppresses these from the printed stream when GenerationConfig.echo == false. When echo == true, raw model output (including chat-template tokens) is emitted unchanged - this preserves backward compatibility for users who explicitly want to see raw tokens. * extension/llm/runner/llm_runner_helper.cpp: get_eos_ids() now MERGES the tokenizer's primary eos_tok() with any additional EOS IDs the model metadata exports under kEosIds, instead of clearing the set when metadata is present. This is correct for HF-tokenizer models (e.g. Llama 3.x) where eos_tok() = <|end_of_text|> but the model also wants <|eot_id|> as a stop token. Also logs the primary tok and only logs metadata IDs that are newly inserted. Why this is split out --------------------- These are runner-behavior changes that affect ALL TextLLMRunner users, not just the new chat-template path. They deserve focused review for backward-compat impact (echo gating) and EOS-set semantics (merge vs clear). Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter library) — only for stack ordering; this PR has no include or symbol dependency on that library. Original PR (full stack): pytorch#16987

…ration Part 3 of the chat-template support stack split out of pytorch#16987. What this PR adds ----------------- * extension/llm/runner/pybindings.cpp: New pybind11 classes: - ChatMessage(role, content) - ChatConversation(messages, bos_token, eos_token, add_generation_prompt) - ChatTemplateType enum (None_, Llama3, Llama32, Gemma3, Custom) - JinjaChatFormatter with from_template / from_string / from_file static factories, format(prompt, system_prompt) and format_conversation(ChatConversation) methods, includes_bos(). * extension/llm/runner/__init__.py: re-exports the new bindings via __all__. * extension/llm/runner/_llm_runner.pyi: type stubs for the new classes so consumers get IDE / mypy support. * extension/llm/runner/test/test_runner_pybindings.py: Python tests covering the new bindings end-to-end. * examples/models/llama/runner/generation.py: LlamaRunner now accepts chat_format / system_prompt / chat_template_file kwargs and exposes _format_prompt + chat_completion using the JinjaChatFormatter. Default chat_format is 'none' (matches llama_main, preserves backward compatibility for existing EagerLlamaRunner / NativeLlamaRunner callers). _resolve_template_type maps 'llama3.2' / 'llama32' / 'llama3_2' to ChatTemplateType.Llama32 (consistent with C++ parseChatTemplateType). * examples/models/llama/runner/eager.py: adds --chat_template_file CLI flag for chat mode. Why this is split out --------------------- Python changes are independently testable and reviewers may want different eyes on the Python vs C++ paths. Also isolates the backward-compat concern around the chat_format default. Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter library headers/symbols). Original PR (full stack): pytorch#16987

seyeong-han requested a review from kirklandsign May 13, 2026 04:52

seyeong-han requested review from kirklandsign, larryliu0820, lucylq and mergennachin as code owners May 13, 2026 04:52

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2026

This was referenced May 13, 2026

[llm][4/4] llama_main CLI flags + chat_formatter wrapper + universal Jinja docs #19536

Open

[llama] Add chat format support for Llama 3 Instruct models #16987

Open

seyeong-han added 3 commits May 12, 2026 22:03

seyeong-han force-pushed the chat-python-bindings branch from 175f580 to 13af6b1 Compare May 13, 2026 05:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llm][3/4] Python bindings for JinjaChatFormatter + LlamaRunner integration#19535

[llm][3/4] Python bindings for JinjaChatFormatter + LlamaRunner integration#19535
seyeong-han wants to merge 3 commits into
pytorch:mainfrom
seyeong-han:chat-python-bindings

seyeong-han commented May 13, 2026

Uh oh!

pytorch-bot Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seyeong-han commented May 13, 2026

Summary

Stack overview

What this PR adds

C++ pybind11 bindings (extension/llm/runner/pybindings.cpp)

Python package surface

Python LlamaRunner integration (examples/models/llama/runner/generation.py)

CLI integration (examples/models/llama/runner/eager.py)

Tests (extension/llm/runner/test/test_runner_pybindings.py)

Why this is split out

Test Plan

Depends on

Original PR

Uh oh!

pytorch-bot Bot commented May 13, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19535

❗ 1 Active SEVs

Uh oh!

github-actions Bot commented May 13, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

C++ pybind11 bindings (`extension/llm/runner/pybindings.cpp`)

Python `LlamaRunner` integration (`examples/models/llama/runner/generation.py`)

CLI integration (`examples/models/llama/runner/eager.py`)

Tests (`extension/llm/runner/test/test_runner_pybindings.py`)

This PR needs a `release notes:` label