Skip to content

[llm][4/4] llama_main CLI flags + chat_formatter wrapper + universal Jinja docs#19536

Open
seyeong-han wants to merge 4 commits into
pytorch:mainfrom
seyeong-han:chat-llama-cli
Open

[llm][4/4] llama_main CLI flags + chat_formatter wrapper + universal Jinja docs#19536
seyeong-han wants to merge 4 commits into
pytorch:mainfrom
seyeong-han:chat-llama-cli

Conversation

@seyeong-han
Copy link
Copy Markdown
Contributor

Summary

Part 4 of the chat-template support stack split out of #16987 per @kirklandsign's request. This is the user-facing surface that wires everything together — the CLI flags for llama_main and the universal-Jinja documentation.

Stack overview

PR Subject
1/4 #19533 Library + tests
2/4 #19534 TextLLMRunner echo gating + EOS merge
3/4 #19535 Python bindings + LlamaRunner integration
4/4 (this PR) llama_main CLI flags + chat_formatter wrapper + universal Jinja docs

What this PR adds

Example-local ChatFormatter (examples/models/llama/runner/chat_formatter.h)

ChatFormatter abstraction with NoChatFormatter and a JinjaChatFormatterAdapter wrapping executorch::extension::llm::JinjaChatFormatter.

  • parse_chat_format is case-insensitive and trims whitespace, so "Llama3", " llama3 ", "LLAMA3" all map correctly.
  • create_chat_formatter throws std::invalid_argument when chat_format=jinja is passed without --chat_template_file (no more silent no-op — addresses the Copilot review comment from [llama] Add chat format support for Llama 3 Instruct models #16987).

CLI flags (examples/models/llama/main.cpp)

Adds:

  • --chat_format=llama3|gemma3|jinja|none (default: none)
  • --chat_template_file=<path> (any HF / vLLM Jinja template)
  • --system_prompt="<text>"
  • --echo=true|false (wired to GenerationConfig.echo)

Wraps the prompt with the chat formatter, catches invalid_argument / std::exception from formatter creation with clear error messages.

Build wiring

  • examples/models/llama/runner/CMakeLists.txt + targets.bzl: link llama_runner against jinja2cpp (transitive include in chat_formatter.h)
  • examples/models/llama/CMakeLists.txt: a guarded FetchContent_Declare(jinja2cpp) so the example builds standalone (when the parent build hasn't already added jinja2cpp via extension/llm/chat_template), without redeclaring when it has

Universal Jinja docs

  • examples/models/llama/README.md: documents the new flags AND the recommended workflow of passing any HuggingFace / vLLM Jinja template via --chat_template_file
  • extension/llm/runner/README.md: documents universal Jinja support — points at vLLM examples and HF tokenizer_config.json files as supported template sources

Universal Jinja example

After this PR, users can drop in any chat template — no recompile required:

# Use any vLLM template:
curl -L -o llama3.2.jinja \
  https://raw.githubusercontent.com/vllm-project/vllm/main/examples/tool_chat_template_llama3.2_pythonic.jinja

cmake-out/examples/models/llama/llama_main \
  --model_path=<model.pte> \
  --tokenizer_path=<tokenizer.json> \
  --chat_template_file=llama3.2.jinja \
  --system_prompt="You are a helpful assistant." \
  --echo=false \
  --prompt="What is the capital of France?"
# Output: The capital of France is Paris.

Why this is split out

This is the user-facing CLI integration that depends on PRs A and C. It's the most reviewable in isolation since it's example code with lower blast radius — reviewers can focus on the CLI ergonomics and docs without re-reading library internals.

Sample vLLM templates are NOT checked in (per @metascroy's reviewer feedback); documentation here points users to vLLM's examples/ directory and HuggingFace tokenizer_config.json files, which the universal Jinja support handles directly.

Test Plan

  • Build with cmake --workflow llm-release
  • Build with make llama-cpu
  • Test --chat_format=llama3 with Llama-3.2-1B-Instruct
  • Verify generation stops at <|eot_id|> token
  • Test --echo=false produces clean output without special tokens
  • Test --echo=true produces raw output WITH special tokens
  • Test --system_prompt affects model behavior
  • Test --chat_template_file=<vLLM template> works for arbitrary HF/vLLM Jinja files
  • Test --chat_format=jinja without --chat_template_file now produces a clear error (not silent no-op)
  • Test --chat_format="Llama3" / " llama3 " (case-insensitive + trim) parses correctly
  • Backward compatible with --chat_format=none (default)
  • Standalone build of examples/models/llama works (FetchContent guard does not redeclare)

Depends on

  • PR-A: #19533JinjaChatFormatter library
  • PR-C: #19535 — Python LlamaRunner (independent runtime path; this PR's chat_formatter.h is C++-only)

Original PR

Splitting #16987 into 4 reviewable PRs. Once this stack lands, #16987 will be closed as superseded.

cc @kirklandsign @larryliu0820 @metascroy

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 13, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19536

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Foundation PR for the chat-template support stack. Adds the Jinja2Cpp-based
JinjaChatFormatter, supporting chat-types, embedded Llama3/Llama3.2/Gemma3
templates, build glue (CMake/Buck), and a focused C++ unit-test suite.
This PR is reviewable in isolation — it has no behavior change for any
existing runner; downstream PRs (B/C/D) plug it in.

This is part 1 of a 4-PR stack split out of pytorch#16987 per reviewer request:

  1/4 (this PR)  Library + tests
  2/4            TextLLMRunner echo-gated special-token filter + EOS merge
  3/4            Python bindings + Python LlamaRunner integration
  4/4            llama_main CLI flags + chat_formatter wrapper + docs

What this PR adds
-----------------
* extension/llm/chat_template/{chat_templates.h, BUCK, CMakeLists.txt,
  targets.bzl} — embedded Llama3/Llama3.2/Gemma3 templates and the
  ChatTemplateType enum + ModelTokens. The CMake file FetchContent's
  Jinja2Cpp 1.3.2, with SUPPORT_REGEX_LOOKAHEAD set BEFORE
  FetchContent_MakeAvailable so it propagates correctly, plus header
  staging for nonstd headers that some Jinja2Cpp installations omit.
  Installs chat_templates.h so SDK consumers can include it.
* extension/llm/runner/{chat_types.h, jinja_chat_formatter.{h,cpp}} — the
  Universal Jinja chat formatter that supports any HuggingFace / vLLM
  chat template, not just the embedded ones. Loadable via fromTemplate
  (built-in), fromString (any string), or fromFile (any .jinja file).
  formatConversation injects vLLM/HuggingFace-standard params (tools=[],
  tool_choice=None, date_string, chat_template_kwargs) so any template
  that references those variables renders correctly.
* normalizeTemplate handles vLLM/HF template quirks for Jinja2Cpp:
  notably, 'not tools is none' maps to 'tools' (truthy check), preserving
  the intent of 'tools is not none' for empty-list defaults.
* extension/llm/runner/{CMakeLists.txt, targets.bzl} — link
  extension_llm_runner against jinja2cpp (PRIVATE) and define
  EXECUTORCH_USE_JINJA2CPP.
* extension/llm/runner/test/{test_jinja_chat_formatter.cpp, CMakeLists.txt,
  targets.bzl, BUCK} — unit tests covering Llama3 / Llama3.2 / Gemma3
  embedded templates, parseChatTemplateType (case-insensitive), and
  three universal-Jinja regression tests:
    - generic HuggingFace-style template (proves it's not Llama-specific)
    - tools-aware template (validates the tools=[] default)
    - 'not tools is none' normalization regression test
* CMakeLists.txt — adds add_subdirectory(extension/llm/chat_template)
  guarded by EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER.
* shim_et/xplat/executorch/build/build_variables.bzl — adds
  jinja_chat_formatter.cpp to the runner sources.

Notes
-----
* No behavior change for existing TextLLMRunner / MultimodalRunner users:
  the formatter is opt-in, only invoked when downstream code calls it.
* Sample vLLM templates are NOT checked in (per reviewer feedback);
  documentation in the follow-up CLI PR points users to vLLM's examples
  directory and HuggingFace tokenizer_config.json files.

Original PR (full stack): pytorch#16987
Part 2 of the chat-template support stack split out of pytorch#16987.

What this PR adds
-----------------
* extension/llm/runner/text_llm_runner.cpp: Add 'is_special_token()'
  with a small kKnownSpecialTokens set covering Llama 3.x, Gemma, and
  generic <s>/</s>/<pad>/<unk> tokens, plus a regex-style match for
  Llama-format <|...|> tokens. wrapped_callback now suppresses these
  from the printed stream when GenerationConfig.echo == false. When
  echo == true, raw model output (including chat-template tokens) is
  emitted unchanged - this preserves backward compatibility for users
  who explicitly want to see raw tokens.

* extension/llm/runner/llm_runner_helper.cpp: get_eos_ids() now MERGES
  the tokenizer's primary eos_tok() with any additional EOS IDs the
  model metadata exports under kEosIds, instead of clearing the set
  when metadata is present. This is correct for HF-tokenizer models
  (e.g. Llama 3.x) where eos_tok() = <|end_of_text|> but the model
  also wants <|eot_id|> as a stop token. Also logs the primary tok
  and only logs metadata IDs that are newly inserted.

Why this is split out
---------------------
These are runner-behavior changes that affect ALL TextLLMRunner users,
not just the new chat-template path. They deserve focused review for
backward-compat impact (echo gating) and EOS-set semantics (merge vs
clear).

Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter
            library) — only for stack ordering; this PR has no
            include or symbol dependency on that library.

Original PR (full stack): pytorch#16987
…ration

Part 3 of the chat-template support stack split out of pytorch#16987.

What this PR adds
-----------------
* extension/llm/runner/pybindings.cpp: New pybind11 classes:
  - ChatMessage(role, content)
  - ChatConversation(messages, bos_token, eos_token, add_generation_prompt)
  - ChatTemplateType enum (None_, Llama3, Llama32, Gemma3, Custom)
  - JinjaChatFormatter with from_template / from_string / from_file
    static factories, format(prompt, system_prompt) and
    format_conversation(ChatConversation) methods, includes_bos().
* extension/llm/runner/__init__.py: re-exports the new bindings via
  __all__.
* extension/llm/runner/_llm_runner.pyi: type stubs for the new
  classes so consumers get IDE / mypy support.
* extension/llm/runner/test/test_runner_pybindings.py: Python tests
  covering the new bindings end-to-end.
* examples/models/llama/runner/generation.py: LlamaRunner now accepts
  chat_format / system_prompt / chat_template_file kwargs and exposes
  _format_prompt + chat_completion using the JinjaChatFormatter.
  Default chat_format is 'none' (matches llama_main, preserves
  backward compatibility for existing EagerLlamaRunner / NativeLlamaRunner
  callers). _resolve_template_type maps 'llama3.2' / 'llama32' /
  'llama3_2' to ChatTemplateType.Llama32 (consistent with C++
  parseChatTemplateType).
* examples/models/llama/runner/eager.py: adds --chat_template_file CLI
  flag for chat mode.

Why this is split out
---------------------
Python changes are independently testable and reviewers may want
different eyes on the Python vs C++ paths. Also isolates the
backward-compat concern around the chat_format default.

Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter
            library headers/symbols).

Original PR (full stack): pytorch#16987
…Jinja docs

Part 4 of the chat-template support stack split out of pytorch#16987. This is
the user-facing surface that wires everything together.

What this PR adds
-----------------
* examples/models/llama/runner/chat_formatter.h: Example-local
  ChatFormatter abstraction with NoChatFormatter and a
  JinjaChatFormatterAdapter wrapping
  executorch::extension::llm::JinjaChatFormatter. parse_chat_format is
  case-insensitive and trims whitespace, so 'Llama3', ' llama3 ',
  'LLAMA3' all map correctly. create_chat_formatter throws
  std::invalid_argument when chat_format=jinja is passed without
  --chat_template_file (no more silent no-op).
* examples/models/llama/main.cpp: Adds --chat_format,
  --chat_template_file, --system_prompt, --echo flags. Wraps the
  prompt with the chat formatter, catches invalid_argument /
  std::exception from formatter creation with clear error messages.
  Wires GenerationConfig.echo from the new --echo flag.
* examples/models/llama/runner/CMakeLists.txt + targets.bzl: link
  llama_runner against jinja2cpp (transitive include in chat_formatter.h).
* examples/models/llama/CMakeLists.txt: add a guarded
  FetchContent_Declare(jinja2cpp) so the example builds standalone
  (when the parent build hasn't already added jinja2cpp via
  extension/llm/chat_template), without redeclaring when it has.
* examples/models/llama/README.md: documents the new flags AND the
  recommended workflow of passing any HuggingFace / vLLM Jinja
  template via --chat_template_file (universal Jinja support).
* extension/llm/runner/README.md: documents universal Jinja support
  for the LLM runner library — points at vLLM examples and HF
  tokenizer_config.json files as supported template sources.

Why this is split out
---------------------
This is the user-facing CLI integration that depends on PRs A and C.
It's the most reviewable in isolation since it's example code with
lower blast radius — reviewers can focus on the CLI ergonomics and
docs without re-reading library internals.

Sample vLLM templates are NOT checked in (per reviewer feedback);
documentation here points users to vLLM's examples directory and
HuggingFace tokenizer_config.json files, which the universal Jinja
support handles directly.

Depends on:
  - PR-A: extension/llm/chat_template/* + JinjaChatFormatter library
  - PR-C: chat_formatter.h includes JinjaChatFormatter (header-only),
          but generation.py / eager.py changes are independent

Original PR (full stack): pytorch#16987
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant