[llama] Add chat format support for Llama 3 Instruct models by seyeong-han · Pull Request #16987 · pytorch/executorch

seyeong-han · 2026-01-29T00:16:36Z

Summary

Without chat formatting, Instruct models behave like base models and never generate end-of-turn tokens.

So, I added chat template support to the llama_main runner for Llama 3 Instruct models.

Problem

When running Llama 3 Instruct models without proper chat formatting:

cmake-out/examples/models/llama/llama_main \    
    --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --prompt="What's the capital of France?"

Paris is incorrect answer so was stated so some major tech companies may choose the other city, and for instance the CEO of Apple is Tim Cook and he is from California, and the CEO of Google is Sundar Pichai and he is from India.

==> Only max_tokens limit

Solution

New CLI flags enable proper Instruct model behavior:

Flag	Description	Default
`--chat_format`	Chat template (llama3, none)	`none`
`--system_prompt`	System prompt for assistant behavior	(empty)
`--echo`	Echo input prompt in output	`true`

Examples

Basic Usage

cmake-out/examples/models/llama/llama_main \
    --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --chat_format="llama3" \
    --prompt="What's the capital of France?"

Output:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

What's the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The capital of France is Paris.

Clean Output (Recommended for Apps)

cmake-out/examples/models/llama/llama_main \
    --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --chat_format="llama3" \
    --echo=false \
    --prompt="What's the capital of France?"

Output:

The capital of France is Paris.

With System Prompt

cmake-out/examples/models/llama/llama_main \
    --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --chat_format="llama3" \
    --system_prompt="You are a helpful assistant. Be concise and respond in one sentence." \
    --echo=false \
    --prompt="Explain quantum computing"

Output:

Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously to solve complex problems faster than classical computers.

Backward Compatible (No Chat Format)

# For base models or pre-formatted prompts
cmake-out/examples/models/llama/llama_main \
    --model_path="llama/Llama-3.2-1B/model.pte" \
    --tokenizer_path="llama/original/tokenizer.model" \
    --chat_format="none" \
    --prompt="Once upon a time"

Files Changed

File	Description
`examples/models/llama/runner/chat_formatter.h`	NEW - ChatFormatter for Llama 3
`examples/models/llama/main.cpp`	Added CLI flags
`examples/models/llama/README.md`	Documentation with examples
`extension/llm/runner/text_llm_runner.cpp`	Special token filtering
`extension/llm/runner/llm_runner_helper.cpp`	EOS token merge logic

Before/After Comparison

Metric	Before (no chat format)	After (--chat_format=llama3 --echo=false)
Tokens generated	120 (max limit)	7 (stops at EOS)
Output	Rambling continuation	Clean answer
Special tokens	Visible in output	Filtered out

Test Plan

Build with make llama-cpu
Test --chat_format=llama3 with Llama-3.2-1B-Instruct
Verify generation stops at <|eot_id|> token
Test --echo=false produces clean output without special tokens
Test --system_prompt affects model behavior
Backward compatible with --chat_format=none (default)

pytorch-bot · 2026-01-29T00:16:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16987

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 100 New Failures, 2 Cancelled Jobs

As of commit 2d09315 with merge base 4930e7c ():

NEW FAILURES - The following jobs have failed:

Build documentation / build (buck2) / Build doc (gh)
RuntimeError: Command docker exec -t ea67d189e03072b080459bdb07da625ee2e8a9d296844143592f7de684418d5f /exec failed with exit code 1
Build Presets / apple (ios-simulator) / build (gh)
/Users/runner/work/executorch/executorch/pytorch/executorch/cmake-out/_deps/boost-src/libs/regex/include/boost/regex/v5/basic_regex_parser.hpp:1950:26: error: implicit conversion loses integer precision: 'std::intmax_t' (aka 'long') to 'unsigned int' [-Werror,-Wshorten-64-to-32]
Build Presets / apple (ios) / build (gh)
/Users/runner/work/executorch/executorch/pytorch/executorch/cmake-out/_deps/boost-src/libs/regex/include/boost/regex/v5/c_regex_traits.hpp:461:17: error: implicit conversion loses integer precision: 'long' to 'int' [-Werror,-Wshorten-64-to-32]
Build Presets / apple (macos) / build (gh)
/Users/runner/work/executorch/executorch/pytorch/executorch/cmake-out/_deps/boost-src/libs/regex/include/boost/regex/v5/c_regex_traits.hpp:461:17: error: implicit conversion loses integer precision: 'long' to 'int' [-Werror,-Wshorten-64-to-32]
Build Presets / linux (pybind, linux.2xlarge, executorch-ubuntu-22.04-clang12) / build (gh)
RuntimeError: Command docker exec -t a1f33b51bb12fadb96c96429c2ed668ffe96d4321d7e9962dafe15072b5d681d /exec failed with exit code 2
Build Presets / linux (pybind, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11-aarch64) / build (gh)
RuntimeError: Command docker exec -t cd66567fbcb36ec1cb80a7a5693214c3befa09e4eb5434fc19a193651894e786 /exec failed with exit code 2
Lint / lintrunner / linux-job (gh)
>>> Lint for extension/llm/runner/test/test_runner_pybindings.py:
Lint / lintrunner-mypy / linux-job (gh)
RuntimeError: Command docker exec -t 8d7986131e19a7a533436823642058f565137f14c9c0061efdda9826d7674c67 /exec failed with exit code 1
pull / android / build-llm-demo / linux-job (gh)
RuntimeError: Command docker exec -t 4be64be20fdd5835b65965f6d880f2ea92d95b6e3cb393e298c1d7827a7f3452 /exec failed with exit code 1
pull / test-build-wasm-linux / linux-job (gh)
RuntimeError: Command docker exec -t b1376359139d42af687fb1340da5bc095cb2ec9837cca05cc06194bdb070dcc7 /exec failed with exit code 1
pull / test-custom-ops-linux / linux-job (gh)
RuntimeError: Command docker exec -t 22766f9f2431bbfef62f88a211b4b2fbf53c815ba2ab2c49d42543ceaa074883 /exec failed with exit code 1
pull / test-eval_llama-wikitext-linux / linux-job (gh)
RuntimeError: Command docker exec -t cce0a542f1a5ea727d52a2dfbb687699d2095f65387443577120be1f90ac2f2e /exec failed with exit code 1
pull / test-llama_runner_eager-linux / linux-job (gh)
RuntimeError: Command docker exec -t 77c4160e963a72b7c41f80195ea03974cabdf97e513e8b33e9948a0045520186 /exec failed with exit code 1
pull / test-llama-runner-linux (bf16, custom, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
RuntimeError: Command docker exec -t b95788b5ac86ba4acb5a80966a5621e79f1640c76dbbd8081f0f0dd321396f12 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, xnnpack+custom+qe, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
RuntimeError: Command docker exec -t b1cf3a15083d9498426fab5a8326c775fc12f9932ba074e46ec87fa122872e3b /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, xnnpack+custom+qe, linux.arm64.2xlarge, executorch-ubuntu-22.04-gc... / linux-job (gh)
RuntimeError: Command docker exec -t fc241e37c1a615b6ced14abd3a8bdc381b71f471c71b9e5c879ebdee4c61a487 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.2xlarge, executorch-ubuntu-22.04... / linux-job (gh)
RuntimeError: Command docker exec -t e06e39def02ab23eec0df35da276c2a9d87cc75654f74f9156104607ed6a64ad /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.arm64.2xlarge, executorch-ubuntu... / linux-job (gh)
RuntimeError: Command docker exec -t 232ed3c65893c293cff36d25e34783bb1e2eaf839e49bd417dd16e11b5885bb5 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, xnnpack+quantize_kv, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
RuntimeError: Command docker exec -t caf8ff129c2c7eae4675ea4ec9ec3cf87f1f9a0b485a9f23b2d0dc4934879039 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, xnnpack+quantize_kv, linux.arm64.2xlarge, executorch-ubuntu-22.04-... / linux-job (gh)
RuntimeError: Command docker exec -t 472d4a7e4a92b83d20665c096ffbb2c7c3aa541466bbcffd61b483a489517bd9 /exec failed with exit code 1
pull / test-llama-runner-linux-android / linux-job (gh)
RuntimeError: Command docker exec -t 06a294f7bbfcbb2d2d664a4374036278fe7e589cf06fb7a6ac12444edcdac8c3 /exec failed with exit code 1
pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh)
RuntimeError: Command docker exec -t a4677f16374878d25a197e285d9573f53bc19dfaf0d7c28a698755ce0173add2 /exec failed with exit code 2
pull / test-llama-runner-qnn-linux (fp32, qnn_8a8w, qnn) / linux-job (gh)
RuntimeError: Command docker exec -t a4f90372eff4f5d47d913b7bd9b49da7bceb077725eb054c74f1dacf89a12646 /exec failed with exit code 2
pull / test-lora-linux / linux-job (gh)
RuntimeError: Command docker exec -t 5a9290159501d423fa48a00335d0e7f09a46ad69e7ed763a9e88a89ad638f186 /exec failed with exit code 1
pull / test-mediatek-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t bc1b828be2493de49d2c6a6604d3f3f2cdf6bad0240a19f231ca14f38c9db266 /exec failed with exit code 1
pull / test-models-linux (add_mul, portable, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t b78434480b519e7f1da10892882a807fe20200a81520ea332e8a64a3d5f2c026 /exec failed with exit code 1
pull / test-models-linux (add_mul, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t cc07f892f1df346b629be2fec49e44280776c859008cece986dfd84e5f417497 /exec failed with exit code 1
pull / test-models-linux (add, portable, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t c2993377ad74526a079cbd7bc0e2b358e6acfd19563aa72363a8bcac430b43ef /exec failed with exit code 1
pull / test-models-linux (add, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 600856eb495e196099ecf78b602e3b2d635d815e1f4024d18a546c07a3923e1d /exec failed with exit code 1
pull / test-models-linux (emformer_join, portable, linux.4xlarge.memory) / linux-job (gh)
RuntimeError: Command docker exec -t 20826d860426791107d0bb9aac74fc3c6be495e40140f38c22517e90aec198f0 /exec failed with exit code 1
pull / test-models-linux (emformer_join, xnnpack-quantization-delegation, linux.4xlarge.memory) / linux-job (gh)
RuntimeError: Command docker exec -t 7eb5063b05b9a913773d90e789d82a4f9b54695fa78cbb466486271e8b8182fd /exec failed with exit code 1
pull / test-models-linux (emformer_transcribe, portable, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t b164580074567fbbccc4b2d1087662b31b9b813bb0ac3ecf94a91adf3eb7d22f /exec failed with exit code 1
pull / test-models-linux (emformer_transcribe, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 07e2bcc36b129112e20e4586c68a062e4c7e9aaf19391cbafb79fc07120ea344 /exec failed with exit code 1
pull / test-models-linux (ic3, portable, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 520867fcd625ceaccd46ce00740b1c0b6fb968655c2f5ff2125c8f6fd0ed892d /exec failed with exit code 1
pull / test-models-linux (ic3, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t f33ffdbd79bc391ae4d7380f0b140a8f6b5430693b317352891b5341910de37d /exec failed with exit code 1
pull / test-models-linux (ic4, portable, linux.4xlarge.memory) / linux-job (gh)
RuntimeError: Command docker exec -t a131338f67b85a03f73f507f2a60f3548f987be241a9cafa025b585fb92e9863 /exec failed with exit code 1
pull / test-models-linux (ic4, xnnpack-quantization-delegation, linux.4xlarge.memory) / linux-job (gh)
RuntimeError: Command docker exec -t 1436518aa524c891508c98cd66729b5d8b8c1a2dab521c4537d0c07eacc73ef7 /exec failed with exit code 1
pull / test-models-linux (linear, portable, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 8c82e810ad52c1eef7754b6c93611bfd3d2a25f65eed9b75af95d6c4c48ef333 /exec failed with exit code 1
pull / test-models-linux (linear, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 0ab44891b6c95f50d6d55dd4ab5077c91973606c2803d9813dd35d103ae2a6d0 /exec failed with exit code 1
pull / test-models-linux (llama3_2_vision_encoder, portable, linux.4xlarge.memory) / linux-job (gh)
RuntimeError: Command docker exec -t 49fb5e09ecaa1b9ffa478e3798eba5d4cd18e485a1435898110aefd45defacd7 /exec failed with exit code 1
pull / test-models-linux (mobilebert, portable, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t f7e14dd77ac21d765e5ba96f04a97535f67a209f8895006407d046847b265bf1 /exec failed with exit code 1
pull / test-models-linux (mobilebert, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 40188598521c8615f9996a0003f3db457673c2107851bc587f3a3730feff71ca /exec failed with exit code 1
pull / test-models-linux (mv2, portable, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 8f99e8a1f3b6f444519239da0f1d14b5d090da19344bd37b40801e2f14a04711 /exec failed with exit code 1
pull / test-models-linux (mv2, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 1ea725dd28495e26703d220661a84ef388df8f78c266ab478c295efa5448458f /exec failed with exit code 1
pull / test-models-linux (phi_4_mini, portable, linux.4xlarge.memory) / linux-job (gh)
RuntimeError: Command docker exec -t b039069d9c64e03f6c1b244e8f7a6423b3993f8019a6a6f4309ddadaa7889f2d /exec failed with exit code 1
pull / test-models-linux (resnet18, portable, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 5090d553653a0a35ee264efc0a219739e7c3d0830047ffaf04f88345b1c5ba79 /exec failed with exit code 1
pull / test-models-linux (resnet18, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 5b83abdf781c8393f037ee26c2497058e0f0ffe0053318dbd9e0eea208a38fc9 /exec failed with exit code 1
pull / test-models-linux (resnet50, portable, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 55fb9c56efb9765652b95b2a4494038461d51c3a4b4a936bf9afdaa55738854e /exec failed with exit code 1
pull / test-models-linux (resnet50, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 5c7ef1e9bebe68d034980b3dff529fed8f93337129edb912b31e1505c3ca091d /exec failed with exit code 1
pull / test-models-linux (w2l, portable, linux.4xlarge.memory) / linux-job (gh)
RuntimeError: Command docker exec -t e443d60d2edacc4bc54a04010f7ecdd96de801d95b8cc9b4b8b0d2b8b6f83677 /exec failed with exit code 1
pull / test-models-linux-basic (mv3, portable, buck2, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
RuntimeError: Command docker exec -t d124711f1c785e1ef11b9ee8cfd903d03c09624558dbe1e65c5b93c8b047d933 /exec failed with exit code 1
pull / test-models-linux-basic (mv3, portable, cmake, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
RuntimeError: Command docker exec -t c35bfda16833bdda0492e45435f037a38c1cc727f4ce6b28a289015d294db92f /exec failed with exit code 1
pull / test-models-linux-basic (mv3, portable, cmake, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11... / linux-job (gh)
RuntimeError: Command docker exec -t fdc92d7ea26880e9cbf658165f260c2e5c457ca9725aa3cf2d5ee0dd9e8a31fe /exec failed with exit code 1
pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, buck2, linux.2xlarge, executorch-u... / linux-job (gh)
RuntimeError: Command docker exec -t 627b892b4813919fb08ca0158be81f3131516dedd28e97632eed7a556e202ee7 /exec failed with exit code 1
pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, cmake, linux.2xlarge, executorch-u... / linux-job (gh)
RuntimeError: Command docker exec -t 96e7d826c0e59e10ef563e503120eb4776525abac4d25f625848eec1e7cb46c3 /exec failed with exit code 1
pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, cmake, linux.arm64.2xlarge, execut... / linux-job (gh)
RuntimeError: Command docker exec -t e2b8cbee358a859d6f94b31384f42c71c4c0498b5ebfa3d19927658cb7c871ad /exec failed with exit code 1
pull / test-models-linux-basic (vit, portable, buck2, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
RuntimeError: Command docker exec -t e632152f766bb474878895019e1ae32613a1b0e94fc6cea693b53cc84b1c15cc /exec failed with exit code 1
pull / test-models-linux-basic (vit, portable, cmake, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
RuntimeError: Command docker exec -t f2bd6e601a79de668e698b8f8dce22d5e9576cf0ef33a9d1462402b3a14fb989 /exec failed with exit code 1
pull / test-models-linux-basic (vit, portable, cmake, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11... / linux-job (gh)
RuntimeError: Command docker exec -t 4e17c2b2fbdba5fe5ed8c83fd955342a731c057e9c875f77d616424e9ecd3fb7 /exec failed with exit code 1
pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, buck2, linux.2xlarge, executorch-u... / linux-job (gh)
RuntimeError: Command docker exec -t 91f5d51f3400d552dbff024808bd312e1e4e76ce9236e706671d8f78de09cd84 /exec failed with exit code 1
pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, cmake, linux.2xlarge, executorch-u... / linux-job (gh)
RuntimeError: Command docker exec -t 5b1a39a0c35cc064aa7a58a0a488fb5aa44f0e5ddd3df53005030de50523f62c /exec failed with exit code 1
pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, cmake, linux.arm64.2xlarge, execut... / linux-job (gh)
RuntimeError: Command docker exec -t fecbb934d508df98f588989ab143ad3e1977e4f5d581d9d63e92776659385138 /exec failed with exit code 1
pull / test-moshi-linux / linux-job (gh)
RuntimeError: Command docker exec -t cd7c445dd1a8a94bb0f79f299667539c39214bae09d8ec24704809f36c69efb0 /exec failed with exit code 1
pull / test-openvino-linux / linux-job (gh)
RuntimeError: Command docker exec -t 17d4e7e3fac2cadacd371619b88a91b5cec5da47d40454704b8f0fa2fd0e8696 /exec failed with exit code 1
pull / test-parakeet-xnnpack-linux / linux-job (gh)
RuntimeError: Command docker exec -t 720dcbcb3a04b6f514f2e9680953374ebf8fd6a9ca19e7f44c98874851912f55 /exec failed with exit code 1
pull / test-phi-3-mini-runner-linux / linux-job (gh)
RuntimeError: Command docker exec -t 6a7f9b086ad9e483cfbbd36e4595ccde07baafb3e2c358c679ec742aba2f81ab /exec failed with exit code 1
pull / test-quantized-aot-lib-linux / linux-job (gh)
RuntimeError: Command docker exec -t a48c5093dd4d6f408b826579bf25122ce6e37cc696778a6bf3c57e6d138f6f94 /exec failed with exit code 1
pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 2841a2c5c2d30fabec6ff02bdb7e6d1d7434c4d18668678c8a2a9218a373a5d6 /exec failed with exit code 1
pull / test-samsung-quantmodels-linux / linux-job (gh)
RuntimeError: Command docker exec -t 81ae829b0beb309c22f0dc7795e8fee31e35e2432b9925565b4aaef1e9d09222 /exec failed with exit code 1
pull / test-selective-build-linux / linux-job (gh)
RuntimeError: Command docker exec -t 44af649bcc5ef3e6e6884dc75d7cde11d883c6da6afa7a575a86bb7b5b4c2c20 /exec failed with exit code 1
pull / test-setup-linux-gcc / linux-job (gh)
RuntimeError: Command docker exec -t b40baa8b3ed203e9ddb87de3c651f0b3e644c821a9ff4aed6c1d660349e91eca /exec failed with exit code 1
pull / test-sqnr-static-llm-qnn-linux (smollm2_135m) / linux-job (gh)
RuntimeError: Command docker exec -t 088e03d53928dd3ae5d847bbd240724f8fe6c11977879ad692c35a7ed111ddb4 /exec failed with exit code 2
pull / test-static-llama-qnn-linux (stories_110m) / linux-job (gh)
RuntimeError: Command docker exec -t 498c7d8420bf8fedadd881758b4e2ba991454b659e342f70377d539c0eb793f4 /exec failed with exit code 2
pull / test-static-llama-qnn-linux (stories_260k_bc) / linux-job (gh)
RuntimeError: Command docker exec -t 7b23e898d29e287d894f6c40618fb33fe3545db4af71c317d677a47f3681d5fd /exec failed with exit code 2
pull / test-vulkan-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 0c53c4034200af97a02c8fb26e20c74afca3f17bd9201b2e5097401257531c86 /exec failed with exit code 1
pull / test-vulkan-operators-linux / linux-job (gh)
RuntimeError: Command docker exec -t 4288a34b72332996f468ec27692e3b07c26626c9d661ab7b67550669a1ff9b85 /exec failed with exit code 1
pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 777f0be7d6565023b8eb740b9885d36cd0a0f873825f044b7cd3c6188e3e94de /exec failed with exit code 1
pull / unittest-arm-backend-with-no-deps (test_pytest_models_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 39db08e9322dd7604cdcc09bb08c4fb1d47f75e18d488e05f4d5e4deda58b367 /exec failed with exit code 1
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_no_target) / linux-job (gh)
RuntimeError: Command docker exec -t ad42464928fa1f0d43d456779026539601cf2b00160160f260b35d68b4b768ac /exec failed with exit code 1
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 9bd7e123bf1464fbb4447b2f4d60a2448df5efc8fac04e54d30ac7d870608694 /exec failed with exit code 1
pull / unittest-arm-backend-with-no-deps (test_run_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 6b66518c3c2a92960ccc205f9ea874d0189374757828fe570474dca30019da9e /exec failed with exit code 1
pull / unittest-buck / linux / linux-job (gh)
RuntimeError: Command docker exec -t c4f30cfb8817ae9ca6fad2c18a7fc2eb3125b4c355df6a7cb2df4db66f8104b5 /exec failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 57975f3cdcc52c024a0a7fdec9954b1bb4a2182e1ad00959d27f28fe6a22708c /exec failed with exit code 1
pull / unittest-nxp-neutron / linux-job (gh)
RuntimeError: Command docker exec -t da09398b2222f99c2dfdc52f01c2e4ccea520f5b59c49fcd8179f1973c1e8f57 /exec failed with exit code 1
pull / unittest-wasm-bindings (--enable-etdump) / linux-job (gh)
RuntimeError: Command docker exec -t f5e33516db17aeb78dd99f852439e897c7524b6da21c98bca1e46075375c0ea5 /exec failed with exit code 1
pull / unittest-wasm-bindings / linux-job (gh)
RuntimeError: Command docker exec -t 01a546a6a7394e4177305e0cb617ab64a69fe82775fd3e00d2ed3f6d50664380 /exec failed with exit code 1
Test CUDA Builds / check-all-cuda-builds (gh)
Process completed with exit code 1.
Test CUDA Builds / test-executorch-cuda-build-12.6 / linux-job (gh)
RuntimeError: Command docker exec -t ff6693c177dd921ad00733f07b0bc8ef650984aace01a7c3e95f6abb6a26ad87 /exec failed with exit code 1
Test CUDA Builds / test-executorch-cuda-build-12.8 / linux-job (gh)
RuntimeError: Command docker exec -t 2a3713961ecd4e2cb7d51a6d9c6610252ac85cb7cf99ab136085c6524a223506 /exec failed with exit code 1
Test CUDA Builds / test-executorch-cuda-build-12.9 / linux-job (gh)
RuntimeError: Command docker exec -t 02b27b6a88d7ec1e8be95abc93aa2e93b6314b499fd412c802dc5b7b0ca5ade6 /exec failed with exit code 1
Test CUDA Builds / test-executorch-cuda-build-13.0 / linux-job (gh)
RuntimeError: Command docker exec -t b8c28589c6ee6f323e3ae1d73ba8d307ce4f9994df4879b1af2dcbea40422dac /exec failed with exit code 1
Test CUDA Builds / test-models-cuda (add_mul) / linux-job (gh)
RuntimeError: Command docker exec -t 0adbd34b9e204e2fe2b127e06027d9f9a65ac8d7283ed2b3f6c1de3e7f47bb6c /exec failed with exit code 1
Test CUDA Builds / test-models-cuda (add) / linux-job (gh)
RuntimeError: Command docker exec -t c703f9d6cd06416975a97273d48104806913f37307e6161818c078dcc2fc4ced /exec failed with exit code 1
Test CUDA Builds / test-models-cuda (conv1d) / linux-job (gh)
RuntimeError: Command docker exec -t 498b9fc49c724dc438f94c68d3b7830889070a48232c3646782ea455714574b8 /exec failed with exit code 1
Test CUDA Builds / test-models-cuda (linear) / linux-job (gh)
RuntimeError: Command docker exec -t e9ae3b7fcb74fe6128ab4f4a6cd37a1677901697a327ba2f44bb70ae95112940 /exec failed with exit code 1
Test CUDA Builds / test-models-cuda (mv2) / linux-job (gh)
RuntimeError: Command docker exec -t fbe97dccd63542ce28d1c4cb80c6346b076d2bccfebbe438d422a9eddab6de16 /exec failed with exit code 1
Test CUDA Builds / test-models-cuda (mv3) / linux-job (gh)
RuntimeError: Command docker exec -t c9beaef9cdcb80dc09e3ab2f095ae4ec7a55de58e70e023184087f22648f8f08 /exec failed with exit code 1
Test CUDA Builds / test-models-cuda (resnet18) / linux-job (gh)
RuntimeError: Command docker exec -t 4f1162a822337aa4ab1a0a5e47d43f9c3479a47d5f030bdaef9c7d39d0ef7f03 /exec failed with exit code 1
Test CUDA Builds / test-models-cuda (sdpa) / linux-job (gh)
RuntimeError: Command docker exec -t 867aca84356960f6eac70426c2be56be7925243eb9b21897bed5606d215d78d3 /exec failed with exit code 1
Test CUDA Builds / unittest-cuda / linux-job (gh)
RuntimeError: Command docker exec -t e82cf0ca91defb476ddb864a131b2c262abe12c9050cde527b89d80ad072f12c /exec failed with exit code 1

CANCELLED JOBS - The following jobs were cancelled. Please retry:

Check Labels (gh)
periodic / gather-models (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-01-29T00:17:16Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

seyeong-han · 2026-02-02T21:31:53Z

Summary

Add Jinja2Cpp-based chat template support for the Llama runner, including embedded templates, template types, and Jinja formatter bindings.
Import vLLM Llama3.2 and Gemma3 Jinja templates and normalize them for Jinja2Cpp compatibility.
Wire Jinja formatting into C++/Python runners (CLI flags, pybinds), add tests, and update docs.
Fix build/link integration for Jinja2Cpp and address missing nonstd headers during install.

Result

CMAKE_POLICY_VERSION_MINIMUM=3.5 make llama-cpu

cmake-out/examples/models/llama/llama_main \
--model_path lama/Llama-3.2-1B-Instruct/model.pte \
--tokenizer_path llama/tokenizer.json \
--chat_template_file extension/llm/runner/templates/tool_chat_template_llama3.2_pythonic.jinja \
--prompt "What is the capital of France?"

I tokenizers:regex.cpp:27] Registering override fallback regex
I tokenizers:hf_tokenizer.cpp:152] Setting up normalizer...
I tokenizers:hf_tokenizer.cpp:158] Normalizer field is null, skipping
I tokenizers:hf_tokenizer.cpp:170] Setting up pretokenizer...
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1770067519.722113  161182 re2.cc:237] Error parsing '((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n...': invalid perl operator: (?!                                                                                                           
I tokenizers:re2_regex.cpp:27] Re2 failed to compile regex: ((?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+), error: invalid perl operator: (?!                                                                                            
This may be ok if a fallback regex is used.
I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
I tokenizers:hf_tokenizer.cpp:174] Pretokenizer set up
I tokenizers:hf_tokenizer.cpp:190] Loading BPE merges...
I tokenizers:hf_tokenizer.cpp:250] Loaded 280147 BPE merge rules
I tokenizers:hf_tokenizer.cpp:262] Built merge ranks map with 127744 entries
I tokenizers:hf_tokenizer.cpp:417] Detected stop token: '<|end_of_text|>' (id=128001)
I tokenizers:hf_tokenizer.cpp:417] Detected stop token: '<|eom_id|>' (id=128008)
I tokenizers:hf_tokenizer.cpp:417] Detected stop token: '<|eot_id|>' (id=128009)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant with tool calling capabilities. Only reply with a tool call if the function exists in the library provided by the user. If it doesn't exist, just reply directly in natural language. When you receive a tool call response, use the output to format an answer to the original user question.<|eot_id|><|start_header_id|>user<|end_header_id|>                                                                                            

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


"Bonjour, I'm happy to help. The capital of France is Paris."

PyTorchObserver {"prompt_tokens":104,"generated_tokens":16,"model_load_start_ms":1770067520100,"model_load_end_ms":1770067522061,"inference_start_ms":1770067522061,"inference_end_ms":1770067524284,"prompt_eval_end_ms":1770067523133,"first_token_ms":1770067523133,"aggregate_sampling_time_ms":5,"SCALING_FACTOR_UNITS_PER_SECOND":1000}                                                                                                                           ```

@larryliu0820

larryliu0820 · 2026-02-02T22:37:23Z

+  Llama3, // Llama 3.x Instruct models
+  Gemma3, // Gemma 3 Instruct models


Why are these 2 special?

I just wanted to cover the llama3 and gemma3 for the initial support and will cover more models such as Qwen and other ExecuTorch LLM supported models.

metascroy · 2026-02-02T22:41:50Z

    )

+    parser.add_argument(
+        "--chat_format",


Is chat_format an existing flag?

Should we standardize on chat_template_file?

Updated: eager.py now uses --chat_template_file only. The --chat_format flag was removed for consistency with the C++ runner.

metascroy · 2026-02-02T22:43:51Z

 ```
+If you build with `make llama-cpu` and hit a RapidJSON CMake error, run it as:
+```
+CMAKE_POLICY_VERSION_MINIMUM=3.5 make llama-cpu


Is this error always hit?

What cmake version do we require in default build?

The CMAKE_POLICY_VERSION_MINIMUM=3.5 workaround is required because RapidJSON’s CMakeLists uses a minimum version that CMake 4+ no longer accepts, and Jinja2Cpp pulls RapidJSON in its internal deps. This is a dependency‑compatibility workaround for specific build environments, not a permanent ExecuTorch requirement.

metascroy · 2026-02-02T22:44:47Z

@@ -0,0 +1,123 @@
+{#- Begin-of-sequence token to start the model prompt -#}


Are these templates available from download from HF?

Do we need to check them in?

We won’t check in the vLLM templates. Instead we’ll reference/download them from Hugging Face and document this in the LLM runner README with a --chat_template_file example.

metascroy · 2026-02-02T22:46:07Z

      auto value = eos_id.toScalar().to<int64_t>();
-      eos_ids.emplace(value);
-      ET_LOG(Info, "eos_id = %" PRId64, value);
+      if (eos_ids.find(value) == eos_ids.end()) {


Do EOS still come from model metadata? Is there a standard way to get them from tokenizer in HF?

Yes—EOS still comes from model metadata if present, but we first read it from the tokenizer. In llm_runner_helper.cpp we use tokenizer->eos_tok() and tokenizer->stop_tokens() (populated from HF tokenizer.json), then merge any additional EOS IDs from model metadata. So the standard HF path is supported, and metadata is an additive fallback when the model exports extra stop IDs.

metascroy · 2026-02-02T22:47:42Z

@@ -0,0 +1,38 @@
+/*


This file is just testing the 2 special cases of llama3 and gemma3?

Is there a test for the more general jinja file?

For now we only test the supported model templates (Llama3/Gemma3). A generic Jinja file smoke test can be added later if needed, but it’s not required to validate supported templates in this PR.

larryliu0820 · 2026-02-02T23:32:54Z

            " warming=" + (config.warming ? "True" : "False") + ">";
      });

+  // Bind chat template helpers


We don't have bindings for text llm runner right? Is this being used in multimodal runner?

Currently, this will only be used for multimodal runner, but will be extended to text_llm_runner in the next PR.

larryliu0820 · 2026-02-02T23:36:00Z

+        string_view.hpp
+        "${_jinja2cpp_nonstd_root}/string-view-lite/include"
+      )
+    endif()


This whole section should live inside extension/llm/chat_template/ directory.

Moved the Jinja2Cpp FetchContent and nonstd/RapidJSON workaround into extension/llm/chat_template/CMakeLists.txt; the root now only adds that subdirectory.

larryliu0820 · 2026-02-02T23:36:38Z

@@ -0,0 +1,51 @@
+#pragma once


I think we should start a new directory extension/llm/chat_template/ and put these files inside.

Done. chat_templates.h now lives under extension/llm/chat_template/ and all includes were updated.

larryliu0820 · 2026-02-03T23:58:20Z

+if(TARGET jinja2cpp)
+  install(
+    TARGETS jinja2cpp
+    EXPORT ExecuTorchTargets
+    DESTINATION ${CMAKE_INSTALL_LIBDIR}
+  )
+endif()


Can we move this to chat_template/CMakeLists.txt as well?

larryliu0820

I think this is a good start! Thank you for adding this

meta-codesync · 2026-02-04T01:11:09Z

@seyeong-han has imported this pull request. If you are a Meta employee, you can view this in D92221320.

Summary: Without chat formatting, Instruct models behave like base models and never generate end-of-turn tokens. So, I added chat template support to the `llama_main` runner for Llama 3 Instruct models. ## Problem When running Llama 3 Instruct models without proper chat formatting: ```bash cmake-out/examples/models/llama/llama_main \ --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \ --tokenizer_path="llama/original/tokenizer.model" \ --prompt="What's the capital of France?" Paris is incorrect answer so was stated so some major tech companies may choose the other city, and for instance the CEO of Apple is Tim Cook and he is from California, and the CEO of Google is Sundar Pichai and he is from India. ``` ==> Only max_tokens limit ### Solution New CLI flags enable proper Instruct model behavior: | Flag | Description | Default | |------|-------------|---------| | `--chat_format` | Chat template (llama3, none) | `none` | | `--system_prompt` | System prompt for assistant behavior | (empty) | | `--echo` | Echo input prompt in output | `true` | ### Examples #### Basic Usage ```bash cmake-out/examples/models/llama/llama_main \ --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \ --tokenizer_path="llama/original/tokenizer.model" \ --chat_format="llama3" \ --prompt="What's the capital of France?" ``` **Output:** ``` <|begin_of_text|><|start_header_id|>user<|end_header_id|> What's the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|> The capital of France is Paris. ``` #### Clean Output (Recommended for Apps) ```bash cmake-out/examples/models/llama/llama_main \ --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \ --tokenizer_path="llama/original/tokenizer.model" \ --chat_format="llama3" \ --echo=false \ --prompt="What's the capital of France?" ``` **Output:** ``` The capital of France is Paris. ``` #### With System Prompt ```bash cmake-out/examples/models/llama/llama_main \ --model_path="llama/Llama-3.2-1B-Instruct/model.pte" \ --tokenizer_path="llama/original/tokenizer.model" \ --chat_format="llama3" \ --system_prompt="You are a helpful assistant. Be concise and respond in one sentence." \ --echo=false \ --prompt="Explain quantum computing" ``` **Output:** ``` Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously to solve complex problems faster than classical computers. ``` #### Backward Compatible (No Chat Format) ```bash # For base models or pre-formatted prompts cmake-out/examples/models/llama/llama_main \ --model_path="llama/Llama-3.2-1B/model.pte" \ --tokenizer_path="llama/original/tokenizer.model" \ --chat_format="none" \ --prompt="Once upon a time" ``` ### Files Changed | File | Description | |------|-------------| | `examples/models/llama/runner/chat_formatter.h` | **NEW** - ChatFormatter for Llama 3 | | `examples/models/llama/main.cpp` | Added CLI flags | | `examples/models/llama/README.md` | Documentation with examples | | `extension/llm/runner/text_llm_runner.cpp` | Special token filtering | | `extension/llm/runner/llm_runner_helper.cpp` | EOS token merge logic | ### Before/After Comparison | Metric | Before (no chat format) | After (--chat_format=llama3 --echo=false) | |--------|------------------------|------------------------------------------| | Tokens generated | 120 (max limit) | 7 (stops at EOS) | | Output | Rambling continuation | Clean answer | | Special tokens | Visible in output | Filtered out | Pull Request resolved: pytorch#16987 Test Plan: - [x] Build with `make llama-cpu` - [x] Test `--chat_format=llama3` with Llama-3.2-1B-Instruct - [x] Verify generation stops at `<|eot_id|>` token - [x] Test `--echo=false` produces clean output without special tokens - [x] Test `--system_prompt` affects model behavior - [x] Backward compatible with `--chat_format=none` (default) Differential Revision: D92221320 Pulled By: seyeong-han

meta-codesync · 2026-02-10T02:28:27Z

@seyeong-han has exported this pull request. If you are a Meta employee, you can view the originating Diff in D92221320.

kirklandsign · 2026-02-24T23:39:12Z

Thanks! Do we plan to integrate into llm runner?

Copilot

Pull request overview

Adds Jinja2-based chat template formatting support to the ExecuTorch LLM runner stack so Instruct models (notably Llama 3.x / 3.2) generate proper end-of-turn behavior, and exposes these capabilities through the C++ runner, Python bindings, and llama example apps.

Changes:

Introduces JinjaChatFormatter + chat types + embedded templates (Llama3/Llama3.2/Gemma3), with C++ and Python bindings plus tests.
Adds llama example CLI + Python runner support for chat formatting (system prompt, template file, echo control) and ships sample templates.
Wires in build system changes (CMake/Buck/Bazel-ish) and updates EOS handling / output token filtering.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
shim_et/xplat/executorch/build/build_variables.bzl	Adds new runner source to xplat build variables
extension/llm/runner/text_llm_runner.cpp	Filters special tokens from printed output
extension/llm/runner/test/test_runner_pybindings.py	Adds Python tests for new chat template bindings
extension/llm/runner/test/test_jinja_chat_formatter.cpp	New C++ unit tests for chat formatter + parsing
extension/llm/runner/test/targets.bzl	Registers new C++ test target
extension/llm/runner/test/CMakeLists.txt	Adds new test source to CMake test suite
extension/llm/runner/test/BUCK	Updates fbcode test target definitions/cleanup
extension/llm/runner/templates/tool_chat_template_llama3.2_pythonic.jinja	New sample Llama3.2 tool chat template
extension/llm/runner/templates/tool_chat_template_gemma3_pythonic.jinja	New sample Gemma3 tool chat template
extension/llm/runner/targets.bzl	Exports new headers/sources + adds template deps
extension/llm/runner/pybindings.cpp	Exposes chat types/template formatter to Python
extension/llm/runner/llm_runner_helper.cpp	Tweaks EOS IDs collection/logging
extension/llm/runner/jinja_chat_formatter.h	New Jinja chat formatter API
extension/llm/runner/jinja_chat_formatter.cpp	Implements template loading/normalization/rendering
extension/llm/runner/chat_types.h	New C++ structs for messages/conversations
extension/llm/runner/_llm_runner.pyi	Adds Python typing for new bindings
extension/llm/runner/init.py	Exports new bindings from package
extension/llm/runner/README.md	Documents chat templates usage for runner/llama_main
extension/llm/runner/CMakeLists.txt	Links runner lib against jinja2cpp + defines macro
extension/llm/chat_template/targets.bzl	New Buck target exporting embedded templates header
extension/llm/chat_template/chat_templates.h	New embedded templates + model token metadata
extension/llm/chat_template/CMakeLists.txt	Fetches/builds jinja2cpp for CMake builds
extension/llm/chat_template/BUCK	Registers chat_template targets
examples/models/llama/runner/targets.bzl	Exports new chat_formatter header
examples/models/llama/runner/generation.py	Adds chat formatting support to Python runner
examples/models/llama/runner/eager.py	Adds CLI args for system prompt/template file
examples/models/llama/runner/chat_formatter.h	New C++ chat formatter wrapper for llama_main
examples/models/llama/runner/CMakeLists.txt	Links llama_runner against jinja2cpp
examples/models/llama/main.cpp	Adds chat-format CLI flags + integrates echo into config
examples/models/llama/README.md	Documents new CLI flags + examples
examples/models/llama/CMakeLists.txt	Fetches jinja2cpp for standalone llama example build
CMakeLists.txt	Adds FetchContent include, installs jinja2cpp if present, adds chat_template subdir

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-24T23:46:58Z

+        if normalized in ("llama3", "llama3.2", "llama32", "llama3_2"):
+            return chat_template_type.Llama3


_resolve_template_type maps Llama 3.2 variants ("llama3.2", "llama32", "llama3_2") to ChatTemplateType.Llama3, even though the bindings expose a distinct ChatTemplateType.Llama32 and the C++ parseChatTemplateType maps these variants to Llama32. This inconsistency can be surprising for callers; map these variants to ChatTemplateType.Llama32 to keep behavior consistent across languages.

Suggested change

if normalized in ("llama3", "llama3.2", "llama32", "llama3_2"):

return chat_template_type.Llama3

if normalized == "llama3":

return chat_template_type.Llama3

if normalized in ("llama3.2", "llama32", "llama3_2"):

return chat_template_type.Llama32

Copilot · 2026-02-24T23:46:58Z

+include(FetchContent)
+cmake_policy(SET CMP0077 NEW)
+
+FetchContent_Declare(
+  jinja2cpp
+  GIT_REPOSITORY https://github.com/jinja2cpp/Jinja2Cpp.git
+  GIT_TAG 1.3.2
+  GIT_SUBMODULES_RECURSE TRUE
+)
+
+set(JINJA2CPP_BUILD_TESTS
+    OFF
+    CACHE BOOL ""
+          FORCE
+)
+set(JINJA2CPP_BUILD_SHARED
+    OFF
+    CACHE BOOL ""
+          FORCE
+)
+set(JINJA2CPP_INSTALL
+    OFF
+    CACHE BOOL ""
+          FORCE
+)
+
+FetchContent_MakeAvailable(jinja2cpp)
+if(NOT TARGET jinja2cpp)
+  message(FATAL_ERROR "Jinja2Cpp target not found after FetchContent.")
+endif()


SUPPORT_REGEX_LOOKAHEAD is set after FetchContent_MakeAvailable(jinja2cpp), so it won't affect any dependencies that are configured/built as part of jinja2cpp's CMake (e.g., if the option is meant to influence its regex backend). If this option is required to avoid the lookahead compilation error, set it in the cache before calling FetchContent_MakeAvailable.

Copilot · 2026-02-24T23:46:58Z

        max_batch_size: int,
        use_kv_cache: bool,
        vocab_size: int,
+        chat_format: str = "llama3",
+        system_prompt: str = "",
+        chat_template_file: Optional[str] = None,
        device: str = "cpu",


The default chat_format is now "llama3", which changes behavior for all existing callers that instantiate LlamaRunner without specifying chat formatting (e.g., NativeLlamaRunner, EagerLlamaRunner). This is likely a backward-incompatible change (and it also diverges from llama_main where --chat_format defaults to none). Consider defaulting to "none" here as well, or inferring the format from model metadata/name so base models don't get wrapped in an instruct template by default.

Copilot · 2026-02-24T23:46:58Z

+inline ChatFormat parse_chat_format(const std::string& format_str) {
+  static const std::unordered_map<std::string, ChatFormat> format_map = {
+      {"none", ChatFormat::None},
+      {"llama3", ChatFormat::Llama3},
+      {"llama3.2", ChatFormat::Llama3},
+      {"llama32", ChatFormat::Llama3},
+      {"llama3_2", ChatFormat::Llama3},
+      {"gemma3", ChatFormat::Gemma3},
+      {"jinja", ChatFormat::Jinja},
+  };
+  auto it = format_map.find(format_str);
+  if (it != format_map.end()) {
+    return it->second;
+  }
+  return ChatFormat::None;
+}


parse_chat_format does an exact lookup without normalizing case/whitespace, so values like "Llama3" or " llama3 " will silently fall back to None. Since this is wired to a CLI flag, it should be more forgiving (e.g., lowercasing and trimming before the lookup).

Copilot · 2026-02-24T23:46:58Z

+inline std::unique_ptr<ChatFormatter> create_chat_formatter(
+    ChatFormat format,
+    const std::string& template_file = "") {
+  using executorch::extension::llm::ChatTemplateType;
+  using executorch::extension::llm::JinjaChatFormatter;
+
+  if (!template_file.empty()) {
+    return std::make_unique<JinjaChatFormatterAdapter>(
+        JinjaChatFormatter::fromFile(template_file));
+  }
+
+  switch (format) {
+    case ChatFormat::Llama3:
+      return std::make_unique<JinjaChatFormatterAdapter>(
+          JinjaChatFormatter::fromTemplate(ChatTemplateType::Llama3));
+    case ChatFormat::Gemma3:
+      return std::make_unique<JinjaChatFormatterAdapter>(
+          JinjaChatFormatter::fromTemplate(ChatTemplateType::Gemma3));
+    case ChatFormat::Jinja:
+      return std::make_unique<NoChatFormatter>();
+    case ChatFormat::None:
+    default:
+      return std::make_unique<NoChatFormatter>();
+  }


ChatFormat::Jinja currently returns NoChatFormatter when template_file is empty, but the CLI supports --chat_format=jinja and the README describes jinja as “custom template from file”. This leads to a confusing no-op when the user forgets --chat_template_file. Consider emitting an error (or at least a warning) when format == ChatFormat::Jinja and no template file is provided.

Copilot · 2026-02-24T23:46:59Z

  // Wrap the token_callback with print function
  std::function<void(const std::string&)> wrapped_callback =
      [token_callback, config](const std::string& piece) {
        if (!config.warming) {
-          llm::safe_printf(piece.c_str());
-          fflush(stdout);
+          // Filter out special tokens when not echoing or for cleaner output
+          if (!is_special_token(piece)) {
+            llm::safe_printf(piece.c_str());
+            fflush(stdout);
+          }
        }


wrapped_callback filters out any piece that looks like a special token regardless of GenerationConfig.echo. This means tokens like <|eot_id|> will be suppressed even when users explicitly want to see raw model output (and it can also hide literal text that happens to match <|...|>). Consider gating this filtering on !config.echo (or adding an explicit filter_special_tokens config), and/or filtering by token IDs (e.g., stop token set) rather than a string-pattern heuristic.

Copilot · 2026-02-24T23:46:59Z

+          {"not tools is none", "not tools"},
+          {"not tools is None", "not tools"},


In normalizeTemplate, the replacement "not tools is none" -> "not tools" changes semantics when tools is a non-empty list. In Jinja, not tools is none is effectively tools is not none (true even when tools is non-empty), but not tools becomes false for non-empty lists, which will skip tool blocks in templates like tool_chat_template_llama3.2_pythonic.jinja. Adjust this normalization to preserve the intended meaning (e.g., map not tools is none to tools or to an explicit non-empty check, depending on the desired behavior).

Suggested change

{"not tools is none", "not tools"},

{"not tools is None", "not tools"},

{"not tools is none", "tools"},

{"not tools is None", "tools"},

Copilot · 2026-02-24T23:46:59Z

+if(NOT EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER)
+  return()
+endif()
+
+include(FetchContent)
+cmake_policy(SET CMP0077 NEW)
+
+FetchContent_Declare(
+  jinja2cpp
+  GIT_REPOSITORY https://github.com/jinja2cpp/Jinja2Cpp.git
+  GIT_TAG 1.3.2
+  GIT_SUBMODULES_RECURSE TRUE
+)
+
+set(JINJA2CPP_BUILD_TESTS
+    OFF


chat_templates.h is included by public runner headers (e.g., jinja_chat_formatter.h), but this CMake subdir doesn't install the extension/llm/chat_template/*.h headers. As a result, make install / consumers of the installed SDK may fail to compile due to missing headers. Add an install(DIRECTORY ...) for this header (and/or define/install an interface target that exports the include dir).

Copilot · 2026-02-24T23:46:59Z

+FetchContent_Declare(
+  jinja2cpp
+  GIT_REPOSITORY https://github.com/jinja2cpp/Jinja2Cpp.git
+  GIT_TAG 1.3.2
+  GIT_SUBMODULES_RECURSE TRUE
+)
+
+set(JINJA2CPP_BUILD_TESTS
+    OFF
+    CACHE BOOL ""
+          FORCE
+)
+set(JINJA2CPP_BUILD_SHARED
+    OFF
+    CACHE BOOL ""
+          FORCE
+)
+set(JINJA2CPP_INSTALL
+    OFF
+    CACHE BOOL ""
+          FORCE
+)
+
+FetchContent_MakeAvailable(jinja2cpp)
+if(NOT TARGET jinja2cpp)


This CMake file downloads and builds the jinja2cpp dependency at configure/build time via FetchContent_Declare from a remote Git repository pinned only to a Git tag, without any checksum or signature verification. If the upstream repository or the 1.3.2 tag is compromised or retagged, a malicious actor could inject arbitrary code into your build artifacts and any environment where this target is built. To harden the supply chain, vendor this dependency or fetch it from a trusted mirror, pin GIT_TAG to an immutable commit SHA, and add integrity verification (or otherwise avoid network-based fetching) for production builds.

Copilot · 2026-02-24T23:47:00Z

+include(FetchContent)
+cmake_policy(SET CMP0077 NEW)
+FetchContent_Declare(
+  jinja2cpp
+  GIT_REPOSITORY https://github.com/jinja2cpp/Jinja2Cpp.git
+  GIT_TAG 1.3.2
+  GIT_SUBMODULES_RECURSE TRUE
+)
+set(JINJA2CPP_BUILD_TESTS
+    OFF
+    CACHE BOOL ""
+          FORCE
+)
+set(JINJA2CPP_BUILD_SHARED
+    OFF
+    CACHE BOOL ""
+          FORCE
+)
+set(JINJA2CPP_INSTALL
+    OFF
+    CACHE BOOL ""
+          FORCE
+)
+FetchContent_MakeAvailable(jinja2cpp)
+


This example CMake configuration pulls jinja2cpp directly from GitHub at configure/build time using FetchContent_Declare with only a Git tag and no integrity verification. A compromise of the upstream repository or the 1.3.2 tag could transparently inject malicious code into developers’ builds or any CI system that runs these example targets. Consider vendoring the dependency or using an internal mirror, pinning to an immutable commit SHA, and adding checksum/signature verification instead of relying on a network fetch to an unverified third-party Git repo.

github-actions · 2026-04-26T01:11:32Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

kirklandsign · 2026-04-30T00:23:45Z

@claude review

claude · 2026-04-30T00:24:27Z

Claude finished @kirklandsign's task in 2m 29s —— View job

Code Review: Chat Format Support for Llama 3 Instruct Models

Read core implementation files
Review build system changes
Review test coverage
Check for security/correctness issues
Provide comprehensive feedback

Overall this is a useful feature. The architecture is clean — a Jinja2Cpp-backed formatter behind an abstract ChatFormatter interface, with factory functions and proper CLI integration. Below are the issues I found, ordered by severity.

High Severity

1. normalizeTemplate changes semantics for tools handling
extension/llm/runner/jinja_chat_formatter.cpp:45-46

{"not tools is none", "not tools"},
{"not tools is None", "not tools"},

In Jinja2, not tools is none means tools is not none (i.e., true when tools is defined and non-null — even if non-empty). But not tools evaluates to false for a non-empty list, which will incorrectly skip tool-use blocks in templates like tool_chat_template_llama3.2_pythonic.jinja when tools is passed as []. The correct replacement should be tools is defined or simply tools (truthy check for non-None), depending on the template's intent. Given that tools is always passed as [] in params, this likely works by accident today but will break if tool-use support is added later.

Fix this →

2. Special token filtering is unconditional and uses string heuristics
extension/llm/runner/text_llm_runner.cpp:28-61

The is_special_token function unconditionally filters tokens from output regardless of config.echo. This means:

When echo=true, users still can't see special tokens even if they explicitly want raw output.
The <|...|> regex-style heuristic (text[1] == '|' && text[text.size()-2] == '|') can match legitimate user content that happens to look like a special token.

The filtering should be gated on !config.echo or a separate filter_special_tokens config option, and ideally match by token ID (using the EOS/stop token set) rather than string patterns.

Fix this →

3. Backward-incompatible default in Python runner
examples/models/llama/runner/generation.py:59

chat_format: str = "llama3",

The Python LlamaRunner defaults to "llama3", while the C++ llama_main defaults to "none". This breaks existing callers of EagerLlamaRunner/NativeLlamaRunner that don't pass chat_format — they'll now get Llama3 chat formatting applied to their prompts unexpectedly. This should default to "none" for consistency and backward compatibility.

Fix this →

Medium Severity

4. parse_chat_format doesn't normalize input
examples/models/llama/runner/chat_formatter.h:100-115

Unlike the extension-level parseChatTemplateType (which lowercases), the example-level parse_chat_format does an exact std::unordered_map lookup. CLI values like "Llama3" or "LLAMA3" silently fall through to ChatFormat::None. Since this is user-facing, it should lowercase before lookup:

inline ChatFormat parse_chat_format(const std::string& format_str) {
  std::string lower = format_str;
  std::transform(lower.begin(), lower.end(), lower.begin(), ::tolower);
  // ... lookup on lower ...
}

Fix this →

5. --chat_format=jinja without --chat_template_file silently does nothing
examples/models/llama/runner/chat_formatter.h:148-149

When a user passes --chat_format=jinja but forgets --chat_template_file, create_chat_formatter returns a NoChatFormatter. This is a confusing no-op. It should emit an error or at minimum a warning.

6. SUPPORT_REGEX_LOOKAHEAD set after FetchContent_MakeAvailable
extension/llm/chat_template/CMakeLists.txt:102-106

This cache variable is set after FetchContent_MakeAvailable(jinja2cpp) (line 31), so if it's meant to influence jinja2cpp's regex backend configuration, it won't take effect. Move it before line 31 if it's relevant to jinja2cpp.

7. Duplicate FetchContent_Declare(jinja2cpp) in two CMakeLists
Both extension/llm/chat_template/CMakeLists.txt and examples/models/llama/CMakeLists.txt declare and fetch jinja2cpp independently. The example one doesn't benefit from the nonstd header copying logic, and the duplication creates maintenance burden. Consider having the example CMake just find_package or rely on the parent build's target.

8. _resolve_template_type maps Llama 3.2 variants to Llama3 instead of Llama32
examples/models/llama/runner/generation.py:253-254

if normalized in ("llama3", "llama3.2", "llama32", "llama3_2"):
    return chat_template_type.Llama3

The C++ parseChatTemplateType maps "llama3.2" variants to ChatTemplateType::Llama32. The Python should do the same for consistency. While they currently share the same template, if they diverge later this will be a subtle bug.

Low Severity / Suggestions

9. Supply-chain pinning: Both FetchContent_Declare calls pin jinja2cpp by Git tag (1.3.2), not by commit SHA. If the tag is rewritten upstream, a different (possibly malicious) revision could be fetched. Consider pinning to an immutable commit SHA.

10. Missing tools in Jinja params: formatConversation doesn't inject a tools variable into the template params. The normalizations in normalizeTemplate paper over this for the embedded templates, but any custom template that references tools (like the shipped tool_chat_template_*.jinja files) will fail at render time or produce unexpected output. Consider passing tools = jinja2::ValuesList() in the params map.

11. kKnownSpecialTokens as a global unordered_set: This is initialized at static-init time. It's small enough that this is fine, but for a library intended for embedded use, consider a constexpr array with linear search or a switch-based approach to avoid heap allocation at startup.

12. Test coverage: The tests validate the formatter in isolation but don't test the end-to-end integration (e.g., that --chat_format=llama3 --echo=false actually suppresses the template prefix in generated output). An integration test would catch issues like the special-token filtering being unconditional.

CI Failures

The 100 CI failures appear to be build failures stemming from Boost regex headers pulled in by jinja2cpp (-Wshorten-64-to-32 errors on Apple builds) and general compilation issues. The Apple builds explicitly fail on:

boost/regex/v5/c_regex_traits.hpp:461:17: error: implicit conversion loses integer precision: 'long' to 'int' [-Werror,-Wshorten-64-to-32]

This needs to be addressed — either by suppressing the warning for jinja2cpp's transitive dependencies or by finding a version of jinja2cpp/boost that doesn't trigger it on Apple toolchains.

Summary

The feature is well-designed and the test suite covers the core formatter logic well. The main actionable items are: (1) fix the normalizeTemplate semantics for tools, (2) gate special-token filtering on !config.echo, (3) fix the Python default to "none", and (4) resolve the Apple CI build failures from Boost. The inconsistency between parse_chat_format (no case normalization) and parseChatTemplateType (has case normalization) should also be fixed since it's user-facing.

kirklandsign · 2026-05-04T20:59:57Z

@claude any suggestions to split this into smaller PR stack?

claude · 2026-05-04T21:00:48Z

Claude finished @kirklandsign's task in 1m 50s —— View job

PR Split Suggestion

Review changed files and dependencies
Identify logical boundaries for splitting
Propose PR stack with ordering and dependencies

Here's how I'd split this into a 4-PR stack, ordered by dependency:

PR 1: Core Jinja2Cpp chat formatter library (foundation)

Files:

extension/llm/chat_template/CMakeLists.txt — FetchContent for jinja2cpp
extension/llm/chat_template/chat_templates.h — ChatTemplateType enum, embedded templates, ModelTokens
extension/llm/chat_template/targets.bzl / BUCK
extension/llm/runner/chat_types.h — ChatMessage, ChatConversation structs
extension/llm/runner/jinja_chat_formatter.h / .cpp — JinjaChatFormatter class + parseChatTemplateType
extension/llm/runner/CMakeLists.txt changes (link against jinja2cpp)
extension/llm/runner/targets.bzl changes (new srcs/deps)
extension/llm/runner/test/test_jinja_chat_formatter.cpp + test build files
CMakeLists.txt — minimal changes to add extension/llm/chat_template subdir

Why first: This is the self-contained library with no downstream impact. It can be reviewed/tested in isolation. Addresses the CI build issues (Boost -Wshorten-64-to-32 on Apple) early.

PR 2: Special token filtering + EOS handling in `TextLLMRunner`

Files:

extension/llm/runner/text_llm_runner.cpp — is_special_token(), kKnownSpecialTokens, wrapped_callback filtering logic
extension/llm/runner/llm_runner_helper.cpp — EOS token merge logic changes

Why separate: These are behavior changes to the shared runner that affect all users of TextLLMRunner. They deserve focused review since (a) the special-token filtering is unconditional regardless of config.echo (flagged in previous reviews), and (b) the EOS merge logic touches tokenizer/model metadata interaction. Keeping it isolated makes it easy to evaluate backward-compat impact.

PR 3: Python bindings + Python runner support

Files:

extension/llm/runner/pybindings.cpp — expose ChatMessage, ChatConversation, ChatTemplateType, JinjaChatFormatter
extension/llm/runner/__init__.py / _llm_runner.pyi
examples/models/llama/runner/generation.py — _resolve_template_type, _format_prompt, chat formatting in generate()
examples/models/llama/runner/eager.py — CLI args
extension/llm/runner/test/test_runner_pybindings.py

Why separate: Python changes are independently testable, and reviewers may want different eyes on the Python vs C++ paths. Also isolates the backward-compat concern (default chat_format="llama3" in Python vs "none" in C++).

PR 4: Llama example CLI integration + templates + docs

Files:

examples/models/llama/runner/chat_formatter.h — ChatFormat enum, parse_chat_format, create_chat_formatter factory
examples/models/llama/main.cpp — new gflags, formatter wiring
examples/models/llama/CMakeLists.txt / runner/CMakeLists.txt / runner/targets.bzl
extension/llm/runner/templates/*.jinja — sample templates
examples/models/llama/README.md / extension/llm/runner/README.md
shim_et/xplat/executorch/build/build_variables.bzl

Why last: This is the user-facing integration that depends on all prior PRs. It's also the most reviewable in isolation since it's example code — lower blast radius.

Merge order

PR 1 (library) → PR 2 (runner behavior) → PR 3 (python) → PR 4 (CLI + docs)

PRs 2 and 3 are independent of each other (both depend only on PR 1), so they could land in parallel if desired:

         ┌─ PR 2 (runner behavior) ─┐
PR 1 ───┤                           ├──► PR 4 (CLI + docs)
         └─ PR 3 (python bindings) ─┘

Notes

PR 1 is the critical path — the Apple CI failures from Boost/jinja2cpp headers need to be fixed here (either suppress -Wshorten-64-to-32 for the dependency, pin to a commit SHA, or vendor a patched version).
PR 2 should address the review feedback about gating is_special_token on !config.echo before landing.
PR 3 should fix the default from chat_format="llama3" to "none" for backward compat.
Each PR is independently buildable and testable, which will make CI green on each step.

seyeong-han · 2026-05-13T04:54:23Z

Per @kirklandsign's request to split this into a smaller PR stack, this work is now broken into 4 reviewable PRs:

PR	Subject	LOC
1/4 — #19533	Library + tests (`extension/llm/chat_template/*` + `JinjaChatFormatter`)	+821 / -28
2/4 — #19534	TextLLMRunner echo-gated special-token filter + EOS merge	+58 / -6 (incremental)
3/4 — #19535	Python bindings + Python `LlamaRunner` integration	+256 / -12 (incremental)
4/4 — #19536	`llama_main` CLI flags + `chat_formatter` wrapper + universal Jinja docs	+424 / -84 (incremental)

Each PR is reviewable in isolation. PR-A is the foundation (no behavior change). PRs B, C, D each layer on focused scope. The stack also addresses the reviewer feedback from this PR and the earlier @claude review:

✅ normalizeTemplate not tools is none → tools (PR-A, addresses Copilot review)
✅ vLLM-style params (tools=[], date_string, etc.) injected so any HF / vLLM template works (PR-A)
✅ SUPPORT_REGEX_LOOKAHEAD ordered before FetchContent_MakeAvailable (PR-A)
✅ chat_templates.h installed (PR-A)
✅ Echo-gated special-token filter (PR-B, addresses Copilot review)
✅ EOS metadata merge instead of clear (PR-B)
✅ Python chat_format="none" default + Llama32 mapping (PR-C, addresses Copilot review)
✅ parse_chat_format case-insensitive + trim (PR-D, addresses Copilot review)
✅ chat_format=jinja w/o template_file throws (PR-D, addresses Copilot review)
✅ FetchContent guard in llama example so no duplicate declaration (PR-D)
✅ Sample vLLM .jinja templates removed; docs reference HF / vLLM template directories (per @metascroy)
✅ Universal Jinja support: any HuggingFace / vLLM Jinja template works via --chat_template_file, with regression tests

Closing this PR in favor of the stack.

cc @kirklandsign @larryliu0820 @metascroy @lucylq @mergennachin

Part 2 of the chat-template support stack split out of pytorch#16987. What this PR adds ----------------- * extension/llm/runner/text_llm_runner.cpp: Add 'is_special_token()' with a small kKnownSpecialTokens set covering Llama 3.x, Gemma, and generic <s>/</s>/<pad>/<unk> tokens, plus a regex-style match for Llama-format <|...|> tokens. wrapped_callback now suppresses these from the printed stream when GenerationConfig.echo == false. When echo == true, raw model output (including chat-template tokens) is emitted unchanged - this preserves backward compatibility for users who explicitly want to see raw tokens. * extension/llm/runner/llm_runner_helper.cpp: get_eos_ids() now MERGES the tokenizer's primary eos_tok() with any additional EOS IDs the model metadata exports under kEosIds, instead of clearing the set when metadata is present. This is correct for HF-tokenizer models (e.g. Llama 3.x) where eos_tok() = <|end_of_text|> but the model also wants <|eot_id|> as a stop token. Also logs the primary tok and only logs metadata IDs that are newly inserted. Why this is split out --------------------- These are runner-behavior changes that affect ALL TextLLMRunner users, not just the new chat-template path. They deserve focused review for backward-compat impact (echo gating) and EOS-set semantics (merge vs clear). Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter library) — only for stack ordering; this PR has no include or symbol dependency on that library. Original PR (full stack): pytorch#16987

…ration Part 3 of the chat-template support stack split out of pytorch#16987. What this PR adds ----------------- * extension/llm/runner/pybindings.cpp: New pybind11 classes: - ChatMessage(role, content) - ChatConversation(messages, bos_token, eos_token, add_generation_prompt) - ChatTemplateType enum (None_, Llama3, Llama32, Gemma3, Custom) - JinjaChatFormatter with from_template / from_string / from_file static factories, format(prompt, system_prompt) and format_conversation(ChatConversation) methods, includes_bos(). * extension/llm/runner/__init__.py: re-exports the new bindings via __all__. * extension/llm/runner/_llm_runner.pyi: type stubs for the new classes so consumers get IDE / mypy support. * extension/llm/runner/test/test_runner_pybindings.py: Python tests covering the new bindings end-to-end. * examples/models/llama/runner/generation.py: LlamaRunner now accepts chat_format / system_prompt / chat_template_file kwargs and exposes _format_prompt + chat_completion using the JinjaChatFormatter. Default chat_format is 'none' (matches llama_main, preserves backward compatibility for existing EagerLlamaRunner / NativeLlamaRunner callers). _resolve_template_type maps 'llama3.2' / 'llama32' / 'llama3_2' to ChatTemplateType.Llama32 (consistent with C++ parseChatTemplateType). * examples/models/llama/runner/eager.py: adds --chat_template_file CLI flag for chat mode. Why this is split out --------------------- Python changes are independently testable and reviewers may want different eyes on the Python vs C++ paths. Also isolates the backward-compat concern around the chat_format default. Depends on: PR-A (extension/llm/chat_template/* + JinjaChatFormatter library headers/symbols). Original PR (full stack): pytorch#16987

…Jinja docs Part 4 of the chat-template support stack split out of pytorch#16987. This is the user-facing surface that wires everything together. What this PR adds ----------------- * examples/models/llama/runner/chat_formatter.h: Example-local ChatFormatter abstraction with NoChatFormatter and a JinjaChatFormatterAdapter wrapping executorch::extension::llm::JinjaChatFormatter. parse_chat_format is case-insensitive and trims whitespace, so 'Llama3', ' llama3 ', 'LLAMA3' all map correctly. create_chat_formatter throws std::invalid_argument when chat_format=jinja is passed without --chat_template_file (no more silent no-op). * examples/models/llama/main.cpp: Adds --chat_format, --chat_template_file, --system_prompt, --echo flags. Wraps the prompt with the chat formatter, catches invalid_argument / std::exception from formatter creation with clear error messages. Wires GenerationConfig.echo from the new --echo flag. * examples/models/llama/runner/CMakeLists.txt + targets.bzl: link llama_runner against jinja2cpp (transitive include in chat_formatter.h). * examples/models/llama/CMakeLists.txt: add a guarded FetchContent_Declare(jinja2cpp) so the example builds standalone (when the parent build hasn't already added jinja2cpp via extension/llm/chat_template), without redeclaring when it has. * examples/models/llama/README.md: documents the new flags AND the recommended workflow of passing any HuggingFace / vLLM Jinja template via --chat_template_file (universal Jinja support). * extension/llm/runner/README.md: documents universal Jinja support for the LLM runner library — points at vLLM examples and HF tokenizer_config.json files as supported template sources. Why this is split out --------------------- This is the user-facing CLI integration that depends on PRs A and C. It's the most reviewable in isolation since it's example code with lower blast radius — reviewers can focus on the CLI ergonomics and docs without re-reading library internals. Sample vLLM templates are NOT checked in (per reviewer feedback); documentation here points users to vLLM's examples directory and HuggingFace tokenizer_config.json files, which the universal Jinja support handles directly. Depends on: - PR-A: extension/llm/chat_template/* + JinjaChatFormatter library - PR-C: chat_formatter.h includes JinjaChatFormatter (header-only), but generation.py / eager.py changes are independent Original PR (full stack): pytorch#16987

Foundation PR for the chat-template support stack. Adds the Jinja2Cpp-based JinjaChatFormatter, supporting chat-types, embedded Llama3/Llama3.2/Gemma3 templates, build glue (CMake/Buck), and a focused C++ unit-test suite. This PR is reviewable in isolation — it has no behavior change for any existing runner; downstream PRs (B/C/D) plug it in. This is part 1 of a 4-PR stack split out of pytorch#16987 per reviewer request: 1/4 (this PR) Library + tests 2/4 TextLLMRunner echo-gated special-token filter + EOS merge 3/4 Python bindings + Python LlamaRunner integration 4/4 llama_main CLI flags + chat_formatter wrapper + docs What this PR adds ----------------- * extension/llm/chat_template/{chat_templates.h, BUCK, CMakeLists.txt, targets.bzl} — embedded Llama3/Llama3.2/Gemma3 templates and the ChatTemplateType enum + ModelTokens. The CMake file FetchContent's Jinja2Cpp 1.3.2, with SUPPORT_REGEX_LOOKAHEAD set BEFORE FetchContent_MakeAvailable so it propagates correctly, plus header staging for nonstd headers that some Jinja2Cpp installations omit. Installs chat_templates.h so SDK consumers can include it. * extension/llm/runner/{chat_types.h, jinja_chat_formatter.{h,cpp}} — the Universal Jinja chat formatter that supports any HuggingFace / vLLM chat template, not just the embedded ones. Loadable via fromTemplate (built-in), fromString (any string), or fromFile (any .jinja file). formatConversation injects vLLM/HuggingFace-standard params (tools=[], tool_choice=None, date_string, chat_template_kwargs) so any template that references those variables renders correctly. * normalizeTemplate handles vLLM/HF template quirks for Jinja2Cpp: notably, 'not tools is none' maps to 'tools' (truthy check), preserving the intent of 'tools is not none' for empty-list defaults. * extension/llm/runner/{CMakeLists.txt, targets.bzl} — link extension_llm_runner against jinja2cpp (PRIVATE) and define EXECUTORCH_USE_JINJA2CPP. * extension/llm/runner/test/{test_jinja_chat_formatter.cpp, CMakeLists.txt, targets.bzl, BUCK} — unit tests covering Llama3 / Llama3.2 / Gemma3 embedded templates, parseChatTemplateType (case-insensitive), and three universal-Jinja regression tests: - generic HuggingFace-style template (proves it's not Llama-specific) - tools-aware template (validates the tools=[] default) - 'not tools is none' normalization regression test * CMakeLists.txt — adds add_subdirectory(extension/llm/chat_template) guarded by EXECUTORCH_BUILD_EXTENSION_LLM_RUNNER. * shim_et/xplat/executorch/build/build_variables.bzl — adds jinja_chat_formatter.cpp to the runner sources. Notes ----- * No behavior change for existing TextLLMRunner / MultimodalRunner users: the formatter is opt-in, only invoked when downstream code calls it. * Sample vLLM templates are NOT checked in (per reviewer feedback); documentation in the follow-up CLI PR points users to vLLM's examples directory and HuggingFace tokenizer_config.json files. Original PR (full stack): pytorch#16987

seyeong-han requested review from larryliu0820, lucylq and mergennachin as code owners January 29, 2026 00:16

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 29, 2026

seyeong-han requested a review from kirklandsign as a code owner January 30, 2026 20:24

This was referenced Feb 2, 2026

[gemm3] Fix EOS token handling for Gemma text-only export huggingface/optimum-executorch#206

Closed

[gemma3] Add text-only runner for gemma-3-1B-it model #16885

Closed

larryliu0820 reviewed Feb 2, 2026

View reviewed changes

metascroy reviewed Feb 2, 2026

View reviewed changes

larryliu0820 reviewed Feb 2, 2026

View reviewed changes

larryliu0820 reviewed Feb 3, 2026

View reviewed changes

larryliu0820 approved these changes Feb 3, 2026

View reviewed changes

seyeong-han mentioned this pull request Feb 4, 2026

[LLM] Add Python binding example for Gemma 3 multimodal inference #17190

Closed

seyeong-han force-pushed the chat-format-support-llama-runner branch from 19cca81 to 2d09315 Compare February 10, 2026 02:28

meta-codesync Bot added fb-exported meta-exported labels Feb 10, 2026

larryliu0820 mentioned this pull request Feb 24, 2026

[RFC] Single-archive model distribution format for LLM #17640

Open

kirklandsign approved these changes Feb 24, 2026

View reviewed changes

kirklandsign requested a review from Copilot February 24, 2026 23:38

Copilot started reviewing on behalf of kirklandsign February 24, 2026 23:39 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

github-actions Bot added the Stale PRs inactive for over 60 days label Apr 26, 2026

		Llama3, // Llama 3.x Instruct models
		Gemma3, // Gemma 3 Instruct models

		@@ -0,0 +1,123 @@
		{#- Begin-of-sequence token to start the model prompt -#}

		if normalized in ("llama3", "llama3.2", "llama32", "llama3_2"):
		return chat_template_type.Llama3

		{"not tools is none", "not tools"},
		{"not tools is None", "not tools"},

Conversation

seyeong-han commented Jan 29, 2026

Summary

Problem

Solution

Examples

Basic Usage

Clean Output (Recommended for Apps)

With System Prompt

Backward Compatible (No Chat Format)

Files Changed

Before/After Comparison

Test Plan

Uh oh!

pytorch-bot Bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16987

❌ 100 New Failures, 2 Cancelled Jobs

Uh oh!

github-actions Bot commented Jan 29, 2026

This PR needs a release notes: label

Uh oh!

seyeong-han commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Result

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

larryliu0820 left a comment

Choose a reason for hiding this comment

Uh oh!

meta-codesync Bot commented Feb 4, 2026

Uh oh!

meta-codesync Bot commented Feb 10, 2026

Uh oh!

kirklandsign commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

pytorch-bot Bot commented Jan 29, 2026 •

edited

Loading

This PR needs a `release notes:` label

seyeong-han commented Feb 2, 2026 •

edited

Loading

claude Bot commented Apr 30, 2026 •

edited

Loading