feat(llama3.1-8b): expose vLLM engine params as CLI arguments by ssaketh-ch · Pull Request #2553 · mlcommons/inference

ssaketh-ch · 2026-02-26T15:01:12Z

Summary

Expose vLLM engine memory and scheduling parameters as CLI arguments in the
Llama 3.1-8B benchmark harness, removing the need to edit source code to tune them.

Problem

When running the vLLM-backed SUT, parameters like gpu_memory_utilization,
max_num_seqs, and max_num_batched_tokens are hardcoded in load_model().
This causes two practical problems:

Out-of-memory (OOM) crashes -- on GPUs with less VRAM, or when running
alongside other processes, there is no way to reduce memory usage without
modifying source code.
No visibility into tunable knobs -- users hitting performance or memory
issues have no obvious way to know which parameters exist or what values to
try. The only option is to read the source, edit it, and re-run.

This is especially painful during bring-up on new hardware where the right
memory configuration is not known in advance.

Solution

Add 7 CLI flags to main.py and thread them through to the SUT and
SUTServer constructors and load_model() calls:

--gpu-memory-utilization -- reduce if hitting OOM (default: 0.90)
--max-num-batched-tokens -- cap total tokens scheduled per step
--max-num-seqs -- limit concurrent request slots
--block-size -- KV cache paging granularity
--enforce-eager / --no-enforce-eager
--enable-chunked-prefill / --no-enable-chunked-prefill
--max-model-len -- cap the KV cache context window

Backward Compatibility

All flags default to vLLM's own defaults. Existing run scripts and
user.conf setups are completely unaffected.

Example

Reducing memory pressure on a smaller GPU:

python main.py --vllm --scenario Offline \
  --gpu-memory-utilization 0.80 \
  --max-num-seqs 128

Hardcoded vLLM memory and scheduling parameters are now configurable via CLI flags with defaults that preserve existing behavior. No change to model outputs or sampling -- only memory allocation and scheduling. New flags: --gpu-memory-utilization --max-num-batched-tokens --max-num-seqs --enable-prefix-caching / --no-enable-prefix-caching --block-size --enforce-eager / --no-enforce-eager --enable-chunked-prefill / --no-enable-chunked-prefill --max-model-len

github-actions · 2026-02-26T15:01:22Z

MLCommons CLA bot:
Thank you very much for your submission, we really appreciate it. Before we can accept your contribution, we ask that you sign the MLCommons CLA (Apache 2). Please use this [Google form] (https://forms.gle/Ew1KkBVpyeJDuRw67) to initiate authorization. If you are from an MLCommons member organization, we will request that you be added to the CLA. If you are not from a member organization, we will email you a CLA to sign. For any questions, please contact support@mlcommons.org.
0 out of 1 committers have signed the MLCommons CLA.
❌ @ssaketh-ch
ssaketh-ch seems not to be a GitHub user. You need a GitHub account after you become MLCommons member. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request}

hanyunfan · 2026-03-24T16:39:09Z

WG: Assigned to Thomas to review it.

attafosu

Please address the comment on the prefix-caching and should be gtg

attafosu · 2026-03-31T16:21:47Z

+            gpu_memory_utilization=gpu_memory_utilization,
+            max_num_batched_tokens=max_num_batched_tokens,
+            max_num_seqs=max_num_seqs,
+            enable_prefix_caching=enable_prefix_caching,


Prefix caching should strictly be off (this is per the rules of the benchmark)

Hi,
I opened two new commits to remove it. Please take a look.

attafosu · 2026-03-31T16:24:33Z

+        default=256,
+        help="Max concurrent sequences (default: 256)",
+    )
+    parser.add_argument(


I'd remove this from the cli to avoid any confusion - the rules for the benchmark does not allow for prefix-caching.

Hi,
I opened two new commits to remove it. Please take a look.

As requested, I removed the Prefix caching from main.py since the benchmark requires it to be False always avoiding any potential confusion

Again, as mentioned, I removed prefix caching as a tunable parameter entirely since the benchmark doesn't allow it.

attafosu

Please sign the cla and we'll merge next week

ssaketh-ch requested a review from a team as a code owner February 26, 2026 15:01

attafosu suggested changes Mar 31, 2026

View reviewed changes

ssaketh-ch added 2 commits April 1, 2026 13:29

Remove prefix caching argument from main.py

0247c29

As requested, I removed the Prefix caching from main.py since the benchmark requires it to be False always avoiding any potential confusion

Remove enable_prefix_caching parameter from SUT_VLLM

1c7fa04

Again, as mentioned, I removed prefix caching as a tunable parameter entirely since the benchmark doesn't allow it.

ssaketh-ch requested a review from attafosu April 1, 2026 08:05

Merge branch 'master' into feat/vllm-configurable-engine-params

685a290

attafosu approved these changes Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llama3.1-8b): expose vLLM engine params as CLI arguments#2553

feat(llama3.1-8b): expose vLLM engine params as CLI arguments#2553
ssaketh-ch wants to merge 4 commits intomlcommons:masterfrom
ssaketh-ch:feat/vllm-configurable-engine-params

ssaketh-ch commented Feb 26, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 26, 2026

Uh oh!

hanyunfan commented Mar 24, 2026

Uh oh!

attafosu left a comment

Uh oh!

attafosu Mar 31, 2026

Uh oh!

ssaketh-ch Apr 1, 2026

Uh oh!

attafosu Mar 31, 2026

Uh oh!

ssaketh-ch Apr 1, 2026

Uh oh!

attafosu left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ssaketh-ch commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Backward Compatibility

Example

Uh oh!

github-actions bot commented Feb 26, 2026

Uh oh!

hanyunfan commented Mar 24, 2026

Uh oh!

attafosu left a comment

Choose a reason for hiding this comment

Uh oh!

attafosu Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

ssaketh-ch Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

attafosu Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

ssaketh-ch Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

attafosu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ssaketh-ch commented Feb 26, 2026 •

edited

Loading

attafosu left a comment •

edited

Loading