Skip to content

feat(llama3.1-8b): expose vLLM engine params as CLI arguments#2553

Open
ssaketh-ch wants to merge 4 commits intomlcommons:masterfrom
ssaketh-ch:feat/vllm-configurable-engine-params
Open

feat(llama3.1-8b): expose vLLM engine params as CLI arguments#2553
ssaketh-ch wants to merge 4 commits intomlcommons:masterfrom
ssaketh-ch:feat/vllm-configurable-engine-params

Conversation

@ssaketh-ch
Copy link
Copy Markdown

@ssaketh-ch ssaketh-ch commented Feb 26, 2026

Summary

Expose vLLM engine memory and scheduling parameters as CLI arguments in the
Llama 3.1-8B benchmark harness, removing the need to edit source code to tune them.

Problem

When running the vLLM-backed SUT, parameters like gpu_memory_utilization,
max_num_seqs, and max_num_batched_tokens are hardcoded in load_model().
This causes two practical problems:

  1. Out-of-memory (OOM) crashes -- on GPUs with less VRAM, or when running
    alongside other processes, there is no way to reduce memory usage without
    modifying source code.
  2. No visibility into tunable knobs -- users hitting performance or memory
    issues have no obvious way to know which parameters exist or what values to
    try. The only option is to read the source, edit it, and re-run.

This is especially painful during bring-up on new hardware where the right
memory configuration is not known in advance.

Solution

Add 7 CLI flags to main.py and thread them through to the SUT and
SUTServer constructors and load_model() calls:

  • --gpu-memory-utilization -- reduce if hitting OOM (default: 0.90)
  • --max-num-batched-tokens -- cap total tokens scheduled per step
  • --max-num-seqs -- limit concurrent request slots
  • --block-size -- KV cache paging granularity
  • --enforce-eager / --no-enforce-eager
  • --enable-chunked-prefill / --no-enable-chunked-prefill
  • --max-model-len -- cap the KV cache context window

Backward Compatibility

All flags default to vLLM's own defaults. Existing run scripts and
user.conf setups are completely unaffected.

Example

Reducing memory pressure on a smaller GPU:

python main.py --vllm --scenario Offline \
  --gpu-memory-utilization 0.80 \
  --max-num-seqs 128

Hardcoded vLLM memory and scheduling parameters are now configurable
via CLI flags with defaults that preserve existing behavior. No change
to model outputs or sampling -- only memory allocation and scheduling.

New flags:
  --gpu-memory-utilization
  --max-num-batched-tokens
  --max-num-seqs
  --enable-prefix-caching / --no-enable-prefix-caching
  --block-size
  --enforce-eager / --no-enforce-eager
  --enable-chunked-prefill / --no-enable-chunked-prefill
  --max-model-len
@ssaketh-ch ssaketh-ch requested a review from a team as a code owner February 26, 2026 15:01
@github-actions
Copy link
Copy Markdown
Contributor

MLCommons CLA bot:
Thank you very much for your submission, we really appreciate it. Before we can accept your contribution, we ask that you sign the MLCommons CLA (Apache 2). Please use this [Google form] (https://forms.gle/Ew1KkBVpyeJDuRw67) to initiate authorization. If you are from an MLCommons member organization, we will request that you be added to the CLA. If you are not from a member organization, we will email you a CLA to sign. For any questions, please contact support@mlcommons.org.
0 out of 1 committers have signed the MLCommons CLA.
@ssaketh-ch
ssaketh-ch seems not to be a GitHub user. You need a GitHub account after you become MLCommons member. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request

@hanyunfan
Copy link
Copy Markdown
Contributor

WG: Assigned to Thomas to review it.

Copy link
Copy Markdown
Contributor

@attafosu attafosu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the comment on the prefix-caching and should be gtg

Comment thread language/llama3.1-8b/SUT_VLLM.py Outdated
gpu_memory_utilization=gpu_memory_utilization,
max_num_batched_tokens=max_num_batched_tokens,
max_num_seqs=max_num_seqs,
enable_prefix_caching=enable_prefix_caching,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix caching should strictly be off (this is per the rules of the benchmark)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I opened two new commits to remove it. Please take a look.

Comment thread language/llama3.1-8b/main.py Outdated
default=256,
help="Max concurrent sequences (default: 256)",
)
parser.add_argument(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this from the cli to avoid any confusion - the rules for the benchmark does not allow for prefix-caching.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I opened two new commits to remove it. Please take a look.

As requested, I removed the Prefix caching from main.py since the benchmark requires it to be False always avoiding any potential confusion
Again, as mentioned, I removed prefix caching as a tunable parameter entirely since the benchmark doesn't allow it.
@ssaketh-ch ssaketh-ch requested a review from attafosu April 1, 2026 08:05
Copy link
Copy Markdown
Contributor

@attafosu attafosu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign the cla and we'll merge next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants