feat(llama3.1-8b): expose vLLM engine params as CLI arguments#2553
feat(llama3.1-8b): expose vLLM engine params as CLI arguments#2553ssaketh-ch wants to merge 4 commits intomlcommons:masterfrom
Conversation
Hardcoded vLLM memory and scheduling parameters are now configurable via CLI flags with defaults that preserve existing behavior. No change to model outputs or sampling -- only memory allocation and scheduling. New flags: --gpu-memory-utilization --max-num-batched-tokens --max-num-seqs --enable-prefix-caching / --no-enable-prefix-caching --block-size --enforce-eager / --no-enforce-eager --enable-chunked-prefill / --no-enable-chunked-prefill --max-model-len
|
MLCommons CLA bot: |
|
WG: Assigned to Thomas to review it. |
attafosu
left a comment
There was a problem hiding this comment.
Please address the comment on the prefix-caching and should be gtg
| gpu_memory_utilization=gpu_memory_utilization, | ||
| max_num_batched_tokens=max_num_batched_tokens, | ||
| max_num_seqs=max_num_seqs, | ||
| enable_prefix_caching=enable_prefix_caching, |
There was a problem hiding this comment.
Prefix caching should strictly be off (this is per the rules of the benchmark)
There was a problem hiding this comment.
Hi,
I opened two new commits to remove it. Please take a look.
| default=256, | ||
| help="Max concurrent sequences (default: 256)", | ||
| ) | ||
| parser.add_argument( |
There was a problem hiding this comment.
I'd remove this from the cli to avoid any confusion - the rules for the benchmark does not allow for prefix-caching.
There was a problem hiding this comment.
Hi,
I opened two new commits to remove it. Please take a look.
As requested, I removed the Prefix caching from main.py since the benchmark requires it to be False always avoiding any potential confusion
Again, as mentioned, I removed prefix caching as a tunable parameter entirely since the benchmark doesn't allow it.
Summary
Expose vLLM engine memory and scheduling parameters as CLI arguments in the
Llama 3.1-8B benchmark harness, removing the need to edit source code to tune them.
Problem
When running the vLLM-backed SUT, parameters like
gpu_memory_utilization,max_num_seqs, andmax_num_batched_tokensare hardcoded inload_model().This causes two practical problems:
alongside other processes, there is no way to reduce memory usage without
modifying source code.
issues have no obvious way to know which parameters exist or what values to
try. The only option is to read the source, edit it, and re-run.
This is especially painful during bring-up on new hardware where the right
memory configuration is not known in advance.
Solution
Add 7 CLI flags to
main.pyand thread them through to theSUTandSUTServerconstructors andload_model()calls:--gpu-memory-utilization-- reduce if hitting OOM (default: 0.90)--max-num-batched-tokens-- cap total tokens scheduled per step--max-num-seqs-- limit concurrent request slots--block-size-- KV cache paging granularity--enforce-eager/--no-enforce-eager--enable-chunked-prefill/--no-enable-chunked-prefill--max-model-len-- cap the KV cache context windowBackward Compatibility
All flags default to vLLM's own defaults. Existing run scripts and
user.confsetups are completely unaffected.Example
Reducing memory pressure on a smaller GPU: