[Router] Add prefix_min_match_length threshold to PrefixAwareRouter#959
[Router] Add prefix_min_match_length threshold to PrefixAwareRouter#959ywc668 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a minimum prefix match length threshold for prefix-aware routing. A new CLI argument, --prefix-min-match-length, allows the router to fall back to QPS-based routing if the longest prefix match is shorter than the specified value. The PrefixAwareRouter and its initialization logic have been updated to support this parameter, and new tests verify the fallback behavior. Feedback suggests clarifying the help text for the new argument, as the match length is calculated in chunks (defaulting to 128 characters), which may affect the expected precision of the threshold.
Add a configurable --prefix-min-match-length option for the prefixaware routing logic. When the longest prefix match is shorter than this threshold, the request falls back to QPS-based routing instead of using the matched endpoint, mitigating load hotspots caused by long shared prefixes (e.g. common system prompts). Defaults to 0, which preserves the original behavior. Partially addresses vllm-project#957. Signed-off-by: Max Li <hitliqiwei@gmail.com>
703e1c8 to
29fa890
Compare
|
@ywc668 非常感谢你的工作。我们可能需要同步更新一下 helm 相关的配置,比如 https://github.com/vllm-project/production-stack/blob/main/helm/templates/deployment-router.yaml 配上对应的参数及说明。 Thank you very much for your work. We may need to synchronously update the Helm-related configurations, such as https://github.com/vllm-project/production-stack/blob/main/helm/templates/deployment-router.yaml, with the corresponding parameters and descriptions. |
Summary
Adds a configurable
--prefix-min-match-lengthoption to the prefix-awarerouting logic. When the longest prefix match for an incoming request is
shorter than this threshold, the request falls back to QPS-based routing
instead of being pinned to the matched endpoint.
Partially addresses #957.
Motivation
With
prefixawarerouting, requests sharing a long common prefix (e.g. alarge shared system prompt) are all routed to the same endpoint. This is
the intended behavior for KV-cache reuse, but it can create load hotspots:
a popular prefix concentrates traffic on a single engine while others sit
idle. This option lets operators require a minimum match length before
the prefix affinity kicks in, so short/incidental matches don't override
load balancing.
Changes
parser.py: new--prefix-min-match-lengtharg (int, default0).routing_logic.py:PrefixAwareRouter.__init__now takesprefix_min_match_length;route_requestfalls back to_qps_routingwhen
match_length < prefix_min_match_length.app.py/initialize_routing_logic: thread the new arg through.src/tests/test_prefixaware_router.pycovering thefallback, the matched-endpoint, and the default-behavior paths.
Design decisions
A few deliberate choices worth surfacing for review:
0preserves existing behavior. With the default, even amatch_lengthof0(no prefix match at all) still uses the matchedendpoint and random selection — identical to behavior before this
change. The threshold is strictly opt-in.
back to QPS routing,
hashtrie.insertis intentionally skipped, so abelow-threshold request doesn't pollute prefix state for future
requests. A test asserts
insertis not awaited on this path.prefix-min-match-length. This PR intentionallydoes not touch other prefix-aware routing knobs or the QPS routing logic
itself. Broader hotspot-mitigation strategies are left for follow-up so
this change stays small and reviewable.
Testing
src/tests/test_prefixaware_router.pyadds three async tests:and does not write to the trie;
0) uses the matched endpoint even with no match.PR Checklist
Signed-off-by).[Router].the design decisions above.