[Router] Add prefix_min_match_length threshold to PrefixAwareRouter by ywc668 · Pull Request #959 · vllm-project/production-stack

ywc668 · 2026-05-23T19:10:24Z

Summary

Adds a configurable --prefix-min-match-length option to the prefix-aware
routing logic. When the longest prefix match for an incoming request is
shorter than this threshold, the request falls back to QPS-based routing
instead of being pinned to the matched endpoint.

Partially addresses #957.

Motivation

With prefixaware routing, requests sharing a long common prefix (e.g. a
large shared system prompt) are all routed to the same endpoint. This is
the intended behavior for KV-cache reuse, but it can create load hotspots:
a popular prefix concentrates traffic on a single engine while others sit
idle. This option lets operators require a minimum match length before
the prefix affinity kicks in, so short/incidental matches don't override
load balancing.

Changes

parser.py: new --prefix-min-match-length arg (int, default 0).
routing_logic.py: PrefixAwareRouter.__init__ now takes
prefix_min_match_length; route_request falls back to _qps_routing
when match_length < prefix_min_match_length.
app.py / initialize_routing_logic: thread the new arg through.
New test file src/tests/test_prefixaware_router.py covering the
fallback, the matched-endpoint, and the default-behavior paths.

Design decisions

A few deliberate choices worth surfacing for review:

Default 0 preserves existing behavior. With the default, even a
match_length of 0 (no prefix match at all) still uses the matched
endpoint and random selection — identical to behavior before this
change. The threshold is strictly opt-in.
The fallback path does not write to the trie. When a request falls
back to QPS routing, hashtrie.insert is intentionally skipped, so a
below-threshold request doesn't pollute prefix state for future
requests. A test asserts insert is not awaited on this path.
Scope is limited to prefix-min-match-length. This PR intentionally
does not touch other prefix-aware routing knobs or the QPS routing logic
itself. Broader hotspot-mitigation strategies are left for follow-up so
this change stays small and reviewable.

Testing

src/tests/test_prefixaware_router.py adds three async tests:

below-threshold match falls back to QPS and picks the lowest-QPS engine,
and does not write to the trie;
above-threshold match uses the matched endpoint and writes to the trie;
default (threshold 0) uses the matched endpoint even with no match.

PR Checklist

DCO sign-off present on the commit (Signed-off-by).
PR title is prefixed with [Router].
New tests added for the new behavior.
Opened as draft — pending self-review / maintainer feedback on
the design decisions above.

gemini-code-assist

Code Review

This pull request introduces a minimum prefix match length threshold for prefix-aware routing. A new CLI argument, --prefix-min-match-length, allows the router to fall back to QPS-based routing if the longest prefix match is shorter than the specified value. The PrefixAwareRouter and its initialization logic have been updated to support this parameter, and new tests verify the fallback behavior. Feedback suggests clarifying the help text for the new argument, as the match length is calculated in chunks (defaulting to 128 characters), which may affect the expected precision of the threshold.

Add a configurable --prefix-min-match-length option for the prefixaware routing logic. When the longest prefix match is shorter than this threshold, the request falls back to QPS-based routing instead of using the matched endpoint, mitigating load hotspots caused by long shared prefixes (e.g. common system prompts). Defaults to 0, which preserves the original behavior. Partially addresses vllm-project#957. Signed-off-by: Max Li <hitliqiwei@gmail.com>

AndrewTsao · 2026-05-25T02:23:46Z

@ywc668 非常感谢你的工作。我们可能需要同步更新一下 helm 相关的配置，比如 https://github.com/vllm-project/production-stack/blob/main/helm/templates/deployment-router.yaml 配上对应的参数及说明。

Thank you very much for your work. We may need to synchronously update the Helm-related configurations, such as https://github.com/vllm-project/production-stack/blob/main/helm/templates/deployment-router.yaml, with the corresponding parameters and descriptions.

gemini-code-assist Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/vllm_router/parsers/parser.py

ywc668 force-pushed the router-prefix-min-match-length branch from 703e1c8 to 29fa890 Compare May 23, 2026 21:12

ywc668 marked this pull request as ready for review May 23, 2026 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Router] Add prefix_min_match_length threshold to PrefixAwareRouter#959

[Router] Add prefix_min_match_length threshold to PrefixAwareRouter#959
ywc668 wants to merge 1 commit into
vllm-project:mainfrom
ywc668:router-prefix-min-match-length

ywc668 commented May 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

AndrewTsao commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ywc668 commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Design decisions

Testing

PR Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

AndrewTsao commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ywc668 commented May 23, 2026 •

edited

Loading