Skip to content

[Router] Add prefix_min_match_length threshold to PrefixAwareRouter#959

Open
ywc668 wants to merge 1 commit into
vllm-project:mainfrom
ywc668:router-prefix-min-match-length
Open

[Router] Add prefix_min_match_length threshold to PrefixAwareRouter#959
ywc668 wants to merge 1 commit into
vllm-project:mainfrom
ywc668:router-prefix-min-match-length

Conversation

@ywc668
Copy link
Copy Markdown

@ywc668 ywc668 commented May 23, 2026

Summary

Adds a configurable --prefix-min-match-length option to the prefix-aware
routing logic. When the longest prefix match for an incoming request is
shorter than this threshold, the request falls back to QPS-based routing
instead of being pinned to the matched endpoint.

Partially addresses #957.

Motivation

With prefixaware routing, requests sharing a long common prefix (e.g. a
large shared system prompt) are all routed to the same endpoint. This is
the intended behavior for KV-cache reuse, but it can create load hotspots:
a popular prefix concentrates traffic on a single engine while others sit
idle. This option lets operators require a minimum match length before
the prefix affinity kicks in, so short/incidental matches don't override
load balancing.

Changes

  • parser.py: new --prefix-min-match-length arg (int, default 0).
  • routing_logic.py: PrefixAwareRouter.__init__ now takes
    prefix_min_match_length; route_request falls back to _qps_routing
    when match_length < prefix_min_match_length.
  • app.py / initialize_routing_logic: thread the new arg through.
  • New test file src/tests/test_prefixaware_router.py covering the
    fallback, the matched-endpoint, and the default-behavior paths.

Design decisions

A few deliberate choices worth surfacing for review:

  1. Default 0 preserves existing behavior. With the default, even a
    match_length of 0 (no prefix match at all) still uses the matched
    endpoint and random selection — identical to behavior before this
    change. The threshold is strictly opt-in.
  2. The fallback path does not write to the trie. When a request falls
    back to QPS routing, hashtrie.insert is intentionally skipped, so a
    below-threshold request doesn't pollute prefix state for future
    requests. A test asserts insert is not awaited on this path.
  3. Scope is limited to prefix-min-match-length. This PR intentionally
    does not touch other prefix-aware routing knobs or the QPS routing logic
    itself. Broader hotspot-mitigation strategies are left for follow-up so
    this change stays small and reviewable.

Testing

src/tests/test_prefixaware_router.py adds three async tests:

  • below-threshold match falls back to QPS and picks the lowest-QPS engine,
    and does not write to the trie;
  • above-threshold match uses the matched endpoint and writes to the trie;
  • default (threshold 0) uses the matched endpoint even with no match.

PR Checklist

  • DCO sign-off present on the commit (Signed-off-by).
  • PR title is prefixed with [Router].
  • New tests added for the new behavior.
  • Opened as draft — pending self-review / maintainer feedback on
    the design decisions above.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a minimum prefix match length threshold for prefix-aware routing. A new CLI argument, --prefix-min-match-length, allows the router to fall back to QPS-based routing if the longest prefix match is shorter than the specified value. The PrefixAwareRouter and its initialization logic have been updated to support this parameter, and new tests verify the fallback behavior. Feedback suggests clarifying the help text for the new argument, as the match length is calculated in chunks (defaulting to 128 characters), which may affect the expected precision of the threshold.

Comment thread src/vllm_router/parsers/parser.py
Add a configurable --prefix-min-match-length option for the
prefixaware routing logic. When the longest prefix match is shorter
than this threshold, the request falls back to QPS-based routing
instead of using the matched endpoint, mitigating load hotspots
caused by long shared prefixes (e.g. common system prompts).

Defaults to 0, which preserves the original behavior.

Partially addresses vllm-project#957.

Signed-off-by: Max Li <hitliqiwei@gmail.com>
@ywc668 ywc668 force-pushed the router-prefix-min-match-length branch from 703e1c8 to 29fa890 Compare May 23, 2026 21:12
@ywc668 ywc668 marked this pull request as ready for review May 23, 2026 21:14
@AndrewTsao
Copy link
Copy Markdown

@ywc668 非常感谢你的工作。我们可能需要同步更新一下 helm 相关的配置,比如 https://github.com/vllm-project/production-stack/blob/main/helm/templates/deployment-router.yaml 配上对应的参数及说明。

Thank you very much for your work. We may need to synchronously update the Helm-related configurations, such as https://github.com/vllm-project/production-stack/blob/main/helm/templates/deployment-router.yaml, with the corresponding parameters and descriptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants