[Model Runner V2] Spec decode rejection sampler greedy support by TheEpicDolphin · Pull Request #37238 · vllm-project/vllm

TheEpicDolphin · 2026-03-16T23:15:30Z

Purpose

Following up on #35461, specifically with support for greedy sampling (temperature = 0).

In order to support this in an efficient way, I get local argmax/max from target logits for greedy requests in _gather_draft_logits_and_target_argmax_kernel. Then during _probabilistic_rejection_kernel, the target argmax token is sampled only for the greedy requests. This limits the performance impact of greedy requests on the rest of the batch.

TheEpicDolphin · 2026-03-16T23:19:00Z

        num_reqs, num_speculative_steps + 1, dtype=torch.int64
    )
    # [num_reqs]
    rejected_steps = sampled.new_empty(num_reqs)
-    _probabilistic_rejection_sample_kernel[(num_reqs,)](
+    # [num_reqs]
+    rejected_pos = pos.new_empty(num_reqs)


I felt it made more sense to compute this in _probabilistic_rejection_kernel rather than _compute_residual_logits_kernel, so i moved it here. Also, renamed it from residual_pos to rejected_pos.

mergify · 2026-03-16T23:21:53Z

Hi @TheEpicDolphin, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

gemini-code-assist

Code Review

This pull request adds support for greedy sampling (temperature=0) in the speculative decoding rejection sampler. The changes are well-structured, introducing new Triton kernels to handle greedy and probabilistic paths efficiently. The logic for rejection sampling and resampling in the greedy case is sound. I've found one potential issue in a newly added but currently unused kernel that should be addressed.

WoosukKwon

Thanks for the PR!

I think we can fuse more kernels to minimize the materialization of *_logits tensors, but we can probably follow up after this.

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

…project#37238) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

…project#37238) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

geraldstanje1 · 2026-04-09T03:59:15Z

hi @TheEpicDolphin what gpu was used in this test and what vllm settings regarding max_batched_tokens etc? also have you done any tests with prefix caching enabled?

…project#37238) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

TheEpicDolphin · 2026-04-10T05:26:20Z

@geraldstanje1 i used H200, and I didn't have the chance to test with max_batched_tokens or prefix caching.

Also please ignore the benchmark results from this PR, there was a bug skewing the acceptance rates that i fixed later. This PR has the most recent benchmark results for rejection sample :)

geraldstanje1 · 2026-04-10T17:06:22Z

@TheEpicDolphin can you show how you run those benchmarks in #38496 - i assume you also have speculative decoding enabled?

TheEpicDolphin · 2026-04-10T23:47:29Z

@geraldstanje1 yep, i used spec decoding and compared strict vs probabilistic rejection sampling methods. I added the server/benchmark commands to #38496

mergify Bot added the v1 label Mar 16, 2026

TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-rejection-sample-greedy branch from d0bbcc1 to f6618c3 Compare March 16, 2026 23:17

TheEpicDolphin commented Mar 16, 2026

View reviewed changes

TheEpicDolphin marked this pull request as ready for review March 16, 2026 23:19

TheEpicDolphin requested review from WoosukKwon and njhill as code owners March 16, 2026 23:19

TheEpicDolphin mentioned this pull request Mar 16, 2026

[Model Runner V2] Spec decode rejection sampler greedy + logprobs support #36930

Closed

gemini-code-assist Bot reviewed Mar 16, 2026

View reviewed changes

Comment thread vllm/v1/worker/gpu/spec_decode/rejection_sampler.py Outdated

TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-rejection-sample-greedy branch 3 times, most recently from 2c46096 to f188893 Compare March 17, 2026 00:04

WoosukKwon approved these changes Mar 18, 2026

View reviewed changes

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026

[Model Runner V2] Spec decode rejection sampler greedy support

47f633e

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

TheEpicDolphin force-pushed the gdelfin/mrv2-spec-decode-rejection-sample-greedy branch from f188893 to 47f633e Compare March 18, 2026 20:32

WoosukKwon enabled auto-merge (squash) March 18, 2026 21:04

WoosukKwon merged commit 04244fd into vllm-project:main Mar 18, 2026
60 checks passed

TheEpicDolphin deleted the gdelfin/mrv2-spec-decode-rejection-sample-greedy branch March 18, 2026 23:02

SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026

[Model Runner V2] Spec decode rejection sampler greedy support (vllm-…

b02bad4

…project#37238) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Model Runner V2] Spec decode rejection sampler greedy support (vllm-…

a35a825

…project#37238) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Model Runner V2] Spec decode rejection sampler greedy support (vllm-…

60f7538

…project#37238) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[Model Runner V2] Spec decode rejection sampler greedy support (vllm-…

10d1aae

…project#37238) Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model Runner V2] Spec decode rejection sampler greedy support#37238

[Model Runner V2] Spec decode rejection sampler greedy support#37238
WoosukKwon merged 1 commit intovllm-project:mainfrom
TheEpicDolphin:gdelfin/mrv2-spec-decode-rejection-sample-greedy

TheEpicDolphin commented Mar 16, 2026 •

edited

Loading

Uh oh!

TheEpicDolphin Mar 16, 2026

Uh oh!

mergify Bot commented Mar 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

geraldstanje1 commented Apr 9, 2026 •

edited

Loading

Uh oh!

TheEpicDolphin commented Apr 10, 2026 •

edited

Loading

Uh oh!

geraldstanje1 commented Apr 10, 2026

Uh oh!

TheEpicDolphin commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

TheEpicDolphin commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

TheEpicDolphin Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Mar 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

geraldstanje1 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheEpicDolphin commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geraldstanje1 commented Apr 10, 2026

Uh oh!

TheEpicDolphin commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TheEpicDolphin commented Mar 16, 2026 •

edited

Loading

geraldstanje1 commented Apr 9, 2026 •

edited

Loading

TheEpicDolphin commented Apr 10, 2026 •

edited

Loading