Skip to content

Eval bug: Multi GPU with draft errors: - the tokens for sequence 0 in the input batch have a starting position of Y = 1360 it is required that the sequence positions remain consecutive: Y = X + 1 #28

@FortinFred

Description

@FortinFred

Name and Version

$ ./llama-server --version
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 23822 MiB):
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 11910 MiB
  Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, VRAM: 11911 MiB
version: 9459 (07ac3cec6)
built with GNU 15.2.1 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05              Driver Version: 595.71.05      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:04:00.0 Off |                  N/A |
| 71%   52C    P2             34W /  170W |   11482MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3080        Off |   00000000:07:00.0  On |                  N/A |
| 68%   57C    P2            118W /  350W |   11067MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1165      G   /usr/lib/Xorg                             4MiB |
|    0   N/A  N/A            1955    C+G   /usr/bin/walker                          92MiB |
|    0   N/A  N/A           10440    C+G   ...rack-uuid=3190708988185955192          4MiB |
|    0   N/A  N/A           30847      C   ...ma.cpp/build/bin/llama-server      11332MiB |
|    1   N/A  N/A            1165      G   /usr/lib/Xorg                            36MiB |
|    1   N/A  N/A            1747      G   Hyprland                                233MiB |
|    1   N/A  N/A            1889      G   Xwayland                                  4MiB |
|    1   N/A  N/A            2316      G   alacritty                                54MiB |
|    1   N/A  N/A           10440      G   ...rack-uuid=3190708988185955192        118MiB |
|    1   N/A  N/A           30847      C   ...ma.cpp/build/bin/llama-server      10396MiB |
+-----------------------------------------------------------------------------------------+

Models

unsloth Qwen3.6-27B

Problem description & steps to reproduce

Ping the model

# llama-swap config
  bee_qwen3.6-27b:
    ttl: 1800 # Auto-unload after 5 minutes of inactivity
    cmd: >
      /home/fred/workspaces/ai/beellama.cpp/build/bin/llama-server
      --port ${PORT}
      -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_S
      --spec-draft-model "/home/fred/.cache/huggingface/hub/Qwen3.6-27B-DFlash-Q4_K_M.gguf"
      --spec-type dflash
      --spec-draft-ngl all
      --jinja
      --flash-attn on
      --no-mmproj
      --no-mmap
      --mlock
      --fit-target 512
      --cache-type-k turbo4 --cache-type-v turbo3_tcq
      --parallel 1
      --kv-unified
      --ctx-size 32000
    filters:
      stripParams: "temperature, top_p, top_k, min_p, presence_penalty, repetition_penalty"
      setParamsByID:
        "${MODEL_ID}:thinking":
          chat_template_kwargs:
            enable_thinking: true
            preserve_thinking: true
          reasoning_budget: 4096
          temperature: 1.0
          top_p: 0.95
          top_k: 20
          min_p: 0.05
          presence_penalty: 1.5
          repetition_penalty: 1.0

        "${MODEL_ID}:thinking-coding":
          chat_template_kwargs:
            enable_thinking: true
            preserve_thinking: true
          reasoning_budget: 4096
          temperature: 0.6
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 0.0
          repetition_penalty: 1.0

        "${MODEL_ID}:instruct":
          chat_template_kwargs:
            enable_thinking: false
            preserve_thinking: false
          reasoning_budget: 4096
          temperature: 0.7
          top_p: 0.8
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
          repetition_penalty: 1.0

        "${MODEL_ID}:instruct-reasoning":
          chat_template_kwargs:
            enable_thinking: false
            preserve_thinking: false
          reasoning_budget: 4096
          temperature: 1.0
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
          repetition_penalty: 1.0

First Bad Commit

No response

Relevant log output

Logs
 - the tokens for sequence 0 in the input batch have a starting position of Y = 1360
 it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
dflash: drafter decode failed with -1
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 259
 - the tokens for sequence 0 in the input batch have a starting position of Y = 1361
 it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
dflash: drafter decode failed with -1
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 259
 - the tokens for sequence 0 in the input batch have a starting position of Y = 1362
 it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
dflash: drafter decode failed with -1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions