Skip to content

vulkan: change gated_delta_net to shard a column across a subgroup#20662

Merged
0cc4m merged 2 commits intoggml-org:masterfrom
jeffbolznv:gdc_sharding
Mar 20, 2026
Merged

vulkan: change gated_delta_net to shard a column across a subgroup#20662
0cc4m merged 2 commits intoggml-org:masterfrom
jeffbolznv:gdc_sharding

Conversation

@jeffbolznv
Copy link
Copy Markdown
Contributor

This is based on #20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.).

This fixes a perf regression from the transposing of the values in memory (!20443).

I had also tried some other options like using vec4 loads, or transposing the values through shared memory, but they didn't recover all the perf. Oliver pointed out to me that his sharding change made the memory accesses less spread out, so CUDA didn't have a regression from the transpose change.

Before !20443:
Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m c:\models\Qwen3.5-35B-A3B-Q4_K_M.gguf -p 512 -n 128 -r 10 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           pp512 |      7758.52 ± 58.26 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           tg128 |        223.27 ± 0.88 |

After !20443:
Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m c:\models\Qwen3.5-35B-A3B-Q4_K_M.gguf -p 512 -n 128 -r 10 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           pp512 |      7730.15 ± 43.73 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           tg128 |        205.70 ± 0.22 |

This PR:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m c:\models\Qwen3.5-35B-A3B-Q4_K_M.gguf -p 512 -n 128 -r 10 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           pp512 |      7851.12 ± 68.33 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           tg128 |        227.73 ± 1.50 |

About the AI usage: If I strictly interpret the contributing guidelines, maybe this use of AI would be rejected. But using AI to translate a shader from one backend to another seems pretty reasonable to me, at least provided that I understand and am able to maintain the translated code.

This is based on ggml-org#20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).

This fixes a perf regression from the transposing of the values in memory
(!20443).
@jeffbolznv jeffbolznv requested a review from a team as a code owner March 17, 2026 01:28
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 17, 2026
@lemmi
Copy link
Copy Markdown

lemmi commented Mar 17, 2026

On Strix Halo (8060s, 40CUs), TG improves about by 1t/s (50->51t/s), PP takes a hit:

master (b6c83aa):

model n_ubatch test t/s
qwen3next 80B.A3B Q4_K - Medium 512 pp2048 580.90 ± 5.84
qwen3next 80B.A3B Q4_K - Medium 1024 pp2048 662.96 ± 6.53
qwen3next 80B.A3B Q4_K - Medium 2048 pp2048 594.46 ± 8.22
qwen35moe 35B.A3B Q8_0 512 pp2048 896.11 ± 2.50
qwen35moe 35B.A3B Q8_0 1024 pp2048 966.07 ± 5.04
qwen35moe 35B.A3B Q8_0 2048 pp2048 918.76 ± 7.64

PR:

model n_ubatch test t/s
qwen3next 80B.A3B Q4_K - Medium 512 pp2048 554.87 ± 5.09
qwen3next 80B.A3B Q4_K - Medium 1024 pp2048 561.44 ± 5.46
qwen3next 80B.A3B Q4_K - Medium 2048 pp2048 517.33 ± 3.94
qwen35moe 35B.A3B Q8_0 512 pp2048 835.33 ± 2.81
qwen35moe 35B.A3B Q8_0 1024 pp2048 855.55 ± 4.93
qwen35moe 35B.A3B Q8_0 2048 pp2048 707.76 ± 4.06

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Mar 17, 2026

I also see pp regressions, even on Nvidia.

AMD RX 8060S
model size params ngl n_ubatch fa test t/s (ROCm) t/s (before) t/s (after) diff
qwen35moe 35B.A3B Q4_0 19.78 GiB 34.66 B 99 512 1 pp2048 954.06 ± 20.53 955.49 ± 27.22 911.09 ± 27.58 -4.6%
qwen35moe 35B.A3B Q4_0 19.78 GiB 34.66 B 99 512 1 tg128 47.48 ± 0.05 57.58 ± 0.26 57.13 ± 0.62 -0.8%
qwen35moe 35B.A3B Q4_0 19.78 GiB 34.66 B 99 1024 1 pp2048 1158.11 ± 6.68 1032.73 ± 3.00 927.02 ± 2.78 -10.2%
qwen35moe 35B.A3B Q4_0 19.78 GiB 34.66 B 99 1024 1 tg128 47.46 ± 0.03 57.43 ± 0.24 56.73 ± 0.64 -1.2%
qwen35moe 35B.A3B Q4_0 19.78 GiB 34.66 B 99 2048 1 pp2048 1214.06 ± 11.33 982.73 ± 0.70 633.83 ± 25.20 -35.5%
qwen35moe 35B.A3B Q4_0 19.78 GiB 34.66 B 99 2048 1 tg128 47.45 ± 0.10 57.53 ± 0.21 56.54 ± 0.61 -1.7%
qwen35moe 122B.A10B Q4_0 65.88 GiB 122.11 B 99 512 1 pp2048 278.74 ± 3.08 264.75 ± 3.55 234.03 ± 2.40 -11.6%
qwen35moe 122B.A10B Q4_0 65.88 GiB 122.11 B 99 512 1 tg128 23.88 ± 0.29 26.79 ± 0.02 27.95 ± 0.04 +4.3%
qwen35moe 122B.A10B Q4_0 65.88 GiB 122.11 B 99 1024 1 pp2048 286.68 ± 0.53 326.03 ± 0.30 248.05 ± 0.85 -23.9%
qwen35moe 122B.A10B Q4_0 65.88 GiB 122.11 B 99 1024 1 tg128 24.41 ± 0.04 27.17 ± 0.03 28.09 ± 0.01 +3.4%
qwen35moe 122B.A10B Q4_0 65.88 GiB 122.11 B 99 2048 1 pp2048 279.77 ± 1.24 325.53 ± 1.49 294.39 ± 1.04 -9.6%
qwen35moe 122B.A10B Q4_0 65.88 GiB 122.11 B 99 2048 1 tg128 24.39 ± 0.09 27.17 ± 0.06 28.09 ± 0.01 +3.4%
AMD RX 9070 XT
model size params ngl n_ubatch fa test t/s (ROCm) t/s (before) t/s (after) diff
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 512 1 pp2048 2171.98 ± 6.98 3008.03 ± 8.25 3011.10 ± 9.33 +0.1%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 512 1 tg128 83.31 ± 0.63 136.65 ± 1.04 137.68 ± 0.99 +0.8%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 1024 1 pp2048 2814.64 ± 6.15 3410.45 ± 6.11 3417.58 ± 7.84 +0.2%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 1024 1 tg128 83.26 ± 0.65 136.85 ± 0.82 137.63 ± 1.03 +0.6%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 2048 1 pp2048 3230.59 ± 2.27 3477.58 ± 7.31 3458.95 ± 5.32 -0.5%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 2048 1 tg128 83.19 ± 0.61 137.02 ± 0.85 137.68 ± 1.11 +0.5%
AMD Radeon Pro VII
model size params ngl n_ubatch fa test t/s (ROCm) t/s (before) t/s (after) diff
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 512 1 pp2048 356.74 ± 0.66 737.56 ± 2.52 705.92 ± 2.26 -4.3%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 512 1 tg128 58.51 ± 0.02 80.18 ± 0.42 83.47 ± 0.27 +4.1%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 1024 1 pp2048 467.88 ± 1.54 868.48 ± 1.40 836.91 ± 1.92 -3.6%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 1024 1 tg128 58.59 ± 0.06 80.84 ± 0.22 83.12 ± 0.20 +2.8%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 2048 1 pp2048 549.35 ± 0.62 921.88 ± 1.90 876.97 ± 0.84 -4.9%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 2048 1 tg128 58.46 ± 0.03 80.50 ± 0.33 83.33 ± 0.06 +3.5%
Intel A770
model size params ngl n_ubatch fa test t/s (before) t/s (after) diff
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 512 1 pp2048 700.89 ± 2.33 726.45 ± 2.43 +3.6%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 512 1 tg128 38.59 ± 0.35 39.75 ± 0.02 +3.0%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 1024 1 pp2048 864.59 ± 1.18 921.97 ± 1.16 +6.6%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 1024 1 tg128 38.70 ± 0.11 39.76 ± 0.03 +2.7%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 2048 1 pp2048 987.11 ± 2.25 1069.47 ± 1.96 +8.3%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 2048 1 tg128 38.75 ± 0.02 39.78 ± 0.02 +2.7%
Nvidia RTX 3090
model size params ngl n_ubatch fa test t/s (CUDA) t/s (before) t/s (after) diff
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 512 1 pp2048 2529.81 ± 7.43 3148.10 ± 12.63 3029.12 ± 9.20 -3.8%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 512 1 tg128 145.91 ± 1.12 146.95 ± 0.78 163.80 ± 1.71 +11.5%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 1024 1 pp2048 3195.98 ± 7.26 3758.22 ± 15.71 3582.31 ± 8.26 -4.7%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 1024 1 tg128 145.85 ± 0.40 148.72 ± 0.66 164.67 ± 0.81 +10.7%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 2048 1 pp2048 3654.74 ± 6.89 4175.09 ± 5.14 3959.85 ± 5.51 -5.2%
qwen35moe 35B.A3B Q2_K - Medium 11.76 GiB 34.66 B 99 2048 1 tg128 145.25 ± 0.39 148.12 ± 0.50 164.14 ± 0.56 +10.8%

@jeffbolznv
Copy link
Copy Markdown
Contributor Author

Thanks for the testing. I think my initial version spread the work out too much (e.g. 1024 workgroups each with 4 subgroups), and while this worked fine on 5090 it didn't work well on smaller GPUs. I was able to see some significant slowdowns in the backend perf tests on 4070. I changed it to spread the work across a configurable number of lanes from 1 to subgroup_size. It currently chooses lanes_per_column = 8, which seems to work well. Perf is even up a bit more on 5090:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           pp512 |      8119.16 ± 40.09 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           tg128 |        227.63 ± 0.74 |

@lemmi
Copy link
Copy Markdown

lemmi commented Mar 17, 2026

Numbers are up for me aswell:

Model Microbatch size Test t/s master t/s 20662 Speedup
qwen35moe 35B.A3B Q8_0 512 pp2048 888.08 907.29 1.02
qwen35moe 35B.A3B Q8_0 1024 pp2048 954.94 994.80 1.04
qwen35moe 35B.A3B Q8_0 2048 pp2048 874.99 952.12 1.09
qwen3next 80B.A3B Q4_K_M 512 pp2048 586.07 587.49 1.00
qwen3next 80B.A3B Q4_K_M 1024 pp2048 655.31 667.98 1.02
qwen3next 80B.A3B Q4_K_M 2048 pp2048 589.75 642.97 1.09

Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance looks good now.

@0cc4m 0cc4m merged commit e06c3ab into ggml-org:master Mar 20, 2026
49 checks passed
Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
…gml-org#20662)

* vulkan: change gated_delta_net to shard a column across a subgroup

This is based on ggml-org#20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).

This fixes a perf regression from the transposing of the values in memory
(!20443).

* vulkan: Spread columns across fewer lanes to reduce the number of workgroups
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…gml-org#20662)

* vulkan: change gated_delta_net to shard a column across a subgroup

This is based on ggml-org#20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).

This fixes a perf regression from the transposing of the values in memory
(!20443).

* vulkan: Spread columns across fewer lanes to reduce the number of workgroups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants