vulkan: change gated_delta_net to shard a column across a subgroup by jeffbolznv · Pull Request #20662 · ggml-org/llama.cpp

jeffbolznv · 2026-03-17T01:28:28Z

This is based on #20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.).

This fixes a perf regression from the transposing of the values in memory (!20443).

I had also tried some other options like using vec4 loads, or transposing the values through shared memory, but they didn't recover all the perf. Oliver pointed out to me that his sharding change made the memory accesses less spread out, so CUDA didn't have a regression from the transpose change.

Before !20443:
Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m c:\models\Qwen3.5-35B-A3B-Q4_K_M.gguf -p 512 -n 128 -r 10 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           pp512 |      7758.52 ± 58.26 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           tg128 |        223.27 ± 0.88 |

After !20443:
Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m c:\models\Qwen3.5-35B-A3B-Q4_K_M.gguf -p 512 -n 128 -r 10 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           pp512 |      7730.15 ± 43.73 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           tg128 |        205.70 ± 0.22 |

This PR:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m c:\models\Qwen3.5-35B-A3B-Q4_K_M.gguf -p 512 -n 128 -r 10 --prio 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           pp512 |      7851.12 ± 68.33 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           tg128 |        227.73 ± 1.50 |

About the AI usage: If I strictly interpret the contributing guidelines, maybe this use of AI would be rejected. But using AI to translate a shader from one backend to another seems pretty reasonable to me, at least provided that I understand and am able to maintain the translated code.

This is based on ggml-org#20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.). This fixes a perf regression from the transposing of the values in memory (!20443).

lemmi · 2026-03-17T06:57:23Z

On Strix Halo (8060s, 40CUs), TG improves about by 1t/s (50->51t/s), PP takes a hit:

master (b6c83aa):

model	n_ubatch	test	t/s
qwen3next 80B.A3B Q4_K - Medium	512	pp2048	580.90 ± 5.84
qwen3next 80B.A3B Q4_K - Medium	1024	pp2048	662.96 ± 6.53
qwen3next 80B.A3B Q4_K - Medium	2048	pp2048	594.46 ± 8.22
qwen35moe 35B.A3B Q8_0	512	pp2048	896.11 ± 2.50
qwen35moe 35B.A3B Q8_0	1024	pp2048	966.07 ± 5.04
qwen35moe 35B.A3B Q8_0	2048	pp2048	918.76 ± 7.64

PR:

model	n_ubatch	test	t/s
qwen3next 80B.A3B Q4_K - Medium	512	pp2048	554.87 ± 5.09
qwen3next 80B.A3B Q4_K - Medium	1024	pp2048	561.44 ± 5.46
qwen3next 80B.A3B Q4_K - Medium	2048	pp2048	517.33 ± 3.94
qwen35moe 35B.A3B Q8_0	512	pp2048	835.33 ± 2.81
qwen35moe 35B.A3B Q8_0	1024	pp2048	855.55 ± 4.93
qwen35moe 35B.A3B Q8_0	2048	pp2048	707.76 ± 4.06

0cc4m · 2026-03-17T10:08:22Z

I also see pp regressions, even on Nvidia.

AMD RX 8060S

model	size	params	ngl	n_ubatch	fa	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
qwen35moe 35B.A3B Q4_0	19.78 GiB	34.66 B	99	512	1	pp2048	954.06 ± 20.53	955.49 ± 27.22	911.09 ± 27.58	-4.6%
qwen35moe 35B.A3B Q4_0	19.78 GiB	34.66 B	99	512	1	tg128	47.48 ± 0.05	57.58 ± 0.26	57.13 ± 0.62	-0.8%
qwen35moe 35B.A3B Q4_0	19.78 GiB	34.66 B	99	1024	1	pp2048	1158.11 ± 6.68	1032.73 ± 3.00	927.02 ± 2.78	-10.2%
qwen35moe 35B.A3B Q4_0	19.78 GiB	34.66 B	99	1024	1	tg128	47.46 ± 0.03	57.43 ± 0.24	56.73 ± 0.64	-1.2%
qwen35moe 35B.A3B Q4_0	19.78 GiB	34.66 B	99	2048	1	pp2048	1214.06 ± 11.33	982.73 ± 0.70	633.83 ± 25.20	-35.5%
qwen35moe 35B.A3B Q4_0	19.78 GiB	34.66 B	99	2048	1	tg128	47.45 ± 0.10	57.53 ± 0.21	56.54 ± 0.61	-1.7%
qwen35moe 122B.A10B Q4_0	65.88 GiB	122.11 B	99	512	1	pp2048	278.74 ± 3.08	264.75 ± 3.55	234.03 ± 2.40	-11.6%
qwen35moe 122B.A10B Q4_0	65.88 GiB	122.11 B	99	512	1	tg128	23.88 ± 0.29	26.79 ± 0.02	27.95 ± 0.04	+4.3%
qwen35moe 122B.A10B Q4_0	65.88 GiB	122.11 B	99	1024	1	pp2048	286.68 ± 0.53	326.03 ± 0.30	248.05 ± 0.85	-23.9%
qwen35moe 122B.A10B Q4_0	65.88 GiB	122.11 B	99	1024	1	tg128	24.41 ± 0.04	27.17 ± 0.03	28.09 ± 0.01	+3.4%
qwen35moe 122B.A10B Q4_0	65.88 GiB	122.11 B	99	2048	1	pp2048	279.77 ± 1.24	325.53 ± 1.49	294.39 ± 1.04	-9.6%
qwen35moe 122B.A10B Q4_0	65.88 GiB	122.11 B	99	2048	1	tg128	24.39 ± 0.09	27.17 ± 0.06	28.09 ± 0.01	+3.4%

AMD RX 9070 XT

model	size	params	ngl	n_ubatch	fa	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	512	1	pp2048	2171.98 ± 6.98	3008.03 ± 8.25	3011.10 ± 9.33	+0.1%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	512	1	tg128	83.31 ± 0.63	136.65 ± 1.04	137.68 ± 0.99	+0.8%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	1024	1	pp2048	2814.64 ± 6.15	3410.45 ± 6.11	3417.58 ± 7.84	+0.2%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	1024	1	tg128	83.26 ± 0.65	136.85 ± 0.82	137.63 ± 1.03	+0.6%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	2048	1	pp2048	3230.59 ± 2.27	3477.58 ± 7.31	3458.95 ± 5.32	-0.5%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	2048	1	tg128	83.19 ± 0.61	137.02 ± 0.85	137.68 ± 1.11	+0.5%

AMD Radeon Pro VII

model	size	params	ngl	n_ubatch	fa	test	t/s (ROCm)	t/s (before)	t/s (after)	diff
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	512	1	pp2048	356.74 ± 0.66	737.56 ± 2.52	705.92 ± 2.26	-4.3%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	512	1	tg128	58.51 ± 0.02	80.18 ± 0.42	83.47 ± 0.27	+4.1%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	1024	1	pp2048	467.88 ± 1.54	868.48 ± 1.40	836.91 ± 1.92	-3.6%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	1024	1	tg128	58.59 ± 0.06	80.84 ± 0.22	83.12 ± 0.20	+2.8%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	2048	1	pp2048	549.35 ± 0.62	921.88 ± 1.90	876.97 ± 0.84	-4.9%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	2048	1	tg128	58.46 ± 0.03	80.50 ± 0.33	83.33 ± 0.06	+3.5%

Intel A770

model	size	params	ngl	n_ubatch	fa	test	t/s (before)	t/s (after)	diff
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	512	1	pp2048	700.89 ± 2.33	726.45 ± 2.43	+3.6%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	512	1	tg128	38.59 ± 0.35	39.75 ± 0.02	+3.0%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	1024	1	pp2048	864.59 ± 1.18	921.97 ± 1.16	+6.6%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	1024	1	tg128	38.70 ± 0.11	39.76 ± 0.03	+2.7%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	2048	1	pp2048	987.11 ± 2.25	1069.47 ± 1.96	+8.3%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	2048	1	tg128	38.75 ± 0.02	39.78 ± 0.02	+2.7%

Nvidia RTX 3090

model	size	params	ngl	n_ubatch	fa	test	t/s (CUDA)	t/s (before)	t/s (after)	diff
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	512	1	pp2048	2529.81 ± 7.43	3148.10 ± 12.63	3029.12 ± 9.20	-3.8%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	512	1	tg128	145.91 ± 1.12	146.95 ± 0.78	163.80 ± 1.71	+11.5%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	1024	1	pp2048	3195.98 ± 7.26	3758.22 ± 15.71	3582.31 ± 8.26	-4.7%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	1024	1	tg128	145.85 ± 0.40	148.72 ± 0.66	164.67 ± 0.81	+10.7%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	2048	1	pp2048	3654.74 ± 6.89	4175.09 ± 5.14	3959.85 ± 5.51	-5.2%
qwen35moe 35B.A3B Q2_K - Medium	11.76 GiB	34.66 B	99	2048	1	tg128	145.25 ± 0.39	148.12 ± 0.50	164.14 ± 0.56	+10.8%

…kgroups

jeffbolznv · 2026-03-17T22:22:58Z

Thanks for the testing. I think my initial version spread the work out too much (e.g. 1024 workgroups each with 4 subgroups), and while this worked fine on 5090 it didn't work well on smaller GPUs. I was able to see some significant slowdowns in the backend perf tests on 4070. I changed it to spread the work across a configurable number of lanes from 1 to subgroup_size. It currently chooses lanes_per_column = 8, which seems to work well. Perf is even up a bit more on 5090:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           pp512 |      8119.16 ± 40.09 |
| qwen35moe 35B.A3B Q4_K - Medium |  20.49 GiB |    34.66 B | Vulkan     |  99 |           tg128 |        227.63 ± 0.74 |

lemmi · 2026-03-17T22:40:10Z

Numbers are up for me aswell:

Model	Microbatch size	Test	t/s master	t/s 20662	Speedup
qwen35moe 35B.A3B Q8_0	512	pp2048	888.08	907.29	1.02
qwen35moe 35B.A3B Q8_0	1024	pp2048	954.94	994.80	1.04
qwen35moe 35B.A3B Q8_0	2048	pp2048	874.99	952.12	1.09
qwen3next 80B.A3B Q4_K_M	512	pp2048	586.07	587.49	1.00
qwen3next 80B.A3B Q4_K_M	1024	pp2048	655.31	667.98	1.02
qwen3next 80B.A3B Q4_K_M	2048	pp2048	589.75	642.97	1.09

0cc4m

Performance looks good now.

…gml-org#20662) * vulkan: change gated_delta_net to shard a column across a subgroup This is based on ggml-org#20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.). This fixes a perf regression from the transposing of the values in memory (!20443). * vulkan: Spread columns across fewer lanes to reduce the number of workgroups

jeffbolznv requested a review from a team as a code owner March 17, 2026 01:28

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 17, 2026

vulkan: Spread columns across fewer lanes to reduce the number of wor…

bab0cf5

…kgroups

0cc4m approved these changes Mar 20, 2026

View reviewed changes

0cc4m merged commit e06c3ab into ggml-org:master Mar 20, 2026
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: change gated_delta_net to shard a column across a subgroup#20662

vulkan: change gated_delta_net to shard a column across a subgroup#20662
0cc4m merged 2 commits intoggml-org:masterfrom
jeffbolznv:gdc_sharding

jeffbolznv commented Mar 17, 2026

Uh oh!

lemmi commented Mar 17, 2026

Uh oh!

0cc4m commented Mar 17, 2026

Uh oh!

jeffbolznv commented Mar 17, 2026

Uh oh!

lemmi commented Mar 17, 2026

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jeffbolznv commented Mar 17, 2026

Uh oh!

lemmi commented Mar 17, 2026

Uh oh!

0cc4m commented Mar 17, 2026

Uh oh!

jeffbolznv commented Mar 17, 2026

Uh oh!

lemmi commented Mar 17, 2026

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants