vulkan: change gated_delta_net to shard a column across a subgroup#20662
Merged
0cc4m merged 2 commits intoggml-org:masterfrom Mar 20, 2026
Merged
vulkan: change gated_delta_net to shard a column across a subgroup#206620cc4m merged 2 commits intoggml-org:masterfrom
0cc4m merged 2 commits intoggml-org:masterfrom
Conversation
This is based on ggml-org#20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.). This fixes a perf regression from the transposing of the values in memory (!20443).
|
On Strix Halo (8060s, 40CUs), TG improves about by 1t/s (50->51t/s), PP takes a hit: master (b6c83aa):
PR:
|
Contributor
|
I also see pp regressions, even on Nvidia. AMD RX 8060S
AMD RX 9070 XT
AMD Radeon Pro VII
Intel A770
Nvidia RTX 3090
|
Contributor
Author
|
Thanks for the testing. I think my initial version spread the work out too much (e.g. 1024 workgroups each with 4 subgroups), and while this worked fine on 5090 it didn't work well on smaller GPUs. I was able to see some significant slowdowns in the backend perf tests on 4070. I changed it to spread the work across a configurable number of lanes from 1 to subgroup_size. It currently chooses lanes_per_column = 8, which seems to work well. Perf is even up a bit more on 5090: |
|
Numbers are up for me aswell:
|
0cc4m
approved these changes
Mar 20, 2026
Contributor
0cc4m
left a comment
There was a problem hiding this comment.
Performance looks good now.
Ethan-a2
pushed a commit
to Ethan-a2/llama.cpp
that referenced
this pull request
Mar 20, 2026
…gml-org#20662) * vulkan: change gated_delta_net to shard a column across a subgroup This is based on ggml-org#20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.). This fixes a perf regression from the transposing of the values in memory (!20443). * vulkan: Spread columns across fewer lanes to reduce the number of workgroups
Seunghhon
pushed a commit
to Seunghhon/llama.cpp
that referenced
this pull request
Apr 26, 2026
…gml-org#20662) * vulkan: change gated_delta_net to shard a column across a subgroup This is based on ggml-org#20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.). This fixes a perf regression from the transposing of the values in memory (!20443). * vulkan: Spread columns across fewer lanes to reduce the number of workgroups
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is based on #20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.).
This fixes a perf regression from the transposing of the values in memory (!20443).
I had also tried some other options like using vec4 loads, or transposing the values through shared memory, but they didn't recover all the perf. Oliver pointed out to me that his sharding change made the memory accesses less spread out, so CUDA didn't have a regression from the transpose change.
About the AI usage: If I strictly interpret the contributing guidelines, maybe this use of AI would be rejected. But using AI to translate a shader from one backend to another seems pretty reasonable to me, at least provided that I understand and am able to maintain the translated code.