Optimized RVV q1_0 dot#31
Conversation
|
Thanks, that's impressive speed on such device :) Do people need a special setup to build and run this, or the llama.cpp build tools work? Would be happy to merge it to our fork, don't have a similar device to test it myself though. Will review more closely later this week. For some reason stopped getting email notifications from Github. |
There was a problem hiding this comment.
Pull request overview
Adds a RISC-V RVV-specific implementation for the q1_0 × q8_0 dot product in the CPU backend, continuing the codebase’s architecture-specific quantized dot-product optimizations.
Changes:
- Added two fixed-width RVV kernels for
ggml_vec_dot_q1_0_q8_0targeting 128-bit and 256-bit vector configurations. - Added RVV runtime dispatch in the RISC-V quantized dot-product path.
- Updated the RISC-V fallback aliasing so this path can call the true generic implementation when needed.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
ggml/src/ggml-cpu/arch/riscv/quants.c |
Adds the new RVV q1_0×q8_0 kernels, helper tables, and runtime dispatch logic. |
ggml/src/ggml-cpu/arch-fallback.h |
Removes the RISC-V alias for the q1_0 generic dot product so the arch-specific implementation can fall back correctly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
I don't actually know if llama.cpp accounts for Zvl64b, it seems it's for embedded or 32 bit cores |
|
yeah copilot might be confused. Saw a similar PR in main llama.cpp ggml-org#22500 |
|
@khosravipasha No this PR is independent from my, but yes it's about this dot prod op too, though implementation there is not very efficient due to it not using vla nor fixed vlen specializzed kernels and relying on LMUL==4, which forces hardware to do 4 op per intruction, also it uses slightly different logic for gathering and masking, I think I need to open my PR too and also try similar approaches. |
|
Turned out I have slept on some free performance, by not noticing better intruction for mask loading:
Perplexity is alright, will prepare PR for mainline |
|
@khosravipasha The thingy got merged, but one question about |
|
Question has resolved itself as there was a confusion about the last comment in the main PR |
Continuation of #10 for risc-v V extension
Implemented two fixed vlen kernels loosely inspired by AVX2 implementation
VLA causes severe overhead and task only have two realistic VL combinations (in simple form)
Benchmarks were performed with:
OrangePI RV2 sbc (Ky X1 / spacemit k1) 8gb
Armbian Debian trixie rolling release at 6.18.26-current-spacemit kernel
Built with official Spacemit toolchain, but IME wasn't used.
Command:
llama-bench -m Bonsai-1.7B.gguf -p 64 -n 16 -t 8 -r 3 -fa 1 -mmp 0Perplexity for 5x512 chunks: Mean KLD 0.00027, PPL 21.09, Same top p 99,22%
pp 64t/stg 16t/sVL128*VL256*forced VLEN 128 kernel with LMUL=2, for VLEN >= 256: LMUL=1As always, I would appreciate your feedback