Bit-interleaved Q1_0 8x32 repack kernels for x86 AVX2#29
Bit-interleaved Q1_0 8x32 repack kernels for x86 AVX2#29pl752 wants to merge 4 commits intoPrismML-Eng:prismfrom
Conversation
…ng#29 Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed PrismML-Eng#3 (TURBO_D). Mintplex-Labs#1 and Mintplex-Labs#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ling (Issue PrismML-Eng#29) Three bugs from the block-size-32 refactor: 1. kernel_set_rows_turbo hardcoded turbo3 packing for turbo4 — split into separate kernel_set_rows_turbo3 and kernel_set_rows_turbo4 kernels. turbo4 now correctly does 3-bit PolarQuant + QJL residual correction. 2. Integer division in n_groups = nk0 / blocks_per_group silently dropped tail blocks for non-128-aligned head dims (e.g. dk=192). Added ceiling division with tail-group bounds checking in turbo3, and GGML_ASSERT in WHT dispatch to catch non-128-aligned tensors. 3. TURBO_D constant was semantically coupled to QK_TURBO4 — replaced with TURBO_ROT_DIM (= QK_TURBO3_GROUP) and added static_assert that QK_TURBO4 == QK_TURBO3_GROUP to guard against future drift. Closes PrismML-Eng#29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-cache fix: turbo4 SET_ROWS, tail-block truncation, constant coupling, stack overflow (Issue PrismML-Eng#29)
…ng#29 Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed PrismML-Eng#3 (TURBO_D). Mintplex-Labs#1 and Mintplex-Labs#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ling (Issue PrismML-Eng#29) Three bugs from the block-size-32 refactor: 1. kernel_set_rows_turbo hardcoded turbo3 packing for turbo4 — split into separate kernel_set_rows_turbo3 and kernel_set_rows_turbo4 kernels. turbo4 now correctly does 3-bit PolarQuant + QJL residual correction. 2. Integer division in n_groups = nk0 / blocks_per_group silently dropped tail blocks for non-128-aligned head dims (e.g. dk=192). Added ceiling division with tail-group bounds checking in turbo3, and GGML_ASSERT in WHT dispatch to catch non-128-aligned tensors. 3. TURBO_D constant was semantically coupled to QK_TURBO4 — replaced with TURBO_ROT_DIM (= QK_TURBO3_GROUP) and added static_assert that QK_TURBO4 == QK_TURBO3_GROUP to guard against future drift. Closes PrismML-Eng#29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Are there instruction on how to test? I was trying to test the PR with a machine that has AVX2 and SSE3 with Bonsai 1.7b gguf and did not notice a difference in pp or tg vs the current llama.cpp implementation. Likely I am doing something wrong. |
|
@twoxfh Hello, which build flags do you use and what does llama-bench (or other executable that you have used) say with Does log have lines like this? If it is llama-server, what does it say about enabled features? |
|
@twoxfh If bonsai is new ternary one, then it won't work right now, as optimized kernels are not implemented yet (Q2_0 in my case is reported due to some kind of issue) |
|
@pl752 I am using the Bonsai 1.7b 1bit. Ah, I see you have AVX512 and my cpu does not support it. My build parameters are simple and without any Intel drivers, -DGGML_CURL=OFF -DGGML_CUDA=OFF -DGGML_AVX512=ON. I did install BLAS but it cut my tg in half, so I removed it.
|
|
@twoxfh Weird, because kernel should not require avx512, only avx2 |
|
@twoxfh Which command was used for launching model and what happens if |
@pl752 With -v it definitely it repacks to q1_0_8x32, but its the same speed for me. I am using the following command |
|
@pl752 I am getting done_getting_tensors tensor 'token_embed.weight' (q1_0) (and 114 others) cannot be used with preferred buffer type CPU_REPACK, using CPU instead. So it tries then fails. using the gguf from https://huggingface.co/prism-ml/Bonsai-1.7B-gguf/tree/main |
|
@twoxfh Does llama-bench (like |
|
@twoxfh However there is indeed something wrong here, as I just found out that when I use llama cli there is no difference between dot implementation and repack (except maybe for preprocessing slightly), so it might be me too doing something wrong, or this thing is just hyper sensitive to ram bandwidth, let me check it further... |
|
@twoxfh |
|
@twoxfh I can see some kind of intel with e-cores, is this why numactl is used? |
|
@twoxfh |
|
@twoxfh I think I might need to find somebody else to find out if this optimization useful only on systems like mine or there is something subtle off. |
@pl752 exactly, I try to ensure I get performance cores since I only use a couple typically. llama-bench gives me pp512 gives me 70.96(your branch) vs 61.51 (llama) and tg128 42.71(your branch) vs 38.97 (llama.cpp). It appears to be more of a divide then I thought, but as much as yours. That might be due to mine being a mobile processor vs desktop? |
|
@twoxfh So there actually IS some difference, my current theory is that if this laptop uses older memory type like ddr4 3200MT it's bandwidth can be maxed out. My cpu is mobile zen4 ryzen with 6 cores and LPDDR5 6400MT. There is a real known issue that outer accumulators in my new kernels don't fit to 16 ymm registers and spill to memory which make bandwidth issue worse just enough to kill most of the boost (+ the fact that it is heterogenous cpu with multiple clusters and e-cores have significantly reduced l1 data cache and NO l3 cache (l2 is used as core cluster shared instead)), this is also indicated by the fact that enabling AVX512 in my case helps performance, even though the only difference is that there are 32 ymm registers available (32 zmm in 256 instead of 512 mode). The purpose of special gemm and gemv kernel is not only to save cycles by reusing some register contents over multiple rows/columns, but by making memory access order more convenient, thus reducing overhead, however my kernels are definitely suboptimal in that sense. |
|
@twoxfh Also on my system there is DIRECT 1:1 inversed corellation for tg speed and model size (same number of parameters, different quant), so memory is indeed limiting factor for tg, though spilling occurs only for gemm and gemv is fine in that sense. (gemv is matrix*vector for number of rows < 4 (repeated nrows times), gemm is matrix*matrix with |
|
@pl752 That makes a lot of sense, also mine is Intel 7 ultra 165U DDR5 5600 MT/s vs your DDR5 6400 MT/s. I saw your jump of 25% and got jealous :). Really appreciate all the effort your putting into the kernels. |
|
@twoxfh Hmm, only 2 threads are set for server, that usually means that there is no maxing out the memory. Windows 11 / WSL, by any chance? |
@pl752 I turned it up to 6 cores, thats about the sweet spot before losing performance adding an ecore. A lot of the time I run a little toasting and cut it back to 2. For the benchmarks I ran at 6 cores. As soon as I go above performance tanks. |
|
Performance tanks due to aformentioned schenanigans with cpu arch, that's expected. Do you use windows with/without WSL2 or linux? Also can you check somehow (for an example via htop with cpu clocks display turned on) what happens to cpu clock speeds when using repack and no repack (can be toggled off with |
|
@twoxfh Why am asking about Windows 11 and WSL: it has security features called "Core isolation and memory integrity" which essentially wraps whole os into thin VM which give some additional hardening, but sometimes OBLITERATES performance of memory intense programs like LLMs or video games, also if VT-x is enabled and allowed in bcdboot Win 11 is notorious for trying to wrap most of the processes into sandbox, which can cause some overhead too. Also WSL2 is essentially a well optimized VM. |
@pl752 Ah I am on Windows with WSL and Docker Desktop. I am running llama-server on a debian docker container. CPU utilization with or without repack on is roughly the same from my spot checks, it boots to 3.5 then settles to 2.8ghz after about 10 seconds. I have a bout 10gb ram free.. |
|
@twoxfh This is the most likely reason for weirdness then (container inside linux inside vm inside windows and power throttling) |
|
@twoxfh Also some AMD insanity: my cpu holds 4.5ghz for tg and drops to 4.2ghz for pp with repack, |
|
Thanks, this is cool, seems there is more juice on cpu side :)
That's kinda odd, is the ggml id changed when you do the repacking? There is few different enums for each type, one of them might have been mixed up. I will need to take a closer look, this is only during llama-bench right? |
There was a problem hiding this comment.
Pull request overview
This PR extends ggml’s CPU repack/matmul pipeline to support bit-interleaved Q1_0 (8x32) repacked kernels on x86 (AVX2-focused), including the required repack format, quantization path, and kernel dispatch selection.
Changes:
- Add a new repacked block layout for Q1_0 (
block_q1_0x8) and implement repack from nativeblock_q1_0. - Introduce Q8_0 quantization for
4x32layout and add Q1_0gemv/gemmkernels for the new repack type (generic + x86 AVX2/AVX512 builds). - Extend repack dispatch to select the Q1_0 repack/kernels on AVX2-capable systems when dimensions are compatible.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
ggml/src/ggml-cpu/repack.h |
Adds Q1_0 support to templated repack block machinery and declares new Q1_0/Q8_0 kernel entrypoints. |
ggml/src/ggml-cpu/repack.cpp |
Implements generic Q8_0 4x32 quantization, generic Q1_0 gemv/gemm, Q1_0 repack routine, and dispatch selection for the new repack type. |
ggml/src/ggml-cpu/arch/x86/repack.cpp |
Adds x86 implementation for Q8_0 4x32 quantization and AVX2/AVX512F Q1_0 gemv/gemm kernels. |
ggml/src/ggml-cpu/arch-fallback.h |
Wires new entrypoints into the existing “rename _generic when no native impl exists” mechanism for relevant architectures/build modes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Some adjustments for copilot suggestions and avoiding problems with some versions of compilers, no performance changes expected |
|
I have tried to aleviate issue with register pressure in gemm and achieved slight improvements, tg not affected.
Might be interesting for @twoxfh |
Continuation of #21 and #10
Been a hot minute
Decided to drop
nrc==2(might revisit if plain AVX and SSSE3 are needed) as it is mostly used in specific situations for ARM_DOTPROD and focus on optimized gemv and gemm.Also I have finally moved to native linux from WSL2, so now benchmarks are run with
-fa 1 -mmp 0 -r 5 -t 6instead of-t 10as SMT threads don't help significantly with performance anymore, but increase memory pressure. So benchmark baselines have shifted again.AVX2pp512AVX2tg128AVX512*pp512AVX512*tg128*- register file increase only, no special kernelAVX512 is in theory usable, but couldn't implement kernel which won't regress Zen 4 AVX512 performance yet, so currently relying on AVX2 code
Perplexity
For some reason model identifies its type as Q2_0
Benchmarks for various number of threads for repack AVX512