Skip to content

Bit-interleaved Q1_0 8x32 repack kernels for x86 AVX2#29

Open
pl752 wants to merge 4 commits intoPrismML-Eng:prismfrom
pl752:perf/q1_0_8x32_repack_AVX2
Open

Bit-interleaved Q1_0 8x32 repack kernels for x86 AVX2#29
pl752 wants to merge 4 commits intoPrismML-Eng:prismfrom
pl752:perf/q1_0_8x32_repack_AVX2

Conversation

@pl752
Copy link
Copy Markdown

@pl752 pl752 commented May 2, 2026

Continuation of #21 and #10

Been a hot minute

Decided to drop nrc==2 (might revisit if plain AVX and SSSE3 are needed) as it is mostly used in specific situations for ARM_DOTPROD and focus on optimized gemv and gemm.

Also I have finally moved to native linux from WSL2, so now benchmarks are run with -fa 1 -mmp 0 -r 5 -t 6 instead of -t 10 as SMT threads don't help significantly with performance anymore, but increase memory pressure. So benchmark baselines have shifted again.

flow run dot repack delta
AVX2 pp512 139.80 t/s 190.98 t/s +36.61%
AVX2 tg128 91.70 t/s 115.17 t/s +25.59%
AVX512* pp512 145.09 t/s 219.96 t/s +51.60%
AVX512* tg128 93.34 t/s 120.47 t/s +29.07%

* - register file increase only, no special kernel

AVX512 is in theory usable, but couldn't implement kernel which won't regress Zen 4 AVX512 performance yet, so currently relying on AVX2 code

Perplexity
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9558 ±    3.1805      -0.00009 ±    0.00239       0.00021 ±    0.00003     0.396 ±  0.056 %    99.608 ±  0.392 %
   2      20.2053 ±    3.4389       0.01465 ±    0.01152       0.00022 ±    0.00002     0.386 ±  0.034 %    99.412 ±  0.339 %
   3      20.8472 ±    2.7882       0.00892 ±    0.00770       0.00022 ±    0.00001     0.365 ±  0.026 %    99.085 ±  0.344 %
   4      21.1986 ±    2.3887       0.00633 ±    0.00579       0.00022 ±    0.00001     0.377 ±  0.026 %    99.216 ±  0.276 %
   5      21.0772 ±    2.1025       0.00518 ±    0.00466       0.00023 ±    0.00001     0.365 ±  0.022 %    99.216 ±  0.247 %

====== Perplexity statistics ======
Mean PPL(Q)                   :  21.077184 ±   2.102473
Mean PPL(base)                :  20.968387 ±   2.074795
Cor(ln(PPL(Q)), ln(PPL(base))):  99.89%
Mean ln(PPL(Q)/PPL(base))     :   0.005175 ±   0.004663
Mean PPL(Q)/PPL(base)         :   1.005189 ±   0.004688
Mean PPL(Q)-PPL(base)         :   0.108796 ±   0.100463

====== KL divergence statistics ======
Mean    KLD:   0.000226 ±   0.000011
Maximum KLD:   0.006768
99.9%   KLD:   0.005245
99.0%   KLD:   0.001404
95.0%   KLD:   0.000682
90.0%   KLD:   0.000481
Median  KLD:   0.000135
10.0%   KLD:   0.000002
 5.0%   KLD:   0.000000
 1.0%   KLD:  -0.000010
 0.1%   KLD:  -0.000033
Minimum KLD:  -0.000039

====== Token probability statistics ======
Mean    Δp:  0.020 ± 0.010 %
Maximum Δp:  3.536%
99.9%   Δp:  2.703%
99.0%   Δp:  1.293%
95.0%   Δp:  0.595%
90.0%   Δp:  0.300%
75.0%   Δp:  0.065%
Median  Δp:  0.000%
25.0%   Δp: -0.041%
10.0%   Δp: -0.277%
 5.0%   Δp: -0.472%
 1.0%   Δp: -1.087%
 0.1%   Δp: -1.576%
Minimum Δp: -1.698%
RMS Δp    :  0.365 ± 0.022 %
Same top p: 99.216 ± 0.247 %

For some reason model identifies its type as Q2_0

Benchmarks for various number of threads for repack AVX512
model size params backend threads fa mmap test t/s
qwen3 1.7B Q2_0 (HUH!? Y?) 231.13 MiB 1.72 B CPU 4 1 0 pp512 167.58 ± 2.59
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 4 1 0 tg128 94.55 ± 0.14
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 6 1 0 pp512 219.96 ± 0.17
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 6 1 0 tg128 120.47 ± 0.16
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 8 1 0 pp512 200.69 ± 0.23
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 8 1 0 tg128 120.49 ± 0.08
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 10 1 0 pp512 197.99 ± 1.67
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 10 1 0 tg128 116.79 ± 1.11
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 12 1 0 pp512 210.22 ± 0.35
qwen3 1.7B Q2_0 231.13 MiB 1.72 B CPU 12 1 0 tg128 121.91 ± 0.16

@github-actions github-actions Bot added the ggml label May 2, 2026
@pl752 pl752 marked this pull request as ready for review May 2, 2026 11:09
retroheim pushed a commit to retroheim/prism-ml-llama.cpp that referenced this pull request May 3, 2026
…ng#29

Codex post-commit review found:
1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes
2. SET_ROWS kernel turbo3-specific but instantiated for turbo4
3. Tail block drop for non-128 head dims

Fixed PrismML-Eng#3 (TURBO_D). Mintplex-Labs#1 and Mintplex-Labs#2 don't affect turbo3+dk128 path.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
retroheim pushed a commit to retroheim/prism-ml-llama.cpp that referenced this pull request May 3, 2026
…ling (Issue PrismML-Eng#29)

Three bugs from the block-size-32 refactor:

1. kernel_set_rows_turbo hardcoded turbo3 packing for turbo4 — split into
   separate kernel_set_rows_turbo3 and kernel_set_rows_turbo4 kernels.
   turbo4 now correctly does 3-bit PolarQuant + QJL residual correction.

2. Integer division in n_groups = nk0 / blocks_per_group silently dropped
   tail blocks for non-128-aligned head dims (e.g. dk=192). Added ceiling
   division with tail-group bounds checking in turbo3, and GGML_ASSERT in
   WHT dispatch to catch non-128-aligned tensors.

3. TURBO_D constant was semantically coupled to QK_TURBO4 — replaced with
   TURBO_ROT_DIM (= QK_TURBO3_GROUP) and added static_assert that
   QK_TURBO4 == QK_TURBO3_GROUP to guard against future drift.

Closes PrismML-Eng#29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
retroheim pushed a commit to retroheim/prism-ml-llama.cpp that referenced this pull request May 3, 2026
…-cache

fix: turbo4 SET_ROWS, tail-block truncation, constant coupling, stack overflow (Issue PrismML-Eng#29)
retroheim pushed a commit to retroheim/prism-ml-llama.cpp that referenced this pull request May 3, 2026
…ng#29

Codex post-commit review found:
1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes
2. SET_ROWS kernel turbo3-specific but instantiated for turbo4
3. Tail block drop for non-128 head dims

Fixed PrismML-Eng#3 (TURBO_D). Mintplex-Labs#1 and Mintplex-Labs#2 don't affect turbo3+dk128 path.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
retroheim pushed a commit to retroheim/prism-ml-llama.cpp that referenced this pull request May 3, 2026
…ling (Issue PrismML-Eng#29)

Three bugs from the block-size-32 refactor:

1. kernel_set_rows_turbo hardcoded turbo3 packing for turbo4 — split into
   separate kernel_set_rows_turbo3 and kernel_set_rows_turbo4 kernels.
   turbo4 now correctly does 3-bit PolarQuant + QJL residual correction.

2. Integer division in n_groups = nk0 / blocks_per_group silently dropped
   tail blocks for non-128-aligned head dims (e.g. dk=192). Added ceiling
   division with tail-group bounds checking in turbo3, and GGML_ASSERT in
   WHT dispatch to catch non-128-aligned tensors.

3. TURBO_D constant was semantically coupled to QK_TURBO4 — replaced with
   TURBO_ROT_DIM (= QK_TURBO3_GROUP) and added static_assert that
   QK_TURBO4 == QK_TURBO3_GROUP to guard against future drift.

Closes PrismML-Eng#29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@twoxfh
Copy link
Copy Markdown

twoxfh commented May 4, 2026

Are there instruction on how to test? I was trying to test the PR with a machine that has AVX2 and SSE3 with Bonsai 1.7b gguf and did not notice a difference in pp or tg vs the current llama.cpp implementation. Likely I am doing something wrong.

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh Hello, which build flags do you use and what does llama-bench (or other executable that you have used) say with -v flag?

Does log have lines like this?

load_tensors:   CPU_REPACK model buffer size =   189.00 MiB
or
repack: repack tensor blk.0.attn_q.weight with q1_0_8x32
or
llama_memory_breakdown_print: |   - CPU_REPACK         |                  189 =   189 +       0 +       0                |

If it is llama-server, what does it say about enabled features?
(like this)

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh If bonsai is new ternary one, then it won't work right now, as optimized kernels are not implemented yet (Q2_0 in my case is reported due to some kind of issue)

@twoxfh
Copy link
Copy Markdown

twoxfh commented May 4, 2026

@pl752 I am using the Bonsai 1.7b 1bit. Ah, I see you have AVX512 and my cpu does not support it. My build parameters are simple and without any Intel drivers, -DGGML_CURL=OFF -DGGML_CUDA=OFF -DGGML_AVX512=ON. I did install BLAS but it cut my tg in half, so I removed it.

system_info: n_threads = 2 (n_threads_batch = 2) / 14 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX-VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh Weird, because kernel should not require avx512, only avx2

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh Which command was used for launching model and what happens if -v is appended?

@twoxfh
Copy link
Copy Markdown

twoxfh commented May 4, 2026

@twoxfh Which command was used for launching model and what happens if -v is appended?

@pl752 With -v it definitely it repacks to q1_0_8x32, but its the same speed for me. I am using the following command
numactl -C 0-2 ./llama-server -m Bonsai-1.7b.gguf -c 4000 --numa distribute -fa on --mmap -jinja -r 5 -t 2 -v

@twoxfh
Copy link
Copy Markdown

twoxfh commented May 4, 2026

@pl752 I am getting done_getting_tensors tensor 'token_embed.weight' (q1_0) (and 114 others) cannot be used with preferred buffer type CPU_REPACK, using CPU instead. So it tries then fails.

using the gguf from https://huggingface.co/prism-ml/Bonsai-1.7B-gguf/tree/main

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh Does llama-bench (like bin/llama-bench -m Bonsai-1.7B.gguf -fa 1 -mmp 0 -r 5 -t 6) produce higher speeds? I have tested with llama-cli just now and for some reason I can't reproduce high speeds anymore, even though it worked perfectly just few days ago. nvm, I have just trolled myself as I didn't plug my laptop to wall power and llama-bench just somehow didn't lose performance

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh However there is indeed something wrong here, as I just found out that when I use llama cli there is no difference between dot implementation and repack (except maybe for preprocessing slightly), so it might be me too doing something wrong, or this thing is just hyper sensitive to ram bandwidth, let me check it further...

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh Now I am frustrated a little bit, rebuilding it cleanly somehow solved the issue for me, even though it was clearly repacked before too, avx2 works fine too

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh I can see some kind of intel with e-cores, is this why numactl is used?

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh 'token_embed.weight' (q1_0) (and 114 others) is fine as repack isn't used for embeddings (they are just retreival by token id and dequant op, not mat_mul) and various norm tensors

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh I think I might need to find somebody else to find out if this optimization useful only on systems like mine or there is something subtle off.

@twoxfh
Copy link
Copy Markdown

twoxfh commented May 4, 2026

@twoxfh I can see some kind of intel with e-cores, is this why numactl is used?

@pl752 exactly, I try to ensure I get performance cores since I only use a couple typically. llama-bench gives me pp512 gives me 70.96(your branch) vs 61.51 (llama) and tg128 42.71(your branch) vs 38.97 (llama.cpp). It appears to be more of a divide then I thought, but as much as yours. That might be due to mine being a mobile processor vs desktop?

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh So there actually IS some difference, my current theory is that if this laptop uses older memory type like ddr4 3200MT it's bandwidth can be maxed out. My cpu is mobile zen4 ryzen with 6 cores and LPDDR5 6400MT. There is a real known issue that outer accumulators in my new kernels don't fit to 16 ymm registers and spill to memory which make bandwidth issue worse just enough to kill most of the boost (+ the fact that it is heterogenous cpu with multiple clusters and e-cores have significantly reduced l1 data cache and NO l3 cache (l2 is used as core cluster shared instead)), this is also indicated by the fact that enabling AVX512 in my case helps performance, even though the only difference is that there are 32 ymm registers available (32 zmm in 256 instead of 512 mode). The purpose of special gemm and gemv kernel is not only to save cycles by reusing some register contents over multiple rows/columns, but by making memory access order more convenient, thus reducing overhead, however my kernels are definitely suboptimal in that sense.

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh Also on my system there is DIRECT 1:1 inversed corellation for tg speed and model size (same number of parameters, different quant), so memory is indeed limiting factor for tg, though spilling occurs only for gemm and gemv is fine in that sense. (gemv is matrix*vector for number of rows < 4 (repeated nrows times), gemm is matrix*matrix with rows % 4 == 0 in this case).

@twoxfh
Copy link
Copy Markdown

twoxfh commented May 4, 2026

@pl752 That makes a lot of sense, also mine is Intel 7 ultra 165U DDR5 5600 MT/s vs your DDR5 6400 MT/s. I saw your jump of 25% and got jealous :). Really appreciate all the effort your putting into the kernels.

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh Hmm, only 2 threads are set for server, that usually means that there is no maxing out the memory. Windows 11 / WSL, by any chance?

@twoxfh
Copy link
Copy Markdown

twoxfh commented May 4, 2026

@twoxfh Hmm, only 2 threads are set for server, that usually means that there is no maxing out the memory. Windows 11 / WSL, by any chance?

@pl752 I turned it up to 6 cores, thats about the sweet spot before losing performance adding an ecore. A lot of the time I run a little toasting and cut it back to 2. For the benchmarks I ran at 6 cores. As soon as I go above performance tanks.

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

Performance tanks due to aformentioned schenanigans with cpu arch, that's expected. Do you use windows with/without WSL2 or linux? Also can you check somehow (for an example via htop with cpu clocks display turned on) what happens to cpu clock speeds when using repack and no repack (can be toggled off with -nr flag)? This cpu has 2+10 cores (vs 6 mine) and has 15W of sustained TDP with 57W peak (my advertised as 35 - 54W depending on cooling and VRM capabilities, and also I use very beefy cooling pad to minimize performance variance due to thermals), so clock can play role, as denser computations can cause higher power draw, causing clock speed to drop to fit into TDP limits (Also there is PL1 and PL2 for temporary boost power limits (last only few seconds to few minutes)).

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh Why am asking about Windows 11 and WSL: it has security features called "Core isolation and memory integrity" which essentially wraps whole os into thin VM which give some additional hardening, but sometimes OBLITERATES performance of memory intense programs like LLMs or video games, also if VT-x is enabled and allowed in bcdboot Win 11 is notorious for trying to wrap most of the processes into sandbox, which can cause some overhead too. Also WSL2 is essentially a well optimized VM.

@twoxfh
Copy link
Copy Markdown

twoxfh commented May 4, 2026

@twoxfh Why am asking about Windows 11 and WSL: it has security features called "Core isolation and memory integrity" which wrap whole os into thin VM which give some additional hardening, but sometimes OBLITERATES performance of memory intense programs like LLMs or video games, also if VT-x is enabled and allowed in bcdboot Win 11 is notorious for trying wrap most of the processes into sandbox, which can cause some overhead too. Also WSL2 is essentially a well optimized VM.

@pl752 Ah I am on Windows with WSL and Docker Desktop. I am running llama-server on a debian docker container. CPU utilization with or without repack on is roughly the same from my spot checks, it boots to 3.5 then settles to 2.8ghz after about 10 seconds. I have a bout 10gb ram free..

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh This is the most likely reason for weirdness then (container inside linux inside vm inside windows and power throttling)

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 4, 2026

@twoxfh Also some AMD insanity: my cpu holds 4.5ghz for tg and drops to 4.2ghz for pp with repack, -t 6 and avx512 and around 100mhz less on avx2 (note: runs are relatively short and cooling pad is set to deafening mode)

@khosravipasha
Copy link
Copy Markdown
Collaborator

khosravipasha commented May 4, 2026

Thanks, this is cool, seems there is more juice on cpu side :)

For some reason model identifies its type as Q2_0

That's kinda odd, is the ggml id changed when you do the repacking? There is few different enums for each type, one of them might have been mixed up. I will need to take a closer look, this is only during llama-bench right?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends ggml’s CPU repack/matmul pipeline to support bit-interleaved Q1_0 (8x32) repacked kernels on x86 (AVX2-focused), including the required repack format, quantization path, and kernel dispatch selection.

Changes:

  • Add a new repacked block layout for Q1_0 (block_q1_0x8) and implement repack from native block_q1_0.
  • Introduce Q8_0 quantization for 4x32 layout and add Q1_0 gemv/gemm kernels for the new repack type (generic + x86 AVX2/AVX512 builds).
  • Extend repack dispatch to select the Q1_0 repack/kernels on AVX2-capable systems when dimensions are compatible.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
ggml/src/ggml-cpu/repack.h Adds Q1_0 support to templated repack block machinery and declares new Q1_0/Q8_0 kernel entrypoints.
ggml/src/ggml-cpu/repack.cpp Implements generic Q8_0 4x32 quantization, generic Q1_0 gemv/gemm, Q1_0 repack routine, and dispatch selection for the new repack type.
ggml/src/ggml-cpu/arch/x86/repack.cpp Adds x86 implementation for Q8_0 4x32 quantization and AVX2/AVX512F Q1_0 gemv/gemm kernels.
ggml/src/ggml-cpu/arch-fallback.h Wires new entrypoints into the existing “rename _generic when no native impl exists” mechanism for relevant architectures/build modes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ggml/src/ggml-cpu/repack.cpp
Comment thread ggml/src/ggml-cpu/arch/x86/repack.cpp Outdated
@pl752
Copy link
Copy Markdown
Author

pl752 commented May 5, 2026

Some adjustments for copilot suggestions and avoiding problems with some versions of compilers, no performance changes expected

@pl752
Copy link
Copy Markdown
Author

pl752 commented May 5, 2026

I have tried to aleviate issue with register pressure in gemm and achieved slight improvements, tg not affected.

flow run dot repack delta
AVX2 pp512 190.98 t/s 213.79 t/s +11.94%
AVX512 pp512 219.96 t/s 228.14 t/s +3.71%

Might be interesting for @twoxfh
I have another repacking approach in mind, will experiment with it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants