docs: Gemma-4-E4B-It perplexity benchmark — turbo4 vs F16 cross-corpus by WillowOneVision · Pull Request #20 · AtomicBot-ai/atomic-llama-cpp-turboquant

WillowOneVision · 2026-05-24T09:21:58Z

Summary

Adds docs/benchmarks/gemma_e4b_turbo4_ppl.md reporting empirical perplexity measurements on Raspberry Pi 16 GB (Cortex-A76 aarch64) comparing F16 dense KV vs turbo4 K + turbo4 V on gemma-4-E4B-it-Q4_K_M.gguf.

Headline finding

Corpus	F16 PPL	turbo4 PPL	Δ relative	Paired pattern
WikiText-2 raw test	55.0055 ± 8.83	50.6767 ± 7.96	−7.87%	turbo4 lower 4/4 chunks
HumanEval (prompt + canonical_solution)	4.2657 ± 0.469	4.1310 ± 0.450	−3.16%	turbo4 lower 4/4 chunks

turbo4 produces lower perplexity than F16 dense on both corpora, with the paired direction consistent across all 4 chunks of each bench. Most likely explanation : Walsh-Hadamard Transform (WHT) rotation applied pre-quant by TurboQuant acts as light implicit regularization (analog to QuaRot / SpinQuant smoothing). For an instruction-tuned model evaluated outside its native chat-format distribution, this smoothing slightly improves attention quality even after 4-bit quantization.

Why submit this

For users deciding whether to adopt turbo4 on ARM edge hardware : the usual mental model "lossy quant = quality trade-off" does not apply on this model class. turbo4 is quality-neutral or slightly positive on plain text and code corpora.

Cited in the doc :

Build : cecil/phase-c2-dispatch HEAD ab632e4
Direct llama-perplexity invocation (MTP + mmproj disabled to isolate KV quant impact)
Standard -c 512 --chunks 4 -t 4 protocol, paired-difference test across chunks
Reproducibility instructions inline

Caveats listed in the doc

Chunks=4 wide CI (paired direction robust, absolute magnitude noisy)
Two corpora tested ; chat-format prompts (model's native domain) not measured
Single hardware target (Cortex-A76 aarch64) ; x86_64 SIMD path not validated

Test plan

Maintainer review of measurement methodology
Optional : tighter chunks=32 re-bench if absolute magnitude relevant
Optional : chat-format corpus addition to close native-domain gap

🤖 Generated with Claude Code

Adds docs/benchmarks/gemma_e4b_turbo4_ppl.md reporting perplexity measurements showing turbo4 K+V is quality-neutral or slightly positive vs F16 dense KV on Gemma-4-E4B-It across two corpora: - WikiText-2 raw test : F16 PPL 55.01 -> turbo4 50.68 (-7.87%) - HumanEval : F16 PPL 4.27 -> turbo4 4.13 (-3.16%) turbo4 lower on 4/4 chunks in both corpora. Probable cause: WHT pre-quant rotation acts as light implicit regularization (analog to QuaRot/SpinQuant smoothing). Effect on chat-format distribution not measured. Hardware: Raspberry Pi 16GB Cortex-A76 aarch64, NEON+DOTPROD+KLEIDIAI. Build: cecil/phase-c2-dispatch HEAD ab632e4. Direct llama-perplexity invocation, MTP and mmproj disabled to isolate KV quant impact. Co-Authored-By: Cecil

github-actions Bot added the documentation Improvements or additions to documentation label May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Gemma-4-E4B-It perplexity benchmark — turbo4 vs F16 cross-corpus#20

docs: Gemma-4-E4B-It perplexity benchmark — turbo4 vs F16 cross-corpus#20
WillowOneVision wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
WillowOneVision:cecil/docs-gemma-e4b-turbo4-ppl

WillowOneVision commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WillowOneVision commented May 24, 2026

Summary

Headline finding

Why submit this

Caveats listed in the doc

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant