Skip to content

docs: Gemma-4-E4B-It perplexity benchmark — turbo4 vs F16 cross-corpus#20

Open
WillowOneVision wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
WillowOneVision:cecil/docs-gemma-e4b-turbo4-ppl
Open

docs: Gemma-4-E4B-It perplexity benchmark — turbo4 vs F16 cross-corpus#20
WillowOneVision wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
WillowOneVision:cecil/docs-gemma-e4b-turbo4-ppl

Conversation

@WillowOneVision
Copy link
Copy Markdown

Summary

Adds docs/benchmarks/gemma_e4b_turbo4_ppl.md reporting empirical perplexity measurements on Raspberry Pi 16 GB (Cortex-A76 aarch64) comparing F16 dense KV vs turbo4 K + turbo4 V on gemma-4-E4B-it-Q4_K_M.gguf.

Headline finding

Corpus F16 PPL turbo4 PPL Δ relative Paired pattern
WikiText-2 raw test 55.0055 ± 8.83 50.6767 ± 7.96 −7.87% turbo4 lower 4/4 chunks
HumanEval (prompt + canonical_solution) 4.2657 ± 0.469 4.1310 ± 0.450 −3.16% turbo4 lower 4/4 chunks

turbo4 produces lower perplexity than F16 dense on both corpora, with the paired direction consistent across all 4 chunks of each bench. Most likely explanation : Walsh-Hadamard Transform (WHT) rotation applied pre-quant by TurboQuant acts as light implicit regularization (analog to QuaRot / SpinQuant smoothing). For an instruction-tuned model evaluated outside its native chat-format distribution, this smoothing slightly improves attention quality even after 4-bit quantization.

Why submit this

For users deciding whether to adopt turbo4 on ARM edge hardware : the usual mental model "lossy quant = quality trade-off" does not apply on this model class. turbo4 is quality-neutral or slightly positive on plain text and code corpora.

Cited in the doc :

  • Build : cecil/phase-c2-dispatch HEAD ab632e4
  • Direct llama-perplexity invocation (MTP + mmproj disabled to isolate KV quant impact)
  • Standard -c 512 --chunks 4 -t 4 protocol, paired-difference test across chunks
  • Reproducibility instructions inline

Caveats listed in the doc

  • Chunks=4 wide CI (paired direction robust, absolute magnitude noisy)
  • Two corpora tested ; chat-format prompts (model's native domain) not measured
  • Single hardware target (Cortex-A76 aarch64) ; x86_64 SIMD path not validated

Test plan

  • Maintainer review of measurement methodology
  • Optional : tighter chunks=32 re-bench if absolute magnitude relevant
  • Optional : chat-format corpus addition to close native-domain gap

🤖 Generated with Claude Code

Adds docs/benchmarks/gemma_e4b_turbo4_ppl.md reporting perplexity
measurements showing turbo4 K+V is quality-neutral or slightly
positive vs F16 dense KV on Gemma-4-E4B-It across two corpora:

- WikiText-2 raw test : F16 PPL 55.01 -> turbo4 50.68 (-7.87%)
- HumanEval : F16 PPL 4.27 -> turbo4 4.13 (-3.16%)

turbo4 lower on 4/4 chunks in both corpora. Probable cause: WHT
pre-quant rotation acts as light implicit regularization (analog to
QuaRot/SpinQuant smoothing). Effect on chat-format distribution
not measured.

Hardware: Raspberry Pi 16GB Cortex-A76 aarch64, NEON+DOTPROD+KLEIDIAI.
Build: cecil/phase-c2-dispatch HEAD ab632e4. Direct llama-perplexity
invocation, MTP and mmproj disabled to isolate KV quant impact.

Co-Authored-By: Cecil
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant