docs: Gemma-4-E4B-It perplexity benchmark — turbo4 vs F16 cross-corpus#20
Open
WillowOneVision wants to merge 1 commit into
Conversation
Adds docs/benchmarks/gemma_e4b_turbo4_ppl.md reporting perplexity measurements showing turbo4 K+V is quality-neutral or slightly positive vs F16 dense KV on Gemma-4-E4B-It across two corpora: - WikiText-2 raw test : F16 PPL 55.01 -> turbo4 50.68 (-7.87%) - HumanEval : F16 PPL 4.27 -> turbo4 4.13 (-3.16%) turbo4 lower on 4/4 chunks in both corpora. Probable cause: WHT pre-quant rotation acts as light implicit regularization (analog to QuaRot/SpinQuant smoothing). Effect on chat-format distribution not measured. Hardware: Raspberry Pi 16GB Cortex-A76 aarch64, NEON+DOTPROD+KLEIDIAI. Build: cecil/phase-c2-dispatch HEAD ab632e4. Direct llama-perplexity invocation, MTP and mmproj disabled to isolate KV quant impact. Co-Authored-By: Cecil
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
docs/benchmarks/gemma_e4b_turbo4_ppl.mdreporting empirical perplexity measurements on Raspberry Pi 16 GB (Cortex-A76 aarch64) comparing F16 dense KV vs turbo4 K + turbo4 V ongemma-4-E4B-it-Q4_K_M.gguf.Headline finding
turbo4 produces lower perplexity than F16 dense on both corpora, with the paired direction consistent across all 4 chunks of each bench. Most likely explanation : Walsh-Hadamard Transform (WHT) rotation applied pre-quant by TurboQuant acts as light implicit regularization (analog to QuaRot / SpinQuant smoothing). For an instruction-tuned model evaluated outside its native chat-format distribution, this smoothing slightly improves attention quality even after 4-bit quantization.
Why submit this
For users deciding whether to adopt turbo4 on ARM edge hardware : the usual mental model "lossy quant = quality trade-off" does not apply on this model class. turbo4 is quality-neutral or slightly positive on plain text and code corpora.
Cited in the doc :
cecil/phase-c2-dispatchHEADab632e4llama-perplexityinvocation (MTP + mmproj disabled to isolate KV quant impact)-c 512 --chunks 4 -t 4protocol, paired-difference test across chunksCaveats listed in the doc
Test plan
🤖 Generated with Claude Code