feat: Add SYCL kernels for Q1_0 and Q1_0_g128 quantization types (Intel Arc / oneAPI)#1
Open
vrwallace wants to merge 2 commits intoMintplex-Labs:prismfrom
Open
Conversation
- vecdotq.hpp: vec_dot_q1_0_q8_1, vec_dot_q1_0_g128_q8_1 - mmvq.cpp: dispatch functions + switch cases for both types - convert.cpp: dequantize functions for fp16/fp32 conversion paths Tested on Intel Arc Pro B50 (BMG G21) with Bonsai-8B Q1_0_g128 GGUF. Achieves ~46 tok/s generation with 37/37 layers on SYCL0. Fixes: fatal error: unsupport data type=q1_0_g128
- vecdotq.hpp: vec_dot_q1_0_q8_1, vec_dot_q1_0_g128_q8_1 - mmvq.cpp: dispatch functions + switch cases for both types - convert.cpp: dequantize functions for fp16/fp32 conversion paths Tested on Intel Arc Pro B50 (BMG G21) with Bonsai-8B Q1_0_g128 GGUF. Achieves ~46 tok/s generation with 37/37 layers on SYCL0. Fixes: fatal error: unsupport data type=q1_0_g128
Author
|
@timothycarambat happy to discuss any changes needed |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds SYCL/oneAPI compute kernels for the
Q1_0andQ1_0_g128quantizationtypes used by PrismML Bonsai 8B models, enabling full GPU offload on Intel Arc
GPUs via the Intel oneAPI Level Zero backend.
Without this patch:
With this patch, all 37 model layers offload to the GPU and inference runs at
~43-46 tok/s on an Intel Arc Pro B50 (BMG G21).
Changes
ggml/src/ggml-sycl/vecdotq.hppvec_dot_q1_0_q8_1— dot product for Q1_0 (32-weight blocks)vec_dot_q1_0_g128_q8_1— dot product for Q1_0_g128 (128-weight blocks)ggml-cuda/vecdotq.cuhbit=1 → +d,bit=0 → -d; Q8_1 scale factor applied correctly per blockggml/src/ggml-sycl/mmvq.cppmul_mat_vec_q1_0_q8_1_sycldispatch functionmul_mat_vec_q1_0_g128_q8_1_sycldispatch functionGGML_TYPE_Q1_0andGGML_TYPE_Q1_0_g128cases to the switch inggml_sycl_op_mul_mat_vec_qggml/src/ggml-sycl/convert.cppdequantize_row_q1_0_sycldequantize_row_q1_0_g128_syclggml_get_to_fp16_syclandggml_get_to_fp32_syclTest Hardware
Test Model
prism-ml/Bonsai-8B-gguf(Q1_0_g128, 1.08 GB)Results
Notes
Battlemage lacks
VK_KHR_shader_integer_dot_product. SYCL is currentlythe only working GPU path on Arc for Bonsai models.
--reasoning offis required to suppress the embedded Qwen3 thinkingtemplate in Bonsai GGUFs.