Skip to content

feat: Add SYCL kernels for Q1_0 and Q1_0_g128 quantization types (Intel Arc / oneAPI)#1

Open
vrwallace wants to merge 2 commits intoMintplex-Labs:prismfrom
vrwallace:feat/sycl-q1_0-q1_0_g128-kernels
Open

feat: Add SYCL kernels for Q1_0 and Q1_0_g128 quantization types (Intel Arc / oneAPI)#1
vrwallace wants to merge 2 commits intoMintplex-Labs:prismfrom
vrwallace:feat/sycl-q1_0-q1_0_g128-kernels

Conversation

@vrwallace
Copy link
Copy Markdown

Summary

Adds SYCL/oneAPI compute kernels for the Q1_0 and Q1_0_g128 quantization
types used by PrismML Bonsai 8B models, enabling full GPU offload on Intel Arc
GPUs via the Intel oneAPI Level Zero backend.

Without this patch:

fatal error: unsupport data type=q1_0_g128
Aborted

With this patch, all 37 model layers offload to the GPU and inference runs at
~43-46 tok/s on an Intel Arc Pro B50 (BMG G21).


Changes

ggml/src/ggml-sycl/vecdotq.hpp

  • Added vec_dot_q1_0_q8_1 — dot product for Q1_0 (32-weight blocks)
  • Added vec_dot_q1_0_g128_q8_1 — dot product for Q1_0_g128 (128-weight blocks)
  • Ported from PrismML CUDA kernels in ggml-cuda/vecdotq.cuh
  • bit=1 → +d, bit=0 → -d; Q8_1 scale factor applied correctly per block

ggml/src/ggml-sycl/mmvq.cpp

  • Added mul_mat_vec_q1_0_q8_1_sycl dispatch function
  • Added mul_mat_vec_q1_0_g128_q8_1_sycl dispatch function
  • Added GGML_TYPE_Q1_0 and GGML_TYPE_Q1_0_g128 cases to the switch in
    ggml_sycl_op_mul_mat_vec_q

ggml/src/ggml-sycl/convert.cpp

  • Added dequantize_row_q1_0_sycl
  • Added dequantize_row_q1_0_g128_sycl
  • Added cases to both ggml_get_to_fp16_sycl and ggml_get_to_fp32_sycl

Test Hardware

  • GPU: Intel Arc Pro B50 (BMG G21, Battlemage)
  • Driver: Mesa ANV 25.2.8 / Intel oneAPI Level Zero
  • OS: Ubuntu Noble 24.04, kernel 6.17

Test Model

  • prism-ml/Bonsai-8B-gguf (Q1_0_g128, 1.08 GB)

Results

load_tensors:   CPU_Mapped  =    83.31 MiB  (non-quantized tensors only)
load_tensors:   SYCL0       =  1015.99 MiB  (all Q1_0_g128 layers on GPU)
offloaded 37/37 layers to GPU

prompt:     ~55 tok/s
generation: ~46 tok/s

Notes

  • Vulkan backend does NOT work for Q1_0_g128 on Intel Arc — Mesa ANV for
    Battlemage lacks VK_KHR_shader_integer_dot_product. SYCL is currently
    the only working GPU path on Arc for Bonsai models.
  • --reasoning off is required to suppress the embedded Qwen3 thinking
    template in Bonsai GGUFs.

vrwallace added 2 commits May 3, 2026 12:46
- vecdotq.hpp: vec_dot_q1_0_q8_1, vec_dot_q1_0_g128_q8_1
- mmvq.cpp: dispatch functions + switch cases for both types
- convert.cpp: dequantize functions for fp16/fp32 conversion paths

Tested on Intel Arc Pro B50 (BMG G21) with Bonsai-8B Q1_0_g128 GGUF.
Achieves ~46 tok/s generation with 37/37 layers on SYCL0.

Fixes: fatal error: unsupport data type=q1_0_g128
- vecdotq.hpp: vec_dot_q1_0_q8_1, vec_dot_q1_0_g128_q8_1
- mmvq.cpp: dispatch functions + switch cases for both types
- convert.cpp: dequantize functions for fp16/fp32 conversion paths

Tested on Intel Arc Pro B50 (BMG G21) with Bonsai-8B Q1_0_g128 GGUF.
Achieves ~46 tok/s generation with 37/37 layers on SYCL0.

Fixes: fatal error: unsupport data type=q1_0_g128
@vrwallace
Copy link
Copy Markdown
Author

@timothycarambat happy to discuss any changes needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant