feat: Add SYCL kernels for Q1_0 and Q1_0_g128 quantization types (Intel Arc / oneAPI) by vrwallace · Pull Request #1 · Mintplex-Labs/prism-ml-llama.cpp

vrwallace · 2026-05-03T17:51:17Z

Summary

Adds SYCL/oneAPI compute kernels for the Q1_0 and Q1_0_g128 quantization
types used by PrismML Bonsai 8B models, enabling full GPU offload on Intel Arc
GPUs via the Intel oneAPI Level Zero backend.

Without this patch:

fatal error: unsupport data type=q1_0_g128
Aborted

With this patch, all 37 model layers offload to the GPU and inference runs at
~43-46 tok/s on an Intel Arc Pro B50 (BMG G21).

Changes

`ggml/src/ggml-sycl/vecdotq.hpp`

Added vec_dot_q1_0_q8_1 — dot product for Q1_0 (32-weight blocks)
Added vec_dot_q1_0_g128_q8_1 — dot product for Q1_0_g128 (128-weight blocks)
Ported from PrismML CUDA kernels in ggml-cuda/vecdotq.cuh
bit=1 → +d, bit=0 → -d; Q8_1 scale factor applied correctly per block

`ggml/src/ggml-sycl/mmvq.cpp`

Added mul_mat_vec_q1_0_q8_1_sycl dispatch function
Added mul_mat_vec_q1_0_g128_q8_1_sycl dispatch function
Added GGML_TYPE_Q1_0 and GGML_TYPE_Q1_0_g128 cases to the switch in
ggml_sycl_op_mul_mat_vec_q

`ggml/src/ggml-sycl/convert.cpp`

Added dequantize_row_q1_0_sycl
Added dequantize_row_q1_0_g128_sycl
Added cases to both ggml_get_to_fp16_sycl and ggml_get_to_fp32_sycl

Test Hardware

GPU: Intel Arc Pro B50 (BMG G21, Battlemage)
Driver: Mesa ANV 25.2.8 / Intel oneAPI Level Zero
OS: Ubuntu Noble 24.04, kernel 6.17

Test Model

prism-ml/Bonsai-8B-gguf (Q1_0_g128, 1.08 GB)

Results

load_tensors:   CPU_Mapped  =    83.31 MiB  (non-quantized tensors only)
load_tensors:   SYCL0       =  1015.99 MiB  (all Q1_0_g128 layers on GPU)
offloaded 37/37 layers to GPU

prompt:     ~55 tok/s
generation: ~46 tok/s

Notes

Vulkan backend does NOT work for Q1_0_g128 on Intel Arc — Mesa ANV for
Battlemage lacks VK_KHR_shader_integer_dot_product. SYCL is currently
the only working GPU path on Arc for Bonsai models.
--reasoning off is required to suppress the embedded Qwen3 thinking
template in Bonsai GGUFs.

- vecdotq.hpp: vec_dot_q1_0_q8_1, vec_dot_q1_0_g128_q8_1 - mmvq.cpp: dispatch functions + switch cases for both types - convert.cpp: dequantize functions for fp16/fp32 conversion paths Tested on Intel Arc Pro B50 (BMG G21) with Bonsai-8B Q1_0_g128 GGUF. Achieves ~46 tok/s generation with 37/37 layers on SYCL0. Fixes: fatal error: unsupport data type=q1_0_g128

vrwallace · 2026-05-03T22:50:43Z

@timothycarambat happy to discuss any changes needed

vrwallace added 2 commits May 3, 2026 12:46

vrwallace mentioned this pull request May 3, 2026

docs: Add Linux Intel Arc SYCL build and run instructions #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add SYCL kernels for Q1_0 and Q1_0_g128 quantization types (Intel Arc / oneAPI)#1

feat: Add SYCL kernels for Q1_0 and Q1_0_g128 quantization types (Intel Arc / oneAPI)#1
vrwallace wants to merge 2 commits intoMintplex-Labs:prismfrom
vrwallace:feat/sycl-q1_0-q1_0_g128-kernels

vrwallace commented May 3, 2026

Uh oh!

vrwallace commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vrwallace commented May 3, 2026

Summary

Changes

ggml/src/ggml-sycl/vecdotq.hpp

ggml/src/ggml-sycl/mmvq.cpp

ggml/src/ggml-sycl/convert.cpp

Test Hardware

Test Model

Results

Notes

Uh oh!

vrwallace commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`ggml/src/ggml-sycl/vecdotq.hpp`

`ggml/src/ggml-sycl/mmvq.cpp`

`ggml/src/ggml-sycl/convert.cpp`