Hi,
I’m currently working on FP4-based quantized GEMM kernels and referring to the “1D Block Scaling Factors Layout (128×4 tile)” described in the documentation.
However, I couldn’t find detailed guidance on how to handle cases where the K dimension is smaller than the tile requirement, and I’d like to clarify the expected layout for scaling factors in such scenarios.
My understanding
For matrix A (M × K):
When M = 512, K = 64, the scaling factors are:
M_scale = 512
K_scale = 4 (since 64 / 16 = 4)
This matches the documented 128 × 4 tile layout, so the scaling factors can be naturally arranged as shown.
My question
If K = 32, then:
K_scale = 2 (since 32 / 16 = 2)
In this case, the scaling factor tile becomes effectively 128 × 2, which does not match the documented 128 × 4 layout.
What is the correct way to handle this situation?
Specifically:
Should the scaling factors be padded (e.g., zero-filled) along the K dimension to match the required 128 × 4 tile layout?
Or should the layout be compacted (i.e., use 128 × 2 tiles without padding)?
Or is there another expected handling (e.g., different scaling mode or alignment requirement)?
Additional context
I am using FP4 quantization with block scaling (e.g., vec16-style scaling), and trying to ensure my scale layout is fully compatible with Tensor Core / cuBLASLt expectations.
Any clarification or reference would be greatly appreciated. Thanks!
Hi,
I’m currently working on FP4-based quantized GEMM kernels and referring to the “1D Block Scaling Factors Layout (128×4 tile)” described in the documentation.
However, I couldn’t find detailed guidance on how to handle cases where the K dimension is smaller than the tile requirement, and I’d like to clarify the expected layout for scaling factors in such scenarios.
My understanding
For matrix A (M × K):
When M = 512, K = 64, the scaling factors are:
M_scale = 512
K_scale = 4 (since 64 / 16 = 4)
This matches the documented 128 × 4 tile layout, so the scaling factors can be naturally arranged as shown.
My question
If K = 32, then:
K_scale = 2 (since 32 / 16 = 2)
In this case, the scaling factor tile becomes effectively 128 × 2, which does not match the documented 128 × 4 layout.
What is the correct way to handle this situation?
Specifically:
Should the scaling factors be padded (e.g., zero-filled) along the K dimension to match the required 128 × 4 tile layout?
Or should the layout be compacted (i.e., use 128 × 2 tiles without padding)?
Or is there another expected handling (e.g., different scaling mode or alignment requirement)?
Additional context
I am using FP4 quantization with block scaling (e.g., vec16-style scaling), and trying to ensure my scale layout is fully compatible with Tensor Core / cuBLASLt expectations.
Any clarification or reference would be greatly appreciated. Thanks!