fix: CUDA bitpacked sliced output allocation#8622
Conversation
|
report: A sliced bit-packed array carries a non-zero Likely fix: allocate ReproductionAdd to the #[crate::test]
fn test_sliced_offset_overruns_output() -> VortexResult<()> {
use crate::executor::CudaArrayExt;
let mut ctx = vortex_array::array_session().create_execution_ctx();
let mut cuda_ctx = CudaSession::create_execution_ctx(&crate::cuda_session())
.vortex_expect("failed to create execution context");
// 2048 values (two 1024-blocks); all < 64 so they fit in 6 bits (no patches).
let array = PrimitiveArray::new((0u32..64).cycle().take(2048).collect::<Buffer<_>>(), NonNullable);
let bp = BitPacked::encode(&array.into_array(), 6, &mut ctx)?;
// Slice to a 1024-long window at offset 1 -> offset = 1, len = 1024.
// Decoder allocates 1024 but slices 1..1025.
let sliced = bp.into_array().slice(1..1025)?;
let gpu_result = block_on(async {
sliced.clone().execute_cuda(&mut cuda_ctx).await
.vortex_expect("GPU decompression failed")
.into_host().await.map(|a| a.into_array())
})?;
assert_arrays_eq!(sliced, gpu_result, &mut ctx);
Ok(())
}Output: ( ImpactDictionary columns slice their bit-packed |
Decode sliced bit-packed arrays in padded coordinates by sizing and launching for offset + len. This keeps the returned offset..offset+len device slice in bounds and ensures the final touched 1024-value chunk is decoded. Signed-off-by: "Alexander Droste" <alexander.droste@protonmail.com>
6927825 to
f8a35cc
Compare
|
Thanks for the heads up on this one: @gargiulofrancesco ! |
Merging this PR will improve performance by 16.39%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | slice_empty_vortex |
339.4 ns | 397.8 ns | -14.66% |
| ⚡ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
26.3 µs | 15.9 µs | +65.8% |
| ⚡ | Simulation | encode_varbin[(1000, 32)] |
163.7 µs | 146.9 µs | +11.45% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing ad/fix-cuda-bitpacked-slice-offset (f8a35cc) with develop (a9f77d1)
Footnotes
-
4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Decode sliced bit-packed arrays in padded coordinates by sizing and launching for offset + len. This keeps the returned offset..offset+len device slice in bounds and ensures the final touched 1024-value chunk is decoded.