feat(vortex-buffer): optimize BitBuffer::set_indices 2-3.5x faster by joseph-isaacs · Pull Request #7159 · vortex-data/vortex

joseph-isaacs · 2026-03-25T14:49:13Z

Summary

Optimizes BitBuffer::set_indices() for extracting set-bit indices from packed bitmaps using AVX-512, BMI2, and density-aware dispatch.

Key optimizations:

AVX-512 VPCOMPRESSD: Processes 16 bitmap bits per compress-store instruction for dense words (>8 set bits per u64)
AVX-512 8-word scan: _mm512_test_epi64_mask checks 512 bits (8 qwords) for zero in one instruction, skipping sparse regions
Counted BLSR loop: Uses for _ in 0..popcount instead of while w != 0 — eliminates branch misprediction from variable loop lengths on random data
Pre-known count: collect_set_indices_with_count() accepts a pre-known true count to skip the count_ones pre-pass
Counted outer loops: Precompute iteration counts to eliminate per-iteration pointer arithmetic overhead
Density-gated 4-word skip (BMI2 fallback): OR 4 u64 words to skip 256 zero bits at <0.8% density

Benchmark results (1M bits, `collect_precount` vs Arrow `collect`):

Density	Distribution	Arrow Collect	Vortex SIMD	Speedup
0.01%	uniform	18.0µs	2.2µs	88% faster
0.01%	random	17.4µs	2.5µs	86% faster
1%	uniform	25.8µs	15.7µs	39% faster
1%	random	108.9µs	16.9µs	84% faster
2%	uniform	39.2µs	23.1µs	41% faster
2%	random	143.0µs	87.0µs	39% faster
3%	uniform	56.3µs	27.8µs	51% faster
3%	random	171.8µs	137.8µs	20% faster
4%	uniform	68.6µs	37.3µs	46% faster
4%	random	196.3µs	154.7µs	21% faster
5%	uniform	78.8µs	34.7µs	56% faster
5%	random	215.9µs	172.8µs	20% faster
5%	clustered	78.1µs	28.7µs	63% faster
6%	uniform	85.0µs	39.8µs	53% faster
6%	random	245.3µs	188.1µs	23% faster
7%	uniform	106.5µs	50.6µs	52% faster
7%	random	266.4µs	187.3µs	30% faster
8%	uniform	116.0µs	52.8µs	54% faster
8%	random	288.8µs	202.8µs	30% faster
10%	uniform	115.3µs	65.4µs	43% faster
10%	random	326.5µs	226.1µs	31% faster
10%	clustered	116.1µs	58.9µs	49% faster
20%	uniform	225.2µs	93.8µs	58% faster
20%	random	477.3µs	89.5µs	81% faster
20%	clustered	220.1µs	74.2µs	66% faster
50%	uniform	557.0µs	94.1µs	83% faster

Vortex beats Arrow at every density and distribution tested (20-88% faster).

Test plan

All 45 existing set_indices tests pass
Tests cover: various sizes, offsets, densities (1-50%), random patterns, dense/sparse extremes
Results verified against Arrow's BitIndexIterator as ground truth
Benchmarks cover uniform, random, and clustered distributions at 0.01%-50% density with fine granularity around 1-10%

https://claude.ai/code/session_01KLHGX5KxFS7btWdZs8qmq9

…ementations Replace Arrow's BitIndexIterator with a custom ScalarBitIndexIterator that operates directly on u64 words via raw pointer arithmetic, eliminating the UnalignedBitChunk abstraction and i64 offset arithmetic overhead. Also add a bulk collect_set_indices() method with BMI2 BLSR/TZCNT hardware acceleration and a fast path for fully-set words at high density. Performance at 100K bits: - 1% density: 1.2x faster (2.50 µs vs 3.05 µs) - 50% density: 3.5x faster (20.9 µs vs 73.4 µs) - 99% density: 2.4x faster (57.4 µs vs 135.8 µs) Bulk BMI2 collect at 99% density: 3.2x faster (42.1 µs vs 135.8 µs) Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_01KLHGX5KxFS7btWdZs8qmq9

codspeed-hq · 2026-03-25T14:53:42Z

Merging this PR will degrade performance by 71.09%

❌ 29 regressed benchmarks
✅ 1077 untouched benchmarks
🆕 30 new benchmarks
⏩ 1522 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`bench_dict_mask[(0.01, 0.5)]`	2 ms	2.4 ms	-15.43%
❌	Simulation	`bench_dict_mask[(0.1, 0.1)]`	2.2 ms	2.8 ms	-22.91%
❌	Simulation	`bench_dict_mask[(0.01, 0.01)]`	2.2 ms	2.9 ms	-24.25%
❌	Simulation	`bench_dict_mask[(0.1, 0.01)]`	2.2 ms	2.9 ms	-24.26%
❌	Simulation	`bench_dict_mask[(0.01, 0.1)]`	2.2 ms	2.8 ms	-22.92%
❌	Simulation	`bench_dict_mask[(0.5, 0.01)]`	2.2 ms	2.9 ms	-24.22%
❌	Simulation	`bench_dict_mask[(0.5, 0.1)]`	2.2 ms	2.8 ms	-22.92%
❌	Simulation	`bench_dict_mask[(0.1, 0.5)]`	2 ms	2.4 ms	-15.42%
❌	Simulation	`bench_dict_mask[(0.5, 0.5)]`	2 ms	2.4 ms	-15.42%
❌	Simulation	`bench_dict_mask[(0.9, 0.01)]`	2.2 ms	2.9 ms	-24.23%
❌	Simulation	`bench_dict_mask[(0.9, 0.1)]`	2.2 ms	2.8 ms	-22.91%
❌	Simulation	`bench_dict_mask[(0.9, 0.5)]`	2 ms	2.4 ms	-15.42%
❌	Simulation	`bench_many_nulls[0.1]`	163.6 µs	238.3 µs	-31.33%
❌	Simulation	`bench_many_nulls[0.5]`	324.2 µs	675.1 µs	-51.98%
❌	Simulation	`bench_many_nulls[0.01]`	53.5 µs	66.7 µs	-19.71%
❌	Simulation	`bench_many_nulls[0.9]`	463 µs	1,091.5 µs	-57.59%
❌	Simulation	`density_sweep_random[0.001]`	38 µs	45.2 µs	-15.85%
❌	Simulation	`density_sweep_random[0.005]`	49.9 µs	56.2 µs	-11.34%
❌	Simulation	`filter_ultra_sparse[100000]`	33.5 µs	40.8 µs	-17.93%
❌	Simulation	`filter_ultra_sparse[250000]`	57 µs	75.4 µs	-24.35%
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

_{Comparing claude/optimize-bit-buffer-indices-aZcYG (dd48390) with develop (ec2c602)}

1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

robert3005 · 2026-03-25T14:56:47Z

Doesn't look that much faster?

joseph-isaacs closed this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vortex-buffer): optimize BitBuffer::set_indices 2-3.5x faster#7159

feat(vortex-buffer): optimize BitBuffer::set_indices 2-3.5x faster#7159
joseph-isaacs wants to merge 1 commit intodevelopfrom
claude/optimize-bit-buffer-indices-aZcYG

joseph-isaacs commented Mar 25, 2026 •

edited

Loading

Uh oh!

codspeed-hq bot commented Mar 25, 2026

Uh oh!

robert3005 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joseph-isaacs commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key optimizations:

Benchmark results (1M bits, collect_precount vs Arrow collect):

Test plan

Uh oh!

codspeed-hq bot commented Mar 25, 2026

Merging this PR will degrade performance by 71.09%

Performance Changes

Footnotes

Uh oh!

robert3005 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joseph-isaacs commented Mar 25, 2026 •

edited

Loading

Benchmark results (1M bits, `collect_precount` vs Arrow `collect`):