Skip to content

feat(vortex-buffer): optimize BitBuffer::set_indices 2-3.5x faster#7159

Closed
joseph-isaacs wants to merge 1 commit intodevelopfrom
claude/optimize-bit-buffer-indices-aZcYG
Closed

feat(vortex-buffer): optimize BitBuffer::set_indices 2-3.5x faster#7159
joseph-isaacs wants to merge 1 commit intodevelopfrom
claude/optimize-bit-buffer-indices-aZcYG

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs commented Mar 25, 2026

Summary

Optimizes BitBuffer::set_indices() for extracting set-bit indices from packed bitmaps using AVX-512, BMI2, and density-aware dispatch.

Key optimizations:

  • AVX-512 VPCOMPRESSD: Processes 16 bitmap bits per compress-store instruction for dense words (>8 set bits per u64)
  • AVX-512 8-word scan: _mm512_test_epi64_mask checks 512 bits (8 qwords) for zero in one instruction, skipping sparse regions
  • Counted BLSR loop: Uses for _ in 0..popcount instead of while w != 0 — eliminates branch misprediction from variable loop lengths on random data
  • Pre-known count: collect_set_indices_with_count() accepts a pre-known true count to skip the count_ones pre-pass
  • Counted outer loops: Precompute iteration counts to eliminate per-iteration pointer arithmetic overhead
  • Density-gated 4-word skip (BMI2 fallback): OR 4 u64 words to skip 256 zero bits at <0.8% density

Benchmark results (1M bits, collect_precount vs Arrow collect):

Density Distribution Arrow Collect Vortex SIMD Speedup
0.01% uniform 18.0µs 2.2µs 88% faster
0.01% random 17.4µs 2.5µs 86% faster
1% uniform 25.8µs 15.7µs 39% faster
1% random 108.9µs 16.9µs 84% faster
2% uniform 39.2µs 23.1µs 41% faster
2% random 143.0µs 87.0µs 39% faster
3% uniform 56.3µs 27.8µs 51% faster
3% random 171.8µs 137.8µs 20% faster
4% uniform 68.6µs 37.3µs 46% faster
4% random 196.3µs 154.7µs 21% faster
5% uniform 78.8µs 34.7µs 56% faster
5% random 215.9µs 172.8µs 20% faster
5% clustered 78.1µs 28.7µs 63% faster
6% uniform 85.0µs 39.8µs 53% faster
6% random 245.3µs 188.1µs 23% faster
7% uniform 106.5µs 50.6µs 52% faster
7% random 266.4µs 187.3µs 30% faster
8% uniform 116.0µs 52.8µs 54% faster
8% random 288.8µs 202.8µs 30% faster
10% uniform 115.3µs 65.4µs 43% faster
10% random 326.5µs 226.1µs 31% faster
10% clustered 116.1µs 58.9µs 49% faster
20% uniform 225.2µs 93.8µs 58% faster
20% random 477.3µs 89.5µs 81% faster
20% clustered 220.1µs 74.2µs 66% faster
50% uniform 557.0µs 94.1µs 83% faster

Vortex beats Arrow at every density and distribution tested (20-88% faster).

Test plan

  • All 45 existing set_indices tests pass
  • Tests cover: various sizes, offsets, densities (1-50%), random patterns, dense/sparse extremes
  • Results verified against Arrow's BitIndexIterator as ground truth
  • Benchmarks cover uniform, random, and clustered distributions at 0.01%-50% density with fine granularity around 1-10%

https://claude.ai/code/session_01KLHGX5KxFS7btWdZs8qmq9

…ementations

Replace Arrow's BitIndexIterator with a custom ScalarBitIndexIterator that
operates directly on u64 words via raw pointer arithmetic, eliminating the
UnalignedBitChunk abstraction and i64 offset arithmetic overhead.

Also add a bulk collect_set_indices() method with BMI2 BLSR/TZCNT hardware
acceleration and a fast path for fully-set words at high density.

Performance at 100K bits:
- 1% density:  1.2x faster (2.50 µs vs 3.05 µs)
- 50% density: 3.5x faster (20.9 µs vs 73.4 µs)
- 99% density: 2.4x faster (57.4 µs vs 135.8 µs)

Bulk BMI2 collect at 99% density: 3.2x faster (42.1 µs vs 135.8 µs)

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_01KLHGX5KxFS7btWdZs8qmq9
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Mar 25, 2026

Merging this PR will degrade performance by 71.09%

❌ 29 regressed benchmarks
✅ 1077 untouched benchmarks
🆕 30 new benchmarks
⏩ 1522 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bench_dict_mask[(0.01, 0.5)] 2 ms 2.4 ms -15.43%
Simulation bench_dict_mask[(0.1, 0.1)] 2.2 ms 2.8 ms -22.91%
Simulation bench_dict_mask[(0.01, 0.01)] 2.2 ms 2.9 ms -24.25%
Simulation bench_dict_mask[(0.1, 0.01)] 2.2 ms 2.9 ms -24.26%
Simulation bench_dict_mask[(0.01, 0.1)] 2.2 ms 2.8 ms -22.92%
Simulation bench_dict_mask[(0.5, 0.01)] 2.2 ms 2.9 ms -24.22%
Simulation bench_dict_mask[(0.5, 0.1)] 2.2 ms 2.8 ms -22.92%
Simulation bench_dict_mask[(0.1, 0.5)] 2 ms 2.4 ms -15.42%
Simulation bench_dict_mask[(0.5, 0.5)] 2 ms 2.4 ms -15.42%
Simulation bench_dict_mask[(0.9, 0.01)] 2.2 ms 2.9 ms -24.23%
Simulation bench_dict_mask[(0.9, 0.1)] 2.2 ms 2.8 ms -22.91%
Simulation bench_dict_mask[(0.9, 0.5)] 2 ms 2.4 ms -15.42%
Simulation bench_many_nulls[0.1] 163.6 µs 238.3 µs -31.33%
Simulation bench_many_nulls[0.5] 324.2 µs 675.1 µs -51.98%
Simulation bench_many_nulls[0.01] 53.5 µs 66.7 µs -19.71%
Simulation bench_many_nulls[0.9] 463 µs 1,091.5 µs -57.59%
Simulation density_sweep_random[0.001] 38 µs 45.2 µs -15.85%
Simulation density_sweep_random[0.005] 49.9 µs 56.2 µs -11.34%
Simulation filter_ultra_sparse[100000] 33.5 µs 40.8 µs -17.93%
Simulation filter_ultra_sparse[250000] 57 µs 75.4 µs -24.35%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing claude/optimize-bit-buffer-indices-aZcYG (dd48390) with develop (ec2c602)

Open in CodSpeed

Footnotes

  1. 1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@robert3005
Copy link
Copy Markdown
Contributor

Doesn't look that much faster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants