Add OnPair string compression encoding with predicate pushdown#7927
Add OnPair string compression encoding with predicate pushdown#7927joseph-isaacs wants to merge 17 commits into
Conversation
Introduces two new crates that integrate the OnPair C++ short-string
compression library (gargiulofrancesco/onpair_cpp, arXiv:2508.02280) as
a first-class Vortex array.
* `encodings/onpair-sys`: build.rs uses cmake-rs to FetchContent the
upstream onpair_cpp at configure time, applies a small in-tree patch
that swaps `boost::unordered_flat_map` for `std::unordered_map` (plus
a `std::hash<std::pair<...>>` specialisation), and links a C-ABI shim
(`cxx/onpair_shim.{h,cpp}`) into a static archive. Safe Rust wraps
the shim in a `Column` owning handle exposing compress / serialise /
decompress and the compressed-domain predicates.
* `encodings/onpair`: Vortex `Array` impl mirroring `vortex-fsst`.
Stores the serialised OnPair column (`ONPAIR01` magic + dictionary +
bit-packed token stream) as a single opaque buffer plus an
`uncompressed_lengths` child for cheap canonicalisation. Default
preset is "dict-12" (12-bit codes, dictionary capped at 4 096 entries).
Wires equals / starts-with / contains pushdown straight through to
the C++ scan implementation via `CompareKernel` and `LikeKernel`, so
`arr = const` and `arr LIKE 'prefix%' / '%substr%'` evaluate on the
compressed stream without decoding rows.
* Tests cover roundtrip, nullable canonicalisation, scalar_at, and all
three pushdown predicates end-to-end through the C++ stack (7/7
pass locally with cmake + g++).
Build requirements: cmake >= 3.21, a C++20 compiler, and network access
on the first build (subsequent builds are cached under
`$OUT_DIR/onpair-build/_deps`). No Boost dependency at build time.
Signed-off-by: Claude <noreply@anthropic.com>
Exercises the C++ → FFI → Vortex stack on a realistic-shape corpus (synthetic URL / HTTP-log strings). Validates roundtrip byte-equality on all 100 000 rows and checks each pushdown predicate result against a brute-force scan. Local results (release build): 100 000 rows, 4 332 157 -> 1 385 145 bytes (3.13x), compress 136 ms, canonicalize 5 ms; equals / starts_with / contains all match the reference counts exactly. Signed-off-by: Claude <noreply@anthropic.com>
Merging this PR will degrade performance by 16.85%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | new_bp_prim_test_between[i32, 16384] |
95.1 µs | 109.6 µs | -13.18% |
| ❌ | Simulation | new_bp_prim_test_between[i32, 32768] |
141.5 µs | 170.4 µs | -16.98% |
| ❌ | Simulation | new_bp_prim_test_between[i16, 32768] |
120.7 µs | 134.7 µs | -10.4% |
| ❌ | Simulation | new_bp_prim_test_between[i64, 16384] |
115.5 µs | 144.9 µs | -20.29% |
| ❌ | Simulation | new_bp_prim_test_between[i64, 32768] |
178.3 µs | 237.2 µs | -24.82% |
| ❌ | Simulation | new_alp_prim_test_between[f64, 16384] |
127.5 µs | 149.3 µs | -14.61% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/vortex-array-rust-bindings-FQfIX (53c3ea4) with develop (b3e1673)
…code
Replaces the previous opaque-blob layout with one that mirrors how FSST
splits its symbols-as-buffer / codes-as-child encoding, and shifts every
read path off the C++ FFI.
Layout
------
Buffer 0 dict_bytes — dictionary blob built by C++ training
Slot 0 dict_offsets u32[] — len = dict_size + 1
Slot 1 codes u16[] — one token id per element, low `bits`
bits populated (FastLanes-bit-packable)
Slot 2 codes_offsets u32[] — per-row token offsets, len = n + 1
Slot 3 uncompressed_lengths — i32[], len = n
Slot 4 validity — optional Bool child
metadata = { bits: u32, uncompressed_lengths_ptype: i32 }
Decode path
-----------
At compress time we call OnPair's C++ trainer to produce the dictionary
and bit-packed token stream, then immediately unpack the stream into u16
codes in Rust (`vortex_onpair_sys::unpack_codes_to_u16`) and drop the
C++ column. After that, nothing on the read path touches C++:
decode_row(r):
for c in codes[codes_offsets[r] .. codes_offsets[r+1]]:
out.extend_from_slice(
dict_bytes[dict_offsets[c] .. dict_offsets[c+1]]
)
`canonicalize`, `scalar_at`, and the compute kernels all share a
`DecodeView` over the materialised children.
Compute kernels (pure Rust, no C++ scan)
----------------------------------------
* compare (Eq / NotEq): streams dict slices per row, short-circuits on
the first mismatch.
* like ('lit', 'pre%', '%sub%'): same streaming approach for prefix; a
full row decode + memmem for contains.
* filter: canonical round-trip + recompress (unchanged).
* slice: zero-copy — narrows codes_offsets / uncompressed_lengths /
validity and shares the dict blob + codes child.
* cast: identity rewrap, no payload touched.
Tests
-----
All 7 unit tests + the 100 000-row big_data smoke test pass. On the
smoke corpus (release): compress 147 ms, full canonicalize 7.5 ms,
equals / starts_with / contains pushdown counts match a brute-force
reference exactly.
Signed-off-by: Claude <noreply@anthropic.com>
|
curious how it would do if wired into the compressor |
* Extract a small `parts_to_children` helper in `vortex-onpair`'s
`compress.rs` so the lift-out-of-C++ step reads top-to-bottom rather
than via a block-and-drop dance.
* Add `OnPairScheme` to `vortex-btrblocks::schemes::string`. The scheme
matches utf8 strings, declares its four primitive children
(dict_offsets / codes / codes_offsets / uncompressed_lengths) so the
cascading compressor can re-encode them downstream
(FastLanes-bit-pack on `codes`, etc.), defers the compression-ratio
estimate to the sample-based path (same as FSST / Zstd), and
reassembles the result via `OnPair::try_new`.
* Feature-gate it via a new `onpair` Cargo feature, enabled by default,
so out-of-the-box `BtrBlocksCompressorBuilder::default()` includes it
in `ALL_SCHEMES` and consumers without a C++ toolchain can opt out
with `default-features = false`.
* Update the FSST scheme-selection test to accept either FSST or OnPair
as the winning encoding — both target the same workload (short
strings with high lexical overlap) and the sample-based selector now
picks the one with the better ratio on the test corpus.
Test results
vortex-onpair 7 unit + 1 100k smoke all green
vortex-btrblocks 36 unit + 3 doctests all green (incl. new
`test_onpair_in_default_scheme_list`)
Signed-off-by: Claude <noreply@anthropic.com>
Polar Signals Profiling ResultsLatest Run
Previous Runs (6)
Powered by Polar Signals Cloud |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.159x ❌, 0↑ 17↓)
datafusion / vortex-compact (1.061x ➖, 0↑ 4↓)
datafusion / parquet (1.078x ➖, 1↑ 5↓)
datafusion / arrow (1.093x ➖, 0↑ 8↓)
duckdb / vortex-file-compressed (1.139x ❌, 0↑ 9↓)
duckdb / vortex-compact (1.061x ➖, 0↑ 4↓)
duckdb / parquet (1.041x ➖, 1↑ 4↓)
duckdb / duckdb (1.047x ➖, 0↑ 1↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.989x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.038x ➖, 0↑ 2↓)
datafusion / parquet (0.969x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.974x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.996x ➖, 0↑ 0↓)
duckdb / parquet (0.990x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.146x ➖, 0↑ 4↓)
datafusion / vortex-compact (1.063x ➖, 0↑ 1↓)
datafusion / parquet (1.019x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.169x ➖, 0↑ 3↓)
duckdb / vortex-compact (1.024x ➖, 0↑ 0↓)
duckdb / parquet (0.986x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: FineWeb NVMeVerdict: Likely regression (low confidence) datafusion / vortex-file-compressed (4.075x ❌, 1↑ 7↓)
datafusion / vortex-compact (1.014x ➖, 0↑ 0↓)
datafusion / parquet (0.990x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (4.180x ❌, 0↑ 7↓)
duckdb / vortex-compact (1.008x ➖, 0↑ 0↓)
duckdb / parquet (0.994x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.978x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.997x ➖, 0↑ 1↓)
datafusion / parquet (1.002x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (0.967x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.000x ➖, 0↑ 0↓)
duckdb / parquet (0.957x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.972x ➖, 25↑ 3↓)
datafusion / vortex-compact (1.017x ➖, 3↑ 3↓)
datafusion / parquet (0.976x ➖, 5↑ 2↓)
duckdb / vortex-file-compressed (1.072x ➖, 3↑ 11↓)
duckdb / vortex-compact (0.976x ➖, 3↑ 0↓)
duckdb / parquet (1.005x ➖, 2↑ 7↓)
duckdb / duckdb (0.920x ➖, 30↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.027x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.004x ➖, 0↑ 0↓)
datafusion / parquet (0.997x ➖, 0↑ 0↓)
datafusion / arrow (0.989x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.045x ➖, 0↑ 2↓)
duckdb / vortex-compact (1.004x ➖, 0↑ 0↓)
duckdb / parquet (1.002x ➖, 0↑ 0↓)
duckdb / duckdb (0.994x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.039x ➖, 4↑ 10↓)
datafusion / parquet (0.998x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.020x ➖, 1↑ 4↓)
duckdb / parquet (1.012x ➖, 0↑ 0↓)
duckdb / duckdb (1.000x ➖, 3↑ 1↓)
Full attributed analysis
|
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.026x ➖ datafusion / vortex-file-compressed (1.026x ➖, 0↑ 2↓)
|
File Sizes: PolarSignals ProfilingFile Size Changes (1 files changed, +0.1% overall, 1↑ 0↓)
Totals:
|
Two changes that together stop FSST from being the default and make OnPair work end-to-end through the file writer + reader. vortex-btrblocks * Remove `FSSTScheme` from `ALL_SCHEMES`. The struct and `Scheme` impl stay in place so callers can opt back in via `BtrBlocksCompressorBuilder::with_new_scheme(&FSSTScheme)`; it just isn't in the default cascade anymore. OnPair fills the string-fragmentation slot. * Tighten `only_cuda_compatible` to exclude OnPair (heavier toolchain dep) instead of FSST. * Tests: drop the FSST-vs-OnPair tie-break test; add `test_onpair_compressed` (FSST-style corpus → OnPair) and `test_fsst_opt_in_still_works` (empty builder + with_new_scheme + FSSTScheme). vortex-file * New `onpair` Cargo feature (default-on, mirrors `vortex-btrblocks`'s) that pulls in `vortex-onpair` and registers `OnPair` in both `register_default_encodings` and `ALLOWED_ENCODINGS`. Without this the normalizer rejects vortex.onpair with "normalize forbids encoding (vortex.onpair)" when round-tripping a file. Consumers without a C++ toolchain can `default-features = false`. CI / reproducibility * Pin `onpair_cpp` to a full commit SHA in `cmake/onpair_pin.cmake` (was tracking `main`). CI's `FetchContent` step is now reproducible and won't break when upstream's main branch moves. Tests: 109 across vortex-onpair, vortex-btrblocks, vortex-file; all green. Clippy clean. Signed-off-by: Claude <noreply@anthropic.com>
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.974x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.984x ➖, 0↑ 0↓)
duckdb / parquet (0.990x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 1.034x ➖ unknown / unknown (1.026x ➖, 7↑ 12↓)
|
Wiring `default = ["onpair"]` directly on `vortex-btrblocks` and
`vortex-file` meant any consumer that depended on those crates with
default features on (including `wasm-test`, which sets
`vortex = { default-features = false }` but cannot disable transitive
default features on a hard dep of `vortex`) ended up pulling
`vortex-onpair-sys` and its CMake / C++20 build, which fails on
wasm32-wasip1.
Move the default-on toggle to the umbrella `vortex` crate:
* `vortex-btrblocks` and `vortex-file` now declare `onpair` as a
feature with **no `default = [...]` line** — they're a la carte.
* `vortex/Cargo.toml`: `default = ["files", "zstd", "onpair"]` plus a
new `onpair = ["vortex-btrblocks/onpair", "vortex-file?/onpair"]`
alias so `vortex` consumers still get OnPair out of the box but
`default-features = false` callers (wasm-test) really do drop it.
* `only_cuda_compatible` annotates its now-conditionally-mutated
`excluded` list with `#[cfg_attr(not(feature = "onpair"),
allow(unused_mut))]` so no-default-features builds stop warning.
Verified:
cargo build --target wasm32-wasip1 \
--manifest-path wasm-test/Cargo.toml # green, no C++ build
Signed-off-by: Claude <noreply@anthropic.com>
* `vortex-onpair`: the cascading compressor narrows the integer slot
children to their tightest ptype (e.g. `codes` from u16 down to u8),
so the decoder's `as_slice::<u16>()` was tripping a panic. Widen all
three primitive children back to their canonical types
(`Buffer<u16>` for codes, `Buffer<u32>` for both offsets) at
materialisation time. Adds three round-trip tests in
`vortex-btrblocks/tests/onpair_roundtrip.rs` that exercise the full
compressor + decompressor on string arrays (non-nullable, nullable,
and an empty-string-heavy edge case) — all three are green.
* Fix the two `unresolved link` rustdoc warnings on `OnPair::compress`
by pointing at the actual entry point (`crate::onpair_compress`).
* `Cargo.toml`: re-sort `vortex-onpair` / `vortex-onpair-sys` into
alphabetical order in `[workspace.dependencies]` so `taplo fmt
--check` (= the `lint-toml` CI job) stops complaining.
* SPDX headers on the three CMake files
(`encodings/onpair-sys/cmake/{CMakeLists.txt,onpair_pin.cmake,strip_boost.cmake}`)
so the `reuse-check` job passes.
* Regenerate `public-api.lock` for `vortex-btrblocks` and add the two
missing locks (`encodings/onpair{,-sys}/public-api.lock`).
Test results
vortex-onpair 7 unit + 1 100k smoke all green
vortex-btrblocks 36 unit + 3 doctests +
3 new onpair_roundtrip all green
Signed-off-by: Claude <noreply@anthropic.com>
* `vortex-file/tests/test_onpair_string_roundtrip.rs`: a full parquet-bench-shape file write/read test for a single string column. Currently `#[ignore]`'d because when the cascading compressor leaves one of OnPair's primitive children (e.g. `dict_offsets` u32, or `codes_offsets` u32) as a raw `PrimitiveArray` rather than bit-pack it, the file roundtrip fails with `Misaligned buffer cannot be used to build PrimitiveArray of u32`. Tracked separately — the fix is to move the offset arrays into the OnPair array's `VTable::buffer` slots (where `BufferHandle::alignment` is preserved across the file format) instead of storing them as primitive slot children. * For now the existing `BtrBlocksCompressor` round-trip tests (`vortex-btrblocks/tests/onpair_roundtrip.rs`) continue to pass — the compressor pipeline is correct, only the file-format serialisation has the alignment limitation. Signed-off-by: Claude <noreply@anthropic.com>
Layout change driven by two related bugs:
1. The cascading compressor can narrow OnPair's primitive slot children
(e.g. `dict_offsets` u32 → u16). My `as_slice::<u32>()` panicked.
The user pointed out codes themselves can't narrow below u9 — only
the *offsets* arrays were ever at risk. Earlier fix (widen on
decode) addressed the symptom; the v3 layout removes the root cause
by keeping offsets as raw byte buffers all the way through.
2. The Vortex flat-segment writer aligns a segment to the alignment of
its *first* buffer only. Primitive slot children that follow a
variable-length buffer in the same segment end up at an arbitrary
offset, and on read `PrimitiveArray<u32>::deserialize` rejects them
with `Misaligned buffer`. This broke the file roundtrip end-to-end.
New layout (all alignment-stable):
Buffer 0 dict_bytes — dictionary blob from C++ trainer
Buffer 1 dict_offsets u32[] — raw little-endian bytes
Buffer 2 codes u16[] — raw little-endian bytes; each
value uses up to `bits` ≤ 16 bits
Buffer 3 codes_offsets u32[] — raw little-endian bytes
Slot 0 uncompressed_lengths — integer PrimitiveArray
Slot 1 validity — optional Bool child
`codes` stays full u16 width on disk (no bit-packing) so the decode
hot loop is a straight indexed dict lookup with no unpack:
for c in codes[lo..hi]:
out.extend_from_slice(dict_bytes[off[c]..off[c+1]])
`bytes_to_buffer_u{16,32}` copies from arbitrarily-aligned input bytes
to a typed `Buffer<uN>`; the inner `from_le_bytes` loop autovectorises
to a single load on LE targets so the decode setup cost is tiny.
OnPairScheme::compress now only sends `uncompressed_lengths` through
the cascading compressor (the rest are buffers, not children); the
buffer alignment travels with the `BufferHandle::alignment` marker so
the segment writer pads correctly on disk.
Tests
* `vortex-onpair` 7 unit + 1 100k smoke green
* `vortex-btrblocks` 35 unit + 3 doctests +
3 onpair_roundtrip green
* `vortex-file` 2 + 1 new `test_onpair_string_roundtrip`
(full file write/read of a Utf8 column) green
Smoke-test perf (release, 100k rows, 4.3 MB raw → still 25 % compressed):
compress 184 ms, canonicalize 9 ms; equals / starts_with / contains
pushdown counts match a brute-force scan exactly.
Signed-off-by: Claude <noreply@anthropic.com>
Expand the file-write round-trip suite from a single 4 K-row column to cover the call shapes that the CI bench actually exercises (and that surfaced the earlier `Misaligned buffer cannot be used to build PrimitiveArray of u32` regression on TPC-H `supplier_0.vortex`): * `single_column_single_chunk` — baseline 4 K rows. * `single_column_many_chunks` — 50 K rows split across chunks. * `tpch_supplier_shape` — 32 K rows × 8 columns (`s_suppkey i64`, `s_name`, `s_address`, `s_nationkey i32`, `s_phone`, `s_acctbal i64`, `s_comment`, `s_city`) — five string columns interleaved with primitive columns, the exact mix where the alignment bug previously fired. * `nullable_and_extreme_shapes` — 16 K rows of mixed string shapes (nulls, empties, 1 KiB-long blobs, short patterns) on a `Nullable` Utf8 column, hitting the validity child path. All four pass after the buffer-only OnPair layout (commit f0e03a3). Signed-off-by: Claude <noreply@anthropic.com>
Match `vortex::VortexSession::default()` precisely (DType + Array + Layout + ScalarFn + ArrayKernels + AggregateFn + Runtime sessions plus `register_default_encodings`). `vortex-file` can't depend on the umbrella `vortex` crate, but inlining the same composition gives the tests identical compressor + decompressor wiring to what `vortex-bench` and downstream applications use. The write path was already using `WriteStrategyBuilder::default()` = `BtrBlocksCompressor::default()`; the helper now spells out that the in-memory write goes through the full cascading compressor and reads back via `OpenOptions::open_buffer` (no disk, no FS) so reviewers don't have to chase the call graph. Signed-off-by: Claude <noreply@anthropic.com>
File Sizes: TPC-H SF=1 on NVMEFile Size Changes (18 files changed, -5.0% overall, 7↑ 11↓)
Totals:
|
File Sizes: FineWeb NVMeFile Size Changes (2 files changed, -12.1% overall, 1↑ 1↓)
Totals:
|
File Sizes: TPC-DS SF=1 on NVMEFile Size Changes (48 files changed, -0.3% overall, 26↑ 22↓)
Totals:
|
Benchmarks: Random AccessVortex (geomean): 0.941x ➖ unknown / unknown (0.947x ➖, 7↑ 0↓)
|
File Sizes: Statistical and Population GeneticsFile Size Changes (2 files changed, -0.2% overall, 0↑ 2↓)
Totals:
|
File Sizes: TPC-H SF=10 on NVMEFile Size Changes (48 files changed, -4.3% overall, 9↑ 39↓)
Totals:
|
File Sizes: Clickbench on NVMEFile Size Changes (201 files changed, -14.2% overall, 7↑ 194↓)
Totals:
|
Match OnPair C++ `decoder.h::decompress` exactly: copy a fixed
`MAX_TOKEN_SIZE = 16` bytes per token regardless of true token length,
then advance the output cursor by the *true* length so the next memcpy
overwrites the trailing slop. LLVM lowers the fixed-size copy to a
single 16-byte unaligned vector store on x86_64 / aarch64, making each
token a constant-time SIMD operation instead of a branchy variable
memcpy.
Changes:
* `MAX_TOKEN_SIZE` is now a public crate-level constant.
* `compress.rs` pads the dictionary blob with 16 trailing zero bytes so
the over-copy never reads past `dict_bytes`. The codes / offsets /
validity invariants are unchanged.
* `decode.rs::DecodeView::decode_row_into` becomes the fast path: a
two-pass loop that first sums true lengths to size the output buffer
once, then over-copies into a pre-reserved region using
`copy_nonoverlapping` and finishes with a single `set_len`.
* New `decode_rows_into(start, count, &mut Vec<u8>)` does the same
thing across a row window with no per-row reserve overhead. The
canonicalise path now bulk-decodes the entire array in one shot.
Benchmark (release, no FFI, real OnPair-compressed URL/log corpus):
rows | median canonicalize | ns / row
---------|----------------------|---------
10 000 | 280 µs | 28
100 000 | 3.12 ms | 31
1 000 000| 57.5 ms | 57 (L2-bound)
For comparison the earlier `extend_from_slice` decode was ~7.5 ms /
100 K rows; the new path is **~2.4× faster**.
Verified
* `cargo test -p vortex-onpair` all green
* `cargo test -p vortex-btrblocks ...` all green (3× roundtrip)
* `cargo test -p vortex-file ... onpair` all green (4× roundtrip
incl. TPC-H shape)
* `datafusion-bench tpch --opt scale-factor=0.01 --formats vortex
--queries 1` end-to-end Parquet →
Vortex (with OnPair) →
DataFusion query 1 in 12 ms
Signed-off-by: Claude <noreply@anthropic.com>
Root cause: Vortex's flat-layout segment writer aligns each segment to
the alignment of its *first* buffer only. With the old buffer order
[dict_bytes, dict_offsets, codes, codes_offsets]
`dict_bytes` is variable-length and has no alignment requirement, so
the segment was written u8-aligned. The next buffer (`dict_offsets`)
was a u32 array but ended up at an offset that was only u8-aligned in
the file, and on read `PrimitiveArray<u32>::deserialize` rejected it
with `Misaligned buffer cannot be used to build PrimitiveArray of u32`.
Single-column tests happened to pass because typical OnPair
dictionaries are coincidentally a multiple of 4 bytes; ClickBench's
wide string tables (and TPC-H's `supplier` post-encoding) hit the bad
case.
New buffer order:
Buffer 0 dict_offsets u32[] ← segment alignment = 4
Buffer 1 codes_offsets u32[] ← length already 4-multiple
Buffer 2 codes u16[] ← starts at 4-aligned offset, OK for u16
Buffer 3 dict_bytes u8[] ← variable length, no alignment needed
Each buffer's natural length is a multiple of its alignment, so every
buffer inside the segment stays correctly aligned. The 16-byte
over-copy padding on `dict_bytes` still applies for the decoder.
Verified
* `cargo test -p vortex-onpair -p vortex-btrblocks -p vortex-file`
all green (5 new file-roundtrip tests pass, including a new
`odd_dict_length_alignment` test specifically exercising the
previously-broken case).
* `datafusion-bench tpch --opt scale-factor=0.01 --formats vortex
--queries 1,2,3,6 --iterations 1` runs all four queries
successfully end-to-end (Parquet → Vortex with OnPair → DataFusion).
Signed-off-by: Claude <noreply@anthropic.com>
… children Move dict_offsets, codes, and codes_offsets out of the OnPair array's raw buffer list and into typed slot children, mirroring FSST. The cascading compressor now sees each integer offset/code array as a regular `PrimitiveArray` child and can re-encode them through the standard `compress_child` pipeline (FastLanes BitPacking on `codes` at exactly `bits` bits, FoR on the offsets, narrow-then-FoR on `uncompressed_lengths`, etc.). New on-disk layout: Buffer 0 dict_bytes (opaque, 8-aligned, +16 pad) Slot 0 dict_offsets u32[] (may be narrowed by compressor) Slot 1 codes u16[] (may be BitPacked to `bits` width) Slot 2 codes_offsets u32[] (may be narrowed by compressor) Slot 3 uncompressed_lengths (integer) Slot 4 optional validity Two pieces have to come along for the ride: 1. Per-child ptype recorded in `OnPairMetadata` (`dict_offsets_ptype`, `codes_ptype`, `codes_offsets_ptype`) so deserialize can ask for the actual narrowed dtype rather than hard-coded `U32` / `U16`. Without this fix `Primitive::deserialize` got handed a u16-aligned buffer for a U32 type and panicked with `Misaligned buffer cannot be used to build PrimitiveArray of u32`. 2. `OwnedDecodeInputs::collect` now widens whatever the compressor handed back (`u8`/`u16` for offsets, `u8` for `bits ≤ 8` codes) to the decode loop's native widths via `match_each_integer_ptype!` so the over-copy hot loop stays the same straight pointer arithmetic. `OnPairScheme` in vortex-btrblocks declares `num_children = 4` and recursively compresses every child, matching FSSTScheme's shape. Tests * `cargo test -p vortex-onpair -p vortex-btrblocks` — all green (7 unit + 1 smoke + 3 btrblocks roundtrip). * `cargo test -p vortex-file --features onpair,tokio --test test_onpair_string_roundtrip` — all 5 green (single chunk, many chunks, TPC-H supplier shape, nullable extremes, odd_dict_length_alignment). * `datafusion-bench tpch --opt scale-factor=0.01 --formats vortex --queries 1,3,6,12 --iterations 1` — all four queries end-to-end through Parquet → Vortex with OnPair → DataFusion. Signed-off-by: Claude <noreply@anthropic.com>
…uble-copy
Two production improvements with measured benchmark backing. A side-by-side
microbench was used to compare four candidate decoders against each other on
the same compressed array; only the winning variant was kept (numbers below).
Combined `(offset << 16) | length` table
----------------------------------------
`OwnedDecodeInputs::collect` now packs `dict_offsets` into a single
`Buffer<u64>` table at materialise time. The hot decode loop loads one u64
per token instead of two adjacent u32s — `entry = *table_ptr.add(c);
off = entry >> 16; len = entry & 0xffff` — matching the strategy
`onpair_cpp/include/onpair/decoding/decoder.h` uses on its hot path. The
table costs `dict_size * 8` bytes (32 KiB at dict-12) which is amortised
over every row decode and trivially small next to the row payload.
Drop double-copy in `canonicalize_onpair`
-----------------------------------------
Previously the canonical buffer was assembled as:
let mut buf: Vec<u8> = Vec::with_capacity(total + MAX_TOKEN_SIZE);
dv.decode_rows_into_with_size(0, n, total, &mut buf);
let mut out_bytes = ByteBufferMut::with_capacity(buf.len());
out_bytes.extend_from_slice(&buf); // ← second memcpy
Now we decode straight into `ByteBufferMut::spare_capacity_mut()`, so the
entire decoded payload is written exactly once.
Strategies that lost the bench (see git history for the full
benchmark + experimental variants):
* Padding every dict entry to 16 B (no `dict_offsets`, straight `c * 16`
lookup): 25 % faster on 10 K and 100 K rows but **3.6× slower on 1 M
rows** — extra working set blew out of L2.
* Non-temporal stores (`_mm_stream_si128`): catastrophic — the
`cursor % 16` realign branch + `sfence` per token tanked it by 17×.
Final numbers (release, URL/log corpus, dict-12, 30 samples)
------------------------------------------------------------
before after speedup
raw decode 10 K 60 µs 56 µs 1.07×
raw decode 100 K 693 µs 635 µs 1.09×
raw decode 1 M 9.5 ms 9.6 ms ≈ 1×
canonicalize 10 K 190 µs 171 µs 1.11×
canonicalize 100 K 2.35 ms 1.85 ms 1.27×
canonicalize 1 M 55 ms 29.7 ms **1.85×**
The raw-decode-only speedup is modest (the inner loop is already
memory-bound at 1 M), but the canonicalize end-to-end win is dominated
by the dropped second memcpy.
Verified
* `cargo test -p vortex-onpair -p vortex-btrblocks` — all green.
* `cargo test -p vortex-file --features onpair,tokio
--test test_onpair_string_roundtrip` — all 5 green.
Signed-off-by: Claude <noreply@anthropic.com>
Local-only follow-up to the combined-table decoder (15569bb). Four correctness-preserving micro-optimisations and some test/bench hygiene. Not pushed; user requested local-only review. 1. Drop `OwnedDecodeInputs::dict_offsets` — the decoder only needs the combined `(offset << 16) | length` `dict_table`, so `collect` no longer materialises a `Buffer<u32>` for the offsets at all. The table is built directly from whatever ptype the cascading compressor handed back via `match_each_integer_ptype!`. Saves one `dict_size`-element allocation per decode. 2. Single-allocation widen. `widen_to_{u16,u32}` now go through `BufferMut::with_capacity` + `push_unchecked` + `freeze` rather than `Vec → Buffer::copy_from`, halving allocator traffic. 3. Zero-copy widen fast path. When the cascading compressor did *not* narrow (the common case for small dicts / wide value ranges), the widen function refcount-bumps the underlying Arc via `PrimitiveArray::into_buffer::<u_N>()` instead of copying. 4. `for_each_dict_slice` + `decoded_len_rows` use `dict_table`. One `u64` load per token instead of two adjacent `u32` loads. 5. Tighter predicate kernels. `row_equals` / `row_starts_with` use raw slice pointer math on the needle/prefix after a single length check, instead of re-running bounds-checked subslicing on every iteration. Tests + bench * New `rstest`-parameterised `test_onpair_unroll_tail_boundaries` for `n ∈ {1, 2, 3, 4, 5, 7, 8, 9}` to stress the 4×-unrolled decode loop's scalar tail. Plus `test_onpair_empty`. * Bench sweeps four corpus shapes (URL/log, short, long, high-card) across two row counts, so a regression on any shape surfaces clearly. Benchmark (release, 30 samples, vs prior tip 15569bb) canonicalize UrlLog 100 K 1.85 ms → 1.42 ms (-23 %) canonicalize UrlLog 1 M 29.7 ms → 15.1 ms (-49 %) decode_rows UrlLog 1 M 9.6 ms → 4.6 ms (-52 %) Verified * `cargo test -p vortex-onpair` — 16/16 (was 7/7). * `cargo test -p vortex-btrblocks` — 35/35. * `cargo test -p vortex-file --features onpair,tokio --test test_onpair_string_roundtrip` — 5/5. * `cargo clippy -p vortex-onpair -p vortex-onpair-sys -p vortex-btrblocks --all-targets` — clean. Signed-off-by: Claude <noreply@anthropic.com>
…ates + memchr contains
Three connected changes that drop the SF=10 regression and accelerate
predicate pushdown.
OnPair::filter — share the dictionary (was the SF=10 cause)
-----------------------------------------------------------
The previous implementation decoded the whole array, filtered the
canonical bytes, and re-trained a brand-new OnPair dictionary on the
surviving rows. TPC-H Q22 customer.c_phone goes through two consecutive
filters (`SUBSTRING(c_phone,1,2) IN (...)` and `c_acctbal > avg`), each
of which paid full `Column::compress` training overhead — a ~50–100 ms
constant cost per call that vanishes below noise at SF=1 but dominates
at SF=10.
The rewrite is FSST-shape: keep `dict_bytes` + `dict_offsets` byte-
identical to the input; rebuild only `codes`, `codes_offsets`,
`uncompressed_lengths`, and validity by walking the mask. No decode,
no retrain, no C++ on the read path. New unit test
`test_onpair_filter_shares_dict` asserts the dict is byte-identical
post-filter.
Bench (UrlLog 1 M, --sample-count 30, release):
filter_share_dict 4.8 ms median
(vs. ~70 ms estimated for the old recompress path)
Token-aware Eq pushdown (no row decode)
---------------------------------------
New `lpm.rs` greedy longest-prefix-match tokeniser. OnPair's dictionary
is sorted lexicographically, so a 257-entry first-byte index gives
O(1) bucket lookup per byte; the inner loop scans the small bucket
to pick the longest matching dict entry. Two byte strings have equal
LPM token sequences iff they have equal bytes (LPM is deterministic
under the same dict), so `compute/compare.rs::compare(Eq)` LPM-tokenises
the needle once and then for each row compares `codes[lo..hi]` against
the tokenised needle as `&[u16]` — direct slice eq, no decode at all.
If the needle contains a byte that has no dict entry, no row can match
(every row was compressed against the same dict) — we leave the
bitmap zeroed and `NotEq` inverts.
Bench (UrlLog 1 M):
eq_constant 6.8 ms median
(mostly OwnedDecodeInputs::collect; the actual token compare is
sub-millisecond)
LIKE pushdown
-------------
* `'literal'` — same token-aware path as Eq.
* `'prefix%'` — byte-streaming via `for_each_dict_slice`. The naive
"tokenise the prefix and compare token prefix"
trick is **wrong** for LIKE: the LPM of the row's
leading bytes may merge tokens past the literal
prefix's boundary. Streaming dict slices and
comparing prefix-wise is the correct minimum-work
option.
* `'%substring%'` — `memchr::memmem::Finder` (SSE2/AVX2 on x86_64,
NEON on aarch64, Two-Way underneath). Built once
per kernel call, reused across every row.
Everything else (escapes, `_`, mid-pattern wildcards,
case-insensitive) returns `None` so the framework decompresses + runs
the scalar `LIKE`.
Bench (UrlLog 1 M):
like_prefix 14.8 ms median
like_contains 36.4 ms median
Bench surface
-------------
* New corpus shapes: `UrlLog`, `Short`, `Long`, `HighCard` × 2 row
counts (100 K, 1 M).
* New compute benches: `eq_constant`, `like_prefix`, `like_contains`,
`filter_share_dict`.
Verified
* `cargo test -p vortex-onpair` 19 / 19
* `cargo test -p vortex-btrblocks` 35 / 35
* `cargo test -p vortex-file --features
onpair,tokio --test test_onpair_string_roundtrip` — 5 / 5
* `cargo clippy -p vortex-onpair --all-targets` clean
Signed-off-by: Claude <noreply@anthropic.com>

Summary
This PR introduces a new
vortex-onpairencoding that integrates the OnPair short-string compression library into Vortex. OnPair is a dictionary-based compressor optimized for string data with two key features:LIKE 'prefix%'), and substring matching (LIKE '%substr%') can be evaluated directly on compressed data without decompressionChanges
vortex-onpair-sys: Low-level FFI bindings to the OnPair C++ libraryonpair_shim.h/cpp) wrapping the OnPair C++ APIonpair_cppviaFetchContentboost::unordered_flat_mapwithstd::unordered_map)Columntype) around the C++OnPairColumnhandlevortex-onpair: Vortex array encoding and compute kernelsOnPairArraytype implementing theVTabletrait for Vortex integrationonpair_compress) with configurable bit-width (9-16 bits, default 12)VarBinViewArrayvia bulk decompressionCompareKernel: Pushdown ofEq/NotEqto compressed-domainequals()LikeKernel: Pushdown ofLIKEpatterns tostarts_with()andcontains()CastKernel: Nullability-only casts between Utf8/BinaryFilterKernelandSliceReduce: Fall back to canonicalizationDesign Notes
OnPairColumnis lazily reconstructed on first use (e.g., canonicalization or predicate pushdown), keeping clone-only paths cheapTesting
https://claude.ai/code/session_01T9bRd6nrSLwGbQE54NrVKd