perf(cubestore): faster repartition (prefetch, per-partition merge, range jobs) by waralexrom · Pull Request #11088 · cube-js/cube

waralexrom · 2026-06-15T18:38:50Z

Summary

Speeds up repartition of an inactive parent's persisted chunks — the dominant cost in large CSV imports. Adds three opt-in levers behind a single strategy selector; default behavior is unchanged.

Changes

Chunk parquet prefetch (CUBESTORE_REPARTITION_PREFETCH_BUDGET): a sequential producer downloads upcoming chunks (anchor last) while the current one is processed, hiding download latency behind processing. A byte-budget semaphore bounds fetched-but-unprocessed data on disk; idempotent/deduped downloads make it race-safe.
Repartition strategy selector (CUBESTORE_REPARTITION_STRATEGY): per_chunk (default), per_partition, range. Unknown value logs a warning and falls back to per_chunk.
per_partition: one job streams the partition's chunks through a k-way merge in groups and splits the sorted stream into the active children at the wal-split limit, producing full-size chunks in one pass (reuses the compaction streaming-writer machinery; no materialization).
range: schedule-time slicing into RepartitionRange jobs, each merging an inclusive [start, end] chunk-id range as one atomic swap on the worker chosen by the hash of its bounds — restoring cross-worker parallelism. Slicing walks all chunks (active + inactive) so ranges stay id-pinned; the range end is job data, not the dedup key, so a late "tail" chunk dedups on its start instead of spawning duplicate jobs. A GC gate keeps an inactive parent's chunks until it fully drains so slicing stays stable.
Merge caps CUBESTORE_REPARTITION_MERGE_MAX_INPUT_FILES (default 50) and CUBESTORE_REPARTITION_MERGE_MAX_ROWS (default 4000000) bound merge fan-in / group size.

Testing

New unit + SQL tests covering each strategy end-to-end (drain + count/sum consistency), range bounding, group yield, empty group, and strategy parsing.
Existing repartition_multi_node_consistency (in-process / cluster / multi-process) verified green with per_partition and range enabled.
cargo check clean on cubestore and cubestore-sql-tests; cargo fmt applied.

Rollout

JobType::RepartitionRange deserializes as an unknown variant on binaries that predate it, so range is only safe alongside the skip-unknown-jobs handling. All strategies default off (per_chunk); enable per-deployment.

claude · 2026-06-15T18:58:40Z

Claude finished @waralexrom's task in 4m 0s —— View job

Code Review — `perf(cubestore): faster repartition`

Verdict: looks solid overall — 0 high, 1 medium, 3 low. Default off (per_chunk) means no blast-radius risk for the baseline path. Posted 4 inline comments on specific concerns.

Details & checklist

Checklist

Gather PR context
Read diff and changed files (9 files, +1739/-57)
Analyze correctness, perf, concurrency
Post inline comments and final review

What's good

The compat reasoning around JobType::RepartitionRange(end) dedup key (end excluded from the index key so tail chunks don't spawn duplicate jobs, fully drained parents leave the keep-set) is well thought out and documented inline.
The range strategy slicing walks active+inactive chunks to keep boundaries id-pinned, combined with the scheduler GC gate — that's the right invariant.
Per-strategy end-to-end SQL tests (repartition_*_keeps_data_consistent) plus the merge / range / empty-group / yield-on-budget unit tests give good coverage of the new paths.
Swap-race tolerance is preserved across all three strategies (loser's chunks GC'd, source-already-inactive treated as drained).
Anchor-last ordering in the merged path preserves the dedup-anchor invariant for partial runs.
oldest_insert_at is carried across the merge — easy to miss, glad it's there.
Rollout note correctly flags that RepartitionRange requires skip-unknown-jobs handling on the binary.

Findings (see inline)

Medium — compute_repartition_children deactivates the table when no active children fall within range; a transient race could nuke a healthy table. (store/mod.rs:1386)
Low — Periodic all_inactive_partitions_to_repartition() call in schedule_chunks_to_delete hot path under range strategy. (scheduler/mod.rs:641)
Low — Prefetch capacity clamped to u32::MAX silently; chunks with no file_size billed at 1 byte — interacts with the documented disk-budget guarantee. (store/mod.rs:980)
Low — No clamp on repartition_merge_max_rows/max_input_files; a typo (=0) silently degrades range to one job per chunk. (cluster/mod.rs:991)

Things considered and dismissed

Semaphore overflow on 32-bit: capacity clamp covers it.
Producer/consumer shutdown on time-budget yield: producer.abort() cleans up correctly; in-flight downloads remain idempotent.
pick_worker_by_ids([start, end]) shifting under tail extension: dedup-on-start prevents incorrect notify; new jobs only created post-drain.
max_rows=0 infinite loop in scheduler: the i += 1 happens before the break check, so always ≥1 chunk per range. Not a livelock, just a degenerate config (still worth clamping — see Low finding).
merge_chunk_group_into_children temp paths: per-call random salt prevents collisions across groups and across reruns.

· branch: `cubestore-chunk-repartition-speed-up`

claude · 2026-06-15T19:19:00Z

Claude finished @waralexrom's task in 1m 0s —— View job

Code Review — follow-up on `2f67894` (review-fixes pass)

Verdict: all four prior findings addressed. No new issues. Resolved my four open threads.

Details

Checklist

Re-read diff vs origin/master (focus: commit 2f67894)
Verify each prior finding is addressed
Scan for any new regressions introduced by the hardening
Resolve stale review threads

Fix-by-fix verification

Prior finding	Fix in `2f67894`	Verdict
Medium — `compute_repartition_children` deactivates table on empty children (race)	`store/mod.rs:1150` — early `if children.is_empty()` returns `CubeError::internal(...)` for scheduler retry; tiling check only runs when children are present	✅ Correct. Deactivation now only fires on the genuine "children present but don't tile" corruption case.
Low — Periodic `all_inactive_partitions_to_repartition()` in GC hot path	`scheduler/mod.rs` — new `draining_repartition_parents()` with 5s-TTL `Mutex<Option<(Instant, Arc<HashSet<u64>>)>>` memo. Inline comment explains staleness is bounded by `not_used_timeout`.	✅ Correct. Cache is per-scheduler; a thundering-herd refresh would just do redundant metastore reads (harmless).
Low — Prefetch bills unknown `file_size` at 1 byte	`store/mod.rs:1006` — `.unwrap_or(capacity as u64)` so an unknown chunk takes the full budget and serializes against any other.	✅ Correct, and the right side of conservative — degrades to fully serialized prefetch on missing metadata rather than overshooting the on-disk budget.
Low — No clamp on `max_rows` / `max_files` (0 silently degrades)	`cluster/mod.rs:957` — both clamped with `.max(1)` and an inline note that the sane range is `>= 2`.	✅ Correct. Loop now always advances; the comment correctly flags that 1 still degrades quality (just doesn't break).

Things considered and dismissed

Cache refresh race: two concurrent callers can both miss and both refresh — extra metastore call, no correctness impact.
Unknown-file_size serialization cost: a missing file_size blocks all other prefetches; in practice persisted chunks have a size, so this is a safety net, not a hot path.
children.is_empty() retry loop: if children are genuinely missing (not a race), the job will retry forever — but the scheduler's existing job-failure backoff handles this, and the alternative (deactivating a healthy table) is strictly worse.

· branch: `cubestore-chunk-repartition-speed-up`

codecov · 2026-06-15T19:38:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.50%. Comparing base (87ae095) to head (c1dafa6).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #11088   +/-   ##
=======================================
  Coverage   58.50%   58.50%           
=======================================
  Files         216      216           
  Lines       17270    17270           
  Branches     3524     3524           
=======================================
  Hits        10103    10103           
  Misses       6652     6652           
  Partials      515      515

Flag	Coverage Δ
cube-backend	`58.50% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

claude · 2026-06-15T20:36:16Z

Claude finished @waralexrom's task in 50s —— View job

Code Review — follow-up on `bde556e` (per_partition row cap + dedup docs)

Verdict: clean follow-up — no new findings. Aligns per_partition grouping with the range strategy's slicing rules and documents the merge dedup semantics. All four prior threads remain resolved.

Details

Checklist

Read git diff 2f67894..HEAD
Verify per_partition group-cut matches range-job slicing
Check loop terminates and respects the .max(1) floor
Sanity-check the dedup commentary against merge_chunks
Confirm prior threads are still resolved

What changed

repartition_partition_chunks_merged (store/mod.rs:1099) now cuts a group when rows ≥ max_rows OR group.len() ≥ max_group, instead of slicing purely by chunk count. This matches the range strategy slicer and ensures a fat-chunk group doesn't blow past the row cap.
Loop advances i += 1 before the break check, so a misconfigured max_rows=0 (already clamped to ≥1 by PRRT_kwDOCOH19M6JrcDq) still progresses one chunk per group — same degenerate-but-safe behavior as the range path.
Anchor-last invariant preserved: chunks.sort_by_key((id == anchor, id)) still places the anchor last, so it lands in the final group and stays active until the run completes or yields on the time budget. ✅
New comment on merge_chunks invocation (store/mod.rs:1263) documents that the dedup is identical to compaction's: for unique-key tables the sort key ends with seq, so the latest wins; group order only decides ties on identical (unique key, seq) rows that are duplicates anyway. Matches the merge implementation behavior I traced in the prior review. ✅

Things considered and dismissed

Row-cap vs. anchor: since anchor is last and groups are cut greedily forward, the anchor can end up alone in its own final group if the preceding chunk just barely tipped a group over max_rows. That's fine — single-chunk merge is degenerate but correct, and the anchor still gets its "active until run end" treatment.
Group-local dedup vs. cross-group dedup: a key that appears in chunk A and chunk Z (different groups) won't dedup within this run, but downstream query-time merge and compaction re-apply it — same property as the per-chunk path. The new comment makes this explicit.
max_rows.max(1) here when config also clamps: defense-in-depth, harmless.

· branch: `cubestore-chunk-repartition-speed-up`

claude · 2026-06-15T20:55:43Z

Claude finished @waralexrom's task in 37s —— View job

Code Review — follow-up on `ae73d0b` (pin per-chunk unit tests)

Verdict: trivial test hardening — no findings. All four prior threads remain resolved.

Details

Checklist

Read git show ae73d0b
Verify test pinning doesn't change production behavior
Confirm prior review threads stay resolved

What changed

repartition_partition_chunks_yields_on_budget and repartition_partition_chunks_prefetch_drains now explicitly set RepartitionStrategy::PerChunk on their test configs (store/mod.rs:1830, store/mod.rs:2005).
Mirrors the existing pattern used by the merge/range tests — keeps these per-chunk-path assertions valid even if the default strategy is flipped later.
No production code touched.

· branch: `cubestore-chunk-repartition-speed-up`

Add an opt-in sequential prefetch to the batch repartition loop. A producer downloads upcoming chunk parquets (anchor last, others by ascending id) into the local cache while the consumer repartitions the current chunk, so the download latency is hidden behind processing. A byte-budget semaphore bounds how much fetched-but-unprocessed data sits on local disk: each chunk holds permits worth its file size from before its download until the consumer finishes it. Chunks prefetched past the time budget stay local and warm the follow-up job, which lands on the same node by partition. download_file is idempotent and dedups in-flight downloads, so the consumer's repartition_chunk just hits the warm file. The producer makes no metastore calls; the chunk list is still read once. Controlled by CUBESTORE_REPARTITION_PREFETCH_BUDGET (accepts size suffixes, e.g. 512MB); None (unset) or 0 disables prefetching. Added env_optparse_size to keep None distinct from an explicit value.

Add an opt-in merge path for repartitioning an inactive parent's persisted chunks. Instead of splitting each chunk independently into the active children, the parent's chunks are merged k-way (in groups of up to a configurable cap) and the sorted stream is split directly into the children at the wal-split limit in one streaming pass, producing full-size chunks and avoiding the per-chunk fragment fan-out plus the compaction that would later merge them. Each group commits with a single atomic swap_chunks; a group whose source was already repartitioned by a racing job is skipped (its new chunks stay inactive and are GC'd). A fully empty group deactivates its sources directly. Children must tile the parent's range exactly, otherwise the table is deactivated as on the per-chunk path. The anchor is processed last so it keeps holding the job dedup key until the run finishes or yields on the time budget. Reuses the compaction streaming writer machinery via a new write_chunks_split_into_children that cuts files on child boundaries and the row-count limit. Controlled by CUBESTORE_REPARTITION_MERGE_MAX_INPUT_FILES (Option<usize>); None / 0 / 1 disable it, Some(m >= 2) caps each group at m input chunks. Default off.

…ector Add a third repartition strategy that slices an inactive parent's persisted chunks into RepartitionRange jobs at schedule time. Slicing walks all chunks (active and inactive) ordered by id and cuts a range once it reaches the row or chunk-count cap, so the [start, end] bounds stay pinned to chunk ids and a re-slice reproduces them; the end is carried as job data, not the dedup key, so a tail that extends the trailing range dedups on its start instead of spawning a second job. Each range runs as one atomic swap on the worker chosen by the hash of its bounds, restoring cross-worker parallelism. A GC gate keeps an inactive parent's chunks until it fully drains so slicing stays stable. Replace the ad-hoc flags with a single CUBESTORE_REPARTITION_STRATEGY selector (per_chunk default, per_partition, range); an unknown value logs a warning and falls back to per_chunk. The merge caps (CUBESTORE_REPARTITION_MERGE_MAX_INPUT_FILES default 50, CUBESTORE_REPARTITION_MERGE_MAX_ROWS default 4000000) become plain caps with defaults. The per-partition merge core is shared between the in-job loop and the range handler. JobType::RepartitionRange deserializes as an unknown variant on binaries that predate it, so it is only safe behind the skip-unknown-jobs handling; enable the strategy per-deployment.

- compute_repartition_children: an empty children set is treated as a transient topology read and returns an error to retry, instead of deactivating the table; deactivation now only happens when children are present but do not tile the parent's range (genuine corruption). - Cache the draining-parents set (inactive parents with active chunks) with a short TTL so the GC loop does not re-scan the chunk table every cycle when the range strategy is enabled. - Prefetch: bill a chunk with no recorded file_size at the full budget instead of ~free, so an unknown-size fetch cannot overshoot the on-disk budget. - Clamp the range slicing caps to >= 1 so a misconfigured 0 does not break the slicing loop before adding a chunk.

…e dedup - repartition_partition_chunks_merged now groups chunks by max_rows OR max_input_files (whichever is hit first), matching how the range strategy slices its jobs, instead of grouping purely by chunk count. - Comment the last-row/aggregate dedup in merge_chunk_group_into_children: it is the same merge_chunks compaction uses; for unique-key tables the sort key ends with the seq column so the latest version wins, and group order only breaks ties between rows with an identical (unique key, seq).

… strategy repartition_partition_chunks_yields_on_budget and repartition_partition_chunks_prefetch_drains assert per-chunk-path behavior, so set the strategy explicitly instead of inheriting the default — keeps them valid regardless of the configured strategy (mirrors how the merge/range tests pin theirs).

…urrent download in merge Remove CUBESTORE_BATCH_REPARTITION / batch_repartition_enabled. The repartition strategy (per_chunk / per_partition / range) is now the sole selector: - per_chunk: one job per chunk -> repartition_chunk - per_partition: one anchor job per partition -> merge - range: RepartitionRange jobs -> merge The job handler and scheduler dispatch on the strategy directly. The dead per_chunk+batch hybrid (per-chunk loop and its producer/consumer prefetch) is gone, along with its two now-obsolete tests. Prefetch is reworked: drop the byte-budget producer/consumer and CUBESTORE_REPARTITION_PREFETCH_BUDGET; add a plain bool CUBESTORE_REPARTITION_CONCURRENT_DOWNLOAD (default off) that downloads a merge group's chunk parquets concurrently before building the merge inputs. It applies to both per_partition and range; the group is already bounded by repartition_merge_max_input_files and the pool by download_concurrency, so no extra budget is needed.

claude · 2026-06-15T22:08:52Z

Claude finished @waralexrom's task in 1m 40s —— View job

Code Review — follow-up on `1ed20eb` (collapse batch flag into strategy selector, concurrent download)

Verdict: clean refactor — no findings. All four prior review threads remain resolved.

Details

Checklist

Read git diff 57f4c8f..HEAD (commit 1ed20eb)
Verify the per-chunk batch hybrid is removed cleanly
Trace the new concurrent-download path in the merge group build
Re-verify dispatch and config consistency
Confirm prior threads still resolved

What changed

Removed CUBESTORE_BATCH_REPARTITION / batch_repartition_enabled and the per-chunk batch loop (repartition_chunk_tolerant, repartition_partition_chunks_prefetched) plus their two tests. The strategy selector is now the only dispatch — PerChunk → repartition_chunk per job, PerPartition → merge path keyed on the anchor chunk, Range → RepartitionRange jobs.
Removed CUBESTORE_REPARTITION_PREFETCH_BUDGET and the byte-budget producer/consumer semaphore.
Added CUBESTORE_REPARTITION_CONCURRENT_DOWNLOAD (bool, default off): in merge_chunk_group_into_children (store/mod.rs:1087), spawns one task per chunk in the group to warm the local cache concurrently before chunk_exec builds the merge inputs. Bounded by repartition_merge_max_input_files (group size) and download_concurrency (download pool). Idempotent — a failure here is swallowed and re-surfaces in chunk_exec.
Job dispatcher (job_processor.rs:222) and scheduler (cluster/mod.rs:1005) both switched from batch_repartition_enabled() to repartition_strategy() == PerPartition; comments updated to reflect that the anchor-chunk overload is for cross-channel compat, not the dead hybrid path.

Why this looks right

Backwards-compat reasoning still holds. PerPartition still reuses RepartitionChunk keyed on the smallest persisted chunk id; an older binary on the same job shard just repartitions that one anchor chunk and drains the rest via its own per-chunk path. The justification comments are tightened, not weakened.
Concurrent download is correctly idempotent. download_file dedups in-flight fetches; the for d in downloads { let _ = d.await; } drain awaits all spawned tasks before chunk_exec runs, so a partial failure here just means chunk_exec re-issues the download against the same cache and reports a real error then.
No new race surface. If the calling future is cancelled mid-merge, the spawned downloads keep running but only warm the cache — no metastore writes, no on-disk dangling state beyond what was there before.
Group size already bounds on-disk overshoot. Previous prefetch path bounded fetched-but-unprocessed bytes explicitly; the new path implicitly bounds them at max_input_files × chunk_size per group, but the group is processed immediately after download, so chunks don't sit unconsumed. The author's "no extra budget needed" claim checks out.
SQL test renamed and re-tuned. repartition_concurrent_download_keeps_data_consistent (sql/mod.rs:3619) exercises the new path end-to-end with max_input_files = 4, giving real concurrent downloads instead of the removed prefetch producer/consumer path.

Things considered and dismissed

Lost log::debug! on download failure. The previous prefetch path logged "Prefetch of chunk {} failed: {}"; the new path swallows silently. Minor observability loss, but chunk_exec will log/return the real error when it actually needs the file, so genuine failures aren't hidden.
Two tests deleted. repartition_partition_chunks_yields_on_budget and repartition_partition_chunks_prefetch_drains covered the per-chunk batch loop and the prefetch producer/consumer — both code paths are gone, so the tests are correctly removed. Yield/drain semantics for PerPartition are now covered by repartition_merge_drains_and_yields and the SQL drain test.
Unbounded cube_ext::spawn per group. Caps at max_input_files (default 50) per merge group; combined with download_concurrency throttling inside download_file, the parallelism stays bounded. Fine.
Config rename in test harness. sql/mod.rs:3608 flipped batch_repartition_enabled = false to repartition_strategy = PerChunk (same intent, new field). Consistent.

Prior threads

All four remain resolved — nothing in 1ed20eb reopens them. Re-affirmed by re-reading the thread bodies against the current diff.

· branch: `cubestore-chunk-repartition-speed-up`

The merge-based repartition path (per_partition / range strategies) commits its new chunks with swap_chunks, which enforces that the activated row count equals the deactivated row count. merge_chunks aggregates (aggregate indexes) and last-row-dedups (unique-key tables) the source group, so it legitimately emits fewer rows than it consumed. The checked swap then rejected the commit with "Deactivated row count (..) doesn't match activated row count (..) during swap", failing RepartitionRange / per-partition jobs. Commit with swap_chunks_without_check instead, matching how compaction commits its dedup'd merges. Adds a unit test on an aggregate index that reproduces the exact failure before the fix. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>

…dedup Adds range- and per-partition-strategy SQL tests on an aggregate-index table whose chunks share dimension keys across inserts. The repartition merge groups those rows by the sort key, so the swap activates fewer rows than it deactivates - the production RepartitionRange row-count-mismatch scenario. Without the unchecked-swap fix the repartition jobs never drain the inactive parents; with it the data stays consistent (sum(m) and per-key sums conserved). Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>

…apses rows The merge-based repartition path committed every group with the unchecked swap. That is required only for aggregate indexes and unique-key tables, where merge_chunks aggregates / last-row-dedups and the activated row count is legitimately smaller. For a plain regular index the merge is a pure sort-merge that conserves rows, so commit it through the checked swap_chunks to keep the row-count integrity guard.

claude · 2026-06-16T12:57:18Z

Claude finished @waralexrom's task in 1m 12s —— View job

Code Review — follow-up on `2604a5c`, `c48785a`, `39a6d3d` (unchecked-swap fix gated by row-collapse)

Verdict: clean correctness fix — no new findings. All four prior threads remain resolved.

Details

Checklist

Read git diff 1ed20eb..HEAD
Verify the swap-gate condition matches merge_chunks's collapse semantics
Confirm the regular-index integrity guard is preserved
Sanity-check the new tests reproduce the original failure
Re-verify prior threads stay resolved

What changed

merge_chunk_group_into_children (store/mod.rs:1082) now derives merge_collapses_rows = unique_key.is_some() || aggregate_columns.is_some() and routes the commit through swap_chunks_without_check only in that case; regular indexes still go through the checked swap_chunks (store/mod.rs:1227).
2604a5c first switched the whole path to the unchecked swap with a unit-test repro on an aggregate index (20 rows → 10 grouped). 39a6d3d then tightened it back to a gated decision so a regular sort-merge still gets the row-count integrity guard.
c48785a adds two end-to-end SQL tests (repartition_range_jobs_aggregate_index_keeps_data_consistent, repartition_merge_aggregate_index_keeps_data_consistent) that cycle g over 0..40 across 200 inserts. Without the fix the inactive parents never drain; with it sum(m)=200 and the per-g aggregation is conserved across the split.

Why this looks right

The gate exactly mirrors merge_chunks's own collapse conditions: aggregate indexes group by sort key, unique-key tables last-row-dedup on (unique_key, seq). Both legitimately emit fewer rows than they consumed. A plain regular index is a pure ordered merge — row-conservation is a real invariant there and the checked swap should fire if anything ever breaks it.
The compaction path uses the same swap_chunks_without_check for the same reason; the inline comment makes the parallel explicit.
Drain-and-verify SQL tests exercise the production code path (range scheduler → RepartitionRange job → merge → swap) on a table whose chunks share dimension keys, which is the exact shape the field bug took.
Both new tests use small thresholds (partition_split_threshold=20, repartition_merge_max_rows=40 / max_input_files=4) so the assertion isn't trivially passed by the merge collapsing to a single group — active.len() > 1 enforces that an actual split happens.

Things considered and dismissed

Could a unique-key table merge ever be row-conserving (no duplicate (unique_key, seq))? Yes, but the gate is by table type, not by observed row count, which is the right call — we don't want a flaky guard that fires only on specific data shapes.
Regular index + aggregate columns on the same table. aggregate_columns only resolves to Some for IndexType::Aggregate, so a regular index over an aggregated table still gets the checked path. Correct.
The replay-handle merge runs before the gate decision. Same as before this PR; unrelated to the swap-check choice.
Test coverage for the regular-index path still relies on the checked swap. The original repartition_keeps_data_consistent and the per-strategy variants already cover that. No gap.

Prior threads

All four resolved; nothing in 2604a5c..HEAD reopens them.

· branch: `cubestore-chunk-repartition-speed-up` ·

compute_repartition_children deactivated the table as corrupt whenever the inactive parent's active children did not tile its range exactly. That check compared data-extent bounds (min_val/max_val), but a split sets the first child's lower bound to its first segment's data min rather than the parent's lower bound, so for normal multi-level topologies the children legitimately leave gaps and the outer edges don't match the parent. The check fired in production on RepartitionRange jobs and deactivated healthy pre-aggregation tables. The streaming split never drops rows (the first child is the low catch-all, the last the high one) and the legacy per-chunk path performed no such check, so the strict tiling guard was wrong. Replace it with an optional non-overlap check, gated behind CUBESTORE_REPARTITION_CHECK_OVERLAPPING_CHILDREN and off by default: gaps are benign, only overlapping children are genuine corruption.

claude · 2026-06-16T18:00:52Z

Claude finished @waralexrom's task in 1m 5s —— View job

Code Review — follow-up on `c1dafa6` (don't deactivate on benign child gaps)

Verdict: clean correctness fix — no new findings. All four prior threads remain resolved.

Details

Checklist

Read git show c1dafa6
Verify the rationale for replacing the tiling check matches partition-bound semantics
Confirm the overlap check (off by default) doesn't reintroduce the false-positive surface
Re-verify prior threads stay resolved

What changed

compute_repartition_children (store/mod.rs:1003) keeps the children.is_empty() early-return-as-retry from the earlier hardening pass.
The strict "children must tile parent's range exactly" guard is gone. It compared partition.min_val/max_val (data extent, not range bounds) against child bounds, which legitimately don't coincide after a split — the first child's min_val is its data min, not the parent's lower bound. In multi-level topologies the children correctly leave gaps at the outer edges, and the old check was deactivating healthy pre-aggregation tables in production under RepartitionRange.
Replaced with an optional non-overlap-only check, gated by CUBESTORE_REPARTITION_CHECK_OVERLAPPING_CHILDREN, off by default. Overlap (a child's range starts before the previous child's range ends) is the only condition where a row could legitimately go to two children — that is real corruption. Gaps are not.
Config plumbed through ConfigObj::repartition_check_overlapping_children() consistently.

Why this looks right

The commit's reasoning checks out: Partition::min_val/max_val track data extents, not range bounds (range bounds live on the partition's place in the index ordering). Tile-equality against those extents was structurally wrong for the upper/lower edges and after sparse data, regardless of strategy.
The streaming split is row-conserving by construction (first child = low catch-all, last = high catch-all), so the guard never protected against row loss anyway — it was a metadata sanity check masquerading as a correctness invariant. The legacy per-chunk path didn't have it.
The new overlap check is narrowly scoped: only fires on a genuine ordering violation, and is opt-in so an operator who wants the assertion can turn it on without imposing it as a default.
The (None, _) | (_, None) interior-bound case being treated as overlap is the right call — an open bound mid-ordering is itself a metadata inconsistency.

Things considered and dismissed

Should the overlap check be on-by-default in test harnesses? The fakes in config/mod.rs set it to false, matching prod. That's fine — the existing per-strategy SQL tests already exercise normal topologies; adding the overlap assertion to tests would just risk re-introducing flakiness on legitimate multi-level layouts.
Could a real corruption (true tiling violation that isn't an overlap) now slip through silently? Yes, but it was always going to slip through the per-chunk path, and the streaming split still doesn't lose rows. The check was load-bearing only against a class of metadata bugs that the codebase doesn't have evidence of in practice; the false-positive cost was concrete.
Naming. CUBESTORE_REPARTITION_CHECK_OVERLAPPING_CHILDREN=true reads as "do the check," not "fail on overlap" — that's correct and consistent with similar env flags in the file.

Prior threads

All four remain resolved; nothing in c1dafa6 reopens them.

· branch: `cubestore-chunk-repartition-speed-up`

waralexrom requested a review from a team as a code owner June 15, 2026 18:38

github-actions Bot added cube store Issues relating to Cube Store rust Pull requests that update Rust code labels Jun 15, 2026

vercel Bot deployed to Preview June 15, 2026 18:40 View deployment

waralexrom force-pushed the cubestore-chunk-repartition-speed-up branch from b09e8e6 to 36caefe Compare June 15, 2026 18:58

vercel Bot deployed to Preview June 15, 2026 19:00 View deployment

claude Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread rust/cubestore/cubestore/src/store/mod.rs

claude Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread rust/cubestore/cubestore/src/scheduler/mod.rs Outdated

claude Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread rust/cubestore/cubestore/src/store/mod.rs Outdated

claude Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread rust/cubestore/cubestore/src/cluster/mod.rs

waralexrom added 7 commits June 16, 2026 00:03

waralexrom force-pushed the cubestore-chunk-repartition-speed-up branch from 18b967e to 1ed20eb Compare June 15, 2026 22:08

vercel Bot deployed to Preview June 15, 2026 22:10 View deployment

paveltiunov mentioned this pull request Jun 16, 2026

fix(cubestore): repartition merge commits with unchecked swap (fixes RepartitionRange row-count regression) #11091

Draft

cursoragent and others added 3 commits June 16, 2026 14:50

waralexrom mentioned this pull request Jun 16, 2026

perf(cubestore): reduce metastore RPC fan-out during partitioning #11095

Open

paveltiunov approved these changes Jun 17, 2026

View reviewed changes

Conversation

waralexrom commented Jun 15, 2026

Summary

Changes

Testing

Rollout

Uh oh!

claude Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — perf(cubestore): faster repartition

Checklist

What's good

Findings (see inline)

Things considered and dismissed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — follow-up on 2f67894 (review-fixes pass)

Checklist

Fix-by-fix verification

Things considered and dismissed

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — follow-up on bde556e (per_partition row cap + dedup docs)

Checklist

What changed

Things considered and dismissed

Uh oh!

claude Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — follow-up on ae73d0b (pin per-chunk unit tests)

Checklist

What changed

Uh oh!

claude Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — follow-up on 1ed20eb (collapse batch flag into strategy selector, concurrent download)

Checklist

What changed

Why this looks right

Things considered and dismissed

Prior threads

Uh oh!

claude Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — follow-up on 2604a5c, c48785a, 39a6d3d (unchecked-swap fix gated by row-collapse)

Checklist

What changed

Why this looks right

Things considered and dismissed

Prior threads

Uh oh!

claude Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — follow-up on c1dafa6 (don't deactivate on benign child gaps)

Checklist

What changed

Why this looks right

Things considered and dismissed

Prior threads

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude Bot commented Jun 15, 2026 •

edited

Loading

Code Review — `perf(cubestore): faster repartition`

claude Bot commented Jun 15, 2026 •

edited

Loading

Code Review — follow-up on `2f67894` (review-fixes pass)

codecov Bot commented Jun 15, 2026 •

edited

Loading

claude Bot commented Jun 15, 2026 •

edited

Loading

Code Review — follow-up on `bde556e` (per_partition row cap + dedup docs)

claude Bot commented Jun 15, 2026 •

edited

Loading

Code Review — follow-up on `ae73d0b` (pin per-chunk unit tests)

claude Bot commented Jun 15, 2026 •

edited

Loading

Code Review — follow-up on `1ed20eb` (collapse batch flag into strategy selector, concurrent download)

claude Bot commented Jun 16, 2026 •

edited

Loading

Code Review — follow-up on `2604a5c`, `c48785a`, `39a6d3d` (unchecked-swap fix gated by row-collapse)

claude Bot commented Jun 16, 2026 •

edited

Loading

Code Review — follow-up on `c1dafa6` (don't deactivate on benign child gaps)