Skip to content

perf(cubestore): faster repartition (prefetch, per-partition merge, range jobs)#11088

Open
waralexrom wants to merge 11 commits into
masterfrom
cubestore-chunk-repartition-speed-up
Open

perf(cubestore): faster repartition (prefetch, per-partition merge, range jobs)#11088
waralexrom wants to merge 11 commits into
masterfrom
cubestore-chunk-repartition-speed-up

Conversation

@waralexrom

Copy link
Copy Markdown
Member

Summary

Speeds up repartition of an inactive parent's persisted chunks — the dominant cost in large CSV imports. Adds three opt-in levers behind a single strategy selector; default behavior is unchanged.

Changes

  • Chunk parquet prefetch (CUBESTORE_REPARTITION_PREFETCH_BUDGET): a sequential producer downloads upcoming chunks (anchor last) while the current one is processed, hiding download latency behind processing. A byte-budget semaphore bounds fetched-but-unprocessed data on disk; idempotent/deduped downloads make it race-safe.
  • Repartition strategy selector (CUBESTORE_REPARTITION_STRATEGY): per_chunk (default), per_partition, range. Unknown value logs a warning and falls back to per_chunk.
  • per_partition: one job streams the partition's chunks through a k-way merge in groups and splits the sorted stream into the active children at the wal-split limit, producing full-size chunks in one pass (reuses the compaction streaming-writer machinery; no materialization).
  • range: schedule-time slicing into RepartitionRange jobs, each merging an inclusive [start, end] chunk-id range as one atomic swap on the worker chosen by the hash of its bounds — restoring cross-worker parallelism. Slicing walks all chunks (active + inactive) so ranges stay id-pinned; the range end is job data, not the dedup key, so a late "tail" chunk dedups on its start instead of spawning duplicate jobs. A GC gate keeps an inactive parent's chunks until it fully drains so slicing stays stable.
  • Merge caps CUBESTORE_REPARTITION_MERGE_MAX_INPUT_FILES (default 50) and CUBESTORE_REPARTITION_MERGE_MAX_ROWS (default 4000000) bound merge fan-in / group size.

Testing

  • New unit + SQL tests covering each strategy end-to-end (drain + count/sum consistency), range bounding, group yield, empty group, and strategy parsing.
  • Existing repartition_multi_node_consistency (in-process / cluster / multi-process) verified green with per_partition and range enabled.
  • cargo check clean on cubestore and cubestore-sql-tests; cargo fmt applied.

Rollout

JobType::RepartitionRange deserializes as an unknown variant on binaries that predate it, so range is only safe alongside the skip-unknown-jobs handling. All strategies default off (per_chunk); enable per-deployment.

@waralexrom waralexrom requested a review from a team as a code owner June 15, 2026 18:38
@github-actions github-actions Bot added cube store Issues relating to Cube Store rust Pull requests that update Rust code labels Jun 15, 2026
@waralexrom waralexrom force-pushed the cubestore-chunk-repartition-speed-up branch from b09e8e6 to 36caefe Compare June 15, 2026 18:58
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 4m 0s —— View job


Code Review — perf(cubestore): faster repartition

Verdict: looks solid overall — 0 high, 1 medium, 3 low. Default off (per_chunk) means no blast-radius risk for the baseline path. Posted 4 inline comments on specific concerns.

Details & checklist

Checklist

  • Gather PR context
  • Read diff and changed files (9 files, +1739/-57)
  • Analyze correctness, perf, concurrency
  • Post inline comments and final review

What's good

  • The compat reasoning around JobType::RepartitionRange(end) dedup key (end excluded from the index key so tail chunks don't spawn duplicate jobs, fully drained parents leave the keep-set) is well thought out and documented inline.
  • The range strategy slicing walks active+inactive chunks to keep boundaries id-pinned, combined with the scheduler GC gate — that's the right invariant.
  • Per-strategy end-to-end SQL tests (repartition_*_keeps_data_consistent) plus the merge / range / empty-group / yield-on-budget unit tests give good coverage of the new paths.
  • Swap-race tolerance is preserved across all three strategies (loser's chunks GC'd, source-already-inactive treated as drained).
  • Anchor-last ordering in the merged path preserves the dedup-anchor invariant for partial runs.
  • oldest_insert_at is carried across the merge — easy to miss, glad it's there.
  • Rollout note correctly flags that RepartitionRange requires skip-unknown-jobs handling on the binary.

Findings (see inline)

  • Mediumcompute_repartition_children deactivates the table when no active children fall within range; a transient race could nuke a healthy table. (store/mod.rs:1386)
  • Low — Periodic all_inactive_partitions_to_repartition() call in schedule_chunks_to_delete hot path under range strategy. (scheduler/mod.rs:641)
  • Low — Prefetch capacity clamped to u32::MAX silently; chunks with no file_size billed at 1 byte — interacts with the documented disk-budget guarantee. (store/mod.rs:980)
  • Low — No clamp on repartition_merge_max_rows/max_input_files; a typo (=0) silently degrades range to one job per chunk. (cluster/mod.rs:991)

Things considered and dismissed

  • Semaphore overflow on 32-bit: capacity clamp covers it.
  • Producer/consumer shutdown on time-budget yield: producer.abort() cleans up correctly; in-flight downloads remain idempotent.
  • pick_worker_by_ids([start, end]) shifting under tail extension: dedup-on-start prevents incorrect notify; new jobs only created post-drain.
  • max_rows=0 infinite loop in scheduler: the i += 1 happens before the break check, so always ≥1 chunk per range. Not a livelock, just a degenerate config (still worth clamping — see Low finding).
  • merge_chunk_group_into_children temp paths: per-call random salt prevents collisions across groups and across reruns.
· branch: `cubestore-chunk-repartition-speed-up`

Comment thread rust/cubestore/cubestore/src/store/mod.rs
Comment thread rust/cubestore/cubestore/src/scheduler/mod.rs Outdated
Comment thread rust/cubestore/cubestore/src/store/mod.rs Outdated
Comment thread rust/cubestore/cubestore/src/cluster/mod.rs
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 1m 0s —— View job


Code Review — follow-up on 2f67894 (review-fixes pass)

Verdict: all four prior findings addressed. No new issues. Resolved my four open threads.

Details

Checklist

  • Re-read diff vs origin/master (focus: commit 2f67894)
  • Verify each prior finding is addressed
  • Scan for any new regressions introduced by the hardening
  • Resolve stale review threads

Fix-by-fix verification

Prior finding Fix in 2f67894 Verdict
Mediumcompute_repartition_children deactivates table on empty children (race) store/mod.rs:1150 — early if children.is_empty() returns CubeError::internal(...) for scheduler retry; tiling check only runs when children are present ✅ Correct. Deactivation now only fires on the genuine "children present but don't tile" corruption case.
Low — Periodic all_inactive_partitions_to_repartition() in GC hot path scheduler/mod.rs — new draining_repartition_parents() with 5s-TTL Mutex<Option<(Instant, Arc<HashSet<u64>>)>> memo. Inline comment explains staleness is bounded by not_used_timeout. ✅ Correct. Cache is per-scheduler; a thundering-herd refresh would just do redundant metastore reads (harmless).
Low — Prefetch bills unknown file_size at 1 byte store/mod.rs:1006.unwrap_or(capacity as u64) so an unknown chunk takes the full budget and serializes against any other. ✅ Correct, and the right side of conservative — degrades to fully serialized prefetch on missing metadata rather than overshooting the on-disk budget.
Low — No clamp on max_rows / max_files (0 silently degrades) cluster/mod.rs:957 — both clamped with .max(1) and an inline note that the sane range is >= 2. ✅ Correct. Loop now always advances; the comment correctly flags that 1 still degrades quality (just doesn't break).

Things considered and dismissed

  • Cache refresh race: two concurrent callers can both miss and both refresh — extra metastore call, no correctness impact.
  • Unknown-file_size serialization cost: a missing file_size blocks all other prefetches; in practice persisted chunks have a size, so this is a safety net, not a hot path.
  • children.is_empty() retry loop: if children are genuinely missing (not a race), the job will retry forever — but the scheduler's existing job-failure backoff handles this, and the alternative (deactivating a healthy table) is strictly worse.
· branch: `cubestore-chunk-repartition-speed-up`

@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.50%. Comparing base (87ae095) to head (c1dafa6).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #11088   +/-   ##
=======================================
  Coverage   58.50%   58.50%           
=======================================
  Files         216      216           
  Lines       17270    17270           
  Branches     3524     3524           
=======================================
  Hits        10103    10103           
  Misses       6652     6652           
  Partials      515      515           
Flag Coverage Δ
cube-backend 58.50% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 50s —— View job


Code Review — follow-up on bde556e (per_partition row cap + dedup docs)

Verdict: clean follow-up — no new findings. Aligns per_partition grouping with the range strategy's slicing rules and documents the merge dedup semantics. All four prior threads remain resolved.

Details

Checklist

  • Read git diff 2f67894..HEAD
  • Verify per_partition group-cut matches range-job slicing
  • Check loop terminates and respects the .max(1) floor
  • Sanity-check the dedup commentary against merge_chunks
  • Confirm prior threads are still resolved

What changed

  • repartition_partition_chunks_merged (store/mod.rs:1099) now cuts a group when rows ≥ max_rows OR group.len() ≥ max_group, instead of slicing purely by chunk count. This matches the range strategy slicer and ensures a fat-chunk group doesn't blow past the row cap.
  • Loop advances i += 1 before the break check, so a misconfigured max_rows=0 (already clamped to ≥1 by PRRT_kwDOCOH19M6JrcDq) still progresses one chunk per group — same degenerate-but-safe behavior as the range path.
  • Anchor-last invariant preserved: chunks.sort_by_key((id == anchor, id)) still places the anchor last, so it lands in the final group and stays active until the run completes or yields on the time budget. ✅
  • New comment on merge_chunks invocation (store/mod.rs:1263) documents that the dedup is identical to compaction's: for unique-key tables the sort key ends with seq, so the latest wins; group order only decides ties on identical (unique key, seq) rows that are duplicates anyway. Matches the merge implementation behavior I traced in the prior review. ✅

Things considered and dismissed

  • Row-cap vs. anchor: since anchor is last and groups are cut greedily forward, the anchor can end up alone in its own final group if the preceding chunk just barely tipped a group over max_rows. That's fine — single-chunk merge is degenerate but correct, and the anchor still gets its "active until run end" treatment.
  • Group-local dedup vs. cross-group dedup: a key that appears in chunk A and chunk Z (different groups) won't dedup within this run, but downstream query-time merge and compaction re-apply it — same property as the per-chunk path. The new comment makes this explicit.
  • max_rows.max(1) here when config also clamps: defense-in-depth, harmless.
· branch: `cubestore-chunk-repartition-speed-up`

@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 37s —— View job


Code Review — follow-up on ae73d0b (pin per-chunk unit tests)

Verdict: trivial test hardening — no findings. All four prior threads remain resolved.

Details

Checklist

  • Read git show ae73d0b
  • Verify test pinning doesn't change production behavior
  • Confirm prior review threads stay resolved

What changed

  • repartition_partition_chunks_yields_on_budget and repartition_partition_chunks_prefetch_drains now explicitly set RepartitionStrategy::PerChunk on their test configs (store/mod.rs:1830, store/mod.rs:2005).
  • Mirrors the existing pattern used by the merge/range tests — keeps these per-chunk-path assertions valid even if the default strategy is flipped later.
  • No production code touched.
· branch: `cubestore-chunk-repartition-speed-up`

Add an opt-in sequential prefetch to the batch repartition loop. A producer
downloads upcoming chunk parquets (anchor last, others by ascending id) into
the local cache while the consumer repartitions the current chunk, so the
download latency is hidden behind processing. A byte-budget semaphore bounds
how much fetched-but-unprocessed data sits on local disk: each chunk holds
permits worth its file size from before its download until the consumer
finishes it. Chunks prefetched past the time budget stay local and warm the
follow-up job, which lands on the same node by partition.

download_file is idempotent and dedups in-flight downloads, so the consumer's
repartition_chunk just hits the warm file. The producer makes no metastore
calls; the chunk list is still read once.

Controlled by CUBESTORE_REPARTITION_PREFETCH_BUDGET (accepts size suffixes,
e.g. 512MB); None (unset) or 0 disables prefetching. Added env_optparse_size
to keep None distinct from an explicit value.
Add an opt-in merge path for repartitioning an inactive parent's persisted
chunks. Instead of splitting each chunk independently into the active children,
the parent's chunks are merged k-way (in groups of up to a configurable cap) and
the sorted stream is split directly into the children at the wal-split limit in
one streaming pass, producing full-size chunks and avoiding the per-chunk
fragment fan-out plus the compaction that would later merge them.

Each group commits with a single atomic swap_chunks; a group whose source was
already repartitioned by a racing job is skipped (its new chunks stay inactive
and are GC'd). A fully empty group deactivates its sources directly. Children
must tile the parent's range exactly, otherwise the table is deactivated as on
the per-chunk path. The anchor is processed last so it keeps holding the job
dedup key until the run finishes or yields on the time budget.

Reuses the compaction streaming writer machinery via a new
write_chunks_split_into_children that cuts files on child boundaries and the
row-count limit. Controlled by CUBESTORE_REPARTITION_MERGE_MAX_INPUT_FILES
(Option<usize>); None / 0 / 1 disable it, Some(m >= 2) caps each group at m
input chunks. Default off.
…ector

Add a third repartition strategy that slices an inactive parent's persisted
chunks into RepartitionRange jobs at schedule time. Slicing walks all chunks
(active and inactive) ordered by id and cuts a range once it reaches the row or
chunk-count cap, so the [start, end] bounds stay pinned to chunk ids and a
re-slice reproduces them; the end is carried as job data, not the dedup key, so
a tail that extends the trailing range dedups on its start instead of spawning a
second job. Each range runs as one atomic swap on the worker chosen by the hash
of its bounds, restoring cross-worker parallelism. A GC gate keeps an inactive
parent's chunks until it fully drains so slicing stays stable.

Replace the ad-hoc flags with a single CUBESTORE_REPARTITION_STRATEGY selector
(per_chunk default, per_partition, range); an unknown value logs a warning and
falls back to per_chunk. The merge caps (CUBESTORE_REPARTITION_MERGE_MAX_INPUT_FILES
default 50, CUBESTORE_REPARTITION_MERGE_MAX_ROWS default 4000000) become plain
caps with defaults. The per-partition merge core is shared between the in-job
loop and the range handler.

JobType::RepartitionRange deserializes as an unknown variant on binaries that
predate it, so it is only safe behind the skip-unknown-jobs handling; enable the
strategy per-deployment.
- compute_repartition_children: an empty children set is treated as a transient
  topology read and returns an error to retry, instead of deactivating the table;
  deactivation now only happens when children are present but do not tile the
  parent's range (genuine corruption).
- Cache the draining-parents set (inactive parents with active chunks) with a
  short TTL so the GC loop does not re-scan the chunk table every cycle when the
  range strategy is enabled.
- Prefetch: bill a chunk with no recorded file_size at the full budget instead of
  ~free, so an unknown-size fetch cannot overshoot the on-disk budget.
- Clamp the range slicing caps to >= 1 so a misconfigured 0 does not break the
  slicing loop before adding a chunk.
…e dedup

- repartition_partition_chunks_merged now groups chunks by max_rows OR
  max_input_files (whichever is hit first), matching how the range strategy slices
  its jobs, instead of grouping purely by chunk count.
- Comment the last-row/aggregate dedup in merge_chunk_group_into_children: it is the
  same merge_chunks compaction uses; for unique-key tables the sort key ends with the
  seq column so the latest version wins, and group order only breaks ties between
  rows with an identical (unique key, seq).
… strategy

repartition_partition_chunks_yields_on_budget and
repartition_partition_chunks_prefetch_drains assert per-chunk-path behavior, so
set the strategy explicitly instead of inheriting the default — keeps them valid
regardless of the configured strategy (mirrors how the merge/range tests pin
theirs).
…urrent download in merge

Remove CUBESTORE_BATCH_REPARTITION / batch_repartition_enabled. The repartition
strategy (per_chunk / per_partition / range) is now the sole selector:
- per_chunk: one job per chunk -> repartition_chunk
- per_partition: one anchor job per partition -> merge
- range: RepartitionRange jobs -> merge

The job handler and scheduler dispatch on the strategy directly. The dead
per_chunk+batch hybrid (per-chunk loop and its producer/consumer prefetch) is
gone, along with its two now-obsolete tests.

Prefetch is reworked: drop the byte-budget producer/consumer and
CUBESTORE_REPARTITION_PREFETCH_BUDGET; add a plain bool
CUBESTORE_REPARTITION_CONCURRENT_DOWNLOAD (default off) that downloads a merge
group's chunk parquets concurrently before building the merge inputs. It applies
to both per_partition and range; the group is already bounded by
repartition_merge_max_input_files and the pool by download_concurrency, so no
extra budget is needed.
@waralexrom waralexrom force-pushed the cubestore-chunk-repartition-speed-up branch from 18b967e to 1ed20eb Compare June 15, 2026 22:08
@claude

claude Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 1m 40s —— View job


Code Review — follow-up on 1ed20eb (collapse batch flag into strategy selector, concurrent download)

Verdict: clean refactor — no findings. All four prior review threads remain resolved.

Details

Checklist

  • Read git diff 57f4c8f..HEAD (commit 1ed20eb)
  • Verify the per-chunk batch hybrid is removed cleanly
  • Trace the new concurrent-download path in the merge group build
  • Re-verify dispatch and config consistency
  • Confirm prior threads still resolved

What changed

  • Removed CUBESTORE_BATCH_REPARTITION / batch_repartition_enabled and the per-chunk batch loop (repartition_chunk_tolerant, repartition_partition_chunks_prefetched) plus their two tests. The strategy selector is now the only dispatch — PerChunkrepartition_chunk per job, PerPartition → merge path keyed on the anchor chunk, RangeRepartitionRange jobs.
  • Removed CUBESTORE_REPARTITION_PREFETCH_BUDGET and the byte-budget producer/consumer semaphore.
  • Added CUBESTORE_REPARTITION_CONCURRENT_DOWNLOAD (bool, default off): in merge_chunk_group_into_children (store/mod.rs:1087), spawns one task per chunk in the group to warm the local cache concurrently before chunk_exec builds the merge inputs. Bounded by repartition_merge_max_input_files (group size) and download_concurrency (download pool). Idempotent — a failure here is swallowed and re-surfaces in chunk_exec.
  • Job dispatcher (job_processor.rs:222) and scheduler (cluster/mod.rs:1005) both switched from batch_repartition_enabled() to repartition_strategy() == PerPartition; comments updated to reflect that the anchor-chunk overload is for cross-channel compat, not the dead hybrid path.

Why this looks right

  • Backwards-compat reasoning still holds. PerPartition still reuses RepartitionChunk keyed on the smallest persisted chunk id; an older binary on the same job shard just repartitions that one anchor chunk and drains the rest via its own per-chunk path. The justification comments are tightened, not weakened.
  • Concurrent download is correctly idempotent. download_file dedups in-flight fetches; the for d in downloads { let _ = d.await; } drain awaits all spawned tasks before chunk_exec runs, so a partial failure here just means chunk_exec re-issues the download against the same cache and reports a real error then.
  • No new race surface. If the calling future is cancelled mid-merge, the spawned downloads keep running but only warm the cache — no metastore writes, no on-disk dangling state beyond what was there before.
  • Group size already bounds on-disk overshoot. Previous prefetch path bounded fetched-but-unprocessed bytes explicitly; the new path implicitly bounds them at max_input_files × chunk_size per group, but the group is processed immediately after download, so chunks don't sit unconsumed. The author's "no extra budget needed" claim checks out.
  • SQL test renamed and re-tuned. repartition_concurrent_download_keeps_data_consistent (sql/mod.rs:3619) exercises the new path end-to-end with max_input_files = 4, giving real concurrent downloads instead of the removed prefetch producer/consumer path.

Things considered and dismissed

  • Lost log::debug! on download failure. The previous prefetch path logged "Prefetch of chunk {} failed: {}"; the new path swallows silently. Minor observability loss, but chunk_exec will log/return the real error when it actually needs the file, so genuine failures aren't hidden.
  • Two tests deleted. repartition_partition_chunks_yields_on_budget and repartition_partition_chunks_prefetch_drains covered the per-chunk batch loop and the prefetch producer/consumer — both code paths are gone, so the tests are correctly removed. Yield/drain semantics for PerPartition are now covered by repartition_merge_drains_and_yields and the SQL drain test.
  • Unbounded cube_ext::spawn per group. Caps at max_input_files (default 50) per merge group; combined with download_concurrency throttling inside download_file, the parallelism stays bounded. Fine.
  • Config rename in test harness. sql/mod.rs:3608 flipped batch_repartition_enabled = false to repartition_strategy = PerChunk (same intent, new field). Consistent.

Prior threads

All four remain resolved — nothing in 1ed20eb reopens them. Re-affirmed by re-reading the thread bodies against the current diff.

· branch: `cubestore-chunk-repartition-speed-up`

cursoragent and others added 3 commits June 16, 2026 14:50
The merge-based repartition path (per_partition / range strategies) commits
its new chunks with swap_chunks, which enforces that the activated row count
equals the deactivated row count. merge_chunks aggregates (aggregate indexes)
and last-row-dedups (unique-key tables) the source group, so it legitimately
emits fewer rows than it consumed. The checked swap then rejected the commit
with "Deactivated row count (..) doesn't match activated row count (..) during
swap", failing RepartitionRange / per-partition jobs.

Commit with swap_chunks_without_check instead, matching how compaction commits
its dedup'd merges. Adds a unit test on an aggregate index that reproduces the
exact failure before the fix.

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
…dedup

Adds range- and per-partition-strategy SQL tests on an aggregate-index table
whose chunks share dimension keys across inserts. The repartition merge groups
those rows by the sort key, so the swap activates fewer rows than it deactivates
- the production RepartitionRange row-count-mismatch scenario. Without the
unchecked-swap fix the repartition jobs never drain the inactive parents; with
it the data stays consistent (sum(m) and per-key sums conserved).

Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
…apses rows

The merge-based repartition path committed every group with the unchecked
swap. That is required only for aggregate indexes and unique-key tables, where
merge_chunks aggregates / last-row-dedups and the activated row count is
legitimately smaller. For a plain regular index the merge is a pure sort-merge
that conserves rows, so commit it through the checked swap_chunks to keep the
row-count integrity guard.
@claude

claude Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 1m 12s —— View job


Code Review — follow-up on 2604a5c, c48785a, 39a6d3d (unchecked-swap fix gated by row-collapse)

Verdict: clean correctness fix — no new findings. All four prior threads remain resolved.

Details

Checklist

  • Read git diff 1ed20eb..HEAD
  • Verify the swap-gate condition matches merge_chunks's collapse semantics
  • Confirm the regular-index integrity guard is preserved
  • Sanity-check the new tests reproduce the original failure
  • Re-verify prior threads stay resolved

What changed

  • merge_chunk_group_into_children (store/mod.rs:1082) now derives merge_collapses_rows = unique_key.is_some() || aggregate_columns.is_some() and routes the commit through swap_chunks_without_check only in that case; regular indexes still go through the checked swap_chunks (store/mod.rs:1227).
  • 2604a5c first switched the whole path to the unchecked swap with a unit-test repro on an aggregate index (20 rows → 10 grouped). 39a6d3d then tightened it back to a gated decision so a regular sort-merge still gets the row-count integrity guard.
  • c48785a adds two end-to-end SQL tests (repartition_range_jobs_aggregate_index_keeps_data_consistent, repartition_merge_aggregate_index_keeps_data_consistent) that cycle g over 0..40 across 200 inserts. Without the fix the inactive parents never drain; with it sum(m)=200 and the per-g aggregation is conserved across the split.

Why this looks right

  • The gate exactly mirrors merge_chunks's own collapse conditions: aggregate indexes group by sort key, unique-key tables last-row-dedup on (unique_key, seq). Both legitimately emit fewer rows than they consumed. A plain regular index is a pure ordered merge — row-conservation is a real invariant there and the checked swap should fire if anything ever breaks it.
  • The compaction path uses the same swap_chunks_without_check for the same reason; the inline comment makes the parallel explicit.
  • Drain-and-verify SQL tests exercise the production code path (range scheduler → RepartitionRange job → merge → swap) on a table whose chunks share dimension keys, which is the exact shape the field bug took.
  • Both new tests use small thresholds (partition_split_threshold=20, repartition_merge_max_rows=40 / max_input_files=4) so the assertion isn't trivially passed by the merge collapsing to a single group — active.len() > 1 enforces that an actual split happens.

Things considered and dismissed

  • Could a unique-key table merge ever be row-conserving (no duplicate (unique_key, seq))? Yes, but the gate is by table type, not by observed row count, which is the right call — we don't want a flaky guard that fires only on specific data shapes.
  • Regular index + aggregate columns on the same table. aggregate_columns only resolves to Some for IndexType::Aggregate, so a regular index over an aggregated table still gets the checked path. Correct.
  • The replay-handle merge runs before the gate decision. Same as before this PR; unrelated to the swap-check choice.
  • Test coverage for the regular-index path still relies on the checked swap. The original repartition_keeps_data_consistent and the per-strategy variants already cover that. No gap.

Prior threads

All four resolved; nothing in 2604a5c..HEAD reopens them.

· branch: `cubestore-chunk-repartition-speed-up` ·

compute_repartition_children deactivated the table as corrupt whenever the
inactive parent's active children did not tile its range exactly. That check
compared data-extent bounds (min_val/max_val), but a split sets the first
child's lower bound to its first segment's data min rather than the parent's
lower bound, so for normal multi-level topologies the children legitimately
leave gaps and the outer edges don't match the parent. The check fired in
production on RepartitionRange jobs and deactivated healthy pre-aggregation
tables.

The streaming split never drops rows (the first child is the low catch-all,
the last the high one) and the legacy per-chunk path performed no such check,
so the strict tiling guard was wrong. Replace it with an optional non-overlap
check, gated behind CUBESTORE_REPARTITION_CHECK_OVERLAPPING_CHILDREN and off by
default: gaps are benign, only overlapping children are genuine corruption.
@claude

claude Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 1m 5s —— View job


Code Review — follow-up on c1dafa6 (don't deactivate on benign child gaps)

Verdict: clean correctness fix — no new findings. All four prior threads remain resolved.

Details

Checklist

  • Read git show c1dafa6
  • Verify the rationale for replacing the tiling check matches partition-bound semantics
  • Confirm the overlap check (off by default) doesn't reintroduce the false-positive surface
  • Re-verify prior threads stay resolved

What changed

  • compute_repartition_children (store/mod.rs:1003) keeps the children.is_empty() early-return-as-retry from the earlier hardening pass.
  • The strict "children must tile parent's range exactly" guard is gone. It compared partition.min_val/max_val (data extent, not range bounds) against child bounds, which legitimately don't coincide after a split — the first child's min_val is its data min, not the parent's lower bound. In multi-level topologies the children correctly leave gaps at the outer edges, and the old check was deactivating healthy pre-aggregation tables in production under RepartitionRange.
  • Replaced with an optional non-overlap-only check, gated by CUBESTORE_REPARTITION_CHECK_OVERLAPPING_CHILDREN, off by default. Overlap (a child's range starts before the previous child's range ends) is the only condition where a row could legitimately go to two children — that is real corruption. Gaps are not.
  • Config plumbed through ConfigObj::repartition_check_overlapping_children() consistently.

Why this looks right

  • The commit's reasoning checks out: Partition::min_val/max_val track data extents, not range bounds (range bounds live on the partition's place in the index ordering). Tile-equality against those extents was structurally wrong for the upper/lower edges and after sparse data, regardless of strategy.
  • The streaming split is row-conserving by construction (first child = low catch-all, last = high catch-all), so the guard never protected against row loss anyway — it was a metadata sanity check masquerading as a correctness invariant. The legacy per-chunk path didn't have it.
  • The new overlap check is narrowly scoped: only fires on a genuine ordering violation, and is opt-in so an operator who wants the assertion can turn it on without imposing it as a default.
  • The (None, _) | (_, None) interior-bound case being treated as overlap is the right call — an open bound mid-ordering is itself a metadata inconsistency.

Things considered and dismissed

  • Should the overlap check be on-by-default in test harnesses? The fakes in config/mod.rs set it to false, matching prod. That's fine — the existing per-strategy SQL tests already exercise normal topologies; adding the overlap assertion to tests would just risk re-introducing flakiness on legitimate multi-level layouts.
  • Could a real corruption (true tiling violation that isn't an overlap) now slip through silently? Yes, but it was always going to slip through the per-chunk path, and the streaming split still doesn't lose rows. The check was load-bearing only against a class of metadata bugs that the codebase doesn't have evidence of in practice; the false-positive cost was concrete.
  • Naming. CUBESTORE_REPARTITION_CHECK_OVERLAPPING_CHILDREN=true reads as "do the check," not "fail on overlap" — that's correct and consistent with similar env flags in the file.

Prior threads

All four remain resolved; nothing in c1dafa6 reopens them.

· branch: `cubestore-chunk-repartition-speed-up`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cube store Issues relating to Cube Store rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants