fix(cubestore): repartition merge commits with unchecked swap (fixes RepartitionRange row-count regression)#11091
Draft
paveltiunov wants to merge 2 commits into
Conversation
The merge-based repartition path (per_partition / range strategies) commits its new chunks with swap_chunks, which enforces that the activated row count equals the deactivated row count. merge_chunks aggregates (aggregate indexes) and last-row-dedups (unique-key tables) the source group, so it legitimately emits fewer rows than it consumed. The checked swap then rejected the commit with "Deactivated row count (..) doesn't match activated row count (..) during swap", failing RepartitionRange / per-partition jobs. Commit with swap_chunks_without_check instead, matching how compaction commits its dedup'd merges. Adds a unit test on an aggregate index that reproduces the exact failure before the fix. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
…dedup Adds range- and per-partition-strategy SQL tests on an aggregate-index table whose chunks share dimension keys across inserts. The repartition merge groups those rows by the sort key, so the swap activates fewer rows than it deactivates - the production RepartitionRange row-count-mismatch scenario. Without the unchecked-swap fix the repartition jobs never drain the inactive parents; with it the data stays consistent (sum(m) and per-key sums conserved). Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## cubestore-chunk-repartition-speed-up #11091 +/- ##
=======================================================================
Coverage ? 58.50%
=======================================================================
Files ? 216
Lines ? 17270
Branches ? 3524
=======================================================================
Hits ? 10103
Misses ? 6652
Partials ? 515
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR #11088 introduced a merge-based repartition path (
per_partitionandrangestrategies). In production withrangeenabled this regressed with a flood of failingRepartitionRangejobs:Root cause
The new merge path (
ChunkStore::merge_chunk_group_into_children) k-way merges a group of source chunks throughmerge_chunks, then commits the new child chunks.merge_chunksaggregates (for aggregate indexes — i.e. pre-aggregation rollups) and last-row-dedups (for unique-key tables) the group, so it legitimately emits fewer rows than it consumed.The commit used the checked
swap_chunks, which enforcesactivated_row_count == deactivated_row_count. For any aggregate/unique-key table the merge collapses rows, so the check rejected the swap and the repartition job failed permanently — the inactive parent never drained.The legacy per-chunk path (
partition_rows) never aggregated (it only routes rows by partition key), so the row count was always conserved there and the check passed — which is why this only surfaced on the new merge strategies.Fix
Commit the merge with
swap_chunks_without_check, exactly as compaction already does for its own dedup'd merges (compact_chunks_to_in_memory/compact_chunks_to_persistent). The row-count equality invariant does not hold for an aggregating/deduping merge, so enforcing it is incorrect for this path.Tests
Following the request to reproduce the error first, then fix it:
store::tests::repartition_merge_aggregate_index_collapses_rows: builds an aggregate index whose two source chunks share every key, drivesrepartition_chunk_range, and asserts the children conserve the aggregated row count. Before the fix this panics with the exact production error:Deactivated row count (20) doesn't match activated row count (10) during swap of (1, 2) to (3, 4) chunks.repartition_range_jobs_aggregate_index_keeps_data_consistentandrepartition_merge_aggregate_index_keeps_data_consistent: aggregate-index table, dimension keys repeated across inserts, forced split, full scheduler → range-slicing → job-runner path. Before the fix the inactive parents never drain (jobs keep failing the swap); after itsum(m)and every per-key sum are conserved.All 15
repartition*unit/SQL tests pass with the fix; the three new tests fail without it.Risk
Mirrors the long-standing compaction behavior; default strategy remains
per_chunk(unchanged). No metastore/schema changes.