Conversation
3553c37 to
a316546
Compare
There was a problem hiding this comment.
Pull request overview
Adds a one-shot “mature merge” capability to Quickwit to merge fully-mature published splits across nodes, exposed via a new quickwit tool merge-mature CLI subcommand.
Changes:
- Introduces a mature-merge planner (
mature_merge_plan) that groups eligible mature splits (by day/partition/doc-mapping) into merge operations. - Implements an execution runner (
mature_merge) that reuses the existing merge actor pipeline to run planned operations across indexes with configurable concurrency and dry-run output. - Extends CLI tooling and a few actor utilities/counters to support the new workflow (publisher counters, temp dir handling, merge permits).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| quickwit/quickwit-indexing/src/mature_merge.rs | New executor that scans indexes, plans merges, and runs the merge actor pipeline (plus dry-run logging and summary). |
| quickwit/quickwit-indexing/src/mature_merge_plan.rs | New pure planning logic and unit tests for selecting/grouping eligible mature splits. |
| quickwit/quickwit-indexing/src/lib.rs | Exposes the new mature-merge modules from quickwit-indexing. |
| quickwit/quickwit-indexing/src/actors/publisher.rs | Adds counter for number of replaced splits and updates tests accordingly. |
| quickwit/quickwit-indexing/src/actors/merge_split_downloader.rs | Uses TempDirectory::named_temp_child for scratch subdirectories. |
| quickwit/quickwit-indexing/src/actors/merge_scheduler_service.rs | Adds MergePermit::new helper for externally managed concurrency. |
| quickwit/quickwit-cli/src/tool.rs | Adds merge-mature subcommand, argument parsing, and optional metrics serving. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| Ok(IndexMergeOutcome { | ||
| num_published_merges: publisher_counters.num_replace_operations, |
There was a problem hiding this comment.
num_published_merges is derived from publisher_counters.num_replace_operations, which may undercount merges that legitimately publish no new split (e.g. a merge that results in an empty output but still replaces input splits). Consider using operations.len() on success, or incorporate publisher_counters.num_empty_splits (or another explicit counter) so the reported merge count matches executed operations.
| num_published_merges: publisher_counters.num_replace_operations, | |
| num_published_merges: publisher_counters.num_replace_operations | |
| + publisher_counters.num_empty_splits, |
| let list_splits_query = | ||
| ListSplitsQuery::for_index(index_uid).with_split_state(SplitState::Published); | ||
| let list_splits_request = ListSplitsRequest::try_from_list_splits_query(&list_splits_query)?; | ||
| let splits_stream = metastore.list_splits(list_splits_request).await?; |
There was a problem hiding this comment.
Wouldn't it be more efficient to filter as much as possible directly in the metastore ?
There was a problem hiding this comment.
I have added a filter on mature splits, but for the rest (e.g split sizes) I would need to add a method to the metastore. I think we can do that as an optimization later.
434cf50 to
df4b5ac
Compare
| continue; | ||
| } | ||
|
|
||
| let day_bucket = start_day * SECS_PER_DAY; |
12a4138 to
cd6a90c
Compare
Description
A cli tool that merges mature splits. It can be made into a service into the future.
How was this PR tested?
Tested on test and has some pretty good unit tests.