Add cluster command for PPL#5265
Conversation
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit df7873a.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
Failed to generate code suggestions for PR |
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
PR Reviewer Guide 🔍(Review updated until commit a88cc9a)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to a88cc9a Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit ca93d37
Suggestions up to commit 9563361
Suggestions up to commit 5bfaf69
Suggestions up to commit 3fbd709
Suggestions up to commit 1763fb8
|
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
|
Persistent review updated to latest commit 152d2d4 |
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
|
Persistent review updated to latest commit 74001d0 |
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
|
Failed to generate code suggestions for PR |
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
|
Failed to generate code suggestions for PR |
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
|
Persistent review updated to latest commit 5792e98 |
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
|
Persistent review updated to latest commit e009b93 |
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
|
Persistent review updated to latest commit 2d99d92 |
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
|
Persistent review updated to latest commit 1763fb8 |
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
|
Persistent review updated to latest commit 3fbd709 |
|
Persistent review updated to latest commit 5bfaf69 |
|
This PR is stalled because it has been open for 2 weeks with no activity. |
|
Persistent review updated to latest commit 9563361 |
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
|
Persistent review updated to latest commit ca93d37 |
Signed-off-by: Ritvi Bhatt <ribhatt@amazon.com>
|
Persistent review updated to latest commit a88cc9a |
|
This PR is stalled because it has been open for 2 weeks with no activity. |
| return threshold; | ||
| } | ||
|
|
||
| private static String validateMatchMode(String matchMode) { |
There was a problem hiding this comment.
ideally this should be enum instead of str, which makes it hard to misuse the interface
| | `countfield` | Optional | Name of the field to store the cluster size. Default is `cluster_count`. | | ||
| | `showcount` | Optional | Whether to include the cluster count field in the output. Default is `false`. | | ||
| | `labelonly` | Optional | When `true`, keeps all rows and only adds the cluster label. When `false` (default), deduplicates by keeping only the first representative row per cluster. Default is `false`. | | ||
| | `delims` | Optional | Delimiter characters used for tokenization. Default is `non-alphanumeric` (splits on any non-alphanumeric character). | |
There was a problem hiding this comment.
if minor: this probably also could stand to be tagged somehow instead of just working directly on strings
penghuo
left a comment
There was a problem hiding this comment.
Did u benchmark cluster command on http_logs dataset?
| | `countfield` | Optional | Name of the field to store the cluster size. Default is `cluster_count`. | | ||
| | `showcount` | Optional | Whether to include the cluster count field in the output. Default is `false`. | | ||
| | `labelonly` | Optional | When `true`, keeps all rows and only adds the cluster label. When `false` (default), deduplicates by keeping only the first representative row per cluster. Default is `false`. | | ||
| | `delims` | Optional | Delimiter characters used for tokenization. Default is `non-alphanumeric` (splits on any non-alphanumeric character). | |
There was a problem hiding this comment.
what is allowed value of delims?
|
|
||
| // Cache vectorized representations to avoid recomputation | ||
| private final Map<String, Map<CharSequence, Integer>> vectorCache = | ||
| new LinkedHashMap<>(MAX_CACHE_SIZE, 0.75f, true) { |
There was a problem hiding this comment.
initialCapacity equal to MAX_CACHE_SIZE?
| private int bufferLimit = 50000; // Configurable buffer size | ||
| private int maxClusters = 10000; // Limit cluster count to prevent memory explosion |
There was a problem hiding this comment.
Does this setting configurable by user?
Description
Description
Adds the cluster command to PPL, which groups documents into clusters based on text similarity. More info on decisions made in #5255
Syntax
Supported algorithms
All algorithms use cosine similarity against the configured threshold to determine cluster membership.
Behavior
representative row per cluster
Changes
Related Issues
Resolves #5255
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.