PPL patterns command (BRAIN + SIMPLE) on the analytics-engine route#21797
Conversation
PR Reviewer Guide 🔍(Review updated until commit 216ae28)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 216ae28 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 4c623c3
Suggestions up to commit 87a268c
Suggestions up to commit 6653bde
Suggestions up to commit 111578e
Suggestions up to commit 6058798
|
972bc50 to
a4d08d8
Compare
|
Persistent review updated to latest commit a4d08d8 |
|
❌ Gradle check result for a4d08d8: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
a4d08d8 to
5380a05
Compare
|
Persistent review updated to latest commit 5380a05 |
|
❌ Gradle check result for 5380a05: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Persistent review updated to latest commit e6db8ce |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21797 +/- ##
============================================
+ Coverage 73.37% 73.45% +0.08%
- Complexity 75448 75532 +84
============================================
Files 6034 6033 -1
Lines 342504 342572 +68
Branches 49259 49276 +17
============================================
+ Hits 251310 251637 +327
+ Misses 71175 70947 -228
+ Partials 20019 19988 -31 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Persistent review updated to latest commit 9d67da9 |
9d67da9 to
9c7d359
Compare
|
Persistent review updated to latest commit 9c7d359 |
|
❌ Gradle check result for 9c7d359: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
… adapter Per @penghuo's review: DataFusion-specific concerns shouldn't live in SQL core. The 'g' flag is needed only because DataFusion's regexp_replace defaults to first-match-only — Calcite's 3-arg form is already replace-all on both pushdown and no-pushdown paths. Restores SQL core, RexStandardizer, the patterns unit test, and the SIMPLE- patterns explain YAMLs to their upstream/main shape. The 'g' flag is appended in opensearch-project/OpenSearch#21797's RegexpReplaceAdapter when converting 3-arg REGEXP_REPLACE to DataFusion. Same end-user behavior, smaller SQL diff, and the Calcite no-pushdown path no longer diverges from the pushdown YAML. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit ec6da64 |
|
❌ Gradle check result for ec6da64: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
…t scaffold Starts the port of PPL `patterns` command's algorithm layer to native Rust so the BRAIN method and `PATTERN_PARSER` scalar can execute on the analytics-engine route. The end goal is `CalcitePPLPatternsIT` passing against parquet-backed indices. A new `patterns/` module containing the pure-logic layer (no DataFusion dependency, fully unit-testable in isolation): - `preprocess.rs` — Java `BrainLogParser.preprocess(...)` port. Regex-based variable detection (IP, ISO datetime, UUID, hex/letter-digit/floats), delimiter normalization, whitespace splitting. The "generic number surrounded by non-alphanumeric" rule is implemented manually because Rust's `regex` crate doesn't support lookaround. - `utils.rs` — Java `PatternUtils` port. `parse_pattern`, `extract_variables`, `ParseResult` with `to_token_order_string`. `WILDCARD_PATTERN` and `TOKEN_PATTERN` regexes are equivalent to Java's `Pattern.compile` strings. - `brain.rs` — `BrainLogParser` struct skeleton, `collapse_continuous_wildcards`, `PatternEntry` / `BrainParseStats` typed result types. The classifier internals (histogram, group token set, `parse_log_pattern`) are deliberate stubs in this milestone — they land in milestone 2. - `tokens.rs` — `PatternResult` typed view of the per-row / per-group result the UDF layer will materialize into Arrow. 14 unit tests pass (1 `#[ignore]` placeholder for the full BRAIN classification once the algorithm port lands): - `preprocess_simple_log_line_splits_on_whitespace` - `preprocess_substitutes_ip_then_blk_number` — matches the expected preprocessed shape from `testBrainLabelMode_NotShowNumberedToken` - `preprocess_substitutes_uuid` — matches `testBrainParseWithUUID_*` - `preprocess_collapses_consecutive_wildcards_via_number_runs` - `parse_wildcard_pattern_splits_on_email_separators` — `<*>@<*>.<*>` from `testSimplePatternLabelMode_*` - `to_token_order_string_rewrites_wildcards_to_numbered_tokens` — produces `<token1>@<token2>.<token3>` - `extract_variables_extracts_email_parts` — matches `testSimplePatternLabelMode_ShowNumberedToken` ImmutableMap expectation - `extract_variables_handles_multi_sample_aggregation` — matches `testBrainAggregationMode_ShowNumberedToken` multi-sample tokens.list - `extract_variables_returns_empty_when_static_mismatch` - `token_pattern_matches_numbered_placeholders` - `collapse_three_consecutive_wildcards` / variants 1. Full BRAIN classifier port — histogram + group-token-set + parse_log_pattern. 2. DataFusion ScalarUDF wrapper for PATTERN_PARSER (`udf/pattern_parser.rs`). 3. DataFusion AggregateUDF wrapper for INTERNAL_PATTERN (aggregation mode). 4. DataFusion WindowUDF wrapper for INTERNAL_PATTERN (label mode). 5. Substrait YAML signatures for `pattern_parser` and `internal_pattern`. 6. Java adapters in analytics-backend-datafusion + ScalarFunction enum + capability registration. 7. `PatternsCommandIT` mirroring `CalcitePPLPatternsIT` against parquet-backed indices. 8. Verification via `:integTestRemote -Dtests.analytics.force_routing=true`. This milestone is a no-op at runtime — `patterns/` is unwired. Lands as the algorithmic foundation for the work above. Signed-off-by: Kai Huang <ahkcs@amazon.com>
PPL `patterns` lowers its result-struct flatten via `INTERNAL_ITEM(struct, "sample_logs")` and `INTERNAL_ITEM(struct, "tokens")`, where `sample_logs` returns `ARRAY<VARCHAR>` and `tokens` returns `MAP<VARCHAR, ARRAY<VARCHAR>>`. The scalar form of ITEM was already in STANDARD_PROJECT_OPS (with SUPPORTED_FIELD_TYPES — covers VARCHAR / numeric returns), but the ARRAY- and MAP-returning shapes weren't registered, so OpenSearchProjectRule rejected the call with `No backend supports scalar function [ITEM] among [datafusion]`. Adds ITEM to both ARRAY_RETURNING_PROJECT_OPS and MAP_RETURNING_PROJECT_OPS. Part of the PPL `patterns` command analytics-engine support stack. Signed-off-by: Kai Huang <ahkcs@amazon.com>
…ssifier port Replaces the stub in patterns/brain.rs with a full port of BrainLogParser.java from the SQL plugin's common/src/main/java/org/opensearch/sql/common/patterns/. ## What lands - preprocess_all_logs — tokenizes each line with default filters (IP/datetime/UUID/numbers), appends a synthetic logId token, and updates token_freq_map. - process_token_histogram — positional token-frequency counter. - calculate_group_token_freq — picks the representative WordCombination (sorted by same_freq_count desc, then word_freq desc) per row and populates the per-(tokens_len,candidate,position) group token set. - parse_log_pattern — per-row classifier. Tokens whose frequency > repFreq are kept ONLY if unique in their group; tokens with frequency < repFreq are kept ONLY if the group has fewer than variable_count_threshold variants. Everything else becomes <*>. - parse_all_log_patterns — full pipeline. Group equal pattern strings together, count occurrences, collect samples up to max_sample_count. - WordCombination — typed pair with the Java compareTo ordering. - collapse_continuous_wildcards — adjacent-<*> collapse, unchanged from milestone 1. ## Tests (18/18 pass) Existing 14 unit tests from milestone 1 still pass. New tests cover the BRAIN classifier directly with fixtures from CalcitePPLPatternsIT: - brain_groups_verification_succeeded_lines — two 'Verification succeeded' lines collapse to one pattern with 2 samples. - brain_aggregates_hdfs_fixtures_into_four_groups — the 8-line HDFS fixture matches the IT's testBrainAggregationMode_NotShowNumberedToken expectation: exactly 4 patterns, every group has count == 2. - brain_aggregates_brain_label_mode_blockstar_into_expected_pattern — spot-checks the exact pattern string the IT asserts on row 1 of testBrainLabelMode_NotShowNumberedToken: 'BLOCK* NameSystem.addStoredBlock: blockMap updated: <*IP*> is added to blk_<*> size <*>'. The fixtures match the IT 1:1 — equivalence with the Java implementation on the test surface that motivated this work is enforced at unit test time. Subsequent milestones wire the algorithm into DataFusion's ScalarUDF / AggregateUDF / WindowUDF APIs and registers the opensearch_scalar_functions.yaml signatures so the analytics-engine route can dispatch to it. Signed-off-by: Kai Huang <ahkcs@amazon.com>
…unctions Adds patterns/eval.rs with the three entry points the PATTERN_PARSER scalar UDF dispatches between. Mirrors PatternParserFunctionImpl's evalField / evalAgg / evalSamples (197-line Java class). - eval_field(pattern, field) — SIMPLE label mode + show_numbered_token. Parses the wildcard pattern, extracts field substrings into a token map, returns numbered-token rewrite. - eval_samples(pattern, sample_logs) — SIMPLE aggregation mode + show_numbered_token. Token map accumulates across all sample logs. - eval_agg(field, agg_object, show_numbered_token) — BRAIN label mode. Scores each candidate pattern against the preprocessed input tokens, picks the highest-similarity candidate, optionally rewrites to numbered tokens. 26/26 patterns module tests pass. New cases pin equivalence with the CalcitePPLPatternsIT expectations directly: - eval_field_renames_email_wildcards_to_numbered_tokens — matches testSimplePatternLabelMode_ShowNumberedToken's <token1>@<token2>.<token3> on "amberduke@pyrami.com". - eval_field_handles_custom_pattern — testSimplePatternLabelModeWithCustomPattern_* with the "amberduke<*>" prefix-anchored template. - eval_samples_accumulates_tokens_across_samples — matches testSimplePatternAggregationMode_ShowNumberedToken's 3-sample case. - eval_agg_picks_best_matching_candidate — best-fit similarity scoring against two BRAIN-aggregate candidates. Next milestone: ScalarUDF wrapper in udf/pattern_parser.rs + opensearch_scalar_functions.yaml entry + ScalarFunction enum + Java adapter so the analytics-engine route can dispatch into these eval functions. Currently unused at runtime. Signed-off-by: Kai Huang <ahkcs@amazon.com>
…R scalar UDF wiring
Wires the eval functions from milestone 3a into a DataFusion ScalarUDF
and registers all the cross-component plumbing the analytics-engine
route needs to dispatch to it. After this milestone the PPL SIMPLE
patterns + show_numbered_token call shape (evalField / evalSamples) is
reachable from a Calcite RelNode on the analytics-engine path.
- udf/pattern_parser.rs — DataFusion ScalarUDF wrapper. Accepts two
operand shapes:
pattern_parser(VARCHAR, VARCHAR) — evalField
pattern_parser(VARCHAR, List<VARCHAR>) — evalSamples
Return type is Struct<pattern: VARCHAR, tokens: Map<VARCHAR,
List<VARCHAR>>>. The 3-arg evalAgg shape used by BRAIN label mode
goes through a separate path (next milestone).
- opensearch_scalar_functions.yaml — substrait entry,
return type declared as any1 (same convention json_extract_all uses
for its concrete Map return).
- ScalarFunction.PATTERN_PARSER — new enum constant in
analytics-framework.
- PatternParserAdapter — rename adapter (AbstractNameMappingAdapter)
that routes PPL's INTERNAL_PATTERN_PARSER calls to the locally-
declared SqlFunction. The locally-declared operator
is the referent of the FunctionMappings.s entry that gives isthmus
the substrait extension name.
- DataFusionAnalyticsBackendPlugin —
* adapter map: PATTERN_PARSER → new PatternParserAdapter()
* MAP_RETURNING_PROJECT_OPS: + PATTERN_PARSER (its return type is
MAP<VARCHAR, ANY> per UserDefinedFunctionUtils.patternStruct)
- DataFusionFragmentConvertor.ADDITIONAL_SCALAR_SIGS —
FunctionMappings.s(LOCAL_PATTERN_PARSER_OP, "pattern_parser")
Existing 26 patterns module unit tests still pass; new 2 unit tests in
udf::pattern_parser pin the StructArray construction shape:
- struct_data_type_has_pattern_and_tokens_fields
- build_struct_array_populates_pattern_and_tokens_for_email_evalfield
Total Rust crate test pass: 28/28.
Native lib rebuild + IT run-through is the next step in this branch —
to verify the IT pass count moves from 3/14 to ≥ 5/14 (the 2 SIMPLE
patterns+show_numbered_token tests should pass once the runtime
substrait binding succeeds).
- INTERNAL_PATTERN aggregate UDF (3 tests — BRAIN aggregation mode)
- INTERNAL_PATTERN window UDF (3 tests — BRAIN label mode; this UDF is
the input to PATTERN_PARSER's 3-arg evalAgg shape)
- TAKE aggregate nullability fix (1 test)
- 3-arg evalAgg shape in this UDF (depends on window UDF landing)
Signed-off-by: Kai Huang <ahkcs@amazon.com>
… returns PPL's flattenParsedPattern wraps INTERNAL_ITEM(struct, key) lookups in SAFE_CAST whenever the keyed field needs an explicit declared type. The flatten step targets: - SAFE_CAST(ITEM(struct, "pattern"), VARCHAR) — scalar (covered) - SAFE_CAST(ITEM(struct, "pattern_count"), BIGINT) — scalar (covered) - SAFE_CAST(ITEM(struct, "tokens"), MAP<VARCHAR,...>) — needs MAP entry - SAFE_CAST(ITEM(struct, "sample_logs"), ARRAY<VAR>) — needs ARRAY entry SAFE_CAST is already in STANDARD_PROJECT_OPS for SUPPORTED_FIELD_TYPES (scalar) returns. Add it to ARRAY_RETURNING_PROJECT_OPS and MAP_RETURNING_PROJECT_OPS so the OpenSearchProjectRule planner check admits the call when its inferred return type is array- or map-shaped. This is part of the PPL patterns command analytics-engine support stack. Effect on CalcitePPLPatternsIT: the 6 SAFE_CAST 'No backend supports' failures from milestone 3b shift to substrait-binding errors that need the INTERNAL_PATTERN window / aggregate UDFs to fully clear. Signed-off-by: Kai Huang <ahkcs@amazon.com>
…BRAIN aggregation mode)
Implements the BRAIN aggregate side of PPL's `patterns ... method=BRAIN` on
the analytics-engine route. Mirrors `LogPatternAggFunction` on the SQL plugin
side: collects per-group log lines and runs BRAIN over the corpus at
finalize time, emitting
List<Struct<pattern, pattern_count, tokens, sample_logs>>
so the downstream UNNEST + ITEM projections (PPL's `flattenParsedPattern`)
resolve as named struct-field access.
Wiring (top-down):
- `AggregateFunction` (analytics-framework): adds `PATTERN` enum constant.
Makes `fromNameOrError` case-insensitive — the PPL operator is registered
lower-case ("pattern") whereas the enum constants are upper-case, and the
raw `valueOf("pattern")` lookup was failing.
- `DataFusionAnalyticsBackendPlugin`: declares `AggregateFunction.PATTERN`
in `AGG_FUNCTIONS` so the capability registry advertises it as supported
on the datafusion backend.
- `DataFusionFragmentConvertor`: adds `LOCAL_INTERNAL_PATTERN_OP`
(substrait-bound to `internal_pattern`) and an `ADDITIONAL_AGGREGATE_SIGS`
entry — same pattern as `LOCAL_TAKE_OP` / `LOCAL_FIRST_OP`.
- `PplAggregateCallRewriter`: rewrites PPL `pattern(...)` calls onto
`LOCAL_INTERNAL_PATTERN_OP` and substitutes the call's return type with
the concrete struct shape (PPL's declared `ARRAY<MAP<VARCHAR, ANY>>` has
an embedded `ANY` that isthmus cannot serialize to Substrait).
- `opensearch_aggregate_functions.yaml`: registers `internal_pattern`
with 4-arg, 5-arg, 6-arg, and 1-arg (FINAL) overloads matching the PPL
emitted call shapes (max_sample_count, buffer_limit, show_numbered_token,
plus optional frequency_threshold_percentage and variable_count_threshold).
- `udaf/internal_pattern.rs`: the actual UDAF — variadic_any signature, an
Accumulator that collects log lines and runs `BrainLogParser` at
evaluate() time, with per-shard state shaped as `List<Utf8>` so the
coordinator's FINAL accumulator can concatenate via `merge_batch`. 3
unit tests pin behaviour (empty corpus, repeated-pattern grouping,
cross-shard merge).
Test impact (`CalcitePPLPatternsIT` via analytics-engine route): the three
BRAIN aggregation-mode tests advance past the previous
`AggregateFunction.pattern` enum lookup; they now surface a downstream
"Project rule encountered unmarked child [LogicalCorrelate]" from the
PPL Calcite path's UNNEST after the aggregate (separate planner work to
add Correlate support, not covered by this commit). Window UDF for BRAIN
label mode and the SAFE_CAST nullability fix for SIMPLE aggregation mode
are also pending.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
…LOCAL_ARRAY_AGG_OP
PPL declares TAKE's return type via {@code PPLReturnTypes.ARG0_ARRAY} which
passes {@code nullable=true} to {@code SqlTypeUtil.createArrayType}. The
rewriter clones that nullable type onto the rewritten LOCAL_TAKE_OP call,
but the operator's default {@code ReturnTypes.TO_ARRAY} infers NOT NULL
(because aggregates over a non-empty group can't be null in standard SQL
semantics). The mismatch trips {@code AggregateCall.create}'s validation
with
type mismatch:
aggCall type: VARCHAR ARRAY
inferred type: VARCHAR ARRAY NOT NULL
Surface: CalcitePPLPatternsIT's SIMPLE aggregation tests — the Parse-then-
Aggregate path applies TAKE to the source field, and the field's nullable
typing flows into the aggregate's declared return type. The bug is general
though — any nullable-input TAKE rewritten through the analytics-engine
backend would have hit it as soon as the rewriter's explicit-type path
fired.
Fix: andThen FORCE_NULLABLE on both LOCAL_TAKE_OP and LOCAL_ARRAY_AGG_OP so
the operator's inferred type matches what PPL emits. Mirror of the PPL
side's ARG0_ARRAY.
Test impact: CalcitePPLPatternsIT 3/15 → 4/15 (one SIMPLE aggregation test
unblocked; the show_numbered_token variants still hit the separate
"Unable to convert the type ANY" issue from PATTERN_PARSER's MAP<VARCHAR,
ANY> return type).
Signed-off-by: Kai Huang <ahkcs@amazon.com>
…list<string> overload (partial)
Wires up the PATTERN_PARSER ANY-type fixes that the SIMPLE pattern
show_numbered_token tests need, but does NOT yet bring them green —
the wrapping {@code map_extract} + {@code array_element} chain that
ArrayElementAdapter created (when ITEM was lowered against the
original PPL MAP<VARCHAR, ANY> declared type) keeps its frozen
ANY return type even after PATTERN_PARSER is rewritten to STRUCT.
Substrait's TypeConverter still rejects with "Unable to convert the
type ANY" when it walks the operand types of the wrappers.
Captures the necessary framework:
- `opensearch_scalar_functions.yaml`: adds a second
`pattern_parser(pattern: string, sample_logs: list<string>)`
overload — the SIMPLE aggregation mode emits this call shape from
the PPL Calcite visitor's `showNumberedToken=true` branch, and
without this overload isthmus's ScalarFunctionConverter rejects the
call earlier with "Unable to convert call pattern_parser(string?,
list<string?>?)" before even getting to operand-type validation.
- `PatternParserAdapter`: overrides `adapt` to substitute the PPL
declared MAP<VARCHAR, ANY> return type with the concrete struct
shape `STRUCT<pattern: VARCHAR, tokens: MAP<VARCHAR, ARRAY<VARCHAR>>>`
that matches the Rust UDF's Arrow output. Same pattern as the
PATTERN aggregate's PplAggregateCallRewriter case.
- `ItemTypeRebuilder` (new): pre-isthmus shuttle that walks every
Project / Filter expression tree. Rebuilds ITEM calls so their
return type is re-derived from operand 0 (handles the legacy
ITEM-on-STRUCT case for non-adapted plans), and substitutes any
pattern_parser call whose declared type is still MAP<VARCHAR, ANY>
with the concrete struct. Wired into both
`convertToSubstrait` and `convertStandalone` so attached-wrapper
paths (PARTIAL agg + FINAL agg conversion) get the same rewrite.
Remaining gap (for follow-up): ArrayElementAdapter has already
converted ITEM on MAP into `array_element(map_extract(map, key), 1)`
with ANY-derived return types BEFORE PatternParserAdapter substitutes
the inner struct type. The wrappers stay typed as ANY ARRAY / ANY,
and isthmus rejects them. Two paths to close this:
- SQL plugin: change PATTERN_PARSER's declared return type from
MAP<VARCHAR, ANY> to a concrete struct (touches v2 path's result
materialisation).
- Backend: extend ItemTypeRebuilder to detect the
`array_element(map_extract(STRUCT, key), 1)` anti-pattern and
rewrite to a direct STRUCT field access.
Test impact (CalcitePPLPatternsIT via analytics-engine route): no
change from previous 4/15 pass count — the SIMPLE pattern
show_numbered_token tests still hit the ANY-typed wrapper chain. The
framework is in place; a follow-up commit closes the remaining gap.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit 9b60303 |
Self-check inside ensureMultiShardProvisioned() reads GET /<index>/_settings and asserts the index settings report number_of_shards=3. Makes the multi-shard nature of the aggregation tests provable from the test itself rather than implicit in the DatasetProvisioner.provision() call. Per @marc's review request to verify the multi-shard claim. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit e39c9ee |
…#5467) * feat(api): add PATTERN_* settings defaults to UnifiedQueryContext PPL `patterns` command's AstBuilder reads cluster settings for method/mode/ max_sample_count/buffer_limit/show_numbered_token defaults when the query omits them. Without these in the analytics-engine path's settings map, the parser reads null, falls into `PatternMethod.valueOf("NULL")`, and every `patterns` query without an explicit `method=` or `mode=` argument fails at parse time with `No enum constant PatternMethod.NULL`. Mirrors the OpenSearchSettings defaults (SIMPLE_PATTERN / LABEL / 10 / 100000 / false). Part of the analytics-engine route support for the `patterns` command. Signed-off-by: Kai Huang <ahkcs@amazon.com> * feat(core): emit 4-arg regexp_replace with 'g' flag for SIMPLE patterns `buildParseRelNode` for `ParseMethod.PATTERNS` lowered through PPL's REPLACE handler, which always emits Calcite's 3-arg `REGEXP_REPLACE_3`. That works on the V2 / Calcite path (Calcite's default is replace-all), but the analytics- engine route converts the call to substrait + DataFusion, and DataFusion's `regexp_replace` defaults to first-match-only without an explicit "g" flag. The dashboard test for `source = bank | patterns email mode=label` returned `<*>@pyrami.com` instead of `<*>@<*>.<*>` because only the first `[a-zA-Z0-9]+` run was replaced. Bypass the REPLACE handler for the PATTERNS branch and emit `REGEXP_REPLACE_PG_4` directly with a constant "g" flag. Same semantics on V2 / Calcite (Calcite's REGEXP_REPLACE_PG_4 with "g" = replace-all); fixes the analytics-engine path. CalcitePPLPatternsTest plan-string expectations updated to match the 4-arg form. 17/17 unit tests pass. IT result on analytics-engine route: testSimplePatternLabelMode_NotShowNumberedToken now passes. Signed-off-by: Kai Huang <ahkcs@amazon.com> * test(integ-test): add CalcitePPLDashboardPatternsIT pinning BRAIN-label dashboard query OpenSearch Dashboards renders BRAIN-pattern panels with the shape: patterns ... method=BRAIN mode=label | stats count() as pattern_count, take(message, 1) as sample_logs by patterns_field | sort -pattern_count | fields patterns_field, pattern_count, sample_logs This integration test pins that shape on the analytics-engine route so regressions surface immediately. Schema-only assertions because BRAIN's clustering output is dataset-version-sensitive — the contract we care about is "the query plans, executes, and returns three columns in the right order". Currently red end-to-end pending the BRAIN label window-UDF type-cascade fix (see the OpenSearch-side WIP commit "BRAIN window UDF + dashboard query path scaffolding" — the {@code PplWindowCallRewriter} stub documents the remaining gap). Signed-off-by: Kai Huang <ahkcs@amazon.com> * style: apply spotless formatting Spotless drift from cherry-picking the analytics-engine patterns work across upstream's recent formatting touch-ups. No behavior change. Signed-off-by: Kai Huang <huangkaics@gmail.com> Signed-off-by: Kai Huang <ahkcs@amazon.com> * test(integ-test): update SIMPLE-patterns explain YAML for 4-arg regexp_replace CalciteExplainIT's `testPatternsSimplePatternMethodWith{out,AggPushDown}Explain` expected the old 3-arg `REGEXP_REPLACE(...)` form, but after the `feat(core)` commit emits 4-arg `REGEXP_REPLACE(..., 'g':VARCHAR)` the plan output now includes the extra operand both in the logical line and in the base64-encoded compounded script of the physical/pushdown plan. Regenerate both YAML expectations against the live planner. Signed-off-by: Kai Huang <ahkcs@amazon.com> * fix(opensearch): collapse 4-arg REGEXP_REPLACE_PG_4 'g' to 3-arg at script pushdown The `feat(core)` commit on this branch lowered PPL `patterns` to a 4-arg `REGEXP_REPLACE_PG_4(field, pattern, replacement, 'g')` so DataFusion (which defaults to first-match-only) does global replacement on the analytics-engine route. Calcite's enumerable runtime — which the V2 / Calcite-pushdown path uses to compile the serialized RexCall into Janino bytecode — has no matching `SqlFunctions.regexpReplace(String, String, String, String)` impl (only `(String, String, String, int[, ...])` variants where the 4th arg is start position, not a flags string). Janino codegen failed with `No applicable constructor/method found` for the 4-arg-with-flags call shape, breaking the patterns.md doctest (`source=apache | patterns message method=simple_pattern mode=aggregation`). Two complementary fixes: 1. `RexStandardizer.visitCall`: before serializing for pushdown, collapse `REGEXP_REPLACE_PG_4(field, pattern, replacement, 'g')` to the 3-arg `REGEXP_REPLACE_3` form. Safe because Calcite's 3-arg variant is already replace-all (same semantics as PG_4 with `g`). Only fires when the flags literal is exactly `"g"` so any future `i`/`m`/etc. use cases pass through untouched. 2. `ExtendedRelJson.toOp`: pass operand count when looking up an operator on the deserialization side so multi-arity SQL names (REGEXP_REPLACE_3 vs REGEXP_REPLACE_PG_4 vs REGEXP_REPLACE_5 all share `name="REGEXP_REPLACE"`) resolve to the right overload. Defensive — the standardizer fix above is what actually unblocks the doctest, but the resolver was picking by name alone and would have surfaced the same bug for any other overloaded builtin. Verified locally: - doctest queries (`patterns ... method=simple_pattern mode=aggregation [...]`) now return fully-tokenized output; - `CalcitePPLDashboardPatternsIT` still 1/1 PASS; - `CalcitePPLPatternsIT` still 10/15 with the same five known-pending failures (LogicalCorrelate + `_ShowNumberedToken` BRAIN cases). Signed-off-by: Kai Huang <ahkcs@amazon.com> * fix(opensearch): revert arity-aware toOp; restore spath/JSON_EXTRACT doctest The arity filter added to ExtendedRelJson.toOp in the previous commit broke SAFE_CAST → JSON_EXTRACT deserialization (used by `spath` lowering): the PPL JSON_EXTRACT UDF, registered as an anonymous UserDefinedFunctionBuilder subclass, doesn't expose a meaningful getOperandCountRange(), so my filter fell through to the firstKindMatch path and skipped the AvaticaUtils.instantiatePlugin "class" path that previously resolved the UDF. spath.md doctest started returning RuntimeException on `source=structured | spath input=doc_n n | eval n=cast(n as int) | stats sum(n)`. The RexStandardizer collapse (4-arg `REGEXP_REPLACE_PG_4(..., 'g')` → 3-arg `REGEXP_REPLACE_3`) already fixes the patterns.md doctest at the source side — by the time pushdown serialization runs, no 4-arg call exists for toOp to disambiguate. The arity filter was defensive only and no longer carries its weight; revert toOp to the original first-kind-match behavior, plus a spotless re-flow that came in with the same change. Verified locally on a fresh cluster: - spath.md doctest query → returns sum(n)=6 (was 500). - patterns.md doctest query → returns fully-tokenized aggregation rows. - CalcitePPLDashboardPatternsIT → 1/1 PASS. - CalcitePPLPatternsIT → 10/15 PASS (same baseline; same five known-pending BRAIN failures tracked separately). Signed-off-by: Kai Huang <ahkcs@amazon.com> * style: trim verbose comments per review Per @penghuo: drop the verbose multi-line explanatory comments and tighten the class/method javadoc on the new dashboard IT. Signed-off-by: Kai Huang <ahkcs@amazon.com> * test(integ-test): add verifyDataRows to dashboard patterns IT Per @dai-chen: schema-only verification doesn't catch "query succeeds but returns 0/wrong rows". Pin the 4 BRAIN clusters with their exact patterns, counts, and sample logs against the HDFS_LOGS fixture. Signed-off-by: Kai Huang <ahkcs@amazon.com> * refactor(core): fuse PATTERNS if-else in buildParseRelNode Per @dai-chen: the two consecutive `if (PATTERNS)` branches in buildParseRelNode share a condition; merge into a single if/else with each branch fully co-located. Pure refactor — CalcitePPLPatternsTest (logical-plan unit test) passes. Signed-off-by: Kai Huang <ahkcs@amazon.com> * test(integ-test): include CalcitePPLDashboardPatternsIT in CalciteNoPushdownIT Per CLAUDE.md guidance, new Calcite IT classes should be added to the no-pushdown suite. Verified locally that the dashboard query also passes with pushdown disabled (Dashboard 1/1, Patterns 10/15 — same baseline). Signed-off-by: Kai Huang <ahkcs@amazon.com> * test(integ-test): regenerate agg-push explain YAML for 3-arg REGEXP_REPLACE The previous YAML capture pre-dated the RexStandardizer 4-arg → 3-arg collapse landing. With the collapse, the pushed-down compounded script serializes the 3-arg form (SOURCES has 7 entries, no trailing 'g'). Signed-off-by: Kai Huang <ahkcs@amazon.com> * revert(core): drop SQL-side 'g' flag for patterns; move to DataFusion adapter Per @penghuo's review: DataFusion-specific concerns shouldn't live in SQL core. The 'g' flag is needed only because DataFusion's regexp_replace defaults to first-match-only — Calcite's 3-arg form is already replace-all on both pushdown and no-pushdown paths. Restores SQL core, RexStandardizer, the patterns unit test, and the SIMPLE- patterns explain YAMLs to their upstream/main shape. The 'g' flag is appended in opensearch-project/OpenSearch#21797's RegexpReplaceAdapter when converting 3-arg REGEXP_REPLACE to DataFusion. Same end-user behavior, smaller SQL diff, and the Calcite no-pushdown path no longer diverges from the pushdown YAML. Signed-off-by: Kai Huang <ahkcs@amazon.com> * test(api): pin UnifiedQueryContext PATTERN_* defaults via planner test Per @dai-chen: verify the RelNode produced when `patterns <field>` is run without explicit method=/mode= args — exercises that the PATTERN_METHOD and PATTERN_MODE defaults flow through to AstBuilder.visitPatternsCommand and produce a valid SIMPLE/LABEL lowering with a `patterns_field` projection. Signed-off-by: Kai Huang <ahkcs@amazon.com> * style: spotlessApply Signed-off-by: Kai Huang <ahkcs@amazon.com> --------- Signed-off-by: Kai Huang <ahkcs@amazon.com> Signed-off-by: Kai Huang <huangkaics@gmail.com>
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit f207f61.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
Per @mch2's nit: pull the kind→name fallback out of OpenSearchProjectRule into a single WindowFunction.resolveFunction(SqlOperator) so future backends pick up the OTHER-kind name-lookup logic without copy-paste. Three unit tests for the new method cover sql-kind hit, OTHER-kind name fallback (PATTERN), and unknown-operator returning null. Signed-off-by: Kai Huang <ahkcs@amazon.com>
Per @mch2's review: the prior wording implied SINGLE-on-SINGLETON was "the only viable choice on a single shard anyway", which is wrong now that PatternsCommandIT exercises this path on a 3-shard index. Restate: SINGLE-on-SINGLETON also runs correctly on multi-shard (gather to the coordinator, then aggregate), it just trades distributed parallelism for the type-mismatch workaround. Distributed parallelism is still the follow-up. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit 5bee96e |
|
Persistent review updated to latest commit ff3830f |
…rop ad-hoc check Per @mch2's review: PERCENTILE_APPROX is semantically a STATE_EXPANDING aggregate (per-key t-digest state grows with input cardinality), exactly like PERCENTILE_CONT and PERCENTILE_DISC which are already in the enum. The "right spot" for it is the AggregateFunction registry, not an ad-hoc string check in OpenSearchAggregateSplitRule. Changes: - AggregateFunction.PERCENTILE_APPROX(Type.STATE_EXPANDING, SqlKind.OTHER). - OpenSearchAggregateSplitRule: replace the isPercentileApprox + separate STATE_EXPANDING checks with a single isStateExpanding(SqlAggFunction) helper that handles the unregistered-op throw gracefully (returns false rather than crashing the planner — fixes a latent issue where my earlier STATE_EXPANDING addition would have crashed on truly unknown aggs). - Javadoc refreshed to describe the STATE_EXPANDING category rather than calling out percentile_approx as a one-off. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit 4c093f4 |
Drop AI-generated explanatory blocks; keep terse WHY-only notes where context isn't obvious from the code itself. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit 6058798 |
|
Persistent review updated to latest commit 111578e |
|
Persistent review updated to latest commit 6653bde |
…olver Required by sandbox-check's missingJavadoc task on public types. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit 87a268c |
|
Persistent review updated to latest commit 4c623c3 |
|
Persistent review updated to latest commit 216ae28 |
What
Wires PPL
patterns(BRAIN + SIMPLE methods, label + aggregation modes) through the analytics-engine route, including the OpenSearch Dashboards BRAIN-label panel query:Companion to opensearch-project/sql#5467 (
UnifiedQueryContextPATTERN_* defaults + dashboard IT + planner unit test). Both PRs are required.Changes
Rust / DataFusion (
sandbox/plugins/analytics-backend-datafusion)PATTERN_PARSERUDF wiring).internal_patternwindow UDF (BRAIN label mode) + aggregate UDF (BRAIN aggregation mode).pattern_parser_get_pattern/pattern_parser_get_tokens— workaround for DataFusion's substrait consumer's "Direct reference StructField with child not supported".SAFE_CAST/ITEMoverloads for ARRAY + MAP returns;FORCE_NULLABLEonLOCAL_TAKE_OP/LOCAL_ARRAY_AGG_OP.Analytics-engine planner (
sandbox/plugins/analytics-engine)PplWindowCallRewriter— bottom-up RelNode shuttle that retypesRexInputRefs beforecopy()so the BRAIN window-UDF result type propagates through enclosing Projects.OpenSearchAggregateSplitRule— skip PARTIAL/FINAL split when STATE_EXPANDING aggregates (TAKE/FIRST/LAST/LIST/VALUES) are present (avoids argList-shift crash ontake(field, 1)in stats).ArrowValues.toJavaValue— recursively unwrap ArrowTextso nestedMap<String, List<String>>returns hand the SQL layer pure Java types.DataFusion
regexp_replaceadapter (RegexpReplaceAdapter)"g"flag to every 3-argREGEXP_REPLACEgoing to DataFusion. DataFusion'sregexp_replacedefaults to first-match-only; Calcite's 3-arg form is already replace-all. The append preserves PPL contract on both backends — no SQL-core knowledge of DataFusion semantics required (companion change in feat(ppl): wire patterns command for analytics-engine dashboard route sql#5467 keeps SQL core on plain 3-arg).Capabilities + substrait wiring
WindowFunction.PATTERN,AggregateFunction.PATTERN.opensearch_window_functions.yamlregistersinternal_patternwindow-UDF variants.DataFusionFragmentConvertoroperator declarations +ADDITIONAL_*_SIGSfor isthmus binding.Results
CalcitePPLDashboardPatternsIT(new)CalcitePPLPatternsIT(analytics-engine route)Remaining 5 / 15: 3 ×
BrainAggregationMode_*(planner needsLogicalCorrelatepost-aggregate support) + 2 ×Brain*Mode_ShowNumberedToken(substrait type wiring for_ShowNumberedToken). Tracked separately, none affect the dashboard query.