feat(ppl): wire patterns command for analytics-engine dashboard route#5467
Conversation
PR Reviewer Guide 🔍(Review updated until commit 42799d4)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 42799d4 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit b830582
Suggestions up to commit 975f8cc
Suggestions up to commit 172916b
Suggestions up to commit f952e2b
Suggestions up to commit edc98f6
|
|
Persistent review updated to latest commit ff989ea |
ff989ea to
12e0524
Compare
|
Persistent review updated to latest commit 12e0524 |
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 172916b.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
Persistent review updated to latest commit 394cb5a |
|
Persistent review updated to latest commit 8720ca7 |
| // PPL `patterns` command defaults — mirror the cluster-side defaults registered in | ||
| // OpenSearchSettings (DEFAULT_PATTERN_METHOD_SETTING etc.). Without these the | ||
| // analytics-engine path's AstBuilder.visitPatternsCommand reads null from | ||
| // `settings.getSettingValue(Key.PATTERN_METHOD)`, fails with | ||
| // `PatternMethod.valueOf("NULL")` IllegalArgumentException, and every query that | ||
| // omits an explicit `method=` / `mode=` argument is rejected. |
There was a problem hiding this comment.
delete all these "AI" comments.
There was a problem hiding this comment.
Updated, trimmed verbose comments
| // Emit `regexp_replace(field, pattern, replacement, "g")` directly so the replacement | ||
| // is global (every match replaced). DataFusion's `regexp_replace` defaults to FIRST | ||
| // match only without the "g" flag — using the 3-arg form via the REPLACE handler | ||
| // produces `<*>@pyrami.com` instead of `<*>@<*>.<*>` on the analytics-engine route. | ||
| // Calcite's REGEXP_REPLACE_PG_4 with "g" matches what `replaceAll` does, so V2 / | ||
| // Calcite-path semantics are preserved. |
There was a problem hiding this comment.
Currently, PPL use regex_replace(field, pattern, replacement) without 'g' , right?
the propsoal is add 'g' paramater? It is does not change PPL behavoiur, right, becuase default is 'g'?
There was a problem hiding this comment.
Currently, PPL use regex_replace(field, pattern, replacement) without 'g' , right?
Yes
Calcite's 3-arg REGEXP_REPLACE is already replace-all, however data fusion's REGEXP_REPLACE replaces only the first match, which causes mismatch. My proposal is to add this g parameter to make the behavior replace-all for both engines.
So our PPL behavior would not change because we will later collapse the 4-arg to 3-arg in Calcite-path in RexStandardizer.java
There was a problem hiding this comment.
After taking another look, this approach can only cover the pushdown path, the no-pushdown path will be broken because of our 4-arg change
Updated the approach, the 'g' flag is now appended in OpenSearch's RegexpReplaceAdapter only when sending to DataFusion
| // PPL's `patterns` lowering emits 4-arg `REGEXP_REPLACE_PG_4(field, pattern, replacement, | ||
| // 'g')` so the analytics-engine/DataFusion path gets global replacement (DataFusion's | ||
| // `regexp_replace` defaults to first-match-only without an explicit flag). Calcite's | ||
| // 3-arg `REGEXP_REPLACE_3` is already replace-all in its enumerable runtime, but the PG_4 | ||
| // form has no matching `SqlFunctions.regexpReplace(String, String, String, String)` impl | ||
| // — Calcite's runtime only ships the `(String, String, String, int[, ...])` shapes (the | ||
| // 4-arg variant treats the 4th arg as start-position, not a flags string). The script | ||
| // pushdown path codegen would fail with `No applicable constructor/method found`. Collapse | ||
| // the 4-arg call to the 3-arg form whenever the flags literal is exactly `"g"`, which | ||
| // preserves replace-all semantics on the V2 / Calcite-pushdown side. | ||
| if (call.op == SqlLibraryOperators.REGEXP_REPLACE_PG_4 | ||
| && call.operands.size() == 4 | ||
| && call.operands.get(3) instanceof RexLiteral flagsLit | ||
| && "g".equals(flagsLit.getValueAs(String.class))) { | ||
| return helper.rexBuilder | ||
| .makeCall( | ||
| call.getType(), | ||
| SqlLibraryOperators.REGEXP_REPLACE_3, | ||
| call.operands.subList(0, 3)) | ||
| .accept(this, helper); | ||
| } |
There was a problem hiding this comment.
Why only this function need sepecial handling?
There was a problem hiding this comment.
The purpose is specifically to let the 4-arg REGEXP_REPLACE survive the pushdown path.
3-arg has a matching SqlFunctions.regexpReplace(String, String, String) Java method — Janino binds, pushdown works. 4-arg-with-flags doesn't: Calcite's SqlFunctions ships regexpReplace(String, String, String, int) (position) but no (String, String, String, String) (flags-as-string) variant. The operator is registered, the Java impl is missing — that's the gap we rewrite around. We collapse the 4-arg to 3-arg in this case
There was a problem hiding this comment.
After taking another look, this approach can only cover the pushdown path, the no-pushdown path will be broken because of our 4-arg change
Updated the approach, the g flag is now appended in OpenSearch's RegexpReplaceAdapter only when sending to DataFusion
|
Persistent review updated to latest commit c42fb7c |
|
Persistent review updated to latest commit 26b2d69 |
|
Persistent review updated to latest commit 172916b |
…ter for 3-arg calls DataFusion's regexp_replace defaults to first-match-only without an explicit flag; Calcite's REGEXP_REPLACE_3 is already replace-all. PPL relies on the Calcite contract (every match replaced — used by SIMPLE patterns, regex_replace, rex mode=sed), so on the DataFusion path the adapter now rewrites every 3-arg REGEXP_REPLACE_3 to 4-arg REGEXP_REPLACE_PG_4(..., "g") preserving the same end-user semantics across backends. Companion change in opensearch-project/sql#5467: the SQL core no longer emits the 'g' flag itself — that DataFusion-specific concern now lives only here. Two existing unit tests updated to expect the new always-global behavior (testAdaptPassesThroughWhenNoQuoteBlock → testAdaptAppendsGlobalFlagFor3Arg, testAdaptPassesThroughNonLiteralPattern → testAdaptAppendsGlobalFlagForNonLiteralPattern). The \Q rewrite, $N backreference rewrite, and 4-arg-with-flags pass-through are unchanged. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit 975f8cc |
|
Persistent review updated to latest commit 03ed5e9 |
PPL `patterns` command's AstBuilder reads cluster settings for method/mode/
max_sample_count/buffer_limit/show_numbered_token defaults when the query
omits them. Without these in the analytics-engine path's settings map, the
parser reads null, falls into `PatternMethod.valueOf("NULL")`, and every
`patterns` query without an explicit `method=` or `mode=` argument fails at
parse time with `No enum constant PatternMethod.NULL`.
Mirrors the OpenSearchSettings defaults (SIMPLE_PATTERN / LABEL / 10 /
100000 / false). Part of the analytics-engine route support for the
`patterns` command.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
`buildParseRelNode` for `ParseMethod.PATTERNS` lowered through PPL's REPLACE handler, which always emits Calcite's 3-arg `REGEXP_REPLACE_3`. That works on the V2 / Calcite path (Calcite's default is replace-all), but the analytics- engine route converts the call to substrait + DataFusion, and DataFusion's `regexp_replace` defaults to first-match-only without an explicit "g" flag. The dashboard test for `source = bank | patterns email mode=label` returned `<*>@pyrami.com` instead of `<*>@<*>.<*>` because only the first `[a-zA-Z0-9]+` run was replaced. Bypass the REPLACE handler for the PATTERNS branch and emit `REGEXP_REPLACE_PG_4` directly with a constant "g" flag. Same semantics on V2 / Calcite (Calcite's REGEXP_REPLACE_PG_4 with "g" = replace-all); fixes the analytics-engine path. CalcitePPLPatternsTest plan-string expectations updated to match the 4-arg form. 17/17 unit tests pass. IT result on analytics-engine route: testSimplePatternLabelMode_NotShowNumberedToken now passes. Signed-off-by: Kai Huang <ahkcs@amazon.com>
…el dashboard query
OpenSearch Dashboards renders BRAIN-pattern panels with the shape:
patterns ... method=BRAIN mode=label
| stats count() as pattern_count, take(message, 1) as sample_logs
by patterns_field
| sort -pattern_count
| fields patterns_field, pattern_count, sample_logs
This integration test pins that shape on the analytics-engine route so
regressions surface immediately. Schema-only assertions because BRAIN's
clustering output is dataset-version-sensitive — the contract we care about
is "the query plans, executes, and returns three columns in the right order".
Currently red end-to-end pending the BRAIN label window-UDF type-cascade
fix (see the OpenSearch-side WIP commit "BRAIN window UDF + dashboard
query path scaffolding" — the {@code PplWindowCallRewriter} stub
documents the remaining gap).
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Spotless drift from cherry-picking the analytics-engine patterns work across upstream's recent formatting touch-ups. No behavior change. Signed-off-by: Kai Huang <huangkaics@gmail.com> Signed-off-by: Kai Huang <ahkcs@amazon.com>
…p_replace
CalciteExplainIT's `testPatternsSimplePatternMethodWith{out,AggPushDown}Explain`
expected the old 3-arg `REGEXP_REPLACE(...)` form, but after the `feat(core)`
commit emits 4-arg `REGEXP_REPLACE(..., 'g':VARCHAR)` the plan output now
includes the extra operand both in the logical line and in the base64-encoded
compounded script of the physical/pushdown plan.
Regenerate both YAML expectations against the live planner.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
…cript pushdown The `feat(core)` commit on this branch lowered PPL `patterns` to a 4-arg `REGEXP_REPLACE_PG_4(field, pattern, replacement, 'g')` so DataFusion (which defaults to first-match-only) does global replacement on the analytics-engine route. Calcite's enumerable runtime — which the V2 / Calcite-pushdown path uses to compile the serialized RexCall into Janino bytecode — has no matching `SqlFunctions.regexpReplace(String, String, String, String)` impl (only `(String, String, String, int[, ...])` variants where the 4th arg is start position, not a flags string). Janino codegen failed with `No applicable constructor/method found` for the 4-arg-with-flags call shape, breaking the patterns.md doctest (`source=apache | patterns message method=simple_pattern mode=aggregation`). Two complementary fixes: 1. `RexStandardizer.visitCall`: before serializing for pushdown, collapse `REGEXP_REPLACE_PG_4(field, pattern, replacement, 'g')` to the 3-arg `REGEXP_REPLACE_3` form. Safe because Calcite's 3-arg variant is already replace-all (same semantics as PG_4 with `g`). Only fires when the flags literal is exactly `"g"` so any future `i`/`m`/etc. use cases pass through untouched. 2. `ExtendedRelJson.toOp`: pass operand count when looking up an operator on the deserialization side so multi-arity SQL names (REGEXP_REPLACE_3 vs REGEXP_REPLACE_PG_4 vs REGEXP_REPLACE_5 all share `name="REGEXP_REPLACE"`) resolve to the right overload. Defensive — the standardizer fix above is what actually unblocks the doctest, but the resolver was picking by name alone and would have surfaced the same bug for any other overloaded builtin. Verified locally: - doctest queries (`patterns ... method=simple_pattern mode=aggregation [...]`) now return fully-tokenized output; - `CalcitePPLDashboardPatternsIT` still 1/1 PASS; - `CalcitePPLPatternsIT` still 10/15 with the same five known-pending failures (LogicalCorrelate + `_ShowNumberedToken` BRAIN cases). Signed-off-by: Kai Huang <ahkcs@amazon.com>
…doctest The arity filter added to ExtendedRelJson.toOp in the previous commit broke SAFE_CAST → JSON_EXTRACT deserialization (used by `spath` lowering): the PPL JSON_EXTRACT UDF, registered as an anonymous UserDefinedFunctionBuilder subclass, doesn't expose a meaningful getOperandCountRange(), so my filter fell through to the firstKindMatch path and skipped the AvaticaUtils.instantiatePlugin "class" path that previously resolved the UDF. spath.md doctest started returning RuntimeException on `source=structured | spath input=doc_n n | eval n=cast(n as int) | stats sum(n)`. The RexStandardizer collapse (4-arg `REGEXP_REPLACE_PG_4(..., 'g')` → 3-arg `REGEXP_REPLACE_3`) already fixes the patterns.md doctest at the source side — by the time pushdown serialization runs, no 4-arg call exists for toOp to disambiguate. The arity filter was defensive only and no longer carries its weight; revert toOp to the original first-kind-match behavior, plus a spotless re-flow that came in with the same change. Verified locally on a fresh cluster: - spath.md doctest query → returns sum(n)=6 (was 500). - patterns.md doctest query → returns fully-tokenized aggregation rows. - CalcitePPLDashboardPatternsIT → 1/1 PASS. - CalcitePPLPatternsIT → 10/15 PASS (same baseline; same five known-pending BRAIN failures tracked separately). Signed-off-by: Kai Huang <ahkcs@amazon.com>
Per @penghuo: drop the verbose multi-line explanatory comments and tighten the class/method javadoc on the new dashboard IT. Signed-off-by: Kai Huang <ahkcs@amazon.com>
Per @dai-chen: schema-only verification doesn't catch "query succeeds but returns 0/wrong rows". Pin the 4 BRAIN clusters with their exact patterns, counts, and sample logs against the HDFS_LOGS fixture. Signed-off-by: Kai Huang <ahkcs@amazon.com>
Per @dai-chen: the two consecutive `if (PATTERNS)` branches in buildParseRelNode share a condition; merge into a single if/else with each branch fully co-located. Pure refactor — CalcitePPLPatternsTest (logical-plan unit test) passes. Signed-off-by: Kai Huang <ahkcs@amazon.com>
…ushdownIT Per CLAUDE.md guidance, new Calcite IT classes should be added to the no-pushdown suite. Verified locally that the dashboard query also passes with pushdown disabled (Dashboard 1/1, Patterns 10/15 — same baseline). Signed-off-by: Kai Huang <ahkcs@amazon.com>
…EPLACE The previous YAML capture pre-dated the RexStandardizer 4-arg → 3-arg collapse landing. With the collapse, the pushed-down compounded script serializes the 3-arg form (SOURCES has 7 entries, no trailing 'g'). Signed-off-by: Kai Huang <ahkcs@amazon.com>
… adapter Per @penghuo's review: DataFusion-specific concerns shouldn't live in SQL core. The 'g' flag is needed only because DataFusion's regexp_replace defaults to first-match-only — Calcite's 3-arg form is already replace-all on both pushdown and no-pushdown paths. Restores SQL core, RexStandardizer, the patterns unit test, and the SIMPLE- patterns explain YAMLs to their upstream/main shape. The 'g' flag is appended in opensearch-project/OpenSearch#21797's RegexpReplaceAdapter when converting 3-arg REGEXP_REPLACE to DataFusion. Same end-user behavior, smaller SQL diff, and the Calcite no-pushdown path no longer diverges from the pushdown YAML. Signed-off-by: Kai Huang <ahkcs@amazon.com>
Per @dai-chen: verify the RelNode produced when `patterns <field>` is run without explicit method=/mode= args — exercises that the PATTERN_METHOD and PATTERN_MODE defaults flow through to AstBuilder.visitPatternsCommand and produce a valid SIMPLE/LABEL lowering with a `patterns_field` projection. Signed-off-by: Kai Huang <ahkcs@amazon.com>
03ed5e9 to
b830582
Compare
…ter for 3-arg calls DataFusion's regexp_replace defaults to first-match-only without an explicit flag; Calcite's REGEXP_REPLACE_3 is already replace-all. PPL relies on the Calcite contract (every match replaced — used by SIMPLE patterns, regex_replace, rex mode=sed), so on the DataFusion path the adapter now rewrites every 3-arg REGEXP_REPLACE_3 to 4-arg REGEXP_REPLACE_PG_4(..., "g") preserving the same end-user semantics across backends. Companion change in opensearch-project/sql#5467: the SQL core no longer emits the 'g' flag itself — that DataFusion-specific concern now lives only here. Two existing unit tests updated to expect the new always-global behavior (testAdaptPassesThroughWhenNoQuoteBlock → testAdaptAppendsGlobalFlagFor3Arg, testAdaptPassesThroughNonLiteralPattern → testAdaptAppendsGlobalFlagForNonLiteralPattern). The \Q rewrite, $N backreference rewrite, and 4-arg-with-flags pass-through are unchanged. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit b830582 |
Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Persistent review updated to latest commit 42799d4 |
…21797) * [Analytics Backend / DataFusion] Patterns milestone 1: pure-logic Rust scaffold Starts the port of PPL `patterns` command's algorithm layer to native Rust so the BRAIN method and `PATTERN_PARSER` scalar can execute on the analytics-engine route. The end goal is `CalcitePPLPatternsIT` passing against parquet-backed indices. A new `patterns/` module containing the pure-logic layer (no DataFusion dependency, fully unit-testable in isolation): - `preprocess.rs` — Java `BrainLogParser.preprocess(...)` port. Regex-based variable detection (IP, ISO datetime, UUID, hex/letter-digit/floats), delimiter normalization, whitespace splitting. The "generic number surrounded by non-alphanumeric" rule is implemented manually because Rust's `regex` crate doesn't support lookaround. - `utils.rs` — Java `PatternUtils` port. `parse_pattern`, `extract_variables`, `ParseResult` with `to_token_order_string`. `WILDCARD_PATTERN` and `TOKEN_PATTERN` regexes are equivalent to Java's `Pattern.compile` strings. - `brain.rs` — `BrainLogParser` struct skeleton, `collapse_continuous_wildcards`, `PatternEntry` / `BrainParseStats` typed result types. The classifier internals (histogram, group token set, `parse_log_pattern`) are deliberate stubs in this milestone — they land in milestone 2. - `tokens.rs` — `PatternResult` typed view of the per-row / per-group result the UDF layer will materialize into Arrow. 14 unit tests pass (1 `#[ignore]` placeholder for the full BRAIN classification once the algorithm port lands): - `preprocess_simple_log_line_splits_on_whitespace` - `preprocess_substitutes_ip_then_blk_number` — matches the expected preprocessed shape from `testBrainLabelMode_NotShowNumberedToken` - `preprocess_substitutes_uuid` — matches `testBrainParseWithUUID_*` - `preprocess_collapses_consecutive_wildcards_via_number_runs` - `parse_wildcard_pattern_splits_on_email_separators` — `<*>@<*>.<*>` from `testSimplePatternLabelMode_*` - `to_token_order_string_rewrites_wildcards_to_numbered_tokens` — produces `<token1>@<token2>.<token3>` - `extract_variables_extracts_email_parts` — matches `testSimplePatternLabelMode_ShowNumberedToken` ImmutableMap expectation - `extract_variables_handles_multi_sample_aggregation` — matches `testBrainAggregationMode_ShowNumberedToken` multi-sample tokens.list - `extract_variables_returns_empty_when_static_mismatch` - `token_pattern_matches_numbered_placeholders` - `collapse_three_consecutive_wildcards` / variants 1. Full BRAIN classifier port — histogram + group-token-set + parse_log_pattern. 2. DataFusion ScalarUDF wrapper for PATTERN_PARSER (`udf/pattern_parser.rs`). 3. DataFusion AggregateUDF wrapper for INTERNAL_PATTERN (aggregation mode). 4. DataFusion WindowUDF wrapper for INTERNAL_PATTERN (label mode). 5. Substrait YAML signatures for `pattern_parser` and `internal_pattern`. 6. Java adapters in analytics-backend-datafusion + ScalarFunction enum + capability registration. 7. `PatternsCommandIT` mirroring `CalcitePPLPatternsIT` against parquet-backed indices. 8. Verification via `:integTestRemote -Dtests.analytics.force_routing=true`. This milestone is a no-op at runtime — `patterns/` is unwired. Lands as the algorithmic foundation for the work above. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Backend / DataFusion] Register ITEM for ARRAY and MAP returns PPL `patterns` lowers its result-struct flatten via `INTERNAL_ITEM(struct, "sample_logs")` and `INTERNAL_ITEM(struct, "tokens")`, where `sample_logs` returns `ARRAY<VARCHAR>` and `tokens` returns `MAP<VARCHAR, ARRAY<VARCHAR>>`. The scalar form of ITEM was already in STANDARD_PROJECT_OPS (with SUPPORTED_FIELD_TYPES — covers VARCHAR / numeric returns), but the ARRAY- and MAP-returning shapes weren't registered, so OpenSearchProjectRule rejected the call with `No backend supports scalar function [ITEM] among [datafusion]`. Adds ITEM to both ARRAY_RETURNING_PROJECT_OPS and MAP_RETURNING_PROJECT_OPS. Part of the PPL `patterns` command analytics-engine support stack. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Backend / DataFusion] Patterns milestone 2: full BRAIN classifier port Replaces the stub in patterns/brain.rs with a full port of BrainLogParser.java from the SQL plugin's common/src/main/java/org/opensearch/sql/common/patterns/. ## What lands - preprocess_all_logs — tokenizes each line with default filters (IP/datetime/UUID/numbers), appends a synthetic logId token, and updates token_freq_map. - process_token_histogram — positional token-frequency counter. - calculate_group_token_freq — picks the representative WordCombination (sorted by same_freq_count desc, then word_freq desc) per row and populates the per-(tokens_len,candidate,position) group token set. - parse_log_pattern — per-row classifier. Tokens whose frequency > repFreq are kept ONLY if unique in their group; tokens with frequency < repFreq are kept ONLY if the group has fewer than variable_count_threshold variants. Everything else becomes <*>. - parse_all_log_patterns — full pipeline. Group equal pattern strings together, count occurrences, collect samples up to max_sample_count. - WordCombination — typed pair with the Java compareTo ordering. - collapse_continuous_wildcards — adjacent-<*> collapse, unchanged from milestone 1. ## Tests (18/18 pass) Existing 14 unit tests from milestone 1 still pass. New tests cover the BRAIN classifier directly with fixtures from CalcitePPLPatternsIT: - brain_groups_verification_succeeded_lines — two 'Verification succeeded' lines collapse to one pattern with 2 samples. - brain_aggregates_hdfs_fixtures_into_four_groups — the 8-line HDFS fixture matches the IT's testBrainAggregationMode_NotShowNumberedToken expectation: exactly 4 patterns, every group has count == 2. - brain_aggregates_brain_label_mode_blockstar_into_expected_pattern — spot-checks the exact pattern string the IT asserts on row 1 of testBrainLabelMode_NotShowNumberedToken: 'BLOCK* NameSystem.addStoredBlock: blockMap updated: <*IP*> is added to blk_<*> size <*>'. The fixtures match the IT 1:1 — equivalence with the Java implementation on the test surface that motivated this work is enforced at unit test time. Subsequent milestones wire the algorithm into DataFusion's ScalarUDF / AggregateUDF / WindowUDF APIs and registers the opensearch_scalar_functions.yaml signatures so the analytics-engine route can dispatch to it. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Backend / DataFusion] Patterns milestone 3a: scalar eval functions Adds patterns/eval.rs with the three entry points the PATTERN_PARSER scalar UDF dispatches between. Mirrors PatternParserFunctionImpl's evalField / evalAgg / evalSamples (197-line Java class). - eval_field(pattern, field) — SIMPLE label mode + show_numbered_token. Parses the wildcard pattern, extracts field substrings into a token map, returns numbered-token rewrite. - eval_samples(pattern, sample_logs) — SIMPLE aggregation mode + show_numbered_token. Token map accumulates across all sample logs. - eval_agg(field, agg_object, show_numbered_token) — BRAIN label mode. Scores each candidate pattern against the preprocessed input tokens, picks the highest-similarity candidate, optionally rewrites to numbered tokens. 26/26 patterns module tests pass. New cases pin equivalence with the CalcitePPLPatternsIT expectations directly: - eval_field_renames_email_wildcards_to_numbered_tokens — matches testSimplePatternLabelMode_ShowNumberedToken's <token1>@<token2>.<token3> on "amberduke@pyrami.com". - eval_field_handles_custom_pattern — testSimplePatternLabelModeWithCustomPattern_* with the "amberduke<*>" prefix-anchored template. - eval_samples_accumulates_tokens_across_samples — matches testSimplePatternAggregationMode_ShowNumberedToken's 3-sample case. - eval_agg_picks_best_matching_candidate — best-fit similarity scoring against two BRAIN-aggregate candidates. Next milestone: ScalarUDF wrapper in udf/pattern_parser.rs + opensearch_scalar_functions.yaml entry + ScalarFunction enum + Java adapter so the analytics-engine route can dispatch into these eval functions. Currently unused at runtime. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Backend / DataFusion] Patterns milestone 3b: PATTERN_PARSER scalar UDF wiring Wires the eval functions from milestone 3a into a DataFusion ScalarUDF and registers all the cross-component plumbing the analytics-engine route needs to dispatch to it. After this milestone the PPL SIMPLE patterns + show_numbered_token call shape (evalField / evalSamples) is reachable from a Calcite RelNode on the analytics-engine path. - udf/pattern_parser.rs — DataFusion ScalarUDF wrapper. Accepts two operand shapes: pattern_parser(VARCHAR, VARCHAR) — evalField pattern_parser(VARCHAR, List<VARCHAR>) — evalSamples Return type is Struct<pattern: VARCHAR, tokens: Map<VARCHAR, List<VARCHAR>>>. The 3-arg evalAgg shape used by BRAIN label mode goes through a separate path (next milestone). - opensearch_scalar_functions.yaml — substrait entry, return type declared as any1 (same convention json_extract_all uses for its concrete Map return). - ScalarFunction.PATTERN_PARSER — new enum constant in analytics-framework. - PatternParserAdapter — rename adapter (AbstractNameMappingAdapter) that routes PPL's INTERNAL_PATTERN_PARSER calls to the locally- declared SqlFunction. The locally-declared operator is the referent of the FunctionMappings.s entry that gives isthmus the substrait extension name. - DataFusionAnalyticsBackendPlugin — * adapter map: PATTERN_PARSER → new PatternParserAdapter() * MAP_RETURNING_PROJECT_OPS: + PATTERN_PARSER (its return type is MAP<VARCHAR, ANY> per UserDefinedFunctionUtils.patternStruct) - DataFusionFragmentConvertor.ADDITIONAL_SCALAR_SIGS — FunctionMappings.s(LOCAL_PATTERN_PARSER_OP, "pattern_parser") Existing 26 patterns module unit tests still pass; new 2 unit tests in udf::pattern_parser pin the StructArray construction shape: - struct_data_type_has_pattern_and_tokens_fields - build_struct_array_populates_pattern_and_tokens_for_email_evalfield Total Rust crate test pass: 28/28. Native lib rebuild + IT run-through is the next step in this branch — to verify the IT pass count moves from 3/14 to ≥ 5/14 (the 2 SIMPLE patterns+show_numbered_token tests should pass once the runtime substrait binding succeeds). - INTERNAL_PATTERN aggregate UDF (3 tests — BRAIN aggregation mode) - INTERNAL_PATTERN window UDF (3 tests — BRAIN label mode; this UDF is the input to PATTERN_PARSER's 3-arg evalAgg shape) - TAKE aggregate nullability fix (1 test) - 3-arg evalAgg shape in this UDF (depends on window UDF landing) Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Backend / DataFusion] Register SAFE_CAST for ARRAY and MAP returns PPL's flattenParsedPattern wraps INTERNAL_ITEM(struct, key) lookups in SAFE_CAST whenever the keyed field needs an explicit declared type. The flatten step targets: - SAFE_CAST(ITEM(struct, "pattern"), VARCHAR) — scalar (covered) - SAFE_CAST(ITEM(struct, "pattern_count"), BIGINT) — scalar (covered) - SAFE_CAST(ITEM(struct, "tokens"), MAP<VARCHAR,...>) — needs MAP entry - SAFE_CAST(ITEM(struct, "sample_logs"), ARRAY<VAR>) — needs ARRAY entry SAFE_CAST is already in STANDARD_PROJECT_OPS for SUPPORTED_FIELD_TYPES (scalar) returns. Add it to ARRAY_RETURNING_PROJECT_OPS and MAP_RETURNING_PROJECT_OPS so the OpenSearchProjectRule planner check admits the call when its inferred return type is array- or map-shaped. This is part of the PPL patterns command analytics-engine support stack. Effect on CalcitePPLPatternsIT: the 6 SAFE_CAST 'No backend supports' failures from milestone 3b shift to substrait-binding errors that need the INTERNAL_PATTERN window / aggregate UDFs to fully clear. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Backend / DataFusion] Add INTERNAL_PATTERN aggregate UDF (BRAIN aggregation mode) Implements the BRAIN aggregate side of PPL's `patterns ... method=BRAIN` on the analytics-engine route. Mirrors `LogPatternAggFunction` on the SQL plugin side: collects per-group log lines and runs BRAIN over the corpus at finalize time, emitting List<Struct<pattern, pattern_count, tokens, sample_logs>> so the downstream UNNEST + ITEM projections (PPL's `flattenParsedPattern`) resolve as named struct-field access. Wiring (top-down): - `AggregateFunction` (analytics-framework): adds `PATTERN` enum constant. Makes `fromNameOrError` case-insensitive — the PPL operator is registered lower-case ("pattern") whereas the enum constants are upper-case, and the raw `valueOf("pattern")` lookup was failing. - `DataFusionAnalyticsBackendPlugin`: declares `AggregateFunction.PATTERN` in `AGG_FUNCTIONS` so the capability registry advertises it as supported on the datafusion backend. - `DataFusionFragmentConvertor`: adds `LOCAL_INTERNAL_PATTERN_OP` (substrait-bound to `internal_pattern`) and an `ADDITIONAL_AGGREGATE_SIGS` entry — same pattern as `LOCAL_TAKE_OP` / `LOCAL_FIRST_OP`. - `PplAggregateCallRewriter`: rewrites PPL `pattern(...)` calls onto `LOCAL_INTERNAL_PATTERN_OP` and substitutes the call's return type with the concrete struct shape (PPL's declared `ARRAY<MAP<VARCHAR, ANY>>` has an embedded `ANY` that isthmus cannot serialize to Substrait). - `opensearch_aggregate_functions.yaml`: registers `internal_pattern` with 4-arg, 5-arg, 6-arg, and 1-arg (FINAL) overloads matching the PPL emitted call shapes (max_sample_count, buffer_limit, show_numbered_token, plus optional frequency_threshold_percentage and variable_count_threshold). - `udaf/internal_pattern.rs`: the actual UDAF — variadic_any signature, an Accumulator that collects log lines and runs `BrainLogParser` at evaluate() time, with per-shard state shaped as `List<Utf8>` so the coordinator's FINAL accumulator can concatenate via `merge_batch`. 3 unit tests pin behaviour (empty corpus, repeated-pattern grouping, cross-shard merge). Test impact (`CalcitePPLPatternsIT` via analytics-engine route): the three BRAIN aggregation-mode tests advance past the previous `AggregateFunction.pattern` enum lookup; they now surface a downstream "Project rule encountered unmarked child [LogicalCorrelate]" from the PPL Calcite path's UNNEST after the aggregate (separate planner work to add Correlate support, not covered by this commit). Window UDF for BRAIN label mode and the SAFE_CAST nullability fix for SIMPLE aggregation mode are also pending. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Backend / DataFusion] FORCE_NULLABLE on LOCAL_TAKE_OP and LOCAL_ARRAY_AGG_OP PPL declares TAKE's return type via {@code PPLReturnTypes.ARG0_ARRAY} which passes {@code nullable=true} to {@code SqlTypeUtil.createArrayType}. The rewriter clones that nullable type onto the rewritten LOCAL_TAKE_OP call, but the operator's default {@code ReturnTypes.TO_ARRAY} infers NOT NULL (because aggregates over a non-empty group can't be null in standard SQL semantics). The mismatch trips {@code AggregateCall.create}'s validation with type mismatch: aggCall type: VARCHAR ARRAY inferred type: VARCHAR ARRAY NOT NULL Surface: CalcitePPLPatternsIT's SIMPLE aggregation tests — the Parse-then- Aggregate path applies TAKE to the source field, and the field's nullable typing flows into the aggregate's declared return type. The bug is general though — any nullable-input TAKE rewritten through the analytics-engine backend would have hit it as soon as the rewriter's explicit-type path fired. Fix: andThen FORCE_NULLABLE on both LOCAL_TAKE_OP and LOCAL_ARRAY_AGG_OP so the operator's inferred type matches what PPL emits. Mirror of the PPL side's ARG0_ARRAY. Test impact: CalcitePPLPatternsIT 3/15 → 4/15 (one SIMPLE aggregation test unblocked; the show_numbered_token variants still hit the separate "Unable to convert the type ANY" issue from PATTERN_PARSER's MAP<VARCHAR, ANY> return type). Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Backend / DataFusion] PATTERN_PARSER struct return type + list<string> overload (partial) Wires up the PATTERN_PARSER ANY-type fixes that the SIMPLE pattern show_numbered_token tests need, but does NOT yet bring them green — the wrapping {@code map_extract} + {@code array_element} chain that ArrayElementAdapter created (when ITEM was lowered against the original PPL MAP<VARCHAR, ANY> declared type) keeps its frozen ANY return type even after PATTERN_PARSER is rewritten to STRUCT. Substrait's TypeConverter still rejects with "Unable to convert the type ANY" when it walks the operand types of the wrappers. Captures the necessary framework: - `opensearch_scalar_functions.yaml`: adds a second `pattern_parser(pattern: string, sample_logs: list<string>)` overload — the SIMPLE aggregation mode emits this call shape from the PPL Calcite visitor's `showNumberedToken=true` branch, and without this overload isthmus's ScalarFunctionConverter rejects the call earlier with "Unable to convert call pattern_parser(string?, list<string?>?)" before even getting to operand-type validation. - `PatternParserAdapter`: overrides `adapt` to substitute the PPL declared MAP<VARCHAR, ANY> return type with the concrete struct shape `STRUCT<pattern: VARCHAR, tokens: MAP<VARCHAR, ARRAY<VARCHAR>>>` that matches the Rust UDF's Arrow output. Same pattern as the PATTERN aggregate's PplAggregateCallRewriter case. - `ItemTypeRebuilder` (new): pre-isthmus shuttle that walks every Project / Filter expression tree. Rebuilds ITEM calls so their return type is re-derived from operand 0 (handles the legacy ITEM-on-STRUCT case for non-adapted plans), and substitutes any pattern_parser call whose declared type is still MAP<VARCHAR, ANY> with the concrete struct. Wired into both `convertToSubstrait` and `convertStandalone` so attached-wrapper paths (PARTIAL agg + FINAL agg conversion) get the same rewrite. Remaining gap (for follow-up): ArrayElementAdapter has already converted ITEM on MAP into `array_element(map_extract(map, key), 1)` with ANY-derived return types BEFORE PatternParserAdapter substitutes the inner struct type. The wrappers stay typed as ANY ARRAY / ANY, and isthmus rejects them. Two paths to close this: - SQL plugin: change PATTERN_PARSER's declared return type from MAP<VARCHAR, ANY> to a concrete struct (touches v2 path's result materialisation). - Backend: extend ItemTypeRebuilder to detect the `array_element(map_extract(STRUCT, key), 1)` anti-pattern and rewrite to a direct STRUCT field access. Test impact (CalcitePPLPatternsIT via analytics-engine route): no change from previous 4/15 pass count — the SIMPLE pattern show_numbered_token tests still hit the ANY-typed wrapper chain. The framework is in place; a follow-up commit closes the remaining gap. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Backend / DataFusion] Hoist struct-returning calls into child Project for FieldAccess Builds on the previous PATTERN_PARSER struct return-type rewrite. Two additions: 1. `tryUnwindArrayElementMapExtract`: detect the {@code array_element(map_extract(<STRUCT_call>, <literal_key>), 1)} chain that {@link ArrayElementAdapter} produces for ITEM-on-MAP and replace it with a direct {@code RexFieldAccess(STRUCT_call, field)}. Isthmus serialises field access natively via a Substrait {@code FieldReference} with {@code StructField} segment, bypassing the missing ITEM-on-struct extension binding. 2. `maybeHoistStructAccess`: isthmus' {@code RexExpressionConverter.visitFieldAccess} only handles refs whose reference expression is {@code RexInputRef}, {@code RexFieldAccess}, or {@code RexCorrelVariable} — function-call references fail with {@code "RexFieldAccess for SqlKind OTHER_FUNCTION not supported"}. Hoist each unique struct-returning call into a child Project so the outer project's FieldAccess references an input ref instead. Mirrors {@code OpenSearchProject#liftNestedRexOver} which hoists nested {@code RexOver} for the same isthmus compatibility reason. Test impact (CalcitePPLPatternsIT via analytics-engine route): 4/15 pass count holds. SIMPLE pattern show_numbered_token tests now make it past substrait conversion (no more {@code "Unable to convert the type ANY"} or {@code "RexFieldAccess for SqlKind OTHER_FUNCTION"}) but DataFusion's substrait consumer rejects nested field access with {@code "Direct reference StructField with child is not supported"}. Closing that gap requires either per-field scalar UDFs (e.g. {@code pattern_parser_get_pattern}, {@code pattern_parser_get_tokens}) that return scalar values directly, or a substrait emission tweak that splits struct field access across two relations. The hoist framework added here remains useful for both paths — it's the pre-requisite for any approach that needs to express "compute struct once, project its fields multiple times". Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Backend / DataFusion] Per-field scalar UDFs for PATTERN_PARSER (SIMPLE patterns end-to-end) Closes the SIMPLE pattern show_numbered_token path on the analytics-engine route. Replaces the previous "PATTERN_PARSER returns struct, downstream unwraps via FieldAccess + hoist" pipeline with two scalar UDFs that each return one field directly, eliminating struct-field access from the substrait plan entirely. Why the previous approach didn't work: - PPL declares PATTERN_PARSER's return type as MAP<VARCHAR, ANY> (the v2 path returns a heterogeneous Java map), and the embedded ANY breaks substrait serialization. - Rewriting the return type to a concrete STRUCT lets isthmus serialize pattern_parser itself, but downstream ITEM(STRUCT, key) has no substrait extension binding, and a RexFieldAccess workaround fails isthmus' "RexFieldAccess for SqlKind OTHER_FUNCTION not supported". - Hoisting the struct call into a child Project + a top-level FieldAccess(InputRef, field) gets past isthmus, but DataFusion's substrait consumer then rejects with "Direct reference StructField with child is not supported". Per-field UDF design: - `pattern_parser_get_pattern(pattern, field|sample_logs)` → Utf8 - `pattern_parser_get_tokens(pattern, field|sample_logs)` → Map<Utf8, List<Utf8>> Both share the same parse logic as the base `pattern_parser` UDF (which is kept registered for the BRAIN aggregate path that emits a struct-of- -fields via UNNEST). The new wrappers delegate to the base UDF and extract the requested field column from its StructArray output. Wiring: - `udf/pattern_parser.rs`: adds `PatternParserGetFieldUdf` parameterised by `PatternField::Pattern | Tokens`; registers both during `register_all`. Each wrapper returns a flat scalar Arrow array — no StructArray in the result. - `opensearch_scalar_functions.yaml`: signatures for both new UDFs in both call shapes (string+string for label mode, string+list<string> for aggregation-with-samples). - `DataFusionFragmentConvertor`: `LOCAL_PATTERN_PARSER_GET_*_OP` locally-declared operators + `ADDITIONAL_SCALAR_SIGS` entries so isthmus binds them by identity to the substrait extension names. - `ItemTypeRebuilder`: rewritten to detect the `array_element(map_extract(pattern_parser(args), 'pattern' | 'tokens'), 1)` chain (the legacy ITEM-on-MAP lowering ArrayElementAdapter produces) and replace it with a direct call to the appropriate per-field operator, preserving the original pattern_parser operand list. Struct field access never reaches substrait. - `PatternParserAdapter`: reverts the earlier MAP→STRUCT return-type substitution (no longer needed) and updates the class doc to point to ItemTypeRebuilder as the actual ANY-elimination step. Test impact (CalcitePPLPatternsIT via analytics-engine route): 4/15 → 8/15. Newly green: - testSimplePatternAggregationMode_ShowNumberedToken - testSimplePatternAggregationMode_WithGroupBy_ShowNumberedToken - testSimplePatternLabelMode_ShowNumberedToken - testSimplePatternLabelModeWithCustomPattern_ShowNumberedToken All SIMPLE pattern variants now pass. Remaining failures are all BRAIN (3 aggregation + 4 label/UUID), each blocked by separate larger items: LogicalCorrelate support in the analytics-engine planner for the aggregation UNNEST, and an INTERNAL_PATTERN window UDF for label mode. The companion SQL-plugin commit (Arrow Text → String via class-name check in {@code ExprValueUtils.fromObjectValue}) is required for the new green tests to deserialize the runtime result rows — see the sql repo's {@code verify/datetime-udt-singleton-fix} branch. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Backend / DataFusion] WIP: BRAIN window UDF + dashboard query path scaffolding Lays down the structural pieces needed for PPL `patterns ... method=BRAIN mode=label` on the analytics-engine route, but stops short of crossing the type-cascade boundary required to ship the end-to-end test. Documented gap below. What's in place: Java planner: - `WindowFunction.PATTERN` enum + case-insensitive `fromName` fallback; `OpenSearchProjectRule.collectWindowFunctions` accepts OTHER-kind operators via name lookup so PPL's INTERNAL_PATTERN no longer trips the planner's "Window function [pattern] is not supported" guard at the capability check. - `DataFusionAnalyticsBackendPlugin` advertises `WindowFunction.PATTERN` in its window capability set. Backend plugin: - `LOCAL_INTERNAL_PATTERN_WINDOW_OP` locally-declared SqlAggFunction + `ADDITIONAL_WINDOW_SIGS` so the WindowFunctionConverter binds the operator by identity to the substrait extension URN. - `opensearch_window_functions.yaml` declares the substrait variant for `internal_pattern` with `window_type: PARTITION` and `return: string?`. - `DataFusionPlugin.loadSubstraitExtensions` merges the new YAML into the extension catalog. - `PplWindowCallRewriter` skeleton (no-op for now — see gap below). Rust: - New `udwf/internal_pattern.rs` module + registration through `session_context::create_session_context`. PartitionEvaluator buffers the field column for the partition, runs BrainLogParser once over the corpus, then per-row matches via eval_agg and emits the wildcard pattern string as Utf8 per row. Two unit tests pin the shape contract. - `udwf/mod.rs` aggregator + `lib.rs` `pub mod udwf;`. ItemTypeRebuilder: - Detects the raw `ITEM(pattern_parser(...), 'key')` shape in addition to the post-ArrayElementAdapter `array_element(map_extract(...), 1)` form. The label-mode chain never goes through ArrayElementAdapter because the pattern_parser's container type still reads as MAP<VARCHAR, ANY> at adapter time, so the raw ITEM shape is the one that reaches the convertor. - Handles the 3-arg `pattern_parser(source, window_result, show_token)` shape PPL emits for BRAIN label mode: 'pattern' key bypasses the parser entirely (returns the window result directly); 'tokens' key routes to the existing 2-arg `pattern_parser_get_tokens(window_result, source)`. Test scaffold: - `CalcitePPLDashboardPatternsIT` pins the canonical Dashboards query shape: `patterns ... method=BRAIN mode=label | stats count(), take(message, 1) by patterns_field | sort -pattern_count`. Currently red. The gap — RexInputRef type cascade after RexOver substitution: When the BRAIN label expression compiles, `OpenSearchProject#liftNestedRexOver` hoists the `pattern(...) OVER ()` call into a child Project so DataFusion's substrait consumer auto-lifts it into a `LogicalWindow`. The parent Project's {@code PATTERN_PARSER(field, $hoistedCol, showToken)} call captures a {@code RexInputRef($hoistedCol)} whose Calcite-frozen type is the child's row-type entry for that column — currently {@code MAP<VARCHAR, ANY> ARRAY} (PPL's declared INTERNAL_PATTERN return). Rewriting the inner RexOver to {@code LOCAL_INTERNAL_PATTERN_WINDOW_OP} with VARCHAR return type cascades cleanly through the child Project's expressions, but the parent's frozen {@code RexInputRef} type doesn't update — Calcite flags it with {@code "type mismatch: ref: MAP NOT NULL ARRAY, input: VARCHAR"} at the next Aggregate/Sort construction. Closing the gap requires RelNode-level coordination: rewrite the inner Project's row type, then walk every downstream Project / Aggregate that references the hoisted column and rebuild their RexInputRefs with the new VARCHAR type. That's a multi-RelNode rewriter (similar in spirit to `maybeHoistStructAccess` in ItemTypeRebuilder but spanning a wider rel window). Left for the next milestone — the no-op stub in {@link PplWindowCallRewriter#WindowRewriteShuttle#visitOver} preserves the existing planner-level capability check pass-through so the failure surfaces as the substrait-side "Unable to convert the type ANY" rather than the earlier "Window function [pattern] is not supported". Test impact (`CalcitePPLDashboardPatternsIT`, `CalcitePPLPatternsIT` via analytics-engine route): no change from the 8/15 baseline established by the per-field PATTERN_PARSER UDF work. BRAIN tests (3 agg + 4 label/UUID) remain red. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Engine] BRAIN label-mode window-UDF cascade rewrite + STATE_EXPANDING split skip Lands the BRAIN label-mode path end-to-end through the analytics-engine route. Two coordinated changes: 1. `PplWindowCallRewriter` — full RelNode-level rewriter Replaces the prior no-op stub with a recursive bottom-up rewriter that swaps PPL's INTERNAL_PATTERN window operator onto our local `LOCAL_INTERNAL_PATTERN_WINDOW_OP` stub (returning {@code VARCHAR}) and cascades the resulting column type change through every downstream RelNode's {@link RexInputRef} and wrapping {@link RexCall}. The cascade was the non-obvious piece. `OpenSearchProject#liftNestedRexOver` hoists the {@code pattern(...) OVER ()} call into its own child Project before fragment conversion, so the parent Project's {@code PATTERN_PARSER} call references the hoisted column via a {@link RexInputRef} whose declared type is frozen at construction to the original {@code MAP<VARCHAR, ANY> ARRAY}. Substituting the inner RexOver's return type to VARCHAR makes the child Project's row type at that slot mismatch the parent's frozen RexInputRef type — Calcite's {@code Aggregate.typeMatchesInferred} (and the analogous check on {@code Project.<init>}) rejects with {@code "type mismatch: ref: MAP NOT NULL ARRAY, input: VARCHAR"} on the next RelNode construction. Closing the cascade required: - A custom bottom-up shuttle (`descendAndRewrite`) that visits children first then retypes the current RelNode's expressions *before* calling {@code copy()}, avoiding {@code RelShuttleImpl.visitChildren}'s eager copy that would otherwise validate against stale RexInputRefs. - {@code rewriteProject} early-returns the unchanged original Project only when both {@code exprChanged=false} AND {@code childChanged=false} — not just {@code inputRowTypeChanged=false}. Discovered via a per-recursion hash trace that showed the top Project returning the original (hash 441987200 → 441987200) and silently discarding all descendant rebuilds because its own expressions hadn't changed. - {@code visitCall} uses the 3-arg {@link RexBuilder#makeCall} form (preserving the call's declared return type) rather than the 2-arg form. The 2-arg form re-invokes operator-specific inference like {@code SqlCastFunction}'s {@code lambda$returnTypeInference$0} which expects an operand at index 1 that doesn't exist for SAFE_CAST as constructed via {@code RexBuilder.makeCast}. - {@code ExprRewriter#visitInputRef} retypes references whose declared type diverges from the current input row type — propagates the VARCHAR substitution upward through every dependent RelNode. 2. `OpenSearchAggregateSplitRule` — skip PARTIAL/FINAL split for STATE_EXPANDING Surfaced by the dashboard query (`patterns ... method=BRAIN mode=label | stats count() as pattern_count, take(message, 1) as sample_logs by patterns_field | sort -pattern_count`). The split rule constructs both {@code SINGLE-on-SINGLETON} and {@code PARTIAL+FINAL} alternatives. The FINAL alternative reuses the original aggregate's argList — pointing into original input column indices ({@code TAKE($1=content, $2=N_literal)}). But FINAL's input is the PARTIAL output where those positions hold STATE columns of different types ({@code count_state BIGINT}, {@code take_state ARRAY<VARCHAR>}). The original TAKE operator's {@code ARG0_ARRAY} inference on the shifted arg 0 gives {@code BIGINT ARRAY} while the declared output is {@code VARCHAR ARRAY} — {@code Aggregate.typeMatchesInferred} throws before {@code DistributedAggregateRewriter} (which DOES remap correctly) can run. Extend {@code shouldSkipPartialFinalSplit} to skip when any aggregate is STATE_EXPANDING. On a 1-shard cluster the SINGLE-on-SINGLETON alternative wins anyway, so we lose no parallelism in single-shard tests; on multi-shard the trade-off is correctness over potential parallelism for those aggs until per-phase operator overloads or split-time remapping ships. Test impact (`CalcitePPLPatternsIT` via analytics-engine route): Before this commit: 8/15 After this commit: 10/15 Newly green: - testBrainLabelMode_NotShowNumberedToken - testBrainParseWithUUID_NotShowNumberedToken Also newly green: - `CalcitePPLDashboardPatternsIT.testDashboardBrainLabelStatsByPatternsField` (the canonical Dashboards BRAIN-label pattern panel query) Remaining 5/15 failures: - testBrainLabelMode_ShowNumberedToken — needs the tokens-map extraction path through the per-field {@code pattern_parser_get_tokens} call (uses the 2-arg shape; should be reachable but currently surfaces a separate substrait type issue). - testBrainParseWithUUID_ShowNumberedToken — same. - 3 testBrainAggregationMode_* — separate planner gap (`LogicalCorrelate` after UNNEST). Not blocked by this commit. Signed-off-by: Kai Huang <ahkcs@amazon.com> * ci: retrigger gradle-check (previous run returned null result after 2h timeout) Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Engine] Recursively normalize nested Arrow Text in ArrowValues ArrowValues.toJavaValue was the boundary that converts Arrow vectors to plain Java values handed to ExprValueUtils, but it only stripped Arrow's `Text` wrapper at one level — top-level VarCharVector cells, MapVector entry values, and ListVector elements. Nested shapes (e.g. the BRAIN patterns `tokens` column, which is Map<String, List<String>>) escaped with raw `Text` wrappers inside the inner list, because MapVector's value-extraction path didn't recurse into the list's elements. ExprValueUtils.fromObjectValue then rejected those raw Text objects with "unsupported object class". Introduce a small private `normalize(Object)` helper that recurses through Lists, Maps, and Text wrappers, so containers of any depth land on the SQL side as pure Java types. This keeps the Arrow→Java conversion responsibility where it belongs (the analytics-engine result materialization) instead of leaking Arrow type knowledge into the SQL plugin's generic value converter. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Backend / DataFusion] Append "g" flag in RegexpReplaceAdapter for 3-arg calls DataFusion's regexp_replace defaults to first-match-only without an explicit flag; Calcite's REGEXP_REPLACE_3 is already replace-all. PPL relies on the Calcite contract (every match replaced — used by SIMPLE patterns, regex_replace, rex mode=sed), so on the DataFusion path the adapter now rewrites every 3-arg REGEXP_REPLACE_3 to 4-arg REGEXP_REPLACE_PG_4(..., "g") preserving the same end-user semantics across backends. Companion change in opensearch-project/sql#5467: the SQL core no longer emits the 'g' flag itself — that DataFusion-specific concern now lives only here. Two existing unit tests updated to expect the new always-global behavior (testAdaptPassesThroughWhenNoQuoteBlock → testAdaptAppendsGlobalFlagFor3Arg, testAdaptPassesThroughNonLiteralPattern → testAdaptAppendsGlobalFlagForNonLiteralPattern). The \Q rewrite, $N backreference rewrite, and 4-arg-with-flags pass-through are unchanged. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Backend / DataFusion] spotlessApply Signed-off-by: Kai Huang <ahkcs@amazon.com> * [QA] Add PatternsCommandIT — multi-shard coverage for STATE_EXPANDING aggregates Per @marc's review: the OpenSearchAggregateSplitRule SINGLE-on-SINGLETON fallback (when STATE_EXPANDING aggs like take/list/values are present) only meaningfully fires on a multi-shard index — on single-shard, no split ever happens. CoordinatorReduceIT covers no-group-by take/list/values on 2 shards, but the dashboard query shape (take(field) by group-key) on multi-shard had no coverage. Adds three tests against the existing app_logs dataset: - testSimplePatternLabelMode — single-shard, asserts every row gets tokenized (every patterns_field contains "<*>"), validates the RegexpReplaceAdapter "g" flag append. - testSimplePatternAggregationModeMultiShard — 3-shard, exercises the OpenSearchAggregateSplitRule SINGLE-on-SINGLETON fallback. Asserts pattern_count totals sum to the dataset row count (200) and every group's sample_logs array is non-empty. - testSimplePatternAggregationGroupByServiceMultiShard — 3-shard with explicit `by patterns_field, service_name` group key (cardinality > 1), drives the same fallback path with `take(message, 1)` as the aggregate. BRAIN coverage stays on CalcitePPLDashboardPatternsIT (SQL side); BRAIN aggregation has open LogicalCorrelate post-aggregate work tracked separately. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [QA] PatternsCommandIT: pin number_of_shards=3 in multi-shard fixture Self-check inside ensureMultiShardProvisioned() reads GET /<index>/_settings and asserts the index settings report number_of_shards=3. Makes the multi-shard nature of the aggregation tests provable from the test itself rather than implicit in the DatasetProvisioner.provision() call. Per @marc's review request to verify the multi-shard claim. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Framework] Add WindowFunction.resolveFunction(SqlOperator) Per @mch2's nit: pull the kind→name fallback out of OpenSearchProjectRule into a single WindowFunction.resolveFunction(SqlOperator) so future backends pick up the OTHER-kind name-lookup logic without copy-paste. Three unit tests for the new method cover sql-kind hit, OTHER-kind name fallback (PATTERN), and unknown-operator returning null. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Engine] Correct shouldSkipPartialFinalSplit comment Per @mch2's review: the prior wording implied SINGLE-on-SINGLETON was "the only viable choice on a single shard anyway", which is wrong now that PatternsCommandIT exercises this path on a 3-shard index. Restate: SINGLE-on-SINGLETON also runs correctly on multi-shard (gather to the coordinator, then aggregate), it just trades distributed parallelism for the type-mismatch workaround. Distributed parallelism is still the follow-up. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Engine] Register PERCENTILE_APPROX in AggregateFunction, drop ad-hoc check Per @mch2's review: PERCENTILE_APPROX is semantically a STATE_EXPANDING aggregate (per-key t-digest state grows with input cardinality), exactly like PERCENTILE_CONT and PERCENTILE_DISC which are already in the enum. The "right spot" for it is the AggregateFunction registry, not an ad-hoc string check in OpenSearchAggregateSplitRule. Changes: - AggregateFunction.PERCENTILE_APPROX(Type.STATE_EXPANDING, SqlKind.OTHER). - OpenSearchAggregateSplitRule: replace the isPercentileApprox + separate STATE_EXPANDING checks with a single isStateExpanding(SqlAggFunction) helper that handles the unregistered-op throw gracefully (returns false rather than crashing the planner — fixes a latent issue where my earlier STATE_EXPANDING addition would have crashed on truly unknown aggs). - Javadoc refreshed to describe the STATE_EXPANDING category rather than calling out percentile_approx as a one-off. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Engine] Trim verbose comments Drop AI-generated explanatory blocks; keep terse WHY-only notes where context isn't obvious from the code itself. Signed-off-by: Kai Huang <ahkcs@amazon.com> * [Analytics Framework] Restore javadocs for Type / IntermediateTypeResolver Required by sandbox-check's missingJavadoc task on public types. Signed-off-by: Kai Huang <ahkcs@amazon.com> --------- Signed-off-by: Kai Huang <ahkcs@amazon.com>
What
Wires PPL
patterns(BRAIN + SIMPLE methods, label + aggregation modes) through the analytics-engine route, including this OpenSearch Dashboards query:Companion to opensearch-project/OpenSearch#21797 (DataFusion UDFs + analytics-engine planner work + DataFusion
regexp_replace'g'flag inRegexpReplaceAdapter). Both PRs are required.Changes
feat(api): addPATTERN_*defaults toUnifiedQueryContextpatternsqueries that omit explicitmethod=/mode=died onPatternMethod.valueOf("NULL")inAstBuilder.visitPatternsCommandbefore reaching any capability check.test(api):UnifiedQueryPlannerTest.testPPLPatternsPicksUpDefaultspatterns <field>produces a SIMPLE / LABEL RelNode withREGEXP_REPLACE+ apatterns_fieldprojection.test(integ-test):CalcitePPLDashboardPatternsIT+ suite inclusionverifyDataRowsagainstHDFS_LOGS. Added toCalciteNoPushdownITso the no-pushdown variant runs too.Results
CalcitePPLDashboardPatternsIT(new)CalcitePPLPatternsIT(analytics-engine)CalcitePPLPatternsIT(Calcite/V2)Verified on all four configurations: Calcite/V2 (pushdown on + off) and analytics-engine route (pushdown on + off).
Remaining 5 / 15 patterns failures — analytics-engine-route-only, tracked separately:
BrainAggregationMode_*→ planner needsLogicalCorrelatepost-aggregate.Brain*Mode_ShowNumberedToken→ substrait type wiring for_ShowNumberedToken.