Skip to content

feat(ppl): wire patterns command for analytics-engine dashboard route#5467

Merged
ahkcs merged 15 commits into
opensearch-project:mainfrom
ahkcs:feat/ppl-patterns-analytics-engine
May 27, 2026
Merged

feat(ppl): wire patterns command for analytics-engine dashboard route#5467
ahkcs merged 15 commits into
opensearch-project:mainfrom
ahkcs:feat/ppl-patterns-analytics-engine

Conversation

@ahkcs
Copy link
Copy Markdown
Collaborator

@ahkcs ahkcs commented May 26, 2026

What

Wires PPL patterns (BRAIN + SIMPLE methods, label + aggregation modes) through the analytics-engine route, including this OpenSearch Dashboards query:

source = test_logs_parquet
| where `@timestamp` >= '...' and `@timestamp` <= '...'
| patterns message method=brain mode=label
| stats count() as pattern_count, take(message, 1) as sample_logs by patterns_field
| sort - pattern_count
| fields patterns_field, pattern_count, sample_logs

Companion to opensearch-project/OpenSearch#21797 (DataFusion UDFs + analytics-engine planner work + DataFusion regexp_replace 'g' flag in RegexpReplaceAdapter). Both PRs are required.

Changes

Commit Why
feat(api): add PATTERN_* defaults to UnifiedQueryContext Without these, patterns queries that omit explicit method=/mode= died on PatternMethod.valueOf("NULL") in AstBuilder.visitPatternsCommand before reaching any capability check.
test(api): UnifiedQueryPlannerTest.testPPLPatternsPicksUpDefaults Pins that the PATTERN_* defaults flow through AstBuilder so a bare patterns <field> produces a SIMPLE / LABEL RelNode with REGEXP_REPLACE + a patterns_field projection.
test(integ-test): CalcitePPLDashboardPatternsIT + suite inclusion Pins the Dashboards BRAIN-label panel query end-to-end with verifyDataRows against HDFS_LOGS. Added to CalciteNoPushdownIT so the no-pushdown variant runs too.

Results

Suite Before After
CalcitePPLDashboardPatternsIT (new) 1 / 1
CalcitePPLPatternsIT (analytics-engine) 3 / 15 10 / 15
CalcitePPLPatternsIT (Calcite/V2) 15 / 15 15 / 15

Verified on all four configurations: Calcite/V2 (pushdown on + off) and analytics-engine route (pushdown on + off).

Remaining 5 / 15 patterns failures — analytics-engine-route-only, tracked separately:

  • 3 × BrainAggregationMode_* → planner needs LogicalCorrelate post-aggregate.
  • 2 × Brain*Mode_ShowNumberedToken → substrait type wiring for _ShowNumberedToken.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

PR Reviewer Guide 🔍

(Review updated until commit 42799d4)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

The settings map is initialized with a HashMap wrapping Map.ofEntries(), but Map.ofEntries() already returns an immutable map. If code later attempts to modify this map via setting(String, Object) or similar methods, it will fail with UnsupportedOperationException. The old code used Map.of() directly in the HashMap constructor, which also creates an immutable map, so this issue likely existed before. However, if the intent is to allow runtime modifications (as suggested by the setting(String, Object) method mentioned in the comment), the map should be initialized as new HashMap<>() followed by putAll() or individual put() calls.

new HashMap<Settings.Key, Object>(
    Map.ofEntries(
        Map.entry(QUERY_SIZE_LIMIT, SysLimit.DEFAULT.querySizeLimit()),
        Map.entry(PPL_SUBSEARCH_MAXOUT, SysLimit.UNLIMITED_SUBSEARCH.subsearchLimit()),
        Map.entry(
            PPL_JOIN_SUBSEARCH_MAXOUT, SysLimit.UNLIMITED_SUBSEARCH.joinSubsearchLimit()),
        Map.entry(CALCITE_ENGINE_ENABLED, true),
        Map.entry(PPL_REX_MAX_MATCH_LIMIT, 10),
        Map.entry(PATTERN_METHOD, "SIMPLE_PATTERN"),
        Map.entry(PATTERN_MODE, "LABEL"),
        Map.entry(PATTERN_MAX_SAMPLE_COUNT, 10),
        Map.entry(PATTERN_BUFFER_LIMIT, 100000),
        Map.entry(PATTERN_SHOW_NUMBERED_TOKEN, false)));

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

PR Code Suggestions ✨

Latest suggestions up to 42799d4

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Verify pattern method default value

The PATTERN_METHOD default value "SIMPLE_PATTERN" appears inconsistent with the test
expectation. The test testDashboardBrainLabelStatsByPatternsField uses method=BRAIN
mode=label, and the comment in testPPLPatternsPicksUpDefaults mentions "SIMPLE /
LABEL mode". Consider verifying the correct default value matches the expected
pattern method enum.

api/src/main/java/org/opensearch/sql/api/UnifiedQueryContext.java [160]

 private final Map<Settings.Key, Object> settings =
     new HashMap<Settings.Key, Object>(
         Map.ofEntries(
             Map.entry(QUERY_SIZE_LIMIT, SysLimit.DEFAULT.querySizeLimit()),
             Map.entry(PPL_SUBSEARCH_MAXOUT, SysLimit.UNLIMITED_SUBSEARCH.subsearchLimit()),
             Map.entry(
                 PPL_JOIN_SUBSEARCH_MAXOUT, SysLimit.UNLIMITED_SUBSEARCH.joinSubsearchLimit()),
             Map.entry(CALCITE_ENGINE_ENABLED, true),
             Map.entry(PPL_REX_MAX_MATCH_LIMIT, 10),
-            Map.entry(PATTERN_METHOD, "SIMPLE_PATTERN"),
+            Map.entry(PATTERN_METHOD, "SIMPLE"),
             Map.entry(PATTERN_MODE, "LABEL"),
             Map.entry(PATTERN_MAX_SAMPLE_COUNT, 10),
             Map.entry(PATTERN_BUFFER_LIMIT, 100000),
             Map.entry(PATTERN_SHOW_NUMBERED_TOKEN, false)));
Suggestion importance[1-10]: 3

__

Why: The suggestion raises a valid concern about the PATTERN_METHOD default value being "SIMPLE_PATTERN" vs "SIMPLE". However, the test testPPLPatternsPicksUpDefaults doesn't explicitly verify the method value, and testDashboardBrainLabelStatsByPatternsIT explicitly sets method=BRAIN, so there's no direct contradiction. The comment mentions "SIMPLE / LABEL mode" which could refer to "SIMPLE_PATTERN". Without seeing the enum definition or more context, this is a minor verification request rather than a clear bug.

Low

Previous suggestions

Suggestions up to commit b830582
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix pattern method enum value

The PATTERN_METHOD default value "SIMPLE_PATTERN" doesn't match the enum constant
name used in the codebase. Based on the test case expecting "SIMPLE" mode and
typical enum naming conventions, this should be "SIMPLE" instead of "SIMPLE_PATTERN"
to avoid IllegalArgumentException when PatternMethod.valueOf() is called.

api/src/main/java/org/opensearch/sql/api/UnifiedQueryContext.java [160]

 private final Map<Settings.Key, Object> settings =
     new HashMap<Settings.Key, Object>(
         Map.ofEntries(
             Map.entry(QUERY_SIZE_LIMIT, SysLimit.DEFAULT.querySizeLimit()),
             Map.entry(PPL_SUBSEARCH_MAXOUT, SysLimit.UNLIMITED_SUBSEARCH.subsearchLimit()),
             Map.entry(
                 PPL_JOIN_SUBSEARCH_MAXOUT, SysLimit.UNLIMITED_SUBSEARCH.joinSubsearchLimit()),
             Map.entry(CALCITE_ENGINE_ENABLED, true),
             Map.entry(PPL_REX_MAX_MATCH_LIMIT, 10),
-            Map.entry(PATTERN_METHOD, "SIMPLE_PATTERN"),
+            Map.entry(PATTERN_METHOD, "SIMPLE"),
             Map.entry(PATTERN_MODE, "LABEL"),
             Map.entry(PATTERN_MAX_SAMPLE_COUNT, 10),
             Map.entry(PATTERN_BUFFER_LIMIT, 100000),
             Map.entry(PATTERN_SHOW_NUMBERED_TOKEN, false)));
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a critical bug where PATTERN_METHOD is set to "SIMPLE_PATTERN" instead of "SIMPLE". This would cause PatternMethod.valueOf() to throw an IllegalArgumentException at runtime. The test in UnifiedQueryPlannerTest.java confirms that patterns should work with defaults, making this a high-priority fix.

High
Suggestions up to commit 975f8cc
CategorySuggestion                                                                                                                                    Impact
General
Align default pattern method with usage

The default value for PATTERN_METHOD is set to "SIMPLE_PATTERN", but the integration
test uses "BRAIN" mode. This mismatch could cause unexpected behavior when users
rely on defaults. Consider aligning the default with the most commonly used pattern
method or documenting this discrepancy clearly.

api/src/main/java/org/opensearch/sql/api/UnifiedQueryContext.java [160-164]

-Map.entry(PATTERN_METHOD, "SIMPLE_PATTERN"),
+Map.entry(PATTERN_METHOD, "BRAIN"),
 Map.entry(PATTERN_MODE, "LABEL"),
 Map.entry(PATTERN_MAX_SAMPLE_COUNT, 10),
 Map.entry(PATTERN_BUFFER_LIMIT, 100000),
 Map.entry(PATTERN_SHOW_NUMBERED_TOKEN, false)
Suggestion importance[1-10]: 3

__

Why: While the suggestion identifies a discrepancy between the default PATTERN_METHOD value ("SIMPLE_PATTERN") and the test usage ("BRAIN"), this is not necessarily a bug. Integration tests often test non-default configurations. The default value may be intentionally set to "SIMPLE_PATTERN" for performance or other reasons. Without evidence that this causes actual issues, this is a minor observation about potential documentation needs rather than a critical fix.

Low
Suggestions up to commit 172916b
CategorySuggestion                                                                                                                                    Impact
General
Avoid unnecessary 4-arg to 3-arg conversion

The code creates a 4-arg REGEXP_REPLACE_PG_4 call but RexStandardizer later
collapses it back to 3-arg when the flag is "g". This round-trip is unnecessary.
Consider directly using REGEXP_REPLACE_3 here to avoid the extra transformation
step.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [4304-4312]

 if (ParseMethod.PATTERNS.equals(parseMethod)) {
-    // Emit 4-arg REGEXP_REPLACE_PG_4 with "g" so DataFusion's regexp_replace
-    // (first-match-only by default) replaces every match.
-    RexNode globalFlag =
-        context.rexBuilder.makeLiteral(
-            "g", context.rexBuilder.getTypeFactory().createSqlType(SqlTypeName.VARCHAR), true);
+    // Use 3-arg REGEXP_REPLACE_3 directly since it already does replace-all by default
     RexNode innerRex =
         context.rexBuilder.makeCall(
-            SqlLibraryOperators.REGEXP_REPLACE_PG_4, ArrayUtils.add(rexNodeList, globalFlag));
+            SqlLibraryOperators.REGEXP_REPLACE_3, rexNodeList);
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that creating REGEXP_REPLACE_PG_4 only to collapse it back to REGEXP_REPLACE_3 in RexStandardizer is inefficient. However, the architectural decision to use 4-arg may be intentional for consistency with DataFusion's API or for clarity in expressing the "global replace" intent. The optimization is valid but represents a moderate improvement rather than fixing a critical issue.

Medium
Suggestions up to commit f952e2b
CategorySuggestion                                                                                                                                    Impact
Possible issue
Avoid mutating shared list in loop

The code uses ArrayUtils.add() to append the global flag, but rexNodeList is reused
in the loop for each groupCandidate. This creates a shared mutable list that grows
with each iteration, causing incorrect behavior. Create a new list for each
iteration instead.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [4304-4312]

 if (ParseMethod.PATTERNS.equals(parseMethod)) {
     // Emit 4-arg REGEXP_REPLACE_PG_4 with "g" so DataFusion's regexp_replace
     // (first-match-only by default) replaces every match.
     RexNode globalFlag =
         context.rexBuilder.makeLiteral(
             "g", context.rexBuilder.getTypeFactory().createSqlType(SqlTypeName.VARCHAR), true);
+    List<RexNode> rexNodesWithFlag = new ArrayList<>(rexNodeList);
+    rexNodesWithFlag.add(globalFlag);
     RexNode innerRex =
         context.rexBuilder.makeCall(
-            SqlLibraryOperators.REGEXP_REPLACE_PG_4, ArrayUtils.add(rexNodeList, globalFlag));
+            SqlLibraryOperators.REGEXP_REPLACE_PG_4, rexNodesWithFlag);
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a critical bug where ArrayUtils.add(rexNodeList, globalFlag) is called inside a loop over groupCandidates. Since ArrayUtils.add() returns a new array but rexNodeList is reused across iterations, this could lead to incorrect behavior if multiple group candidates exist. Creating a new list per iteration prevents this mutation issue.

High
Suggestions up to commit edc98f6
CategorySuggestion                                                                                                                                    Impact
General
Prevent potential list mutation issues

The code uses ArrayUtils.add(rexNodeList, globalFlag) which modifies or creates a
new list. Verify that rexNodeList is not reused elsewhere in the loop, as this could
lead to unintended side effects or incorrect behavior when processing multiple
groupCandidates.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [4304-4312]

 if (ParseMethod.PATTERNS.equals(parseMethod)) {
     // Emit 4-arg REGEXP_REPLACE_PG_4 with "g" so DataFusion's regexp_replace
     // (first-match-only by default) replaces every match.
     RexNode globalFlag =
         context.rexBuilder.makeLiteral(
             "g", context.rexBuilder.getTypeFactory().createSqlType(SqlTypeName.VARCHAR), true);
+    List<RexNode> extendedList = new ArrayList<>(rexNodeList);
+    extendedList.add(globalFlag);
     RexNode innerRex =
         context.rexBuilder.makeCall(
-            SqlLibraryOperators.REGEXP_REPLACE_PG_4, ArrayUtils.add(rexNodeList, globalFlag));
+            SqlLibraryOperators.REGEXP_REPLACE_PG_4, extendedList);
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential issue where ArrayUtils.add(rexNodeList, globalFlag) could cause unintended side effects if rexNodeList is reused in the loop. Creating a defensive copy with new ArrayList<>(rexNodeList) is a safer approach that prevents mutation of the original list, though the actual impact depends on whether rexNodeList is indeed reused across iterations.

Medium

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit ff989ea

@ahkcs ahkcs force-pushed the feat/ppl-patterns-analytics-engine branch from ff989ea to 12e0524 Compare May 26, 2026 17:31
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 12e0524

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 172916b.

PathLineSeverityDescription
integ-test/src/test/resources/expectedOutput/calcite/explain_patterns_simple_pattern_agg_push.yaml1mediumThe updated expected-output YAML embeds a hardcoded numeric timestamp value (utcTimestamp:1779836363879052000) inside a serialized Base64/JSON script payload sent to OpenSearch. The old value was 0; the new value is a specific epoch nanosecond. While this may be a snapshot of a real test run, embedding a real timestamp in a static test fixture is anomalous — test fixtures should be deterministic. If this timestamp is derived from runtime state it could indicate the fixture was generated by running against a live cluster in a way that leaks temporal metadata. Maintainers should verify how this value was generated and whether it could vary across environments.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 1 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 394cb5a

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 8720ca7

Comment on lines +160 to +165
// PPL `patterns` command defaults — mirror the cluster-side defaults registered in
// OpenSearchSettings (DEFAULT_PATTERN_METHOD_SETTING etc.). Without these the
// analytics-engine path's AstBuilder.visitPatternsCommand reads null from
// `settings.getSettingValue(Key.PATTERN_METHOD)`, fails with
// `PatternMethod.valueOf("NULL")` IllegalArgumentException, and every query that
// omits an explicit `method=` / `mode=` argument is rejected.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete all these "AI" comments.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, trimmed verbose comments

Comment on lines +4306 to +4311
// Emit `regexp_replace(field, pattern, replacement, "g")` directly so the replacement
// is global (every match replaced). DataFusion's `regexp_replace` defaults to FIRST
// match only without the "g" flag — using the 3-arg form via the REPLACE handler
// produces `<*>@pyrami.com` instead of `<*>@<*>.<*>` on the analytics-engine route.
// Calcite's REGEXP_REPLACE_PG_4 with "g" matches what `replaceAll` does, so V2 /
// Calcite-path semantics are preserved.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, PPL use regex_replace(field, pattern, replacement) without 'g' , right?
the propsoal is add 'g' paramater? It is does not change PPL behavoiur, right, becuase default is 'g'?

Copy link
Copy Markdown
Collaborator Author

@ahkcs ahkcs May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, PPL use regex_replace(field, pattern, replacement) without 'g' , right?

Yes

Calcite's 3-arg REGEXP_REPLACE is already replace-all, however data fusion's REGEXP_REPLACE replaces only the first match, which causes mismatch. My proposal is to add this g parameter to make the behavior replace-all for both engines.

So our PPL behavior would not change because we will later collapse the 4-arg to 3-arg in Calcite-path in RexStandardizer.java

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After taking another look, this approach can only cover the pushdown path, the no-pushdown path will be broken because of our 4-arg change

Updated the approach, the 'g' flag is now appended in OpenSearch's RegexpReplaceAdapter only when sending to DataFusion

Comment on lines +77 to +97
// PPL's `patterns` lowering emits 4-arg `REGEXP_REPLACE_PG_4(field, pattern, replacement,
// 'g')` so the analytics-engine/DataFusion path gets global replacement (DataFusion's
// `regexp_replace` defaults to first-match-only without an explicit flag). Calcite's
// 3-arg `REGEXP_REPLACE_3` is already replace-all in its enumerable runtime, but the PG_4
// form has no matching `SqlFunctions.regexpReplace(String, String, String, String)` impl
// — Calcite's runtime only ships the `(String, String, String, int[, ...])` shapes (the
// 4-arg variant treats the 4th arg as start-position, not a flags string). The script
// pushdown path codegen would fail with `No applicable constructor/method found`. Collapse
// the 4-arg call to the 3-arg form whenever the flags literal is exactly `"g"`, which
// preserves replace-all semantics on the V2 / Calcite-pushdown side.
if (call.op == SqlLibraryOperators.REGEXP_REPLACE_PG_4
&& call.operands.size() == 4
&& call.operands.get(3) instanceof RexLiteral flagsLit
&& "g".equals(flagsLit.getValueAs(String.class))) {
return helper.rexBuilder
.makeCall(
call.getType(),
SqlLibraryOperators.REGEXP_REPLACE_3,
call.operands.subList(0, 3))
.accept(this, helper);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only this function need sepecial handling?

Copy link
Copy Markdown
Collaborator Author

@ahkcs ahkcs May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose is specifically to let the 4-arg REGEXP_REPLACE survive the pushdown path.

3-arg has a matching SqlFunctions.regexpReplace(String, String, String) Java method — Janino binds, pushdown works. 4-arg-with-flags doesn't: Calcite's SqlFunctions ships regexpReplace(String, String, String, int) (position) but no (String, String, String, String) (flags-as-string) variant. The operator is registered, the Java impl is missing — that's the gap we rewrite around. We collapse the 4-arg to 3-arg in this case

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After taking another look, this approach can only cover the pushdown path, the no-pushdown path will be broken because of our 4-arg change

Updated the approach, the g flag is now appended in OpenSearch's RegexpReplaceAdapter only when sending to DataFusion

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit c42fb7c

@ahkcs ahkcs added the enhancement New feature or request label May 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 26b2d69

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 172916b

ahkcs added a commit to ahkcs/OpenSearch that referenced this pull request May 26, 2026
…ter for 3-arg calls

DataFusion's regexp_replace defaults to first-match-only without an explicit
flag; Calcite's REGEXP_REPLACE_3 is already replace-all. PPL relies on the
Calcite contract (every match replaced — used by SIMPLE patterns, regex_replace,
rex mode=sed), so on the DataFusion path the adapter now rewrites every 3-arg
REGEXP_REPLACE_3 to 4-arg REGEXP_REPLACE_PG_4(..., "g") preserving the same
end-user semantics across backends.

Companion change in opensearch-project/sql#5467: the SQL core no longer emits
the 'g' flag itself — that DataFusion-specific concern now lives only here.

Two existing unit tests updated to expect the new always-global behavior
(testAdaptPassesThroughWhenNoQuoteBlock → testAdaptAppendsGlobalFlagFor3Arg,
testAdaptPassesThroughNonLiteralPattern → testAdaptAppendsGlobalFlagForNonLiteralPattern).
The \Q rewrite, $N backreference rewrite, and 4-arg-with-flags pass-through
are unchanged.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 975f8cc

dai-chen
dai-chen previously approved these changes May 27, 2026
Comment thread api/src/main/java/org/opensearch/sql/api/UnifiedQueryContext.java
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 03ed5e9

ahkcs added 14 commits May 27, 2026 09:53
PPL `patterns` command's AstBuilder reads cluster settings for method/mode/
max_sample_count/buffer_limit/show_numbered_token defaults when the query
omits them. Without these in the analytics-engine path's settings map, the
parser reads null, falls into `PatternMethod.valueOf("NULL")`, and every
`patterns` query without an explicit `method=` or `mode=` argument fails at
parse time with `No enum constant PatternMethod.NULL`.

Mirrors the OpenSearchSettings defaults (SIMPLE_PATTERN / LABEL / 10 /
100000 / false). Part of the analytics-engine route support for the
`patterns` command.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
`buildParseRelNode` for `ParseMethod.PATTERNS` lowered through PPL's REPLACE
handler, which always emits Calcite's 3-arg `REGEXP_REPLACE_3`. That works on
the V2 / Calcite path (Calcite's default is replace-all), but the analytics-
engine route converts the call to substrait + DataFusion, and DataFusion's
`regexp_replace` defaults to first-match-only without an explicit "g" flag.

The dashboard test for `source = bank | patterns email mode=label` returned
`<*>@pyrami.com` instead of `<*>@<*>.<*>` because only the first
`[a-zA-Z0-9]+` run was replaced.

Bypass the REPLACE handler for the PATTERNS branch and emit
`REGEXP_REPLACE_PG_4` directly with a constant "g" flag. Same semantics on V2 /
Calcite (Calcite's REGEXP_REPLACE_PG_4 with "g" = replace-all); fixes the
analytics-engine path.

CalcitePPLPatternsTest plan-string expectations updated to match the 4-arg
form. 17/17 unit tests pass. IT result on analytics-engine route:
testSimplePatternLabelMode_NotShowNumberedToken now passes.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
…el dashboard query

OpenSearch Dashboards renders BRAIN-pattern panels with the shape:

  patterns ... method=BRAIN mode=label
  | stats count() as pattern_count, take(message, 1) as sample_logs
    by patterns_field
  | sort -pattern_count
  | fields patterns_field, pattern_count, sample_logs

This integration test pins that shape on the analytics-engine route so
regressions surface immediately. Schema-only assertions because BRAIN's
clustering output is dataset-version-sensitive — the contract we care about
is "the query plans, executes, and returns three columns in the right order".

Currently red end-to-end pending the BRAIN label window-UDF type-cascade
fix (see the OpenSearch-side WIP commit "BRAIN window UDF + dashboard
query path scaffolding" — the {@code PplWindowCallRewriter} stub
documents the remaining gap).

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Spotless drift from cherry-picking the analytics-engine patterns work
across upstream's recent formatting touch-ups. No behavior change.

Signed-off-by: Kai Huang <huangkaics@gmail.com>
Signed-off-by: Kai Huang <ahkcs@amazon.com>
…p_replace

CalciteExplainIT's `testPatternsSimplePatternMethodWith{out,AggPushDown}Explain`
expected the old 3-arg `REGEXP_REPLACE(...)` form, but after the `feat(core)`
commit emits 4-arg `REGEXP_REPLACE(..., 'g':VARCHAR)` the plan output now
includes the extra operand both in the logical line and in the base64-encoded
compounded script of the physical/pushdown plan.

Regenerate both YAML expectations against the live planner.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
…cript pushdown

The `feat(core)` commit on this branch lowered PPL `patterns` to a 4-arg
`REGEXP_REPLACE_PG_4(field, pattern, replacement, 'g')` so DataFusion (which
defaults to first-match-only) does global replacement on the analytics-engine
route. Calcite's enumerable runtime — which the V2 / Calcite-pushdown path uses
to compile the serialized RexCall into Janino bytecode — has no matching
`SqlFunctions.regexpReplace(String, String, String, String)` impl (only
`(String, String, String, int[, ...])` variants where the 4th arg is start
position, not a flags string). Janino codegen failed with
`No applicable constructor/method found` for the 4-arg-with-flags call shape,
breaking the patterns.md doctest (`source=apache | patterns message
method=simple_pattern mode=aggregation`).

Two complementary fixes:

1. `RexStandardizer.visitCall`: before serializing for pushdown, collapse
   `REGEXP_REPLACE_PG_4(field, pattern, replacement, 'g')` to the 3-arg
   `REGEXP_REPLACE_3` form. Safe because Calcite's 3-arg variant is already
   replace-all (same semantics as PG_4 with `g`). Only fires when the flags
   literal is exactly `"g"` so any future `i`/`m`/etc. use cases pass through
   untouched.

2. `ExtendedRelJson.toOp`: pass operand count when looking up an operator on
   the deserialization side so multi-arity SQL names (REGEXP_REPLACE_3 vs
   REGEXP_REPLACE_PG_4 vs REGEXP_REPLACE_5 all share `name="REGEXP_REPLACE"`)
   resolve to the right overload. Defensive — the standardizer fix above is
   what actually unblocks the doctest, but the resolver was picking by name
   alone and would have surfaced the same bug for any other overloaded
   builtin.

Verified locally:
- doctest queries (`patterns ... method=simple_pattern mode=aggregation [...]`)
  now return fully-tokenized output;
- `CalcitePPLDashboardPatternsIT` still 1/1 PASS;
- `CalcitePPLPatternsIT` still 10/15 with the same five known-pending failures
  (LogicalCorrelate + `_ShowNumberedToken` BRAIN cases).

Signed-off-by: Kai Huang <ahkcs@amazon.com>
…doctest

The arity filter added to ExtendedRelJson.toOp in the previous commit broke
SAFE_CAST → JSON_EXTRACT deserialization (used by `spath` lowering): the
PPL JSON_EXTRACT UDF, registered as an anonymous UserDefinedFunctionBuilder
subclass, doesn't expose a meaningful getOperandCountRange(), so my filter
fell through to the firstKindMatch path and skipped the
AvaticaUtils.instantiatePlugin "class" path that previously resolved the
UDF. spath.md doctest started returning RuntimeException on
`source=structured | spath input=doc_n n | eval n=cast(n as int) | stats sum(n)`.

The RexStandardizer collapse (4-arg `REGEXP_REPLACE_PG_4(..., 'g')` → 3-arg
`REGEXP_REPLACE_3`) already fixes the patterns.md doctest at the source side
— by the time pushdown serialization runs, no 4-arg call exists for toOp to
disambiguate. The arity filter was defensive only and no longer carries its
weight; revert toOp to the original first-kind-match behavior, plus a spotless
re-flow that came in with the same change.

Verified locally on a fresh cluster:
- spath.md doctest query → returns sum(n)=6 (was 500).
- patterns.md doctest query → returns fully-tokenized aggregation rows.
- CalcitePPLDashboardPatternsIT → 1/1 PASS.
- CalcitePPLPatternsIT → 10/15 PASS (same baseline; same five known-pending
  BRAIN failures tracked separately).

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Per @penghuo: drop the verbose multi-line explanatory comments and tighten
the class/method javadoc on the new dashboard IT.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Per @dai-chen: schema-only verification doesn't catch "query succeeds but
returns 0/wrong rows". Pin the 4 BRAIN clusters with their exact patterns,
counts, and sample logs against the HDFS_LOGS fixture.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Per @dai-chen: the two consecutive `if (PATTERNS)` branches in
buildParseRelNode share a condition; merge into a single if/else with
each branch fully co-located. Pure refactor — CalcitePPLPatternsTest
(logical-plan unit test) passes.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
…ushdownIT

Per CLAUDE.md guidance, new Calcite IT classes should be added to the
no-pushdown suite. Verified locally that the dashboard query also passes
with pushdown disabled (Dashboard 1/1, Patterns 10/15 — same baseline).

Signed-off-by: Kai Huang <ahkcs@amazon.com>
…EPLACE

The previous YAML capture pre-dated the RexStandardizer 4-arg → 3-arg
collapse landing. With the collapse, the pushed-down compounded script
serializes the 3-arg form (SOURCES has 7 entries, no trailing 'g').

Signed-off-by: Kai Huang <ahkcs@amazon.com>
… adapter

Per @penghuo's review: DataFusion-specific concerns shouldn't live in SQL core.
The 'g' flag is needed only because DataFusion's regexp_replace defaults to
first-match-only — Calcite's 3-arg form is already replace-all on both pushdown
and no-pushdown paths.

Restores SQL core, RexStandardizer, the patterns unit test, and the SIMPLE-
patterns explain YAMLs to their upstream/main shape. The 'g' flag is appended
in opensearch-project/OpenSearch#21797's RegexpReplaceAdapter when converting
3-arg REGEXP_REPLACE to DataFusion. Same end-user behavior, smaller SQL diff,
and the Calcite no-pushdown path no longer diverges from the pushdown YAML.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Per @dai-chen: verify the RelNode produced when `patterns <field>` is run
without explicit method=/mode= args — exercises that the PATTERN_METHOD and
PATTERN_MODE defaults flow through to AstBuilder.visitPatternsCommand and
produce a valid SIMPLE/LABEL lowering with a `patterns_field` projection.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the feat/ppl-patterns-analytics-engine branch from 03ed5e9 to b830582 Compare May 27, 2026 16:57
ahkcs added a commit to ahkcs/OpenSearch that referenced this pull request May 27, 2026
…ter for 3-arg calls

DataFusion's regexp_replace defaults to first-match-only without an explicit
flag; Calcite's REGEXP_REPLACE_3 is already replace-all. PPL relies on the
Calcite contract (every match replaced — used by SIMPLE patterns, regex_replace,
rex mode=sed), so on the DataFusion path the adapter now rewrites every 3-arg
REGEXP_REPLACE_3 to 4-arg REGEXP_REPLACE_PG_4(..., "g") preserving the same
end-user semantics across backends.

Companion change in opensearch-project/sql#5467: the SQL core no longer emits
the 'g' flag itself — that DataFusion-specific concern now lives only here.

Two existing unit tests updated to expect the new always-global behavior
(testAdaptPassesThroughWhenNoQuoteBlock → testAdaptAppendsGlobalFlagFor3Arg,
testAdaptPassesThroughNonLiteralPattern → testAdaptAppendsGlobalFlagForNonLiteralPattern).
The \Q rewrite, $N backreference rewrite, and 4-arg-with-flags pass-through
are unchanged.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit b830582

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 42799d4

@ahkcs ahkcs requested a review from dai-chen May 27, 2026 17:20
@ahkcs ahkcs merged commit 1183e73 into opensearch-project:main May 27, 2026
61 of 62 checks passed
mch2 pushed a commit to opensearch-project/OpenSearch that referenced this pull request May 28, 2026
…21797)

* [Analytics Backend / DataFusion] Patterns milestone 1: pure-logic Rust scaffold

Starts the port of PPL `patterns` command's algorithm layer to native Rust
so the BRAIN method and `PATTERN_PARSER` scalar can execute on the
analytics-engine route. The end goal is `CalcitePPLPatternsIT` passing
against parquet-backed indices.

A new `patterns/` module containing the pure-logic layer (no DataFusion
dependency, fully unit-testable in isolation):

- `preprocess.rs` — Java `BrainLogParser.preprocess(...)` port. Regex-based
  variable detection (IP, ISO datetime, UUID, hex/letter-digit/floats),
  delimiter normalization, whitespace splitting. The "generic number
  surrounded by non-alphanumeric" rule is implemented manually because
  Rust's `regex` crate doesn't support lookaround.

- `utils.rs` — Java `PatternUtils` port. `parse_pattern`, `extract_variables`,
  `ParseResult` with `to_token_order_string`. `WILDCARD_PATTERN` and
  `TOKEN_PATTERN` regexes are equivalent to Java's `Pattern.compile` strings.

- `brain.rs` — `BrainLogParser` struct skeleton, `collapse_continuous_wildcards`,
  `PatternEntry` / `BrainParseStats` typed result types. The classifier
  internals (histogram, group token set, `parse_log_pattern`) are deliberate
  stubs in this milestone — they land in milestone 2.

- `tokens.rs` — `PatternResult` typed view of the per-row / per-group result
  the UDF layer will materialize into Arrow.

14 unit tests pass (1 `#[ignore]` placeholder for the full BRAIN
classification once the algorithm port lands):

- `preprocess_simple_log_line_splits_on_whitespace`
- `preprocess_substitutes_ip_then_blk_number` — matches the expected
  preprocessed shape from `testBrainLabelMode_NotShowNumberedToken`
- `preprocess_substitutes_uuid` — matches `testBrainParseWithUUID_*`
- `preprocess_collapses_consecutive_wildcards_via_number_runs`
- `parse_wildcard_pattern_splits_on_email_separators` —
  `<*>@<*>.<*>` from `testSimplePatternLabelMode_*`
- `to_token_order_string_rewrites_wildcards_to_numbered_tokens` —
  produces `<token1>@<token2>.<token3>`
- `extract_variables_extracts_email_parts` — matches
  `testSimplePatternLabelMode_ShowNumberedToken` ImmutableMap expectation
- `extract_variables_handles_multi_sample_aggregation` — matches
  `testBrainAggregationMode_ShowNumberedToken` multi-sample tokens.list
- `extract_variables_returns_empty_when_static_mismatch`
- `token_pattern_matches_numbered_placeholders`
- `collapse_three_consecutive_wildcards` / variants

1. Full BRAIN classifier port — histogram + group-token-set + parse_log_pattern.
2. DataFusion ScalarUDF wrapper for PATTERN_PARSER (`udf/pattern_parser.rs`).
3. DataFusion AggregateUDF wrapper for INTERNAL_PATTERN (aggregation mode).
4. DataFusion WindowUDF wrapper for INTERNAL_PATTERN (label mode).
5. Substrait YAML signatures for `pattern_parser` and `internal_pattern`.
6. Java adapters in analytics-backend-datafusion + ScalarFunction enum + capability registration.
7. `PatternsCommandIT` mirroring `CalcitePPLPatternsIT` against parquet-backed indices.
8. Verification via `:integTestRemote -Dtests.analytics.force_routing=true`.

This milestone is a no-op at runtime — `patterns/` is unwired. Lands as
the algorithmic foundation for the work above.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Backend / DataFusion] Register ITEM for ARRAY and MAP returns

PPL `patterns` lowers its result-struct flatten via `INTERNAL_ITEM(struct, "sample_logs")`
and `INTERNAL_ITEM(struct, "tokens")`, where `sample_logs` returns `ARRAY<VARCHAR>`
and `tokens` returns `MAP<VARCHAR, ARRAY<VARCHAR>>`. The scalar form of ITEM was
already in STANDARD_PROJECT_OPS (with SUPPORTED_FIELD_TYPES — covers VARCHAR /
numeric returns), but the ARRAY- and MAP-returning shapes weren't registered, so
OpenSearchProjectRule rejected the call with
`No backend supports scalar function [ITEM] among [datafusion]`.

Adds ITEM to both ARRAY_RETURNING_PROJECT_OPS and MAP_RETURNING_PROJECT_OPS.
Part of the PPL `patterns` command analytics-engine support stack.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Backend / DataFusion] Patterns milestone 2: full BRAIN classifier port

Replaces the stub in patterns/brain.rs with a full port of
BrainLogParser.java from the SQL plugin's
common/src/main/java/org/opensearch/sql/common/patterns/.

## What lands

- preprocess_all_logs — tokenizes each line with default filters
  (IP/datetime/UUID/numbers), appends a synthetic logId token, and
  updates token_freq_map.
- process_token_histogram — positional token-frequency counter.
- calculate_group_token_freq — picks the representative WordCombination
  (sorted by same_freq_count desc, then word_freq desc) per row and
  populates the per-(tokens_len,candidate,position) group token set.
- parse_log_pattern — per-row classifier. Tokens whose frequency >
  repFreq are kept ONLY if unique in their group; tokens with frequency
  < repFreq are kept ONLY if the group has fewer than
  variable_count_threshold variants. Everything else becomes <*>.
- parse_all_log_patterns — full pipeline. Group equal pattern strings
  together, count occurrences, collect samples up to max_sample_count.
- WordCombination — typed pair with the Java compareTo ordering.
- collapse_continuous_wildcards — adjacent-<*> collapse, unchanged from
  milestone 1.

## Tests (18/18 pass)

Existing 14 unit tests from milestone 1 still pass. New tests cover
the BRAIN classifier directly with fixtures from CalcitePPLPatternsIT:

- brain_groups_verification_succeeded_lines — two
  'Verification succeeded' lines collapse to one pattern with 2 samples.
- brain_aggregates_hdfs_fixtures_into_four_groups — the 8-line HDFS
  fixture matches the IT's testBrainAggregationMode_NotShowNumberedToken
  expectation: exactly 4 patterns, every group has count == 2.
- brain_aggregates_brain_label_mode_blockstar_into_expected_pattern —
  spot-checks the exact pattern string the IT asserts on row 1 of
  testBrainLabelMode_NotShowNumberedToken:
  'BLOCK* NameSystem.addStoredBlock: blockMap updated: <*IP*> is added
  to blk_<*> size <*>'.

The fixtures match the IT 1:1 — equivalence with the Java implementation
on the test surface that motivated this work is enforced at unit test
time. Subsequent milestones wire the algorithm into DataFusion's
ScalarUDF / AggregateUDF / WindowUDF APIs and registers the
opensearch_scalar_functions.yaml signatures so the analytics-engine
route can dispatch to it.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Backend / DataFusion] Patterns milestone 3a: scalar eval functions

Adds patterns/eval.rs with the three entry points the PATTERN_PARSER
scalar UDF dispatches between. Mirrors PatternParserFunctionImpl's
evalField / evalAgg / evalSamples (197-line Java class).

- eval_field(pattern, field) — SIMPLE label mode + show_numbered_token.
  Parses the wildcard pattern, extracts field substrings into a token
  map, returns numbered-token rewrite.
- eval_samples(pattern, sample_logs) — SIMPLE aggregation mode +
  show_numbered_token. Token map accumulates across all sample logs.
- eval_agg(field, agg_object, show_numbered_token) — BRAIN label mode.
  Scores each candidate pattern against the preprocessed input tokens,
  picks the highest-similarity candidate, optionally rewrites to
  numbered tokens.

26/26 patterns module tests pass. New cases pin equivalence with the
CalcitePPLPatternsIT expectations directly:

- eval_field_renames_email_wildcards_to_numbered_tokens — matches
  testSimplePatternLabelMode_ShowNumberedToken's <token1>@<token2>.<token3>
  on "amberduke@pyrami.com".
- eval_field_handles_custom_pattern — testSimplePatternLabelModeWithCustomPattern_*
  with the "amberduke<*>" prefix-anchored template.
- eval_samples_accumulates_tokens_across_samples — matches
  testSimplePatternAggregationMode_ShowNumberedToken's 3-sample case.
- eval_agg_picks_best_matching_candidate — best-fit similarity
  scoring against two BRAIN-aggregate candidates.

Next milestone: ScalarUDF wrapper in udf/pattern_parser.rs +
opensearch_scalar_functions.yaml entry + ScalarFunction enum +
Java adapter so the analytics-engine route can dispatch into these
eval functions. Currently unused at runtime.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Backend / DataFusion] Patterns milestone 3b: PATTERN_PARSER scalar UDF wiring

Wires the eval functions from milestone 3a into a DataFusion ScalarUDF
and registers all the cross-component plumbing the analytics-engine
route needs to dispatch to it. After this milestone the PPL SIMPLE
patterns + show_numbered_token call shape (evalField / evalSamples) is
reachable from a Calcite RelNode on the analytics-engine path.

- udf/pattern_parser.rs — DataFusion ScalarUDF wrapper. Accepts two
  operand shapes:
    pattern_parser(VARCHAR, VARCHAR)         — evalField
    pattern_parser(VARCHAR, List<VARCHAR>)   — evalSamples
  Return type is Struct<pattern: VARCHAR, tokens: Map<VARCHAR,
  List<VARCHAR>>>. The 3-arg evalAgg shape used by BRAIN label mode
  goes through a separate path (next milestone).

- opensearch_scalar_functions.yaml — substrait  entry,
  return type declared as any1 (same convention json_extract_all uses
  for its concrete Map return).

- ScalarFunction.PATTERN_PARSER — new enum constant in
  analytics-framework.

- PatternParserAdapter — rename adapter (AbstractNameMappingAdapter)
  that routes PPL's INTERNAL_PATTERN_PARSER calls to the locally-
  declared  SqlFunction. The locally-declared operator
  is the referent of the FunctionMappings.s entry that gives isthmus
  the substrait extension name.

- DataFusionAnalyticsBackendPlugin —
    * adapter map: PATTERN_PARSER → new PatternParserAdapter()
    * MAP_RETURNING_PROJECT_OPS: + PATTERN_PARSER (its return type is
      MAP<VARCHAR, ANY> per UserDefinedFunctionUtils.patternStruct)

- DataFusionFragmentConvertor.ADDITIONAL_SCALAR_SIGS —
    FunctionMappings.s(LOCAL_PATTERN_PARSER_OP, "pattern_parser")

Existing 26 patterns module unit tests still pass; new 2 unit tests in
udf::pattern_parser pin the StructArray construction shape:
- struct_data_type_has_pattern_and_tokens_fields
- build_struct_array_populates_pattern_and_tokens_for_email_evalfield

Total Rust crate test pass: 28/28.

Native lib rebuild + IT run-through is the next step in this branch —
to verify the IT pass count moves from 3/14 to ≥ 5/14 (the 2 SIMPLE
patterns+show_numbered_token tests should pass once the runtime
substrait binding succeeds).

- INTERNAL_PATTERN aggregate UDF (3 tests — BRAIN aggregation mode)
- INTERNAL_PATTERN window UDF (3 tests — BRAIN label mode; this UDF is
  the input to PATTERN_PARSER's 3-arg evalAgg shape)
- TAKE aggregate nullability fix (1 test)
- 3-arg evalAgg shape in this UDF (depends on window UDF landing)

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Backend / DataFusion] Register SAFE_CAST for ARRAY and MAP returns

PPL's flattenParsedPattern wraps INTERNAL_ITEM(struct, key) lookups in
SAFE_CAST whenever the keyed field needs an explicit declared type. The
flatten step targets:
 - SAFE_CAST(ITEM(struct, "pattern"), VARCHAR)        — scalar (covered)
 - SAFE_CAST(ITEM(struct, "pattern_count"), BIGINT)   — scalar (covered)
 - SAFE_CAST(ITEM(struct, "tokens"), MAP<VARCHAR,...>) — needs MAP entry
 - SAFE_CAST(ITEM(struct, "sample_logs"), ARRAY<VAR>) — needs ARRAY entry

SAFE_CAST is already in STANDARD_PROJECT_OPS for SUPPORTED_FIELD_TYPES
(scalar) returns. Add it to ARRAY_RETURNING_PROJECT_OPS and
MAP_RETURNING_PROJECT_OPS so the OpenSearchProjectRule planner check
admits the call when its inferred return type is array- or map-shaped.

This is part of the PPL patterns command analytics-engine support
stack. Effect on CalcitePPLPatternsIT: the 6 SAFE_CAST 'No backend
supports' failures from milestone 3b shift to substrait-binding
errors that need the INTERNAL_PATTERN window / aggregate UDFs to
fully clear.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Backend / DataFusion] Add INTERNAL_PATTERN aggregate UDF (BRAIN aggregation mode)

Implements the BRAIN aggregate side of PPL's `patterns ... method=BRAIN` on
the analytics-engine route. Mirrors `LogPatternAggFunction` on the SQL plugin
side: collects per-group log lines and runs BRAIN over the corpus at
finalize time, emitting

  List<Struct<pattern, pattern_count, tokens, sample_logs>>

so the downstream UNNEST + ITEM projections (PPL's `flattenParsedPattern`)
resolve as named struct-field access.

Wiring (top-down):

- `AggregateFunction` (analytics-framework): adds `PATTERN` enum constant.
  Makes `fromNameOrError` case-insensitive — the PPL operator is registered
  lower-case ("pattern") whereas the enum constants are upper-case, and the
  raw `valueOf("pattern")` lookup was failing.

- `DataFusionAnalyticsBackendPlugin`: declares `AggregateFunction.PATTERN`
  in `AGG_FUNCTIONS` so the capability registry advertises it as supported
  on the datafusion backend.

- `DataFusionFragmentConvertor`: adds `LOCAL_INTERNAL_PATTERN_OP`
  (substrait-bound to `internal_pattern`) and an `ADDITIONAL_AGGREGATE_SIGS`
  entry — same pattern as `LOCAL_TAKE_OP` / `LOCAL_FIRST_OP`.

- `PplAggregateCallRewriter`: rewrites PPL `pattern(...)` calls onto
  `LOCAL_INTERNAL_PATTERN_OP` and substitutes the call's return type with
  the concrete struct shape (PPL's declared `ARRAY<MAP<VARCHAR, ANY>>` has
  an embedded `ANY` that isthmus cannot serialize to Substrait).

- `opensearch_aggregate_functions.yaml`: registers `internal_pattern`
  with 4-arg, 5-arg, 6-arg, and 1-arg (FINAL) overloads matching the PPL
  emitted call shapes (max_sample_count, buffer_limit, show_numbered_token,
  plus optional frequency_threshold_percentage and variable_count_threshold).

- `udaf/internal_pattern.rs`: the actual UDAF — variadic_any signature, an
  Accumulator that collects log lines and runs `BrainLogParser` at
  evaluate() time, with per-shard state shaped as `List<Utf8>` so the
  coordinator's FINAL accumulator can concatenate via `merge_batch`. 3
  unit tests pin behaviour (empty corpus, repeated-pattern grouping,
  cross-shard merge).

Test impact (`CalcitePPLPatternsIT` via analytics-engine route): the three
BRAIN aggregation-mode tests advance past the previous
`AggregateFunction.pattern` enum lookup; they now surface a downstream
"Project rule encountered unmarked child [LogicalCorrelate]" from the
PPL Calcite path's UNNEST after the aggregate (separate planner work to
add Correlate support, not covered by this commit). Window UDF for BRAIN
label mode and the SAFE_CAST nullability fix for SIMPLE aggregation mode
are also pending.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Backend / DataFusion] FORCE_NULLABLE on LOCAL_TAKE_OP and LOCAL_ARRAY_AGG_OP

PPL declares TAKE's return type via {@code PPLReturnTypes.ARG0_ARRAY} which
passes {@code nullable=true} to {@code SqlTypeUtil.createArrayType}. The
rewriter clones that nullable type onto the rewritten LOCAL_TAKE_OP call,
but the operator's default {@code ReturnTypes.TO_ARRAY} infers NOT NULL
(because aggregates over a non-empty group can't be null in standard SQL
semantics). The mismatch trips {@code AggregateCall.create}'s validation
with

  type mismatch:
    aggCall type:   VARCHAR ARRAY
    inferred type:  VARCHAR ARRAY NOT NULL

Surface: CalcitePPLPatternsIT's SIMPLE aggregation tests — the Parse-then-
Aggregate path applies TAKE to the source field, and the field's nullable
typing flows into the aggregate's declared return type. The bug is general
though — any nullable-input TAKE rewritten through the analytics-engine
backend would have hit it as soon as the rewriter's explicit-type path
fired.

Fix: andThen FORCE_NULLABLE on both LOCAL_TAKE_OP and LOCAL_ARRAY_AGG_OP so
the operator's inferred type matches what PPL emits. Mirror of the PPL
side's ARG0_ARRAY.

Test impact: CalcitePPLPatternsIT 3/15 → 4/15 (one SIMPLE aggregation test
unblocked; the show_numbered_token variants still hit the separate
"Unable to convert the type ANY" issue from PATTERN_PARSER's MAP<VARCHAR,
ANY> return type).

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Backend / DataFusion] PATTERN_PARSER struct return type + list<string> overload (partial)

Wires up the PATTERN_PARSER ANY-type fixes that the SIMPLE pattern
show_numbered_token tests need, but does NOT yet bring them green —
the wrapping {@code map_extract} + {@code array_element} chain that
ArrayElementAdapter created (when ITEM was lowered against the
original PPL MAP<VARCHAR, ANY> declared type) keeps its frozen
ANY return type even after PATTERN_PARSER is rewritten to STRUCT.
Substrait's TypeConverter still rejects with "Unable to convert the
type ANY" when it walks the operand types of the wrappers.

Captures the necessary framework:

- `opensearch_scalar_functions.yaml`: adds a second
  `pattern_parser(pattern: string, sample_logs: list<string>)`
  overload — the SIMPLE aggregation mode emits this call shape from
  the PPL Calcite visitor's `showNumberedToken=true` branch, and
  without this overload isthmus's ScalarFunctionConverter rejects the
  call earlier with "Unable to convert call pattern_parser(string?,
  list<string?>?)" before even getting to operand-type validation.

- `PatternParserAdapter`: overrides `adapt` to substitute the PPL
  declared MAP<VARCHAR, ANY> return type with the concrete struct
  shape `STRUCT<pattern: VARCHAR, tokens: MAP<VARCHAR, ARRAY<VARCHAR>>>`
  that matches the Rust UDF's Arrow output. Same pattern as the
  PATTERN aggregate's PplAggregateCallRewriter case.

- `ItemTypeRebuilder` (new): pre-isthmus shuttle that walks every
  Project / Filter expression tree. Rebuilds ITEM calls so their
  return type is re-derived from operand 0 (handles the legacy
  ITEM-on-STRUCT case for non-adapted plans), and substitutes any
  pattern_parser call whose declared type is still MAP<VARCHAR, ANY>
  with the concrete struct. Wired into both
  `convertToSubstrait` and `convertStandalone` so attached-wrapper
  paths (PARTIAL agg + FINAL agg conversion) get the same rewrite.

Remaining gap (for follow-up): ArrayElementAdapter has already
converted ITEM on MAP into `array_element(map_extract(map, key), 1)`
with ANY-derived return types BEFORE PatternParserAdapter substitutes
the inner struct type. The wrappers stay typed as ANY ARRAY / ANY,
and isthmus rejects them. Two paths to close this:
  - SQL plugin: change PATTERN_PARSER's declared return type from
    MAP<VARCHAR, ANY> to a concrete struct (touches v2 path's result
    materialisation).
  - Backend: extend ItemTypeRebuilder to detect the
    `array_element(map_extract(STRUCT, key), 1)` anti-pattern and
    rewrite to a direct STRUCT field access.

Test impact (CalcitePPLPatternsIT via analytics-engine route): no
change from previous 4/15 pass count — the SIMPLE pattern
show_numbered_token tests still hit the ANY-typed wrapper chain. The
framework is in place; a follow-up commit closes the remaining gap.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Backend / DataFusion] Hoist struct-returning calls into child Project for FieldAccess

Builds on the previous PATTERN_PARSER struct return-type rewrite. Two
additions:

1. `tryUnwindArrayElementMapExtract`: detect the
   {@code array_element(map_extract(<STRUCT_call>, <literal_key>), 1)}
   chain that {@link ArrayElementAdapter} produces for ITEM-on-MAP and
   replace it with a direct {@code RexFieldAccess(STRUCT_call, field)}.
   Isthmus serialises field access natively via a Substrait
   {@code FieldReference} with {@code StructField} segment, bypassing
   the missing ITEM-on-struct extension binding.

2. `maybeHoistStructAccess`: isthmus'
   {@code RexExpressionConverter.visitFieldAccess} only handles refs
   whose reference expression is {@code RexInputRef}, {@code RexFieldAccess},
   or {@code RexCorrelVariable} — function-call references fail with
   {@code "RexFieldAccess for SqlKind OTHER_FUNCTION not supported"}.
   Hoist each unique struct-returning call into a child Project so the
   outer project's FieldAccess references an input ref instead.
   Mirrors {@code OpenSearchProject#liftNestedRexOver} which hoists
   nested {@code RexOver} for the same isthmus compatibility reason.

Test impact (CalcitePPLPatternsIT via analytics-engine route): 4/15 pass
count holds. SIMPLE pattern show_numbered_token tests now make it past
substrait conversion (no more {@code "Unable to convert the type ANY"}
or {@code "RexFieldAccess for SqlKind OTHER_FUNCTION"}) but DataFusion's
substrait consumer rejects nested field access with {@code "Direct
reference StructField with child is not supported"}. Closing that gap
requires either per-field scalar UDFs (e.g. {@code pattern_parser_get_pattern},
{@code pattern_parser_get_tokens}) that return scalar values directly,
or a substrait emission tweak that splits struct field access across
two relations. The hoist framework added here remains useful for both
paths — it's the pre-requisite for any approach that needs to express
"compute struct once, project its fields multiple times".

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Backend / DataFusion] Per-field scalar UDFs for PATTERN_PARSER (SIMPLE patterns end-to-end)

Closes the SIMPLE pattern show_numbered_token path on the analytics-engine
route. Replaces the previous "PATTERN_PARSER returns struct, downstream
unwraps via FieldAccess + hoist" pipeline with two scalar UDFs that each
return one field directly, eliminating struct-field access from the
substrait plan entirely.

Why the previous approach didn't work:
  - PPL declares PATTERN_PARSER's return type as MAP<VARCHAR, ANY> (the v2
    path returns a heterogeneous Java map), and the embedded ANY breaks
    substrait serialization.
  - Rewriting the return type to a concrete STRUCT lets isthmus serialize
    pattern_parser itself, but downstream ITEM(STRUCT, key) has no
    substrait extension binding, and a RexFieldAccess workaround fails
    isthmus' "RexFieldAccess for SqlKind OTHER_FUNCTION not supported".
  - Hoisting the struct call into a child Project + a top-level
    FieldAccess(InputRef, field) gets past isthmus, but DataFusion's
    substrait consumer then rejects with "Direct reference StructField
    with child is not supported".

Per-field UDF design:
  - `pattern_parser_get_pattern(pattern, field|sample_logs)` → Utf8
  - `pattern_parser_get_tokens(pattern, field|sample_logs)` → Map<Utf8, List<Utf8>>
  Both share the same parse logic as the base `pattern_parser` UDF (which
  is kept registered for the BRAIN aggregate path that emits a struct-of-
  -fields via UNNEST). The new wrappers delegate to the base UDF and
  extract the requested field column from its StructArray output.

Wiring:
  - `udf/pattern_parser.rs`: adds `PatternParserGetFieldUdf` parameterised
    by `PatternField::Pattern | Tokens`; registers both during
    `register_all`. Each wrapper returns a flat scalar Arrow array — no
    StructArray in the result.
  - `opensearch_scalar_functions.yaml`: signatures for both new UDFs in
    both call shapes (string+string for label mode, string+list<string>
    for aggregation-with-samples).
  - `DataFusionFragmentConvertor`: `LOCAL_PATTERN_PARSER_GET_*_OP`
    locally-declared operators + `ADDITIONAL_SCALAR_SIGS` entries so
    isthmus binds them by identity to the substrait extension names.
  - `ItemTypeRebuilder`: rewritten to detect the
    `array_element(map_extract(pattern_parser(args), 'pattern' | 'tokens'),
    1)` chain (the legacy ITEM-on-MAP lowering ArrayElementAdapter
    produces) and replace it with a direct call to the appropriate
    per-field operator, preserving the original pattern_parser operand
    list. Struct field access never reaches substrait.
  - `PatternParserAdapter`: reverts the earlier MAP→STRUCT return-type
    substitution (no longer needed) and updates the class doc to point
    to ItemTypeRebuilder as the actual ANY-elimination step.

Test impact (CalcitePPLPatternsIT via analytics-engine route): 4/15 → 8/15.
Newly green:
  - testSimplePatternAggregationMode_ShowNumberedToken
  - testSimplePatternAggregationMode_WithGroupBy_ShowNumberedToken
  - testSimplePatternLabelMode_ShowNumberedToken
  - testSimplePatternLabelModeWithCustomPattern_ShowNumberedToken
All SIMPLE pattern variants now pass. Remaining failures are all BRAIN
(3 aggregation + 4 label/UUID), each blocked by separate larger items:
LogicalCorrelate support in the analytics-engine planner for the
aggregation UNNEST, and an INTERNAL_PATTERN window UDF for label mode.

The companion SQL-plugin commit (Arrow Text → String via class-name check
in {@code ExprValueUtils.fromObjectValue}) is required for the new green
tests to deserialize the runtime result rows — see the sql repo's
{@code verify/datetime-udt-singleton-fix} branch.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Backend / DataFusion] WIP: BRAIN window UDF + dashboard query path scaffolding

Lays down the structural pieces needed for PPL `patterns ... method=BRAIN mode=label`
on the analytics-engine route, but stops short of crossing the type-cascade
boundary required to ship the end-to-end test. Documented gap below.

What's in place:

Java planner:
  - `WindowFunction.PATTERN` enum + case-insensitive `fromName` fallback;
    `OpenSearchProjectRule.collectWindowFunctions` accepts OTHER-kind operators
    via name lookup so PPL's INTERNAL_PATTERN no longer trips the planner's
    "Window function [pattern] is not supported" guard at the capability check.
  - `DataFusionAnalyticsBackendPlugin` advertises `WindowFunction.PATTERN` in
    its window capability set.

Backend plugin:
  - `LOCAL_INTERNAL_PATTERN_WINDOW_OP` locally-declared SqlAggFunction +
    `ADDITIONAL_WINDOW_SIGS` so the WindowFunctionConverter binds the operator
    by identity to the substrait extension URN.
  - `opensearch_window_functions.yaml` declares the substrait variant for
    `internal_pattern` with `window_type: PARTITION` and `return: string?`.
  - `DataFusionPlugin.loadSubstraitExtensions` merges the new YAML into the
    extension catalog.
  - `PplWindowCallRewriter` skeleton (no-op for now — see gap below).

Rust:
  - New `udwf/internal_pattern.rs` module + registration through
    `session_context::create_session_context`. PartitionEvaluator buffers the
    field column for the partition, runs BrainLogParser once over the corpus,
    then per-row matches via eval_agg and emits the wildcard pattern string
    as Utf8 per row. Two unit tests pin the shape contract.
  - `udwf/mod.rs` aggregator + `lib.rs` `pub mod udwf;`.

ItemTypeRebuilder:
  - Detects the raw `ITEM(pattern_parser(...), 'key')` shape in addition to
    the post-ArrayElementAdapter `array_element(map_extract(...), 1)` form.
    The label-mode chain never goes through ArrayElementAdapter because the
    pattern_parser's container type still reads as MAP<VARCHAR, ANY> at adapter
    time, so the raw ITEM shape is the one that reaches the convertor.
  - Handles the 3-arg `pattern_parser(source, window_result, show_token)`
    shape PPL emits for BRAIN label mode: 'pattern' key bypasses the parser
    entirely (returns the window result directly); 'tokens' key routes to
    the existing 2-arg `pattern_parser_get_tokens(window_result, source)`.

Test scaffold:
  - `CalcitePPLDashboardPatternsIT` pins the canonical Dashboards query shape:
    `patterns ... method=BRAIN mode=label | stats count(), take(message, 1)
    by patterns_field | sort -pattern_count`. Currently red.

The gap — RexInputRef type cascade after RexOver substitution:

When the BRAIN label expression compiles, `OpenSearchProject#liftNestedRexOver`
hoists the `pattern(...) OVER ()` call into a child Project so DataFusion's
substrait consumer auto-lifts it into a `LogicalWindow`. The parent Project's
{@code PATTERN_PARSER(field, $hoistedCol, showToken)} call captures a
{@code RexInputRef($hoistedCol)} whose Calcite-frozen type is the child's
row-type entry for that column — currently {@code MAP<VARCHAR, ANY> ARRAY}
(PPL's declared INTERNAL_PATTERN return).

Rewriting the inner RexOver to {@code LOCAL_INTERNAL_PATTERN_WINDOW_OP} with
VARCHAR return type cascades cleanly through the child Project's expressions,
but the parent's frozen {@code RexInputRef} type doesn't update — Calcite
flags it with {@code "type mismatch: ref: MAP NOT NULL ARRAY, input: VARCHAR"}
at the next Aggregate/Sort construction.

Closing the gap requires RelNode-level coordination: rewrite the inner
Project's row type, then walk every downstream Project / Aggregate that
references the hoisted column and rebuild their RexInputRefs with the new
VARCHAR type. That's a multi-RelNode rewriter (similar in spirit to
`maybeHoistStructAccess` in ItemTypeRebuilder but spanning a wider rel
window). Left for the next milestone — the no-op stub in
{@link PplWindowCallRewriter#WindowRewriteShuttle#visitOver} preserves the
existing planner-level capability check pass-through so the failure surfaces
as the substrait-side "Unable to convert the type ANY" rather than the
earlier "Window function [pattern] is not supported".

Test impact (`CalcitePPLDashboardPatternsIT`, `CalcitePPLPatternsIT` via
analytics-engine route): no change from the 8/15 baseline established by
the per-field PATTERN_PARSER UDF work. BRAIN tests (3 agg + 4 label/UUID)
remain red.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Engine] BRAIN label-mode window-UDF cascade rewrite + STATE_EXPANDING split skip

Lands the BRAIN label-mode path end-to-end through the analytics-engine
route. Two coordinated changes:

1. `PplWindowCallRewriter` — full RelNode-level rewriter

Replaces the prior no-op stub with a recursive bottom-up rewriter that
swaps PPL's INTERNAL_PATTERN window operator onto our local
`LOCAL_INTERNAL_PATTERN_WINDOW_OP` stub (returning {@code VARCHAR}) and
cascades the resulting column type change through every downstream
RelNode's {@link RexInputRef} and wrapping {@link RexCall}.

The cascade was the non-obvious piece. `OpenSearchProject#liftNestedRexOver`
hoists the {@code pattern(...) OVER ()} call into its own child Project
before fragment conversion, so the parent Project's {@code PATTERN_PARSER}
call references the hoisted column via a {@link RexInputRef} whose
declared type is frozen at construction to the original
{@code MAP<VARCHAR, ANY> ARRAY}. Substituting the inner RexOver's return
type to VARCHAR makes the child Project's row type at that slot mismatch
the parent's frozen RexInputRef type — Calcite's
{@code Aggregate.typeMatchesInferred} (and the analogous check on
{@code Project.<init>}) rejects with {@code "type mismatch: ref:
MAP NOT NULL ARRAY, input: VARCHAR"} on the next RelNode construction.

Closing the cascade required:
  - A custom bottom-up shuttle (`descendAndRewrite`) that visits children
    first then retypes the current RelNode's expressions *before* calling
    {@code copy()}, avoiding {@code RelShuttleImpl.visitChildren}'s eager
    copy that would otherwise validate against stale RexInputRefs.
  - {@code rewriteProject} early-returns the unchanged original Project
    only when both {@code exprChanged=false} AND {@code childChanged=false}
    — not just {@code inputRowTypeChanged=false}. Discovered via a
    per-recursion hash trace that showed the top Project returning the
    original (hash 441987200 → 441987200) and silently discarding all
    descendant rebuilds because its own expressions hadn't changed.
  - {@code visitCall} uses the 3-arg {@link RexBuilder#makeCall} form
    (preserving the call's declared return type) rather than the 2-arg
    form. The 2-arg form re-invokes operator-specific inference like
    {@code SqlCastFunction}'s {@code lambda$returnTypeInference$0} which
    expects an operand at index 1 that doesn't exist for SAFE_CAST as
    constructed via {@code RexBuilder.makeCast}.
  - {@code ExprRewriter#visitInputRef} retypes references whose declared
    type diverges from the current input row type — propagates the
    VARCHAR substitution upward through every dependent RelNode.

2. `OpenSearchAggregateSplitRule` — skip PARTIAL/FINAL split for STATE_EXPANDING

Surfaced by the dashboard query (`patterns ... method=BRAIN mode=label |
stats count() as pattern_count, take(message, 1) as sample_logs by
patterns_field | sort -pattern_count`). The split rule constructs both
{@code SINGLE-on-SINGLETON} and {@code PARTIAL+FINAL} alternatives. The
FINAL alternative reuses the original aggregate's argList — pointing into
original input column indices ({@code TAKE($1=content, $2=N_literal)}).
But FINAL's input is the PARTIAL output where those positions hold STATE
columns of different types ({@code count_state BIGINT}, {@code take_state
ARRAY<VARCHAR>}). The original TAKE operator's
{@code ARG0_ARRAY} inference on the shifted arg 0 gives
{@code BIGINT ARRAY} while the declared output is {@code VARCHAR ARRAY}
— {@code Aggregate.typeMatchesInferred} throws before
{@code DistributedAggregateRewriter} (which DOES remap correctly) can run.

Extend {@code shouldSkipPartialFinalSplit} to skip when any aggregate is
STATE_EXPANDING. On a 1-shard cluster the SINGLE-on-SINGLETON alternative
wins anyway, so we lose no parallelism in single-shard tests; on
multi-shard the trade-off is correctness over potential parallelism for
those aggs until per-phase operator overloads or split-time remapping
ships.

Test impact (`CalcitePPLPatternsIT` via analytics-engine route):

  Before this commit:  8/15
  After this commit:  10/15

Newly green:
  - testBrainLabelMode_NotShowNumberedToken
  - testBrainParseWithUUID_NotShowNumberedToken

Also newly green:
  - `CalcitePPLDashboardPatternsIT.testDashboardBrainLabelStatsByPatternsField`
    (the canonical Dashboards BRAIN-label pattern panel query)

Remaining 5/15 failures:
  - testBrainLabelMode_ShowNumberedToken — needs the tokens-map extraction
    path through the per-field {@code pattern_parser_get_tokens} call
    (uses the 2-arg shape; should be reachable but currently surfaces a
    separate substrait type issue).
  - testBrainParseWithUUID_ShowNumberedToken — same.
  - 3 testBrainAggregationMode_* — separate planner gap
    (`LogicalCorrelate` after UNNEST). Not blocked by this commit.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* ci: retrigger gradle-check (previous run returned null result after 2h timeout)

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Engine] Recursively normalize nested Arrow Text in ArrowValues

ArrowValues.toJavaValue was the boundary that converts Arrow vectors to
plain Java values handed to ExprValueUtils, but it only stripped Arrow's
`Text` wrapper at one level — top-level VarCharVector cells, MapVector
entry values, and ListVector elements.

Nested shapes (e.g. the BRAIN patterns `tokens` column, which is
Map<String, List<String>>) escaped with raw `Text` wrappers inside the
inner list, because MapVector's value-extraction path didn't recurse
into the list's elements. ExprValueUtils.fromObjectValue then rejected
those raw Text objects with "unsupported object class".

Introduce a small private `normalize(Object)` helper that recurses
through Lists, Maps, and Text wrappers, so containers of any depth land
on the SQL side as pure Java types. This keeps the Arrow→Java
conversion responsibility where it belongs (the analytics-engine result
materialization) instead of leaking Arrow type knowledge into the SQL
plugin's generic value converter.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Backend / DataFusion] Append "g" flag in RegexpReplaceAdapter for 3-arg calls

DataFusion's regexp_replace defaults to first-match-only without an explicit
flag; Calcite's REGEXP_REPLACE_3 is already replace-all. PPL relies on the
Calcite contract (every match replaced — used by SIMPLE patterns, regex_replace,
rex mode=sed), so on the DataFusion path the adapter now rewrites every 3-arg
REGEXP_REPLACE_3 to 4-arg REGEXP_REPLACE_PG_4(..., "g") preserving the same
end-user semantics across backends.

Companion change in opensearch-project/sql#5467: the SQL core no longer emits
the 'g' flag itself — that DataFusion-specific concern now lives only here.

Two existing unit tests updated to expect the new always-global behavior
(testAdaptPassesThroughWhenNoQuoteBlock → testAdaptAppendsGlobalFlagFor3Arg,
testAdaptPassesThroughNonLiteralPattern → testAdaptAppendsGlobalFlagForNonLiteralPattern).
The \Q rewrite, $N backreference rewrite, and 4-arg-with-flags pass-through
are unchanged.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Backend / DataFusion] spotlessApply

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [QA] Add PatternsCommandIT — multi-shard coverage for STATE_EXPANDING aggregates

Per @marc's review: the OpenSearchAggregateSplitRule SINGLE-on-SINGLETON
fallback (when STATE_EXPANDING aggs like take/list/values are present)
only meaningfully fires on a multi-shard index — on single-shard, no split
ever happens. CoordinatorReduceIT covers no-group-by take/list/values
on 2 shards, but the dashboard query shape (take(field) by group-key) on
multi-shard had no coverage.

Adds three tests against the existing app_logs dataset:
- testSimplePatternLabelMode — single-shard, asserts every row gets
  tokenized (every patterns_field contains "<*>"), validates the
  RegexpReplaceAdapter "g" flag append.
- testSimplePatternAggregationModeMultiShard — 3-shard, exercises the
  OpenSearchAggregateSplitRule SINGLE-on-SINGLETON fallback. Asserts
  pattern_count totals sum to the dataset row count (200) and every
  group's sample_logs array is non-empty.
- testSimplePatternAggregationGroupByServiceMultiShard — 3-shard with
  explicit `by patterns_field, service_name` group key (cardinality > 1),
  drives the same fallback path with `take(message, 1)` as the aggregate.

BRAIN coverage stays on CalcitePPLDashboardPatternsIT (SQL side); BRAIN
aggregation has open LogicalCorrelate post-aggregate work tracked
separately.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [QA] PatternsCommandIT: pin number_of_shards=3 in multi-shard fixture

Self-check inside ensureMultiShardProvisioned() reads
GET /<index>/_settings and asserts the index settings report
number_of_shards=3. Makes the multi-shard nature of the
aggregation tests provable from the test itself rather than
implicit in the DatasetProvisioner.provision() call.

Per @marc's review request to verify the multi-shard claim.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Framework] Add WindowFunction.resolveFunction(SqlOperator)

Per @mch2's nit: pull the kind→name fallback out of OpenSearchProjectRule
into a single WindowFunction.resolveFunction(SqlOperator) so future
backends pick up the OTHER-kind name-lookup logic without copy-paste.

Three unit tests for the new method cover sql-kind hit, OTHER-kind name
fallback (PATTERN), and unknown-operator returning null.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Engine] Correct shouldSkipPartialFinalSplit comment

Per @mch2's review: the prior wording implied SINGLE-on-SINGLETON was
"the only viable choice on a single shard anyway", which is wrong now
that PatternsCommandIT exercises this path on a 3-shard index. Restate:
SINGLE-on-SINGLETON also runs correctly on multi-shard (gather to the
coordinator, then aggregate), it just trades distributed parallelism for
the type-mismatch workaround. Distributed parallelism is still the
follow-up.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Engine] Register PERCENTILE_APPROX in AggregateFunction, drop ad-hoc check

Per @mch2's review: PERCENTILE_APPROX is semantically a STATE_EXPANDING
aggregate (per-key t-digest state grows with input cardinality), exactly like
PERCENTILE_CONT and PERCENTILE_DISC which are already in the enum. The "right
spot" for it is the AggregateFunction registry, not an ad-hoc string check in
OpenSearchAggregateSplitRule.

Changes:
- AggregateFunction.PERCENTILE_APPROX(Type.STATE_EXPANDING, SqlKind.OTHER).
- OpenSearchAggregateSplitRule: replace the isPercentileApprox + separate
  STATE_EXPANDING checks with a single isStateExpanding(SqlAggFunction) helper
  that handles the unregistered-op throw gracefully (returns false rather
  than crashing the planner — fixes a latent issue where my earlier STATE_EXPANDING
  addition would have crashed on truly unknown aggs).
- Javadoc refreshed to describe the STATE_EXPANDING category rather than
  calling out percentile_approx as a one-off.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Engine] Trim verbose comments

Drop AI-generated explanatory blocks; keep terse WHY-only notes
where context isn't obvious from the code itself.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* [Analytics Framework] Restore javadocs for Type / IntermediateTypeResolver

Required by sandbox-check's missingJavadoc task on public types.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

---------

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants