Skip to content

[AURON #2335] Support more literal patterns in native split#2336

Open
weimingdiit wants to merge 1 commit into
apache:masterfrom
weimingdiit:feat/native-split-literal-patterns
Open

[AURON #2335] Support more literal patterns in native split#2336
weimingdiit wants to merge 1 commit into
apache:masterfrom
weimingdiit:feat/native-split-literal-patterns

Conversation

@weimingdiit

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #2335

Rationale for this change

Spark split conversion currently supports only a fixed set of literal patterns. This causes safe literal patterns such as /, \\+, and :: to fall back even though native Spark_StringSplit already splits by literal string pattern.

This PR expands native coverage for common non-regex split patterns without adding full regex support.

What changes are included in this PR?

This PR updates Spark StringSplit conversion in ShimsImpl to recognize patterns that can be safely treated as literal strings.

The new logic supports:

  • plain literal patterns without regex metacharacters
  • escaped regex metacharacters that represent literal characters

Patterns requiring regex semantics continue to fall back.

This PR also adds SQL coverage in AuronFunctionSuite for:

  • split(c1, '/')
  • split(c2, '\\+')
  • split(c3, '::')

Are there any user-facing changes?

Yes. More Spark split expressions with literal patterns can now be executed natively instead of falling back to Spark.
There is no behavior change for regex patterns that are not safe to treat as literals.

How was this patch tested?

UT.

Signed-off-by: weimingdiit <weimingdiit@gmail.com>
@github-actions github-actions Bot added the spark label Jun 16, 2026
@weimingdiit weimingdiit marked this pull request as ready for review June 16, 2026 06:36
@cxzl25 cxzl25 requested a review from Copilot June 16, 2026 12:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands native conversion coverage for Spark split expressions when the pattern can be safely treated as a literal string (including escaped regex metacharacters like \\+), reducing unnecessary fallbacks while still rejecting patterns requiring true regex semantics.

Changes:

  • Replace the hard-coded split pattern allowlist with a small literal-pattern parser in ShimsImpl.
  • Route eligible StringSplit expressions to native Spark_StringSplit using the parsed literal separator.
  • Add SQL test coverage for additional literal separators (/, \\+, ::).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
spark-extension-shims-spark/src/main/scala/org/apache/spark/sql/auron/ShimsImpl.scala Adds literal-pattern parsing and expands native conversion eligibility for StringSplit.
spark-extension-shims-spark/src/test/scala/org/apache/auron/AuronFunctionSuite.scala Adds a test exercising the newly supported literal split patterns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +192 to +199
test("split function with literal regex patterns") {
withTable("t1") {
sql("create table t1(c1 string, c2 string, c3 string) using parquet")
sql("insert into t1 values('a/b/c', 'a+b+c', 'a::b::c'), (null, null, null)")
checkSparkAnswerAndOperator(
"select split(c1, '/'), split(c2, '\\\\+'), split(c3, '::') from t1")
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support more non-regex literal patterns in native Spark split

2 participants