[AURON #2257] Avoid URI reparsing in JNI Hadoop paths by zhtttylz · Pull Request #2264 · apache/auron

zhtttylz · 2026-05-12T03:56:00Z

Which issue does this PR close?

Rationale for this change

Auron's JNI Hadoop file wrappers currently reconstruct Hadoop paths with new Path(new URI(path)).
This does not preserve Hadoop Path(String) semantics before the path is passed back to FileSystem.

When a raw Hadoop path string contains a literal #, Java URI parsing treats the suffix after # as a fragment, so the actual Hadoop path is truncated.

For example, the intended path:

hdfs://mycluster/auron-it-hdfs-rbf-repro/raw#mini.txt

is opened as:

/auron-it-hdfs-rbf-repro/raw

What changes are included in this PR?

This PR stops reparsing Hadoop path strings through java.net.URI in JniBridge.
The path reconstruction is changed from:

- new Path(new URI(path))
+ new Path(path)

This preserves Hadoop Path(String) semantics.
Add a regression test for JNI Hadoop file wrapper path handling when the path contains a literal #.

Are there any user-facing changes?

This fixes a bug where Hadoop paths containing a literal # could be truncated.

No new APIs, configs, or migration steps are required.

How was this patch tested?

Ran the focused Java regression test:

mvn -pl auron-core -am -Pspark-3.5 -Pscala-2.12 -Ppre \
  -DskipBuildNative \
  -Dtest=org.apache.auron.jni.JniBridgeTest \
   test

Result:

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
BUILD SUCCESS

Copilot

Pull request overview

This PR fixes a bug in Auron’s JNI Hadoop file wrappers where paths containing a literal # could be truncated due to java.net.URI fragment parsing, and adds Java regression coverage to prevent recurrence.

Changes:

Adjusted JNI bridge path handling to avoid fragment truncation when # appears in the path string.
Added JniBridgeTest regression tests covering literal # handling and percent-encoding behavior.
Added a test-scoped Hadoop runtime dependency to support the new unit test.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`auron-core/src/main/java/org/apache/auron/jni/JniBridge.java`	Changes how input/output paths are converted to Hadoop `Path` objects to avoid `#` fragment truncation.
`auron-core/src/test/java/org/apache/auron/jni/JniBridgeTest.java`	Adds regression tests asserting `#` is preserved and that read/write path encoding behavior is stable.
`auron-core/pom.xml`	Adds `hadoop-client-runtime` as a test dependency to compile/run the new test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    public static FSDataInputWrapper openFileAsDataInputWrapper(FileSystem fs, String path) throws Exception {
-        // the path is a URI string, so we need to convert it to a URI object
-        return FSDataInputWrapper.wrap(fs.open(new Path(new URI(path))));
+        return FSDataInputWrapper.wrap(fs.open(toInputPath(path)));
    }


    public static FSDataOutputWrapper createFileAsDataOutputWrapper(FileSystem fs, String path) throws Exception {
-        return FSDataOutputWrapper.wrap(fs.create(new Path(new URI(path))));
+        return FSDataOutputWrapper.wrap(fs.create(new Path(path)));
+    }
+
+    private static Path toInputPath(String path) throws URISyntaxException {
+        String safePath = path.indexOf('#') >= 0 ? path.replace("#", "%23") : path;
+        return new Path(new URI(safePath));
    }


ShreyeshArangath · 2026-05-15T22:47:18Z

+    }
+
+    private static Path toInputPath(String path) throws URISyntaxException {
+        String safePath = path.indexOf('#') >= 0 ? path.replace("#", "%23") : path;


This seems a little brittle; why do we need this? can we do something like so?

public static FSDataInputWrapper openFileAsDataInputWrapper(FileSystem fs, String path) throws Exception { return FSDataInputWrapper.wrap(fs.open(new Path(path))); }

Thanks, I went with this approach.

yew1eb

The fix for the read path is inconsistent with the write path, and the toInputPath workaround has an edge-case bug.

yew1eb · 2026-05-16T07:29:33Z

+
+    private static Path toInputPath(String path) throws URISyntaxException {
+        String safePath = path.indexOf('#') >= 0 ? path.replace("#", "%23") : path;
+        return new Path(new URI(safePath));


Two issues here:

Inconsistent fix: createFileAsDataOutputWrapper was changed to new Path(path) (the correct simple fix), but openFileAsDataInputWrapper still goes through new URI(...) after escaping #. If new Path(path) is correct for writes, the same change should work for reads — the PR description itself says the fix is "change from new Path(new URI(path)) to new Path(path)". Why does the read path need a different approach?

Double-encoding bug: path.replace("#", "%23") will corrupt a path that already contains a literal %23 (i.e. a percent-encoded #) by turning it into %2523. If the simpler new Path(path) works for writes, applying it uniformly to reads as well would fix both issues at once.

Thanks, I moved the normalization earlier and removed the read-side workaround.

zhtttylz · 2026-05-26T13:06:39Z

I tried to address this by normalizing the scan path before the native/JNI boundary, so JniBridge can use new Path(path) for both reads and writes. Happy to adjust if this still looks off, or if there’s a better way to handle it :)

weiqingy

Thanks for iterating on this — moving the decode up to the Spark scan side so JniBridge collapses to a symmetric new Path(path) is a clean answer to the earlier read/write asymmetry, and testFileWrappersPreserveLiteralHashInHdfsPath is a solid red/green guard (it asserts both the path is preserved and the URI fragment is null, so it fails on the pre-fix impl). One question inline.

weiqingy · 2026-06-14T06:29:53Z

    PartitionedFile(partitionValues, filePath, offset, size)

+  @sparkver("3.0 / 3.1 / 3.2 / 3.3")
+  override def getPartitionedFilePathString(file: PartitionedFile): String = {


The new 3.0–3.3 getPartitionedFilePathString (split + unescapePathName + rejoin) is the substantive new logic, and grepping the repo it has no test coverage anywhere — JniBridgeTest only exercises the Java new Path(path) half, which is the already-symmetric part. A shim-level test (Spark-3.x profile) asserting this method's output against the old new Path(new URI(filePath)) result for a # path, a %20 path, and — the case that matters most — a non-ASCII multi-byte path would lock the decode fidelity down. Worth adding? That third case is what would surface the encoding question on line 982 directly rather than leaving it to inspection.

[AURON apache#2257] Avoid URI reparsing in JNI Hadoop paths

a507d0e

github-actions Bot added build core labels May 12, 2026

zhtttylz marked this pull request as draft May 12, 2026 12:04

Fix JNI Hadoop path decoding

0c4fc47

zhtttylz marked this pull request as ready for review May 14, 2026 08:38

cxzl25 requested a review from Copilot May 14, 2026 08:58

Copilot started reviewing on behalf of cxzl25 May 14, 2026 08:58 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

ShreyeshArangath reviewed May 15, 2026

View reviewed changes

yew1eb reviewed May 16, 2026

View reviewed changes

github-actions Bot added the spark label May 26, 2026

zhtttylz force-pushed the fix-hadoop-fs-path-hash branch 2 times, most recently from f18f358 to 0fbcabd Compare May 28, 2026 11:45

Fix JNI Hadoop path handling

a759cdb

zhtttylz force-pushed the fix-hadoop-fs-path-hash branch from 0fbcabd to a759cdb Compare May 29, 2026 07:12

weiqingy reviewed Jun 14, 2026

View reviewed changes

SteNicholas assigned cxzl25 and ShreyeshArangath Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AURON #2257] Avoid URI reparsing in JNI Hadoop paths#2264

[AURON #2257] Avoid URI reparsing in JNI Hadoop paths#2264
zhtttylz wants to merge 3 commits into
apache:masterfrom
zhtttylz:fix-hadoop-fs-path-hash

zhtttylz commented May 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

ShreyeshArangath May 15, 2026

Uh oh!

zhtttylz May 26, 2026

Uh oh!

yew1eb left a comment

Uh oh!

yew1eb May 16, 2026

Uh oh!

zhtttylz May 26, 2026

Uh oh!

zhtttylz commented May 26, 2026

Uh oh!

weiqingy left a comment •

edited

Loading

Uh oh!

weiqingy Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

zhtttylz commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

ShreyeshArangath May 15, 2026

Choose a reason for hiding this comment

Uh oh!

zhtttylz May 26, 2026

Choose a reason for hiding this comment

Uh oh!

yew1eb left a comment

Choose a reason for hiding this comment

Uh oh!

yew1eb May 16, 2026

Choose a reason for hiding this comment

Uh oh!

zhtttylz May 26, 2026

Choose a reason for hiding this comment

Uh oh!

zhtttylz commented May 26, 2026

Uh oh!

weiqingy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weiqingy Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhtttylz commented May 12, 2026 •

edited

Loading

weiqingy left a comment •

edited

Loading