[AURON #2257] Avoid URI reparsing in JNI Hadoop paths#2264
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug in Auron’s JNI Hadoop file wrappers where paths containing a literal # could be truncated due to java.net.URI fragment parsing, and adds Java regression coverage to prevent recurrence.
Changes:
- Adjusted JNI bridge path handling to avoid fragment truncation when
#appears in the path string. - Added
JniBridgeTestregression tests covering literal#handling and percent-encoding behavior. - Added a test-scoped Hadoop runtime dependency to support the new unit test.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
auron-core/src/main/java/org/apache/auron/jni/JniBridge.java |
Changes how input/output paths are converted to Hadoop Path objects to avoid # fragment truncation. |
auron-core/src/test/java/org/apache/auron/jni/JniBridgeTest.java |
Adds regression tests asserting # is preserved and that read/write path encoding behavior is stable. |
auron-core/pom.xml |
Adds hadoop-client-runtime as a test dependency to compile/run the new test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public static FSDataInputWrapper openFileAsDataInputWrapper(FileSystem fs, String path) throws Exception { | ||
| // the path is a URI string, so we need to convert it to a URI object | ||
| return FSDataInputWrapper.wrap(fs.open(new Path(new URI(path)))); | ||
| return FSDataInputWrapper.wrap(fs.open(toInputPath(path))); | ||
| } |
| public static FSDataOutputWrapper createFileAsDataOutputWrapper(FileSystem fs, String path) throws Exception { | ||
| return FSDataOutputWrapper.wrap(fs.create(new Path(new URI(path)))); | ||
| return FSDataOutputWrapper.wrap(fs.create(new Path(path))); | ||
| } | ||
|
|
||
| private static Path toInputPath(String path) throws URISyntaxException { | ||
| String safePath = path.indexOf('#') >= 0 ? path.replace("#", "%23") : path; | ||
| return new Path(new URI(safePath)); | ||
| } |
| } | ||
|
|
||
| private static Path toInputPath(String path) throws URISyntaxException { | ||
| String safePath = path.indexOf('#') >= 0 ? path.replace("#", "%23") : path; |
There was a problem hiding this comment.
This seems a little brittle; why do we need this? can we do something like so?
public static FSDataInputWrapper openFileAsDataInputWrapper(FileSystem fs, String path) throws Exception {
return FSDataInputWrapper.wrap(fs.open(new Path(path)));
}
There was a problem hiding this comment.
Thanks, I went with this approach.
yew1eb
left a comment
There was a problem hiding this comment.
The fix for the read path is inconsistent with the write path, and the toInputPath workaround has an edge-case bug.
|
|
||
| private static Path toInputPath(String path) throws URISyntaxException { | ||
| String safePath = path.indexOf('#') >= 0 ? path.replace("#", "%23") : path; | ||
| return new Path(new URI(safePath)); |
There was a problem hiding this comment.
Two issues here:
-
Inconsistent fix:
createFileAsDataOutputWrapperwas changed tonew Path(path)(the correct simple fix), butopenFileAsDataInputWrapperstill goes throughnew URI(...)after escaping#. Ifnew Path(path)is correct for writes, the same change should work for reads — the PR description itself says the fix is "change fromnew Path(new URI(path))tonew Path(path)". Why does the read path need a different approach? -
Double-encoding bug:
path.replace("#", "%23")will corrupt a path that already contains a literal%23(i.e. a percent-encoded#) by turning it into%2523. If the simplernew Path(path)works for writes, applying it uniformly to reads as well would fix both issues at once.
There was a problem hiding this comment.
Thanks, I moved the normalization earlier and removed the read-side workaround.
|
I tried to address this by normalizing the scan path before the native/JNI boundary, so JniBridge can use new Path(path) for both reads and writes. Happy to adjust if this still looks off, or if there’s a better way to handle it :) |
f18f358 to
0fbcabd
Compare
0fbcabd to
a759cdb
Compare
There was a problem hiding this comment.
Thanks for iterating on this — moving the decode up to the Spark scan side so JniBridge collapses to a symmetric new Path(path) is a clean answer to the earlier read/write asymmetry, and testFileWrappersPreserveLiteralHashInHdfsPath is a solid red/green guard (it asserts both the path is preserved and the URI fragment is null, so it fails on the pre-fix impl). One question inline.
| PartitionedFile(partitionValues, filePath, offset, size) | ||
|
|
||
| @sparkver("3.0 / 3.1 / 3.2 / 3.3") | ||
| override def getPartitionedFilePathString(file: PartitionedFile): String = { |
There was a problem hiding this comment.
The new 3.0–3.3 getPartitionedFilePathString (split + unescapePathName + rejoin) is the substantive new logic, and grepping the repo it has no test coverage anywhere — JniBridgeTest only exercises the Java new Path(path) half, which is the already-symmetric part. A shim-level test (Spark-3.x profile) asserting this method's output against the old new Path(new URI(filePath)) result for a # path, a %20 path, and — the case that matters most — a non-ASCII multi-byte path would lock the decode fidelity down. Worth adding? That third case is what would surface the encoding question on line 982 directly rather than leaving it to inspection.
Which issue does this PR close?
Closes #2257
Rationale for this change
Auron's JNI Hadoop file wrappers currently reconstruct Hadoop paths with
new Path(new URI(path)).This does not preserve Hadoop
Path(String)semantics before the path is passed back toFileSystem.When a raw Hadoop path string contains a literal
#, Java URI parsing treats the suffix after#as a fragment, so the actual Hadoop path is truncated.For example, the intended path:
is opened as:
What changes are included in this PR?
This PR stops reparsing Hadoop path strings through java.net.URI in JniBridge.
The path reconstruction is changed from:
This preserves
Hadoop Path(String)semantics.Add a regression test for JNI Hadoop file wrapper path handling when the path contains a literal #.
Are there any user-facing changes?
This fixes a bug where Hadoop paths containing a literal
#could be truncated.No new APIs, configs, or migration steps are required.
How was this patch tested?
Ran the focused Java regression test:
Result: