feat: skip completed files in resumable dataset uploads#5929
Conversation
Automated Reviewer SuggestionsBased on the
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5929 +/- ##
============================================
- Coverage 56.93% 56.73% -0.21%
+ Complexity 3026 3005 -21
============================================
Files 1129 1129
Lines 43794 43854 +60
Branches 4743 4771 +28
============================================
- Hits 24936 24880 -56
- Misses 17384 17488 +104
- Partials 1474 1486 +12
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
94ce43f to
ce97bb4
Compare
ce97bb4 to
a586457
Compare
|
Hi @carloea2, thanks for the PR. Could you open a corresponding issue for this PR? The existing discussion may not include the implementation design details for this PR. Also, I was wondering whether the notification window might overlap with the resumable/restart flow for incomplete uploads. Thanks! |
|
What do you mean with overlap? Issue: #5938 |
There was a problem hiding this comment.
Pull request overview
This PR improves resumable dataset uploads by distinguishing “resumable multipart session exists” from “file already exists in the dataset (same path + size)”, enabling users to skip already-completed files during a retry batch.
Changes:
- Backend: add
POST /dataset/{did}/existing-upload-filesto detect already-present committed/staged files by normalized path + size. - Frontend: extend the upload retry flow to first handle active multipart sessions, then prompt Upload/Skip for completed matching files.
- Tests: add/extend Scala + Angular unit tests to cover the new endpoint and the mixed retry-batch UX.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| frontend/src/app/dashboard/service/user/dataset/dataset.service.ts | Adds client method to query existing upload files by path + size. |
| frontend/src/app/dashboard/service/user/dataset/dataset.service.spec.ts | Adds coverage for existing-upload-files call and multipart resume behavior. |
| frontend/src/app/dashboard/component/user/user-dataset/user-dataset-explorer/dataset-detail.component.html | Wires dataset id (did) into the uploader component. |
| frontend/src/app/dashboard/component/user/files-uploader/files-uploader.component.ts | Adds Upload/Skip prompt for already-existing files and integrates both checks in retry flow. |
| frontend/src/app/dashboard/component/user/files-uploader/files-uploader.component.spec.ts | Adds new tests for mixed retry batches and Upload/Skip For All behavior. |
| frontend/src/app/dashboard/component/user/files-uploader/conflicting-file-modal-content/conflicting-file-modal-content.component.ts | Extends modal data with optional hint. |
| frontend/src/app/dashboard/component/user/files-uploader/conflicting-file-modal-content/conflicting-file-modal-content.component.html | Renders a custom hint when provided. |
| file-service/src/test/scala/org/apache/texera/service/resource/DatasetResourceSpec.scala | Adds backend tests for matching by path+size and validation/auth checks. |
| file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala | Implements the new existing-upload-files endpoint. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| val requested = Option(request) | ||
| .map(_.files) | ||
| .getOrElse(List.empty) | ||
| .map { file => |
| val committed = getLatestDatasetVersion(ctx, did) | ||
| .map(v => | ||
| LakeFSStorageClient | ||
| .retrieveObjectsOfVersion(dataset.getRepositoryName, v.getVersionHash) | ||
| .map(obj => obj.getPath -> obj.getSizeBytes.longValue()) | ||
| ) | ||
| .getOrElse(List.empty) |
| val staged = LakeFSStorageClient | ||
| .retrieveUncommittedObjects(dataset.getRepositoryName) | ||
| .filterNot(diff => Option(diff.getType).exists(_.getValue.equalsIgnoreCase("removed"))) | ||
| .flatMap(diff => Option(diff.getSizeBytes).map(size => diff.getPath -> size.longValue())) |
xuang7
left a comment
There was a problem hiding this comment.
Overall LGTM!
One small question: if a file appears in both activePaths and existingPaths, would the user first see the Resume/Restart dialog and then the Upload/Skip dialog for the same file? Having two dialogs back to back could be confusing. Would it be better to exclude active uploads from the existingPaths conflict flow?
|
|
||
| val existing = (committed ++ staged).toMap | ||
| val matches = requested | ||
| .collect { case (path, size) if existing.get(path).contains(size) => path } |
There was a problem hiding this comment.
The backend matches on the normalized path and returns that normalized key, but the frontend compares against the original item.name. Suggest either echoing back the original request path from the backend (use the original path as the returned value), or normalizing symmetrically on the frontend.
| } | ||
|
|
||
| private async resolveExistingFiles(items: FileUploadItem[], existingPaths: string[]): Promise<FileUploadItem[]> { | ||
| const existing = new Set(existingPaths ?? []); |
There was a problem hiding this comment.
frontend compares against the original item.name.
|
Thanks @xuang7. I added tests for both cases. For the overlap case, I kept the current behavior intentionally: if the same file appears in both For the path mismatch, I updated the backend to normalize for lookup but return the original requested path in |
What changes were proposed in this PR?
This PR improves dataset upload retry behavior by distinguishing between two cases that currently look similar to users:
The backend adds a dataset endpoint that checks candidate upload files against committed and staged dataset files by path and size. The frontend uses that endpoint after checking active multipart sessions, so failed uploads can still be resumed while completed matching files can be skipped from the retry batch.
The user-facing copy intentionally says a file with the same path and size exists, instead of implying byte-for-byte certainty.
Any related issues, documentation, discussions?
Related to discussion #5744: Improve resumable upload: track completion at the batch/session level.
How was this PR tested?
Added backend and frontend tests covering:
DatasetResourceSpecDatasetServicespecFilesUploaderComponentspecCommands run: