Core: Add ManifestFile adapter for v4 tracked files#16867
Conversation
b50410f to
718ae34
Compare
Add TrackedFileAdapters.asManifestFile to wrap a DATA_MANIFEST or DELETE_MANIFEST TrackedFile as a ManifestFile. Closes apache#16227
718ae34 to
93368cb
Compare
wombatu-kun
left a comment
There was a problem hiding this comment.
Code looks correct: field delegation, equals/hashCode (consistent with GenericManifestFile), constructor preconditions, and the test coverage are all solid.
Non-blocking discussion point: partitionSpecId() throws UnsupportedOperationException. That is defensible since a v4 manifest is not bound to a single spec. But @rdblue pushed back on #16100 against these adapters exposing an invalid or arbitrary spec ID, so it may be worth confirming with reviewers that a throw is the intended contract for any future caller that reads partitionSpecId().
| NO_PARTITION, | ||
| 8L, | ||
| 2048L); | ||
| file.set(MANIFEST_INFO_ORDINAL, manifestInfo); |
There was a problem hiding this comment.
we can use the builder from @gaborkaszab 's PR: #16769
There was a problem hiding this comment.
Sounds good. I can make the change if #16769 gets merged first or do this as a followup PR if this gets merged first.
There was a problem hiding this comment.
FYI, in the meantime the builder got merged
gaborkaszab
left a comment
There was a problem hiding this comment.
Thank you for the PR, @anoopj !
I took a Quick Look and left some minor comment.
|
|
||
| private TrackedManifestFile(TrackedFile file) { | ||
| Preconditions.checkArgument( | ||
| file.tracking() != null, "Cannot create manifest file: no tracking"); |
There was a problem hiding this comment.
nit: other adapters don't check if tracking is null. Not sure we have to here. It's a required field anyway
| private TrackedManifestFile(TrackedFile file) { | ||
| Preconditions.checkArgument( | ||
| file.tracking() != null, "Cannot create manifest file: no tracking"); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
As per the spec proposal data seq num is 'Optional for leaf manifests, required for root.'. I think it's valid to have it as null here. Seems also optional in ManifestFile
There was a problem hiding this comment.
This PR is for root-level manifest tracking where null is truly an invalid state. We can either validate or just have it fail with an NPE.
Seems also optional in ManifestFile
It is required for a manifest file: ManifestFile.sequenceNumber() returns a primitive
There was a problem hiding this comment.
just to add on top of what Anoop said. this is for leaf manifest file entry (TrackedFile) in the root manifest file. Hence dataSequenceNumber should be required
There was a problem hiding this comment.
Thanks for the explanation @anoopj and @stevenzwu !
| } | ||
|
|
||
| @Override | ||
| public boolean equals(Object other) { |
There was a problem hiding this comment.
The other adapters don't override equals and hashCode. Is there a particular reason it's needed here?
There was a problem hiding this comment.
I'm trying to be consistent with the semantics of GenericManifestFile.
There was a problem hiding this comment.
Do we actually use equals and hashCode? We try to avoid implementing those methods where there are different interpretations of what "equality" might mean.
There was a problem hiding this comment.
I didn't add any code that uses it. I was trying to be consistent with the behavior of the GenericManifestFile which defines equality with the location. I have removed it.
stevenzwu
left a comment
There was a problem hiding this comment.
LGTM.
I will merge Gabor's TrackedFileBuilder once CI passed. then let's rebase this one and use the builder as applicable.
| private TrackedManifestFile(TrackedFile file) { | ||
| Preconditions.checkArgument( | ||
| file.tracking() != null, "Cannot create manifest file: no tracking"); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
just to add on top of what Anoop said. this is for leaf manifest file entry (TrackedFile) in the root manifest file. Hence dataSequenceNumber should be required
| private TrackedManifestFile(TrackedFile file) { | ||
| Preconditions.checkArgument( | ||
| file.tracking().dataSequenceNumber() != null, | ||
| "Cannot create manifest file: no data sequence number"); |
There was a problem hiding this comment.
Nit: I think we prefer the slightly more specific "Invalid data sequence number: null" because "no X" doesn't necessarily mean "x is null".
Also, even though the attempt is to create a manifest file, that's not the actual problem. It's debatable whether to include what is being attempted, "Cannot create manifest file", but I would opt to leave it out because that's clear from other context (that the check is in asManifestFile and the TrackedManifestFile constructor).
| @Override | ||
| public int partitionSpecId() { | ||
| throw new UnsupportedOperationException( | ||
| "Tracked manifests are not bound to a single partition spec"); |
There was a problem hiding this comment.
I don't think that it is correct to say "Tracked manifests": what is a "tracked" manifest and why is one from a manifest list not "tracked"?
I think it is better to say "v4 manifests are not bound to a single partition spec".
There was a problem hiding this comment.
Fixed. I was trying to avoid having too many v4 references.
| public ManifestContent content() { | ||
| return file.contentType() == FileContent.DATA_MANIFEST | ||
| ? ManifestContent.DATA | ||
| : ManifestContent.DELETES; |
There was a problem hiding this comment.
Minor: I think it is safer to avoid a dependency between this and the constructor's verification. I would recommend using a switch statement:
switch (file.contentType()) {
case DATA_MANIFEST:
return ManifestContent.DATA;
case DELETE_MANIFEST:
return ManifestContent.DELETES;
default:
throw new UnsupportedOperationException("Unsupported content type for manifests: " + file.contentType());
}| public boolean equals(Object other) { | ||
| if (this == other) { | ||
| return true; | ||
| } else if (!(other instanceof TrackedManifestFile)) { |
There was a problem hiding this comment.
I don't think that the implementation of GenericManifestFile is very helpful. It considers two files equal if they have the same path (which I think is reasonable) but it also excludes any implementation other than GenericManifestFile, which means that they aren't actually interchangeable as the interface suggests.
In addition, this creates a situation where a.equals(b) may be true, but b.equals(a) can be false for reasonable implementations.
I'd prefer to remove this and fix places where it is assumed or used. Good catch, @gaborkaszab!
| "Cannot create manifest file: no data sequence number"); | ||
| Preconditions.checkArgument( | ||
| file.tracking().snapshotId() != null, "Cannot create manifest file: no snapshot ID"); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
This also seems an internal invariant for TrackedFile. The new TrackedFileBuilder guarantees that manifestInfo is set for manifest entries.
There was a problem hiding this comment.
That makes sense. Removed.
| NO_PARTITION, | ||
| 8L, | ||
| 2048L); | ||
| file.set(MANIFEST_INFO_ORDINAL, manifestInfo); |
There was a problem hiding this comment.
FYI, in the meantime the builder got merged
| private TrackedManifestFile(TrackedFile file) { | ||
| Preconditions.checkArgument( | ||
| file.tracking() != null, "Cannot create manifest file: no tracking"); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
Thanks for the explanation @anoopj and @stevenzwu !
| void testManifestFileAdapterDelegation() { | ||
| Tracking tracking = createTracking(); | ||
| ManifestInfo manifestInfo = createManifestInfo(); | ||
| TrackedFileStruct file = |
There was a problem hiding this comment.
nit: Can we hide the construction of the input TrackedFile into a separate function? Like dataManifestFile() it could get ManifestInfo, FileContent, and/or Tracking parameters if that matters for the test.
There was a problem hiding this comment.
Done. extracted manifestBuilder(FileContent) / manifestFile(FileContent, ManifestInfo) and now tests build through them.
| } | ||
|
|
||
| @Test | ||
| void testManifestFileAdapterDelegation() { |
There was a problem hiding this comment.
nit: maybe parameterize the test for data + delete manifest. The testDeleteManifestContentMapsToDeletes test could be dropped then
# Conflicts: # core/src/test/java/org/apache/iceberg/TestTrackedFileAdapters.java
gaborkaszab
left a comment
There was a problem hiding this comment.
just some minor comments. Other than them, LGTM!
| .manifestInfo(manifestInfo) | ||
| .keyMetadata(ByteBuffer.wrap(new byte[] {7, 8, 9})) | ||
| .build(); | ||
| populateTrackingFields(file); |
There was a problem hiding this comment.
Just FYI: in the follow-up for the TrackedFileBuilder PR, we decided to drop populateTrackingFields and went to call the constructor of TrackedFile instead using the createTracking (revived in the follow-up PR) method.
There was a problem hiding this comment.
Ack. I can rebase when the followup PR is merged.
| manifestBuilder(FileContent.DATA_MANIFEST).manifestInfo(createManifestInfo()).build(); | ||
| // Inheritance fills the data sequence number, which the adapter requires; first row ID and | ||
| // key metadata stay unset. | ||
| ((TrackingStruct) file.tracking()).set(DATA_SEQUENCE_NUMBER_ORDINAL, DATA_SEQUENCE_NUMBER); |
There was a problem hiding this comment.
instead of using the ordinal, we can use the inheritFrom(some_manifest) method to do that.
There was a problem hiding this comment.
Routing through inheritFrom() will require building a parent Tzracking with several nulls. I prefer to keep it since this is simpler.
| /** Adapts {@link TrackedFile} entries to the {@link DataFile} and {@link DeleteFile} APIs. */ | ||
| /** | ||
| * Adapts {@link TrackedFile} entries to the {@link DataFile}, {@link DeleteFile}, and {@link | ||
| * ManifestFile} APIs. |
There was a problem hiding this comment.
If comments require regular maintenance when touching the file, I prefer to rewrite them to minimize churn. This doesn't need to list everything that can be produced. Maybe edit this so that it say for example, {@link DataFile}?
| return new TrackedManifestFile(file); | ||
| } | ||
|
|
||
| /** Shared base for all tracked file adapters. */ |
There was a problem hiding this comment.
This is no longer true. How about "Shared base for data and delete file adapters"?
| tracking.set(MANIFEST_POS_ORDINAL, MANIFEST_POS); | ||
| } | ||
|
|
||
| private static ManifestInfo createManifestInfo() { |
There was a problem hiding this comment.
I think this should be a constant, not a method. We don't need to create a manifest info struct every time, and a method gives no guarantees about the values produced. It could create random values every time.
If we use a constant, then you can write assertions like assertThat(manifest.existingFilesCount()).isEqualTo(manifestInfo.existingFilesCount()).
|
|
||
| // Builder for a manifest entry with the required non-tracking fields set. Callers add the | ||
| // manifest info and any optional fields before building. | ||
| private static TrackedFileBuilder manifestBuilder(FileContent contentType) { |
There was a problem hiding this comment.
I don't think that the complexity here to use the builder is a good thing.
This creates indirection and makes tests harder to read, while gaining almost nothing. There is hardly any readability benefit to using the builder when you're already setting constants like FORMAT_VERSION_V4. All this does is move the initialization here, so people have to find this method to validate that the assertions above are correct, like assertThat(manifest.path()).isEqualTo(MANIFEST_FILE_LOCATION). I would rather see MANIFEST_FILE_LOCATION passed in and then validated.
I think this should be removed and the constructor should be used instead. All these tests need to verify is that these values are passed through correctly.
| .manifestInfo(manifestInfo) | ||
| .keyMetadata(ByteBuffer.wrap(new byte[] {7, 8, 9})) | ||
| .build(); | ||
| populateTrackingFields(file); |
There was a problem hiding this comment.
Please remove this and either use a Tracking constant or create one inline. This indirection makes the tests unreadable.
| file.tracking().snapshotId() != null, "Invalid snapshot ID: null"); | ||
| Preconditions.checkArgument( | ||
| file.manifestInfo().dv() == null, | ||
| "Cannot adapt manifest with a deletion vector to ManifestFile: %s", |
There was a problem hiding this comment.
This is correct and calls into question whether this adapter is going to be useful. We expect manifest DVs to be common and if we can't adapt a manifest with a DV back to ManifestFile, then what is the purpose of this adapter?
ManifestFile is mostly below the API level. It is exposed in some places (accessed primarily from Snapshot), but reading and writing for correctness (scan planning and commit operations) are all done within Table APIs and even snapshot change access is moving to SnapshotChanges. You can read manifests directly using utils in core, but I don't think that is very common.
I think we may want to avoid adding this for now, even though it seems like a reasonable addition. In order for this to be valuable, we would need to continue using ManifestFile and expose a DV through that interface. Since we don't know if that's going to be helpful, I think we should keep this as a draft for now.
There was a problem hiding this comment.
OK. I will convert this into a draft and we can revisit this if it's needed as we build the reader and writer. cc @stevenzwu
There was a problem hiding this comment.
In the write path, I also proposed new additions to the ManifestFile interface. I added 3 new methods in my next PR #16936 . We can add more (including manifest DV) as we go.
There was a problem hiding this comment.
If we add the DV, let me know and I can revive this PR.
Co-authored-by: Ryan Blue <blue@apache.org>
Co-authored-by: Ryan Blue <blue@apache.org>
Add TrackedFileAdapters.asManifestFile to wrap a DATA_MANIFEST or DELETE_MANIFEST TrackedFile as a ManifestFile.
The ManifestFile adapter will only be used by tests. Note that ManifestFile doesn't support manifest DVs. So in the actual manifest read path, the reader resolves the MDV first and produces the resulting TrackedFiles, which are then adapted to data/delete files.
Closes #16227