Core: Add v4 manifest reader by anoopj · Pull Request #16958 · apache/iceberg

anoopj · 2026-06-24T22:07:25Z

The reader supports:

column projection
partition pruning via a row filter, including multi-spec manifests

Content stats reading and metadata inheritance are not yet implemented.

supports: - column projection - partition pruning via a row filter, including multi-spec manifests Content stats reading and metadata inheritance are not yet implemented.

anoopj · 2026-06-24T22:12:23Z

+        .specId(0);
+  }
+
+  private static TrackedFile fileWithStatus(EntryStatus status, String location) {


Not using the builder here so that we can set any status here.

gaborkaszab

I'm in the process of getting familiar with the details here. Asked early questions for my benefit.
Thanks @anoopj !

gaborkaszab · 2026-06-26T13:56:46Z

+   * Builds the tracking field with {@link MetadataColumns#ROW_POSITION} appended so the reader
+   * populates the manifest position of each entry.
+   */
+  private static Types.NestedField trackingWithRowPosition() {


These Tracking schemas are kind of confusing. Let me know if I get it wrong!
So Tracking.schema() is the physical representation of the data, what we expect to be written, or in other words, this is the write schema.
TrackingStruct.BASE_TYPE is the logical representation including field(s) that are physically not persisted, or in other words this is the read schema.
Naively asking, since we are on the read path, can't we simply use the read schema from TrackingStruct instead of constructing it here from the physical schema?

Your understanding correct! Tracking.schema() is the on-disk schema. TrackingStruct.BASE_TYPE is the same schema plus row_position. We can reuse it. I just needed to change the visibility of TrackingStruct.BASE_TYPE from private to package private.

I think the reason this is still needed is that we use the write schema (TrackedFile.schemaWithContentStats()) on the read path and as a result we have to iterate the fields and replace Tracking to hold 'logical' field(s) too. I'm not sure we have other occasions where we do the same for other types, but seems more complicated than necessary. Can't we simply have a method that produces the read schema for us including logical fields in Tracking`.

I believe we could then get rid of the whole for loop on fields to replace/omit some of them. It would be as simple as:

private Schema readSchema() { if (fileProjection != null) { // construct and return a schema using the projection + the mandatory fields } else { return someMethodToGetReadSchema(); } }

I'd lean toward keeping the loop. It derives the read schema from the canonical schemaWithContentStats(...), so field IDs and ordering stay in one place. A separately hand-built read schema would have to be kept in lockstep ), which is duplication we should try to avoid.

gaborkaszab · 2026-06-26T14:04:34Z

+    // manifestLocation is not stored in the manifest; the reader fills it from the file location.
+    // manifestPos is filled from ROW_POSITION while reading the tracking struct.
+    if (tracking instanceof TrackingStruct) {
+      ((TrackingStruct) tracking).setManifestLocation(file.location());


We populate Tracking.manifestPos using a metadata column ROW_POSITION, while we populate Tracking.manifestLocation manually here. For the latter, is there a reason we can't use MetadataColumns.FILE_PATH to be consistent?
We could get rid of the custom setter TrackingStruct.setManifestlocation method too and let it flow through internalSet getByPos methods.

They look symmetric but the underlying mechanism differs. ROW_POSITION is synthesized by the reader itself. FILE_PATH us only populated when the caller injects it. I'd lean toward keeping the manual set for now

I guess the difficulty here is that for InternalData API we don't have a way to pass the constants map to the underlying reader, right?
Would it worth considering as an improvement? For me both manifestLocation and manifestPos seems regular metadata columns, we just don't have the plumbing currently to make the former act like a metadata column.

I wonder what others say. Probably is an overkill at this point.

gaborkaszab · 2026-06-26T14:05:36Z

+
+  private TrackedFile prepare(TrackedFile trackedFile) {
+    Tracking tracking = trackedFile.tracking();
+    Preconditions.checkState(


Instead of checking here, should we rely on the builders enforcing this invariant?

On the read path, we don't use builders. Tracking here comes from deserializing the manifest and not from a builder.

But i think the check here is not needed for a different reason. We have made Tracking a required field, so the internal reader should throw before prepare() runs. So this check is likely redundant.

stevenzwu · 2026-06-28T06:31:34Z

+    return entries;
+  }
+
+  private boolean matchesPartition(TrackedFile trackedFile) {


I thought we want to drop tuple-based pruning entirely and push predicates against partition-transform-expression columns in content_stats. Under that model the partition tuple is only needed to match data files to equality-delete files.

I am planning to bring up the discussion in Monday's sync.

As a followup of today's discussion, we will keep the partition tuple for planning/pruning for now and we can move to content stats based pruning later. This implementation in this PR is purely a pruning optimization, so swapping/removing it has no correctness implications.

stevenzwu · 2026-06-28T06:41:29Z

+import org.apache.iceberg.util.StructProjection;
+
+/** Reader that reads a v4 manifest file as {@link TrackedFile}s. */
+class V4ManifestReader extends CloseableGroup implements CloseableIterable<TrackedFile> {


this name V4ManifestReader would be come stale when V5 rolls in. maybe TrackedFileManifestReader or just TrackedFilereader.

TrackedFile is an abstraction that we want to keep internal for now. V4ManifestReader is the best name I could come up with. Open to other suggestions here. cc @rdblue for his thoughts.

This class implements and returns a CloseableIterable<TrackedFile> so we don't really keep TrackedFile internal in my opinion. Following this design we can name this TrackedFileReader.

That is actually a good point. Leaving it open to hear from others.

TrackedFile is internal in the sense that it is not a public interface (at least not yet). This class TrackedFileReader would probably only be package private.

gaborkaszab · 2026-06-30T08:55:58Z

+import org.apache.iceberg.util.StructProjection;
+
+/** Reader that reads a v4 manifest file as {@link TrackedFile}s. */
+class V4ManifestReader extends CloseableGroup implements CloseableIterable<TrackedFile> {


This class implements and returns a CloseableIterable<TrackedFile> so we don't really keep TrackedFile internal in my opinion. Following this design we can name this TrackedFileReader.

gaborkaszab · 2026-06-30T09:09:28Z

+    }
+
+    Builder project(Schema newFileProjection) {
+      this.fileProjection = newFileProjection;


nit: precondition for != null?

This is intentional. Null projection means read all columns, and it's a supported case. This is the current semantics of the existing manifest reader as well.

gaborkaszab · 2026-06-30T13:00:59Z

+    InputFile manifest = writeManifest(EMPTY_PARTITION, ImmutableList.of(file));
+
+    TrackedFile actual = read(manifest, UNPARTITIONED_SPECS).get(0);
+    assertThat(actual.partition()).isNotNull();


For unpartitioned entry ,the partition should be null, and also spec_id I believe, once the relevant PR is merged.

Agreed on partition: once #17000 makes the partition field optional, I can update the test to expect null. For now, leaving the current assertions.

On spec_id: in this test the file is written with specId(0) and UNPARTITIONED_SPECS includes the unpartitioned spec (id 0), so it reads back as 0, not null. spec_id is null only for a file that has no spec at all.

Core: Add v4 manifest reader

08c7f59

supports: - column projection - partition pruning via a row filter, including multi-spec manifests Content stats reading and metadata inheritance are not yet implemented.

anoopj moved this to In review in V4: metadata tree Jun 24, 2026

anoopj added this to V4: metadata tree Jun 24, 2026

github-actions Bot added the core label Jun 24, 2026

anoopj commented Jun 24, 2026

View reviewed changes

gaborkaszab reviewed Jun 26, 2026

View reviewed changes

anoopj added 3 commits June 26, 2026 17:38

PR feedback

ef3f3a1

Merge remote-tracking branch 'apache/main' into v4-manifest-reader

f4f1f0c

Rename format version

ed345b8

stevenzwu reviewed Jun 28, 2026

View reviewed changes

Comment update

719bc9a

gaborkaszab reviewed Jun 30, 2026

View reviewed changes

PR feedback

48aa494

Uh oh!

Conversation

anoopj commented Jun 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborkaszab left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stevenzwu Jun 28, 2026 •

edited

Loading