[AURON #2321] Support Iceberg column rename and drop-then-add in the native scan by lyne7-sc · Pull Request #2322 · apache/auron

lyne7-sc · 2026-06-10T07:39:05Z

Which issue does this PR close?

Rationale for this change

The native Iceberg scan matches data-file columns by name, but Iceberg tracks them by field-id. After a column rename, old files read as all-NULL; after a drop-then-add of the same name, the new column reads the old column's data.

What changes are included in this PR?

Resolve columns by Iceberg field-id instead of by name:

proto: add field_id to Field.
JVM (AuronIcebergSourceUtil, IcebergScanSupport, NativeConverters): extract top-level name → field-id from the scan's expectedSchema() and serialize it into the plan.
native (auron-planner, scan/mod.rs): stamp the id into Arrow field metadata (PARQUET:field_id); fields_match matches by id when present, else falls back to case-insensitive name matching (non-Iceberg scans unchanged).

Nested-struct evolution and ORC rename/drop fall back to Spark, additive evolution stays native.

Are there any user-facing changes?

Yes. Iceberg queries on renamed or drop-then-added columns now return correct results under the native scan. Unsupported cases fall back to Spark. No API change.

How was this patch tested?

Added cases to AuronIcebergIntegrationSuite

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

SteNicholas

@lyne7-sc, thanks for contribution. I left some comments for this pull request. PTAL.

SteNicholas · 2026-06-15T12:34:19Z

+            .metadata()
+            .get(PARQUET_FIELD_ID_META_KEY)
+            .is_some_and(|file_field_id| file_field_id == table_field_id),
+        None => table_field.name().eq_ignore_ascii_case(file_field.name()),


When table_field has a PARQUET:field_id but file_field does not, is_some_and returns false and there is no name-based fallback — the column simply doesn't match.

For spec-compliant Iceberg Parquet files this is fine (the Iceberg spec mandates field IDs, and arrow-rs populates them into Arrow metadata). But if an older Parquet writer omitted the field_id in the Thrift SchemaElement, or if a non-Iceberg Parquet file happens to be served through this path, every column would fail to match and the scan would produce all-NULL rows.

Consider falling back to name matching when file_field lacks a field ID:

fn fields_match(table_field: &Field, file_field: &Field) -> bool { match table_field.metadata().get(PARQUET_FIELD_ID_META_KEY) { Some(table_field_id) => match file_field.metadata().get(PARQUET_FIELD_ID_META_KEY) { Some(file_field_id) => file_field_id == table_field_id, None => table_field.name().eq_ignore_ascii_case(file_field.name()), }, None => table_field.name().eq_ignore_ascii_case(file_field.name()), } }

This preserves field-id matching when both sides have IDs, but degrades gracefully to name matching otherwise.

Updated fields_match to use a nested match. When file_field lacks a field id, it now falls back to case-insensitive name matching.

SteNicholas · 2026-06-15T12:34:20Z

@@ -75,6 +76,27 @@ object IcebergScanSupport extends Logging {
      partitionSchema.fields.forall(field => NativeConverters.isTypeSupported(field.dataType)),
      "Has unsupported schema type.")



The inspected block bundles detectRenameOrDrop and expectedFieldIds in a single try/catch. If detectRenameOrDrop throws (e.g., a catalog timeout in table.schemas()), expectedFieldIds — which is independent and likely to succeed since expectedSchema() is a local field access — is also discarded. The scan falls back to Spark entirely.

Consider separating them:

val fieldIdsByName = try { AuronIcebergSourceUtil.expectedFieldIds(scan.asInstanceOf[AnyRef]) } catch { case NonFatal(t) => logWarning(...); return None } val renameOrDrop = try { AuronIcebergSourceUtil.detectRenameOrDrop(scan.asInstanceOf[AnyRef]) } catch { case NonFatal(t) => logWarning(...) AuronIcebergSourceUtil.RenameOrDrop(topLevel = true, nested = true) // conservative }

This way a transient schema-history failure can still fall back on the ORC/nested guards while preserving field-id matching for Parquet.

Split this into two independent inspection steps.

expectedFieldIds failure returns None because field-id mapping is required for safe native planning.

detectRenameOrDrop failure returns None because rename/drop safety cannot be determined reliably.

This avoids reporting a misleading nested rename/drop fallback reason when the actual issue is schema-history inspection failure.

SteNicholas · 2026-06-15T12:34:20Z

@@ -75,6 +76,27 @@ object IcebergScanSupport extends Logging {
      partitionSchema.fields.forall(field => NativeConverters.isTypeSupported(field.dataType)),
      "Has unsupported schema type.")



Minor: plan() is called twice per query (once in isSupported, once in convert — pattern from IcebergConvertProvider). This PR adds detectRenameOrDrop inside plan(), which iterates all historical schema versions via table.schemas().values(). For long-lived tables this loading + comparison now happens twice per scan node. Consider caching the plan result (e.g., via a TreeNodeTag on the BatchScanExec) to avoid this.

Added TreeNodeTag caching on the BatchScanExec. plan(exec) now reuses the cached result on the second call

SteNicholas · 2026-06-15T12:34:20Z

+
+  def detectRenameOrDrop(scan: AnyRef): RenameOrDrop = {
+    val table = asBatchQueryScan(scan).table()
+    val currentFields = collectFieldIdToName(table.schema())


Two observations on detectRenameOrDrop:

table.schemas().values() includes the current schema. When compared against currentFields (built from table.schema()), every field matches itself — the entire iteration is a no-op. Consider filtering it out: table.schemas().asScala.filterNot(_._1 == table.schema().schemaId()).

collectFieldIdToName hand-rolls a recursive field-ID collector. Iceberg provides TypeUtil.indexById(schema.asStruct()) which returns Map<Integer, NestedField> covering all nested fields. The only extra piece is the topLevel flag, which can be derived trivially from schema.columns(). Using the Iceberg utility would reduce maintenance surface and benefit from upstream fixes for new type variants.

Addressed both points:

Skipped the current schema.

Replaced the local recursive collector with TypeUtil.indexById(schema.asStruct()).

SteNicholas · 2026-06-15T12:34:20Z

+      fileSchema.fields.forall(field => fieldIdsByName.contains(field.name)),
+      "Failed to find field ids for all Iceberg data columns.")
+
    val partitions = inputPartitions(exec)


Nit: the assertion message "Failed to find field ids for all Iceberg data columns." doesn't include which columns are missing, making debugging harder. Consider:

val missing = fileSchema.fields.filterNot(f => fieldIdsByName.contains(f.name)).map(_.name) assert(missing.isEmpty, s"Missing Iceberg field ids for columns: ${missing.mkString(", ")}")

Updated the assertion to be more explicit.

…berg_rename # Conflicts: # thirdparty/auron-iceberg/src/main/scala/org/apache/spark/sql/auron/iceberg/IcebergScanSupport.scala

lyne7-sc

@SteNicholas Thanks for the detailed review. I’ve addressed the comments and updated the PR. Please take another look when you get a chance.

lyne7-sc · 2026-06-16T12:43:28Z

+            .metadata()
+            .get(PARQUET_FIELD_ID_META_KEY)
+            .is_some_and(|file_field_id| file_field_id == table_field_id),
+        None => table_field.name().eq_ignore_ascii_case(file_field.name()),


Updated fields_match to use a nested match. When file_field lacks a field id, it now falls back to case-insensitive name matching.

lyne7-sc · 2026-06-16T12:44:21Z

+
+  def detectRenameOrDrop(scan: AnyRef): RenameOrDrop = {
+    val table = asBatchQueryScan(scan).table()
+    val currentFields = collectFieldIdToName(table.schema())


Addressed both points:

Skipped the current schema.

Replaced the local recursive collector with TypeUtil.indexById(schema.asStruct()).

lyne7-sc · 2026-06-16T12:46:02Z

@@ -75,6 +76,27 @@ object IcebergScanSupport extends Logging {
      partitionSchema.fields.forall(field => NativeConverters.isTypeSupported(field.dataType)),
      "Has unsupported schema type.")



Added TreeNodeTag caching on the BatchScanExec. plan(exec) now reuses the cached result on the second call

lyne7-sc · 2026-06-16T12:46:19Z

+      fileSchema.fields.forall(field => fieldIdsByName.contains(field.name)),
+      "Failed to find field ids for all Iceberg data columns.")
+
    val partitions = inputPartitions(exec)


Updated the assertion to be more explicit.

lyne7-sc · 2026-06-16T12:49:00Z

@@ -75,6 +76,27 @@ object IcebergScanSupport extends Logging {
      partitionSchema.fields.forall(field => NativeConverters.isTypeSupported(field.dataType)),
      "Has unsupported schema type.")



Split this into two independent inspection steps.

expectedFieldIds failure returns None because field-id mapping is required for safe native planning.

detectRenameOrDrop failure returns None because rename/drop safety cannot be determined reliably.

This avoids reporting a misleading nested rename/drop fallback reason when the actual issue is schema-history inspection failure.

lyne7-sc and others added 3 commits June 9, 2026 22:14

fix iceberg rename

6c72cff

simplify

70e6956

simplify

44261cb

github-actions Bot added spark native thirdparty-iceberg labels Jun 10, 2026

lint

b2a6740

cxzl25 requested a review from Copilot June 10, 2026 09:08

Copilot started reviewing on behalf of cxzl25 June 10, 2026 09:08 View session

Copilot AI reviewed Jun 10, 2026

SteNicholas requested a review from Copilot June 15, 2026 12:18

Copilot started reviewing on behalf of SteNicholas June 15, 2026 12:18 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread native-engine/datafusion-ext-plans/src/scan/mod.rs

SteNicholas reviewed Jun 15, 2026

View reviewed changes

SteNicholas assigned SteNicholas, merrily01 and richox Jun 15, 2026

lyne7-sc added 2 commits June 16, 2026 20:29

apply suggestions per reviews

41cf5ba

Merge remote-tracking branch 'origin/fix/iceberg_rename' into fix/ice…

854aa04

…berg_rename # Conflicts: # thirdparty/auron-iceberg/src/main/scala/org/apache/spark/sql/auron/iceberg/IcebergScanSupport.scala

lyne7-sc commented Jun 16, 2026

View reviewed changes

fix ci

c04a037

		@@ -75,6 +76,27 @@ object IcebergScanSupport extends Logging {
		partitionSchema.fields.forall(field => NativeConverters.isTypeSupported(field.dataType)),
		"Has unsupported schema type.")

Conversation

lyne7-sc commented Jun 10, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

SteNicholas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyne7-sc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SteNicholas left a comment •

edited

Loading