Skip to content

Core: Rebase replace transaction onto refreshed metadata on concurrent commit#16975

Open
Kurtiscwright wants to merge 3 commits into
apache:mainfrom
Kurtiscwright:replace-seq-reuse-repro
Open

Core: Rebase replace transaction onto refreshed metadata on concurrent commit#16975
Kurtiscwright wants to merge 3 commits into
apache:mainfrom
Kurtiscwright:replace-seq-reuse-repro

Conversation

@Kurtiscwright

@Kurtiscwright Kurtiscwright commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Fixes #16232
Rebased on PR #16975

Replacing a table replaces its schema, partitioning, and data, but it keeps the
table's history instead of deleting the table and making a new one. The old code
didn't do this when another commit landed first. While a replace was being
prepared, a second commit could update the table. The replace then reloaded the
latest table state to pass the catalog's conflict check, but still wrote out the
changes it had prepared at the start. The second commit's snapshots disappeared
from history, and if that commit had changed the schema, the replace could end
up pointing columns at IDs that no longer matched the data files.

Now, when the table has changed, the replace is re-applied on top of the latest
table state: the new schema, partitioning, sort order, location, and properties
are kept, but they sit on top of the other commit's snapshots, and the replace's
own new data is added after it. If the table no longer exists when the replace
commits (for example, it was dropped in the meantime), there is nothing to build
on, so the replace just creates the table fresh.

Add TableMetadata.buildReplacementPreservingIds. It keeps the field IDs the
replace schema already has, instead of matching columns up by name and handing
out new IDs. Matching by name can move a column to a different ID than its data
files were written with, which makes that data unreadable.

Two design notes:

  1. The replace gets its own re-apply path rather than reusing
    applyUpdates, because a replace's schema and partitioning live in the prepared
    metadata, not in the list of pending updates, so replaying only those updates
    would drop the replacement and leave the old schema in place.

  2. It uses buildReplacementPreservingIds rather than buildReplacement
    so a concurrent schema change can't shift the replace's field IDs away
    from its data files.

Add tests for the rebuilt replace, kept history, stable field IDs, and the
dropped-table case. Two REST-only tests used to expect a replace with an
unchanged schema or partitioning to fail on a spurious ID check; update them to
expect it to succeed.

Kurtis Wright and others added 2 commits June 26, 2026 17:10
…t commit

When a concurrent commit lands after a replace transaction is staged,
commitReplaceTransaction advanced base to the refreshed metadata to satisfy the
optimistic lock but committed the stale staged metadata. This dropped the
concurrent commit's snapshots from history and reused its sequence number,
violating sequence-number monotonicity. A replace is last-writer-wins on the
current schema, spec, and data, but it is not a drop-and-recreate: the table
must keep its history.

Rebuild the replacement on the refreshed base and replay the staged updates so
the concurrent commit's history is preserved and the replacement snapshot's
sequence number is re-derived from the refreshed base. Add
TableMetadata.buildReplacementPreservingIds, which keeps the staged schema's
field IDs instead of re-deriving them by name, so a concurrent schema change
cannot shift the IDs the staged data files were written against.

Add tests covering the rebase, history preservation, and field-id stability.
Update the two server-side conflict tests in CatalogTests, which previously
asserted a spurious last-assigned-id requirement failure, to assert the replace
succeeds as last-writer-wins.
Co-authored-by: Sreesh Maheshwar <maheshwarsreesh@gmail.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a correctness bug in Core replace-table transactions where a concurrent commit could be silently overwritten. The new behavior rebases a staged replace onto refreshed table metadata when the base has advanced, preserving concurrent history (snapshots/schema/spec history) while still applying the replacement’s intended schema/spec/sort/location/properties and adding the replacement’s new data after the concurrent commit.

Changes:

  • Rebase BaseTransaction replace commits onto refreshed TableMetadata when concurrent commits advance the base, and replay staged updates on the rebuilt replacement.
  • Add TableMetadata.buildReplacementPreservingIds to rebuild replacements without reassigning schema field IDs by name.
  • Add/adjust tests to validate preserved history, stable field IDs, dropped-table handling, and updated REST/HMS expectations.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveCreateReplaceTable.java Updates HMS tests to expect concurrent property preservation and relaxes spec equality to content equivalence.
core/src/test/java/org/apache/iceberg/TestReplaceTransaction.java Adds new replace-transaction concurrency tests (rebase behavior, schema history preservation, field ID stability, dropped-table case).
core/src/test/java/org/apache/iceberg/catalog/CatalogTests.java Updates REST/catalog concurrency tests to expect rebased replace to succeed instead of failing on spurious ID requirements.
core/src/main/java/org/apache/iceberg/TableMetadata.java Adds buildReplacementPreservingIds to rebuild replacements while keeping pre-assigned schema field IDs.
core/src/main/java/org/apache/iceberg/BaseTransaction.java Introduces staged replacement tracking and the rebase-and-replay path for replace commits when the base metadata changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +332 to 336
if (base != null) {
rebaseReplaceOnto(base);
}
}

@manuzhang

Copy link
Copy Markdown
Member

@Kurtiscwright What do you mean by rebased on PR 16975 which is the current PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Core: REPLACE TABLE transaction silently overwrites concurrent committed changes

3 participants