Skip to content

fix: align duplicate-key replacement semantics with Java Paimon#389

Open
lucasfang wants to merge 2 commits into
alibaba:mainfrom
lucasfang:macro
Open

fix: align duplicate-key replacement semantics with Java Paimon#389
lucasfang wants to merge 2 commits into
alibaba:mainfrom
lucasfang:macro

Conversation

@lucasfang

Copy link
Copy Markdown
Collaborator

Title: fix: align duplicate-key replacement semantics with Java Paimon

Purpose

Linked issue: close #xxx

This PR aligns several duplicate-key map update paths with Java Paimon semantics.

Java Map.put / LinkedHashMap.put replaces the existing value when the key already exists, while C++ insert / emplace keeps the old value. This caused C++ behavior to diverge from Java in duplicate key scenarios.

The affected paths are:

  • Index manifest combining
    • Use replacement semantics when combining index entries by bucket identifier or file name.
    • Keep BUCKET_UNAWARE deletion vector mode unsupported.
    • Preserve Java-like removed/added list processing in BucketedCombiner.
  • Snapshot deletion file mapping
    • Align SnapshotReader::GetDeletionFiles with Java SnapshotReaderImpl.toDeletionFiles, where later deletion file metadata for the same data file wins.
  • Deletion file factory mapping
    • Align DeletionVector::CreateDeletionFileMap with Java DeletionFile.factory, where later DeletionFile values overwrite earlier values for the same data file name.
  • Deletion file writer DV metadata
    • Align DeletionFileWriter::Write with Java DeletionFileWriter.write, where later DV metadata overwrites earlier metadata for the same data file name.

Tests

Unit tests:

  • DeletionFileWriterTest.WriteCloseAndReadBack
  • DeletionFileWriterTest.GetResultWithoutCloseShouldFail
  • DeletionFileWriterTest.WriteOverwritesDuplicateDataFileName
  • DeletionFileWriterTest.ExternalPathInResult
  • DeletionVectorTest.CreateDeletionFileMap
  • IndexManifestFileHandlerTest.GlobalCombinerDeletesThenAddsByFileName
  • IndexManifestFileHandlerTest.BucketedCombinerUsesPartitionBucketAndIndexType
  • IndexManifestFileHandlerTest.BucketedCombinerOverwritesDuplicateAddedEntries
  • IndexManifestFileHandlerTest.GlobalCombinerOverwritesDuplicateAddedEntries
  • IndexManifestFileHandlerTest.DvWithBucketUnawareModeReturnsNotImplemented
  • SnapshotReaderTest.GetDeletionFilesOverwritesDuplicateDataFileName

API and Format

No public API change.

No storage format or protocol change.

This only changes runtime duplicate-key handling to match Java Paimon replacement semantics.

Documentation

No new feature is introduced.

Generative AI tooling

Generated-by: GitHub Copilot

Copilot AI review requested due to automatic review settings June 29, 2026 10:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Aligns C++ duplicate-key map update behavior with Java Paimon semantics (“last write wins”) across snapshot deletion file mapping, deletion-vector metadata, and index manifest combining.

Changes:

  • Switch several map update paths from insert/emplace to replacement semantics via insert_or_assign.
  • Adjust BucketedCombiner processing to preserve Java-like removed/added handling and allow overwriting duplicate added entries.
  • Add targeted unit tests for duplicate-key overwrite behavior in snapshot reader, deletion vectors, deletion file writer, and index manifest combining.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/paimon/core/table/source/snapshot/snapshot_reader_test.cpp Adds regression test ensuring later DV index metadata overwrites earlier for same data file.
src/paimon/core/table/source/snapshot/snapshot_reader.cpp Updates DV meta → index meta mapping to overwrite on duplicate keys.
src/paimon/core/manifest/index_manifest_file_handler_test.cpp Adds tests around combiner overwrite behavior; adds DV entry helper.
src/paimon/core/manifest/index_manifest_file_handler.h Updates combiner docstrings to reflect combining keys used.
src/paimon/core/manifest/index_manifest_file_handler.cpp Changes combiner maps to overwrite duplicates (insert_or_assign) and adjusts removed/added handling.
src/paimon/core/deletionvectors/deletion_vector_test.cpp Adds coverage for duplicate data file name overwrite in deletion-file map creation.
src/paimon/core/deletionvectors/deletion_vector.cpp Uses overwrite semantics when building deletion-file map.
src/paimon/core/deletionvectors/deletion_file_writer_test.cpp Adds test verifying duplicate writes overwrite DV metadata for same data file name.
src/paimon/core/deletionvectors/deletion_file_writer.cpp Overwrites DV metadata on duplicate key (insert_or_assign).
src/paimon/CMakeLists.txt Includes new snapshot reader unit test in test build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +125 to +129
auto iter = dv_ranges->find("data-file-1");
ASSERT_NE(iter, dv_ranges->end());
ASSERT_GT(iter->second.GetOffset(), 1);
ASSERT_EQ(iter->second.GetCardinality(), std::optional<int64_t>(2));
}
Comment on lines +45 to +46
dv_metas_.insert_or_assign(key, DeletionVectorMeta(key, static_cast<int32_t>(start), length,
deletion_vector->GetCardinality()));

ASSERT_EQ(deletion_files.size(), 1);
ASSERT_TRUE(deletion_files[0].has_value());
EXPECT_EQ(deletion_files[0]->path, dir_->Str() + "/index/index-second");
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants