Skip to content

[Feature] Support snapshot-based sequence ordering for primary-key tables #7806

@JunRuiLee

Description

@JunRuiLee

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

We run a dataset management platform on top of Paimon for LLM training data. Multiple teams and pipelines (cleaning, annotation, deduplication, etc.) write to the same primary-key tables concurrently.

Each writer assigns sequence numbers independently, so sequence numbers across writers are incomparable. The only reliable cross-writer ordering signal is the commit order (snapshot id). Today Paimon resolves primary-key conflicts solely by sequence number (or sequence.field), which cannot express this.

What we need: records committed in a later snapshot always win, regardless of per-record sequence numbers.

Solution

Add a table option sequence.snapshot-ordering. When enabled, merge uses the commit snapshot id as the primary tiebreaker for primary-key conflicts, with sequence number as the secondary tiebreaker within the same snapshot. This reuses DataFileMeta.minSequenceNumber to carry the snapshot id at file level — same pattern as row-tracking tables repurposing minSequenceNumber for firstRowId.

Design Details

  1. File-level stamp at commit time. Repurpose minSequenceNumber to carry the commit snapshot id, same injection point as row-id assignment.
  2. Propagate through compaction. Compaction output inherits max(input snapshot ids), so it never shadows concurrent writes.
  3. Inflate to record at read time. KeyValueFileReaderFactory stamps the file-level snapshot id onto each KeyValue.
  4. Conditional comparator injection. Sort-merge readers use snapshot id as primary tiebreaker only when enabled.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions