Search before asking
Motivation
We run a dataset management platform on top of Paimon for LLM training data. Multiple teams and pipelines (cleaning, annotation, deduplication, etc.) write to the same primary-key tables concurrently.
Each writer assigns sequence numbers independently, so sequence numbers across writers are incomparable. The only reliable cross-writer ordering signal is the commit order (snapshot id). Today Paimon resolves primary-key conflicts solely by sequence number (or sequence.field), which cannot express this.
What we need: records committed in a later snapshot always win, regardless of per-record sequence numbers.
Solution
Add a table option sequence.snapshot-ordering. When enabled, merge uses the commit snapshot id as the primary tiebreaker for primary-key conflicts, with sequence number as the secondary tiebreaker within the same snapshot. This reuses DataFileMeta.minSequenceNumber to carry the snapshot id at file level — same pattern as row-tracking tables repurposing minSequenceNumber for firstRowId.
Design Details
- File-level stamp at commit time. Repurpose
minSequenceNumber to carry the commit snapshot id, same injection point as row-id assignment.
- Propagate through compaction. Compaction output inherits
max(input snapshot ids), so it never shadows concurrent writes.
- Inflate to record at read time.
KeyValueFileReaderFactory stamps the file-level snapshot id onto each KeyValue.
- Conditional comparator injection. Sort-merge readers use snapshot id as primary tiebreaker only when enabled.
Anything else?
No response
Are you willing to submit a PR?
Search before asking
Motivation
We run a dataset management platform on top of Paimon for LLM training data. Multiple teams and pipelines (cleaning, annotation, deduplication, etc.) write to the same primary-key tables concurrently.
Each writer assigns sequence numbers independently, so sequence numbers across writers are incomparable. The only reliable cross-writer ordering signal is the commit order (snapshot id). Today Paimon resolves primary-key conflicts solely by sequence number (or
sequence.field), which cannot express this.What we need: records committed in a later snapshot always win, regardless of per-record sequence numbers.
Solution
Add a table option
sequence.snapshot-ordering. When enabled, merge uses the commit snapshot id as the primary tiebreaker for primary-key conflicts, with sequence number as the secondary tiebreaker within the same snapshot. This reusesDataFileMeta.minSequenceNumberto carry the snapshot id at file level — same pattern as row-tracking tables repurposingminSequenceNumberforfirstRowId.Design Details
minSequenceNumberto carry the commit snapshot id, same injection point as row-id assignment.max(input snapshot ids), so it never shadows concurrent writes.KeyValueFileReaderFactorystamps the file-level snapshot id onto eachKeyValue.Anything else?
No response
Are you willing to submit a PR?