Search before asking
Motivation
Companion to the read-path effort (#413).
PyPaimon's write path is still pure Python, while the Rust core already
implements the full append-write pipeline:
WriteBuilder → TableWrite.write_arrow() → prepare_commit() → TableCommit.commit()
This pipeline is already used by the SQL INSERT INTO path. It is simply not
exposed through bindings/python.
Goal: expose the existing Rust write API to Python so PyPaimon's batch write
path can run on Rust, with the public Python API unchanged.
Benefits:
- Faster batch write — use the Rust write pipeline instead of the pure
Python implementation.
- Consistent semantics — SQL insert and DataFrame batch write can share the
same Rust core path.
- Distributed write foundation — once commit messages are serializable,
workers can write data in parallel and a driver can commit once.
Out of scope: streaming write and merge/update/delete DML (already reachable via SQL).
Scope
This can be implemented incrementally:
-
PR 1 — Expose local append write + commit:
new_write_builder(), new_write(), write_arrow(batch),
prepare_commit(), new_commit().commit(messages).
This gives a minimal end-to-end verifiable loop:
write PyArrow data → prepare commit messages → commit snapshot → read data back.
-
PR 2 — Make commit messages serializable:
opaque / pickle-friendly PyCommitMessage, so commit messages can cross
process boundaries for distributed writes such as Ray.
-
PR 3 — Add dynamic overwrite support:
with_overwrite() and TableCommit.overwrite(messages) for dynamic partition
overwrite, where the partitions touched by the new commit messages are
replaced.
-
PR 4 — Add static partition overwrite support:
extend TableCommit.overwrite(...) with a static_partitions argument, e.g.
overwrite(messages, {"dt": "2026-06-26"}).
-
PR 5 (in apache/paimon, [python]) — Switch PyPaimon's batch write path
to delegate to the Rust pipeline internally.
PR 1–4 land here; PR 5 lands in the main repo once the bindings are released.
Notes
with_overwrite() is intentionally split out from the first write PR because
writer overwrite mode and commit overwrite semantics must be exposed together.
Dynamic overwrite is added first because it does not require Python literal →
Rust Datum conversion. Static partition overwrite is handled separately since
it needs a typed conversion layer for static_partitions.
The first PR focuses on append writes, which are the smallest end-to-end
verifiable write path.
Commit message serialization is also split out: PR 1 only passes commit messages
within the same Python process, while PR 2 makes them suitable for Ray-style
worker-to-driver transfer.
Willingness to contribute
Search before asking
Motivation
Companion to the read-path effort (#413).
PyPaimon's write path is still pure Python, while the Rust core already
implements the full append-write pipeline:
WriteBuilder → TableWrite.write_arrow() → prepare_commit() → TableCommit.commit()This pipeline is already used by the SQL
INSERT INTOpath. It is simply notexposed through
bindings/python.Goal: expose the existing Rust write API to Python so PyPaimon's batch write
path can run on Rust, with the public Python API unchanged.
Benefits:
Python implementation.
same Rust core path.
workers can write data in parallel and a driver can commit once.
Out of scope: streaming write and merge/update/delete DML (already reachable via SQL).
Scope
This can be implemented incrementally:
PR 1 — Expose local append write + commit:
new_write_builder(),new_write(),write_arrow(batch),prepare_commit(),new_commit().commit(messages).This gives a minimal end-to-end verifiable loop:
write PyArrow data → prepare commit messages → commit snapshot → read data back.
PR 2 — Make commit messages serializable:
opaque / pickle-friendly
PyCommitMessage, so commit messages can crossprocess boundaries for distributed writes such as Ray.
PR 3 — Add dynamic overwrite support:
with_overwrite()andTableCommit.overwrite(messages)for dynamic partitionoverwrite, where the partitions touched by the new commit messages are
replaced.
PR 4 — Add static partition overwrite support:
extend
TableCommit.overwrite(...)with astatic_partitionsargument, e.g.overwrite(messages, {"dt": "2026-06-26"}).PR 5 (in
apache/paimon,[python]) — Switch PyPaimon's batch write pathto delegate to the Rust pipeline internally.
PR 1–4 land here; PR 5 lands in the main repo once the bindings are released.
Notes
with_overwrite()is intentionally split out from the first write PR becausewriter overwrite mode and commit overwrite semantics must be exposed together.
Dynamic overwrite is added first because it does not require Python literal →
Rust
Datumconversion. Static partition overwrite is handled separately sinceit needs a typed conversion layer for
static_partitions.The first PR focuses on append writes, which are the smallest end-to-end
verifiable write path.
Commit message serialization is also split out: PR 1 only passes commit messages
within the same Python process, while PR 2 makes them suitable for Ray-style
worker-to-driver transfer.
Willingness to contribute