Skip to content

Expose DataFrame-style write API (WriteBuilder / TableWrite / TableCommit) to Python #414

Description

@JunRuiLee

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Companion to the read-path effort (#413).

PyPaimon's write path is still pure Python, while the Rust core already
implements the full append-write pipeline:

WriteBuilder → TableWrite.write_arrow() → prepare_commit() → TableCommit.commit()

This pipeline is already used by the SQL INSERT INTO path. It is simply not
exposed through bindings/python.

Goal: expose the existing Rust write API to Python so PyPaimon's batch write
path can run on Rust, with the public Python API unchanged.

Benefits:

  • Faster batch write — use the Rust write pipeline instead of the pure
    Python implementation.
  • Consistent semantics — SQL insert and DataFrame batch write can share the
    same Rust core path.
  • Distributed write foundation — once commit messages are serializable,
    workers can write data in parallel and a driver can commit once.

Out of scope: streaming write and merge/update/delete DML (already reachable via SQL).

Scope

This can be implemented incrementally:

  • PR 1 — Expose local append write + commit:
    new_write_builder(), new_write(), write_arrow(batch),
    prepare_commit(), new_commit().commit(messages).

    This gives a minimal end-to-end verifiable loop:
    write PyArrow data → prepare commit messages → commit snapshot → read data back.

  • PR 2 — Make commit messages serializable:
    opaque / pickle-friendly PyCommitMessage, so commit messages can cross
    process boundaries for distributed writes such as Ray.

  • PR 3 — Add dynamic overwrite support:
    with_overwrite() and TableCommit.overwrite(messages) for dynamic partition
    overwrite, where the partitions touched by the new commit messages are
    replaced.

  • PR 4 — Add static partition overwrite support:
    extend TableCommit.overwrite(...) with a static_partitions argument, e.g.
    overwrite(messages, {"dt": "2026-06-26"}).

  • PR 5 (in apache/paimon, [python]) — Switch PyPaimon's batch write path
    to delegate to the Rust pipeline internally.

PR 1–4 land here; PR 5 lands in the main repo once the bindings are released.

Notes

with_overwrite() is intentionally split out from the first write PR because
writer overwrite mode and commit overwrite semantics must be exposed together.
Dynamic overwrite is added first because it does not require Python literal →
Rust Datum conversion. Static partition overwrite is handled separately since
it needs a typed conversion layer for static_partitions.

The first PR focuses on append writes, which are the smallest end-to-end
verifiable write path.

Commit message serialization is also split out: PR 1 only passes commit messages
within the same Python process, while PR 2 makes them suitable for Ray-style
worker-to-driver transfer.

Willingness to contribute

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions