[python] Expose DataFrame-style batch write + commit loop by JunRuiLee · Pull Request #420 · apache/paimon-rust

JunRuiLee · 2026-06-29T03:24:55Z

Purpose

First PR of exposing PyPaimon's DataFrame write path to Rust. Refs #414.

Exposes the Rust core's batch write loop through bindings/python:

wb = table.new_write_builder()
write = wb.new_write()
write.write_arrow(batch)            # PyArrow RecordBatch, may be called repeatedly
messages = write.prepare_commit()
wb.new_commit().commit(messages)    # persists a snapshot; readable via SQL

CommitMessage (from prepare_commit()) is an opaque object passed back to commit() — same-process only for now; cross-process (pickle, for Ray worker→driver) is a later PR.

Out of scope (later PRs): serializable CommitMessage, overwrite, pypaimon-side wiring.

Notes

write_arrow validates the incoming batch schema against the table schema and raises ValueError on mismatch (names + types; binary/fixed_size_binary tolerated), mirroring pypaimon's _validate_pyarrow_schema. No implicit casting — callers supply correctly-typed batches (e.g. INT → Arrow int32).
A single commit_user is fixed per WriteBuilder and shared by new_write() and new_commit(), so the writer and committer agree (Paimon uses it for duplicate-commit detection).
commit([]) (empty messages) is a no-op.

Tests

bindings/python/tests/test_write.py — 6 tests: write→commit→SQL read-back roundtrip, multi-batch, empty-commit no-op, schema-mismatch ValueError, non-message TypeError.

JingsongLi · 2026-06-29T11:12:42Z

+            return Err(mismatch());
+        }
+        if i.data_type() != t.data_type()
+            && !(is_binary_family(i.data_type()) && is_binary_family(t.data_type()))


This accepts LargeBinary/FixedSizeBinary as equivalent to Binary, but the batch is then passed through unchanged. The lower write path still uses the original Arrow arrays/schema: for example extract_datum_from_arrow only downcasts binary Paimon fields to arrow_array::BinaryArray, and append writes would also see the incoming schema rather than the table schema. So a batch that this check accepts can later fail with a type-mismatch (or write files whose Arrow schema differs from the table schema). Please either reject non-exact binary-family types here or normalize/cast the batch to the target Arrow schema before calling write_arrow_batch.

Good catch — confirmed the lower write path downcasts binary fields to arrow_array::BinaryArray only, so tolerating LargeBinary/FixedSizeBinary here was a false positive. Made type matching strict (reject rather than cast, consistent with write_arrow not casting) and added a regression test. Fixed in 2fe6b39.

…tion The lower write path downcasts binary fields to arrow_array::BinaryArray only, so tolerating LargeBinary/FixedSizeBinary in schema validation produced a false positive: such a batch passed validation but failed deeper (type-mismatch or a file whose Arrow schema differs from the table). Make type matching strict and add a regression test. Addresses review feedback on apache#420.

JunRuiLee added 4 commits June 29, 2026 11:16

feat(python): expose batch write + commit loop

d53af0e

test(python): cover write multi-batch, empty commit, error paths

0319ef8

docs(python): add type stubs for write API

86fed34

fix(python): validate write_arrow batch schema against table schema

3e29ad1

JingsongLi reviewed Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[python] Expose DataFrame-style batch write + commit loop#420

[python] Expose DataFrame-style batch write + commit loop#420
JunRuiLee wants to merge 5 commits into
apache:mainfrom
JunRuiLee:feat/py-write-builder-pr1

JunRuiLee commented Jun 29, 2026

Uh oh!

JingsongLi Jun 29, 2026

Uh oh!

JunRuiLee Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JunRuiLee commented Jun 29, 2026

Purpose

Notes

Tests

Uh oh!

JingsongLi Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

JunRuiLee Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants