[python] Expose DataFrame-style batch write + commit loop#420
[python] Expose DataFrame-style batch write + commit loop#420JunRuiLee wants to merge 5 commits into
Conversation
| return Err(mismatch()); | ||
| } | ||
| if i.data_type() != t.data_type() | ||
| && !(is_binary_family(i.data_type()) && is_binary_family(t.data_type())) |
There was a problem hiding this comment.
This accepts LargeBinary/FixedSizeBinary as equivalent to Binary, but the batch is then passed through unchanged. The lower write path still uses the original Arrow arrays/schema: for example extract_datum_from_arrow only downcasts binary Paimon fields to arrow_array::BinaryArray, and append writes would also see the incoming schema rather than the table schema. So a batch that this check accepts can later fail with a type-mismatch (or write files whose Arrow schema differs from the table schema). Please either reject non-exact binary-family types here or normalize/cast the batch to the target Arrow schema before calling write_arrow_batch.
There was a problem hiding this comment.
Good catch — confirmed the lower write path downcasts binary fields to arrow_array::BinaryArray only, so tolerating LargeBinary/FixedSizeBinary here was a false positive. Made type matching strict (reject rather than cast, consistent with write_arrow not casting) and added a regression test. Fixed in 2fe6b39.
…tion The lower write path downcasts binary fields to arrow_array::BinaryArray only, so tolerating LargeBinary/FixedSizeBinary in schema validation produced a false positive: such a batch passed validation but failed deeper (type-mismatch or a file whose Arrow schema differs from the table). Make type matching strict and add a regression test. Addresses review feedback on apache#420.
Purpose
First PR of exposing PyPaimon's DataFrame write path to Rust. Refs #414.
Exposes the Rust core's batch write loop through
bindings/python:CommitMessage(fromprepare_commit()) is an opaque object passed back tocommit()— same-process only for now; cross-process (pickle, for Ray worker→driver) is a later PR.Out of scope (later PRs): serializable
CommitMessage, overwrite, pypaimon-side wiring.Notes
write_arrowvalidates the incoming batch schema against the table schema and raisesValueErroron mismatch (names + types; binary/fixed_size_binary tolerated), mirroring pypaimon's_validate_pyarrow_schema. No implicit casting — callers supply correctly-typed batches (e.g.INT→ Arrowint32).commit_useris fixed perWriteBuilderand shared bynew_write()andnew_commit(), so the writer and committer agree (Paimon uses it for duplicate-commit detection).commit([])(empty messages) is a no-op.Tests
bindings/python/tests/test_write.py— 6 tests: write→commit→SQL read-back roundtrip, multi-batch, empty-commit no-op, schema-mismatch ValueError, non-message TypeError.