Skip to content

[python] Expose DataFrame-style batch write + commit loop#420

Open
JunRuiLee wants to merge 5 commits into
apache:mainfrom
JunRuiLee:feat/py-write-builder-pr1
Open

[python] Expose DataFrame-style batch write + commit loop#420
JunRuiLee wants to merge 5 commits into
apache:mainfrom
JunRuiLee:feat/py-write-builder-pr1

Conversation

@JunRuiLee

Copy link
Copy Markdown
Contributor

Purpose

First PR of exposing PyPaimon's DataFrame write path to Rust. Refs #414.

Exposes the Rust core's batch write loop through bindings/python:

wb = table.new_write_builder()
write = wb.new_write()
write.write_arrow(batch)            # PyArrow RecordBatch, may be called repeatedly
messages = write.prepare_commit()
wb.new_commit().commit(messages)    # persists a snapshot; readable via SQL

CommitMessage (from prepare_commit()) is an opaque object passed back to commit() — same-process only for now; cross-process (pickle, for Ray worker→driver) is a later PR.

Out of scope (later PRs): serializable CommitMessage, overwrite, pypaimon-side wiring.

Notes

  • write_arrow validates the incoming batch schema against the table schema and raises ValueError on mismatch (names + types; binary/fixed_size_binary tolerated), mirroring pypaimon's _validate_pyarrow_schema. No implicit casting — callers supply correctly-typed batches (e.g. INT → Arrow int32).
  • A single commit_user is fixed per WriteBuilder and shared by new_write() and new_commit(), so the writer and committer agree (Paimon uses it for duplicate-commit detection).
  • commit([]) (empty messages) is a no-op.

Tests

bindings/python/tests/test_write.py — 6 tests: write→commit→SQL read-back roundtrip, multi-batch, empty-commit no-op, schema-mismatch ValueError, non-message TypeError.

Comment thread bindings/python/src/write.rs Outdated
return Err(mismatch());
}
if i.data_type() != t.data_type()
&& !(is_binary_family(i.data_type()) && is_binary_family(t.data_type()))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This accepts LargeBinary/FixedSizeBinary as equivalent to Binary, but the batch is then passed through unchanged. The lower write path still uses the original Arrow arrays/schema: for example extract_datum_from_arrow only downcasts binary Paimon fields to arrow_array::BinaryArray, and append writes would also see the incoming schema rather than the table schema. So a batch that this check accepts can later fail with a type-mismatch (or write files whose Arrow schema differs from the table schema). Please either reject non-exact binary-family types here or normalize/cast the batch to the target Arrow schema before calling write_arrow_batch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — confirmed the lower write path downcasts binary fields to arrow_array::BinaryArray only, so tolerating LargeBinary/FixedSizeBinary here was a false positive. Made type matching strict (reject rather than cast, consistent with write_arrow not casting) and added a regression test. Fixed in 2fe6b39.

…tion

The lower write path downcasts binary fields to arrow_array::BinaryArray only,
so tolerating LargeBinary/FixedSizeBinary in schema validation produced a false
positive: such a batch passed validation but failed deeper (type-mismatch or a
file whose Arrow schema differs from the table). Make type matching strict and
add a regression test. Addresses review feedback on apache#420.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants