fix(format): reconcile inline blob descriptor BINARY -> LARGE_BINARY on parquet read by duanyyyyyyy · Pull Request #388 · alibaba/paimon-cpp

duanyyyyyyy · 2026-06-29T09:18:55Z

Purpose

Reading a column declared with blob-descriptor-field fails with:

Invalid: src type binary and target type large_binary mismatch

Inline blob descriptors are stored in the main parquet file as BINARY, while a
BLOB column's read type is LARGE_BINARY (a BLOB field is large_binary in the
table schema; the writer narrows it to binary for inline storage). In
ParquetFileBatchReader::NextBatch, the per-batch read-schema reconciliation
required the parquet array type to equal the read type, so it rejected this legal
binary/large_binary difference before any blob handling ran — including before
BlobViewResolvingBatchReader, which itself expects a LargeBinaryArray input.

That reconciliation already walks the whole read schema every batch (for timestamp
timezones), so this PR folds the blob widening into it rather than adding a second
pass. ParquetTimestampConverter is renamed to ParquetTimestampBinaryConverter
and, in the same traversal:

NeedCastArrayForTimestamp treats BINARY -> LARGE_BINARY as "needs cast"
instead of a mismatch;
CastArrayForTimestamp widens BINARY arrays to LARGE_BINARY (rebuild the
32-bit offsets as int64, reuse the value/null buffers — no value-data copy).

Every other column (including timestamps awaiting timezone adjustment) is
unchanged, and non-blob reads do no extra work. This is the read-side inverse of
the large_binary -> binary narrowing the writer already applies to inline blob
fields.

Tests

parquet_file_batch_reader_test.cpp: adds TestReadBinaryColumnWithLargeBinaryReadSchema
— writes a {int32, binary} parquet, reads it through a {int32, large_binary}
schema, and asserts the blob column is widened to large_binary with correct
values (including null / empty) while the int32 column is untouched.
Existing timestamp converter tests (renamed to
parquet_timestamp_binary_converter_test.cpp) continue to pass.

API and Format

No file-format change. Internal only: ParquetTimestampConverter is renamed to
ParquetTimestampBinaryConverter; no public API is affected.

Documentation

No documentation change needed.

Generative AI tooling

Claude Code 4.8

…on parquet read Reading a column declared with `blob-descriptor-field` failed with: Invalid: src type binary and target type large_binary mismatch Inline blob descriptors are stored in the main parquet file as BINARY, while the blob column's read type is LARGE_BINARY (a BLOB field is large_binary in the table schema; the writer narrows it to binary for inline storage). In ParquetFileBatchReader::NextBatch the per-batch read-schema reconciliation (NeedCastArrayForTimestamp / CastArrayForTimestamp) required the parquet array type to equal the read type, so it rejected this legal binary/large_binary difference before any blob handling ran. That reconciliation already walks the whole read schema each batch (for timestamp timezones), so fold the blob widening into it rather than adding a second pass: rename ParquetTimestampConverter to ParquetTimestampBinaryConverter and, in the same traversal, - NeedCastArrayForTimestamp: treat BINARY -> LARGE_BINARY as "needs cast" instead of a mismatch; - CastArrayForTimestamp: widen BINARY arrays to LARGE_BINARY (rebuild 32-bit offsets as int64, reuse the value/null buffers -- no data copy). Every other column, including timestamps awaiting timezone adjustment, is unchanged, and non-blob reads do no extra work (one traversal, as before). ParquetFileBatchReader now just calls the renamed converter; the standalone ParquetBinaryConverter and the need_binary_cast_ gate are removed. TestReadBinaryColumnWithLargeBinaryReadSchema (parquet_file_batch_reader_test) covers the read end to end: write a {int32, binary} parquet, read it through a {int32, large_binary} schema, assert the blob column is widened with correct values while the int column is untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CLAassistant · 2026-06-29T09:19:04Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

duanyan.duan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

duanyyyyyyy closed this Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(format): reconcile inline blob descriptor BINARY -> LARGE_BINARY on parquet read#388

fix(format): reconcile inline blob descriptor BINARY -> LARGE_BINARY on parquet read#388
duanyyyyyyy wants to merge 1 commit into
alibaba:mainfrom
duanyyyyyyy:duanyan/blob_descriptor_binary_widen

duanyyyyyyy commented Jun 29, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

duanyyyyyyy commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

API and Format

Documentation

Generative AI tooling

Uh oh!

CLAassistant commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

duanyyyyyyy commented Jun 29, 2026 •

edited

Loading