Skip to content

fix(format): reconcile inline blob descriptor BINARY -> LARGE_BINARY on parquet read#388

Closed
duanyyyyyyy wants to merge 1 commit into
alibaba:mainfrom
duanyyyyyyy:duanyan/blob_descriptor_binary_widen
Closed

fix(format): reconcile inline blob descriptor BINARY -> LARGE_BINARY on parquet read#388
duanyyyyyyy wants to merge 1 commit into
alibaba:mainfrom
duanyyyyyyy:duanyan/blob_descriptor_binary_widen

Conversation

@duanyyyyyyy

@duanyyyyyyy duanyyyyyyy commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Purpose

Reading a column declared with blob-descriptor-field fails with:

Invalid: src type binary and target type large_binary mismatch

Inline blob descriptors are stored in the main parquet file as BINARY, while a
BLOB column's read type is LARGE_BINARY (a BLOB field is large_binary in the
table schema; the writer narrows it to binary for inline storage). In
ParquetFileBatchReader::NextBatch, the per-batch read-schema reconciliation
required the parquet array type to equal the read type, so it rejected this legal
binary/large_binary difference before any blob handling ran — including before
BlobViewResolvingBatchReader, which itself expects a LargeBinaryArray input.

That reconciliation already walks the whole read schema every batch (for timestamp
timezones), so this PR folds the blob widening into it rather than adding a second
pass. ParquetTimestampConverter is renamed to ParquetTimestampBinaryConverter
and, in the same traversal:

  • NeedCastArrayForTimestamp treats BINARY -> LARGE_BINARY as "needs cast"
    instead of a mismatch;
  • CastArrayForTimestamp widens BINARY arrays to LARGE_BINARY (rebuild the
    32-bit offsets as int64, reuse the value/null buffers — no value-data copy).

Every other column (including timestamps awaiting timezone adjustment) is
unchanged, and non-blob reads do no extra work. This is the read-side inverse of
the large_binary -> binary narrowing the writer already applies to inline blob
fields.

Tests

  • parquet_file_batch_reader_test.cpp: adds TestReadBinaryColumnWithLargeBinaryReadSchema
    — writes a {int32, binary} parquet, reads it through a {int32, large_binary}
    schema, and asserts the blob column is widened to large_binary with correct
    values (including null / empty) while the int32 column is untouched.
  • Existing timestamp converter tests (renamed to
    parquet_timestamp_binary_converter_test.cpp) continue to pass.

API and Format

No file-format change. Internal only: ParquetTimestampConverter is renamed to
ParquetTimestampBinaryConverter; no public API is affected.

Documentation

No documentation change needed.

Generative AI tooling

Claude Code 4.8

…on parquet read

Reading a column declared with `blob-descriptor-field` failed with:

  Invalid: src type binary and target type large_binary mismatch

Inline blob descriptors are stored in the main parquet file as BINARY, while
the blob column's read type is LARGE_BINARY (a BLOB field is large_binary in
the table schema; the writer narrows it to binary for inline storage). In
ParquetFileBatchReader::NextBatch the per-batch read-schema reconciliation
(NeedCastArrayForTimestamp / CastArrayForTimestamp) required the parquet array
type to equal the read type, so it rejected this legal binary/large_binary
difference before any blob handling ran.

That reconciliation already walks the whole read schema each batch (for
timestamp timezones), so fold the blob widening into it rather than adding a
second pass: rename ParquetTimestampConverter to ParquetTimestampBinaryConverter
and, in the same traversal,
  - NeedCastArrayForTimestamp: treat BINARY -> LARGE_BINARY as "needs cast"
    instead of a mismatch;
  - CastArrayForTimestamp: widen BINARY arrays to LARGE_BINARY (rebuild 32-bit
    offsets as int64, reuse the value/null buffers -- no data copy).
Every other column, including timestamps awaiting timezone adjustment, is
unchanged, and non-blob reads do no extra work (one traversal, as before).

ParquetFileBatchReader now just calls the renamed converter; the standalone
ParquetBinaryConverter and the need_binary_cast_ gate are removed.

TestReadBinaryColumnWithLargeBinaryReadSchema (parquet_file_batch_reader_test)
covers the read end to end: write a {int32, binary} parquet, read it through a
{int32, large_binary} schema, assert the blob column is widened with correct
values while the int column is untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


duanyan.duan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants