fix(format): reconcile inline blob descriptor BINARY -> LARGE_BINARY on parquet read#388
Closed
duanyyyyyyy wants to merge 1 commit into
Closed
Conversation
…on parquet read
Reading a column declared with `blob-descriptor-field` failed with:
Invalid: src type binary and target type large_binary mismatch
Inline blob descriptors are stored in the main parquet file as BINARY, while
the blob column's read type is LARGE_BINARY (a BLOB field is large_binary in
the table schema; the writer narrows it to binary for inline storage). In
ParquetFileBatchReader::NextBatch the per-batch read-schema reconciliation
(NeedCastArrayForTimestamp / CastArrayForTimestamp) required the parquet array
type to equal the read type, so it rejected this legal binary/large_binary
difference before any blob handling ran.
That reconciliation already walks the whole read schema each batch (for
timestamp timezones), so fold the blob widening into it rather than adding a
second pass: rename ParquetTimestampConverter to ParquetTimestampBinaryConverter
and, in the same traversal,
- NeedCastArrayForTimestamp: treat BINARY -> LARGE_BINARY as "needs cast"
instead of a mismatch;
- CastArrayForTimestamp: widen BINARY arrays to LARGE_BINARY (rebuild 32-bit
offsets as int64, reuse the value/null buffers -- no data copy).
Every other column, including timestamps awaiting timezone adjustment, is
unchanged, and non-blob reads do no extra work (one traversal, as before).
ParquetFileBatchReader now just calls the renamed converter; the standalone
ParquetBinaryConverter and the need_binary_cast_ gate are removed.
TestReadBinaryColumnWithLargeBinaryReadSchema (parquet_file_batch_reader_test)
covers the read end to end: write a {int32, binary} parquet, read it through a
{int32, large_binary} schema, assert the blob column is widened with correct
values while the int column is untouched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
duanyan.duan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Reading a column declared with
blob-descriptor-fieldfails with:Invalid: src type binary and target type large_binary mismatch
Inline blob descriptors are stored in the main parquet file as
BINARY, while aBLOB column's read type is
LARGE_BINARY(a BLOB field islarge_binaryin thetable schema; the writer narrows it to
binaryfor inline storage). InParquetFileBatchReader::NextBatch, the per-batch read-schema reconciliationrequired the parquet array type to equal the read type, so it rejected this legal
binary/large_binarydifference before any blob handling ran — including beforeBlobViewResolvingBatchReader, which itself expects aLargeBinaryArrayinput.That reconciliation already walks the whole read schema every batch (for timestamp
timezones), so this PR folds the blob widening into it rather than adding a second
pass.
ParquetTimestampConverteris renamed toParquetTimestampBinaryConverterand, in the same traversal:
NeedCastArrayForTimestamptreatsBINARY -> LARGE_BINARYas "needs cast"instead of a mismatch;
CastArrayForTimestampwidensBINARYarrays toLARGE_BINARY(rebuild the32-bit offsets as int64, reuse the value/null buffers — no value-data copy).
Every other column (including timestamps awaiting timezone adjustment) is
unchanged, and non-blob reads do no extra work. This is the read-side inverse of
the
large_binary -> binarynarrowing the writer already applies to inline blobfields.
Tests
parquet_file_batch_reader_test.cpp: addsTestReadBinaryColumnWithLargeBinaryReadSchema— writes a
{int32, binary}parquet, reads it through a{int32, large_binary}schema, and asserts the blob column is widened to
large_binarywith correctvalues (including null / empty) while the
int32column is untouched.parquet_timestamp_binary_converter_test.cpp) continue to pass.API and Format
No file-format change. Internal only:
ParquetTimestampConverteris renamed toParquetTimestampBinaryConverter; no public API is affected.Documentation
No documentation change needed.
Generative AI tooling
Claude Code 4.8