Skip to content

Parquet: Fix variant shredding of large decimals (precision > 18)#17002

Open
xiaoxuandev wants to merge 1 commit into
apache:mainfrom
xiaoxuandev:fix-variant-shredding-of-large-decimals
Open

Parquet: Fix variant shredding of large decimals (precision > 18)#17002
xiaoxuandev wants to merge 1 commit into
apache:mainfrom
xiaoxuandev:fix-variant-shredding-of-large-decimals

Conversation

@xiaoxuandev

@xiaoxuandev xiaoxuandev commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

VariantShreddingAnalyzer shredded a decimal with precision > 18 into a FIXED_LEN_BYTE_ARRAY typed_value with a hardcoded length of 16. The writer (FixedDecimalWriter) instead emits the minimum number of bytes for the precision, TypeUtil.decimalRequiredBytes (e.g. 9 bytes for precision 20), so the write fails with IllegalArgumentException: 'Fixed Binary size 9 does not match field type length 16'. This breaks write/read of any variant holding an int128-range value (stored as DECIMAL16), such as an integer with 19-34 digits.

The hardcoded 16 was the only fixed-decimal width in the codebase not derived from decimalRequiredBytes; TypeToMessageType and every Parquet decimal writer already use it. decimalRequiredBytes is 9-15 bytes for precision 19-35 and 16 bytes only for precision 36-38, so the constant matched the writer just at 36-38; for precision 19-35 the declared length exceeded the bytes the writer emits. Set the length to TypeUtil.decimalRequiredBytes(maxPrecision) to realign this outlier. Precision <= 18 is unaffected (INT32/INT64).

@xiaoxuandev xiaoxuandev force-pushed the fix-variant-shredding-of-large-decimals branch from 3fa6899 to 79ac9a8 Compare June 29, 2026 18:12
@github-actions github-actions Bot added the flink label Jun 29, 2026
@nssalian

Copy link
Copy Markdown
Collaborator

Thanks for raising this @xiaoxuandev. Will take a look.

@xiaoxuandev

Copy link
Copy Markdown
Contributor Author

Hi @huaxingao @nssalian, small Parquet fix, mind taking a look? Corrects the shredded fixed-decimal length in VariantShreddingAnalyzer (16 → decimalRequiredBytes) that broke DECIMAL16 variant writes.
cc: @Guosmilesmile @pvary for the Flink test bit. Thanks!

@nssalian nssalian left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look and the fix is right. The reader side maps any FIXED_LEN_BYTE_ARRAY to DECIMAL16 so there's no file-format compatibility concern.

One thing for the description: per TypeUtil.REQUIRED_LENGTH, precision 35 maps to 15 bytes, not 16. So the constant only happened to match the writer at precision 36-38, and the bug fires across precision 19 through 35. Worth correcting since this usually lands in the commit message.

@TestTemplate
public void testShreddedLargeIntegerVariantReadBack() {
// A 20-digit integer (DECIMAL16, above Long.MAX_VALUE) shreds to a FIXED_LEN_BYTE_ARRAY
// typed_value. Before the fix its declared length (16) exceeded the bytes written, failing the

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rephrase this so it describes the invariant instead of the history? Something like:

// DECIMAL16 (20-digit integer, above Long.MAX_VALUE) shreds to a FIXED_LEN_BYTE_ARRAY
// typed_value whose declared length must equal what FixedDecimalWriter emits

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

"value",
optional(PrimitiveType.PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY)
.length(16)
.length(9) // decimalRequiredBytes(21) = 9, not a hardcoded 16

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clean up this comment as well along with the other similar ones?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

.isEqualTo(PrimitiveType.PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY);
assertThat(valPrimitive.getTypeLength())
.isEqualTo(TypeUtil.decimalRequiredBytes(expectedPrecision))
.isLessThan(16);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back to #14297 to see what was done.

The only test it had was for the FIXED_LEN_BYTE_ARRAY branch happened to land on precision 38, where decimalRequiredBytes(38) = 16 and the hardcoded 16 looked correct. Here we are adding coverage at precision 20 (9 bytes) and keeps the precision-38 case, but those are both boundaries of the affected range.

Could we add one more case at a middle precision, e.g. 28 (12 bytes) or 33 (14 bytes)? Easiest as a @ParameterizedTest taking precision + expected byte length, or just one extra @test. One non-boundary case can help prove the fix is doing the arithmetic correctly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the precision test using precision → expected length pairs, covering non-boundary cases that exercise the arithmetic, thanks!

VariantShreddingAnalyzer shredded a decimal with precision > 18 into a
FIXED_LEN_BYTE_ARRAY typed_value with a hardcoded length of 16. The writer
(FixedDecimalWriter) instead emits the minimum number of bytes for the
precision, TypeUtil.decimalRequiredBytes (e.g. 9 bytes for precision 20), so
the write fails with IllegalArgumentException: 'Fixed Binary size 9 does not
match field type length 16'. This breaks write/read of any variant holding an
int128-range value (stored as DECIMAL16), such as an integer with 19-35 digits.

The hardcoded 16 was the only fixed-decimal width in the codebase not derived
from decimalRequiredBytes; TypeToMessageType and every Parquet decimal writer
already use it. decimalRequiredBytes is 9-15 bytes for precision 19-35 and 16
bytes only for precision 36-38, so the constant matched the writer just at
36-38; for precision 19-35 the declared length exceeded the bytes the writer
emits. Set the length to TypeUtil.decimalRequiredBytes(maxPrecision) to realign
this outlier. Precision <= 18 is unaffected (INT32/INT64).
@xiaoxuandev xiaoxuandev force-pushed the fix-variant-shredding-of-large-decimals branch from 79ac9a8 to f63c329 Compare June 29, 2026 21:46
@xiaoxuandev

Copy link
Copy Markdown
Contributor Author

Took a look and the fix is right. The reader side maps any FIXED_LEN_BYTE_ARRAY to DECIMAL16 so there's no file-format compatibility concern.

Thanks for the review! agreed, BinaryAsDecimalReader decodes via new BigInteger(bytes), which is width-agnostic so the length change is physical-only with no read-side impact.

One thing for the description: per TypeUtil.REQUIRED_LENGTH, precision 35 maps to 15 bytes, not 16. So the constant only happened to match the writer at precision 36-38, and the bug fires across precision 19 through 35. Worth correcting since this usually lands in the commit message.

Good catch, fixed the commit message and PR description.

@soumilshah199500

Copy link
Copy Markdown

Thanks @xiaoxuandev for filing the issue

@nssalian nssalian requested review from huaxingao and pvary June 30, 2026 20:17
@nssalian

Copy link
Copy Markdown
Collaborator

Thanks for making the fixes. Added Peter and Huaxin to review this before merging.

@Guosmilesmile Guosmilesmile left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.Thanks for the pr!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants