Parquet: Fix variant shredding of large decimals (precision > 18) by xiaoxuandev · Pull Request #17002 · apache/iceberg

xiaoxuandev · 2026-06-29T18:01:34Z

VariantShreddingAnalyzer shredded a decimal with precision > 18 into a FIXED_LEN_BYTE_ARRAY typed_value with a hardcoded length of 16. The writer (FixedDecimalWriter) instead emits the minimum number of bytes for the precision, TypeUtil.decimalRequiredBytes (e.g. 9 bytes for precision 20), so the write fails with IllegalArgumentException: 'Fixed Binary size 9 does not match field type length 16'. This breaks write/read of any variant holding an int128-range value (stored as DECIMAL16), such as an integer with 19-34 digits.

The hardcoded 16 was the only fixed-decimal width in the codebase not derived from decimalRequiredBytes; TypeToMessageType and every Parquet decimal writer already use it. decimalRequiredBytes is 9-15 bytes for precision 19-35 and 16 bytes only for precision 36-38, so the constant matched the writer just at 36-38; for precision 19-35 the declared length exceeded the bytes the writer emits. Set the length to TypeUtil.decimalRequiredBytes(maxPrecision) to realign this outlier. Precision <= 18 is unaffected (INT32/INT64).

nssalian · 2026-06-29T18:27:41Z

Thanks for raising this @xiaoxuandev. Will take a look.

xiaoxuandev · 2026-06-29T18:34:35Z

Hi @huaxingao @nssalian, small Parquet fix, mind taking a look? Corrects the shredded fixed-decimal length in VariantShreddingAnalyzer (16 → decimalRequiredBytes) that broke DECIMAL16 variant writes.
cc: @Guosmilesmile @pvary for the Flink test bit. Thanks!

nssalian

Took a look and the fix is right. The reader side maps any FIXED_LEN_BYTE_ARRAY to DECIMAL16 so there's no file-format compatibility concern.

One thing for the description: per TypeUtil.REQUIRED_LENGTH, precision 35 maps to 15 bytes, not 16. So the constant only happened to match the writer at precision 36-38, and the bug fires across precision 19 through 35. Worth correcting since this usually lands in the commit message.

nssalian · 2026-06-29T20:11:53Z

+  @TestTemplate
+  public void testShreddedLargeIntegerVariantReadBack() {
+    // A 20-digit integer (DECIMAL16, above Long.MAX_VALUE) shreds to a FIXED_LEN_BYTE_ARRAY
+    // typed_value. Before the fix its declared length (16) exceeded the bytes written, failing the


Could we rephrase this so it describes the invariant instead of the history? Something like:

// DECIMAL16 (20-digit integer, above Long.MAX_VALUE) shreds to a FIXED_LEN_BYTE_ARRAY // typed_value whose declared length must equal what FixedDecimalWriter emits

Fixed, thanks!

nssalian · 2026-06-29T20:12:29Z

            "value",
            optional(PrimitiveType.PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY)
-                .length(16)
+                .length(9) // decimalRequiredBytes(21) = 9, not a hardcoded 16


Can we clean up this comment as well along with the other similar ones?

Updated, thanks!

nssalian · 2026-06-29T20:16:07Z

+        .isEqualTo(PrimitiveType.PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY);
+    assertThat(valPrimitive.getTypeLength())
+        .isEqualTo(TypeUtil.decimalRequiredBytes(expectedPrecision))
+        .isLessThan(16);


I went back to #14297 to see what was done.

The only test it had was for the FIXED_LEN_BYTE_ARRAY branch happened to land on precision 38, where decimalRequiredBytes(38) = 16 and the hardcoded 16 looked correct. Here we are adding coverage at precision 20 (9 bytes) and keeps the precision-38 case, but those are both boundaries of the affected range.

Could we add one more case at a middle precision, e.g. 28 (12 bytes) or 33 (14 bytes)? Easiest as a @ParameterizedTest taking precision + expected byte length, or just one extra @test. One non-boundary case can help prove the fix is doing the arithmetic correctly.

Added the precision test using precision → expected length pairs, covering non-boundary cases that exercise the arithmetic, thanks!

VariantShreddingAnalyzer shredded a decimal with precision > 18 into a FIXED_LEN_BYTE_ARRAY typed_value with a hardcoded length of 16. The writer (FixedDecimalWriter) instead emits the minimum number of bytes for the precision, TypeUtil.decimalRequiredBytes (e.g. 9 bytes for precision 20), so the write fails with IllegalArgumentException: 'Fixed Binary size 9 does not match field type length 16'. This breaks write/read of any variant holding an int128-range value (stored as DECIMAL16), such as an integer with 19-35 digits. The hardcoded 16 was the only fixed-decimal width in the codebase not derived from decimalRequiredBytes; TypeToMessageType and every Parquet decimal writer already use it. decimalRequiredBytes is 9-15 bytes for precision 19-35 and 16 bytes only for precision 36-38, so the constant matched the writer just at 36-38; for precision 19-35 the declared length exceeded the bytes the writer emits. Set the length to TypeUtil.decimalRequiredBytes(maxPrecision) to realign this outlier. Precision <= 18 is unaffected (INT32/INT64).

xiaoxuandev · 2026-06-29T22:01:18Z

Took a look and the fix is right. The reader side maps any FIXED_LEN_BYTE_ARRAY to DECIMAL16 so there's no file-format compatibility concern.

Thanks for the review! agreed, BinaryAsDecimalReader decodes via new BigInteger(bytes), which is width-agnostic so the length change is physical-only with no read-side impact.

One thing for the description: per TypeUtil.REQUIRED_LENGTH, precision 35 maps to 15 bytes, not 16. So the constant only happened to match the writer at precision 36-38, and the bug fires across precision 19 through 35. Worth correcting since this usually lands in the commit message.

Good catch, fixed the commit message and PR description.

soumilshah199500 · 2026-06-30T18:05:34Z

Thanks @xiaoxuandev for filing the issue

nssalian · 2026-06-30T20:17:47Z

Thanks for making the fixes. Added Peter and Huaxin to review this before merging.

Guosmilesmile

LGTM.Thanks for the pr！

github-actions Bot added spark parquet labels Jun 29, 2026

xiaoxuandev force-pushed the fix-variant-shredding-of-large-decimals branch from 3fa6899 to 79ac9a8 Compare June 29, 2026 18:12

github-actions Bot added the flink label Jun 29, 2026

nssalian suggested changes Jun 29, 2026

View reviewed changes

xiaoxuandev force-pushed the fix-variant-shredding-of-large-decimals branch from 79ac9a8 to f63c329 Compare June 29, 2026 21:46

wombatu-kun approved these changes Jun 30, 2026

View reviewed changes

nssalian requested review from huaxingao and pvary June 30, 2026 20:17

Guosmilesmile approved these changes Jul 1, 2026

View reviewed changes

Uh oh!

Conversation

xiaoxuandev commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nssalian commented Jun 29, 2026

Uh oh!

xiaoxuandev commented Jun 29, 2026

Uh oh!

nssalian left a comment

Choose a reason for hiding this comment

Uh oh!

nssalian Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

xiaoxuandev Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

nssalian Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

xiaoxuandev Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

nssalian Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

xiaoxuandev Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

xiaoxuandev commented Jun 29, 2026

Uh oh!

soumilshah199500 commented Jun 30, 2026

Uh oh!

nssalian commented Jun 30, 2026

Uh oh!

Guosmilesmile left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xiaoxuandev commented Jun 29, 2026 •

edited

Loading