Parquet: Fix variant shredding of large decimals (precision > 18)#17002
Parquet: Fix variant shredding of large decimals (precision > 18)#17002xiaoxuandev wants to merge 1 commit into
Conversation
3fa6899 to
79ac9a8
Compare
|
Thanks for raising this @xiaoxuandev. Will take a look. |
|
Hi @huaxingao @nssalian, small Parquet fix, mind taking a look? Corrects the shredded fixed-decimal length in VariantShreddingAnalyzer (16 → decimalRequiredBytes) that broke DECIMAL16 variant writes. |
nssalian
left a comment
There was a problem hiding this comment.
Took a look and the fix is right. The reader side maps any FIXED_LEN_BYTE_ARRAY to DECIMAL16 so there's no file-format compatibility concern.
One thing for the description: per TypeUtil.REQUIRED_LENGTH, precision 35 maps to 15 bytes, not 16. So the constant only happened to match the writer at precision 36-38, and the bug fires across precision 19 through 35. Worth correcting since this usually lands in the commit message.
| @TestTemplate | ||
| public void testShreddedLargeIntegerVariantReadBack() { | ||
| // A 20-digit integer (DECIMAL16, above Long.MAX_VALUE) shreds to a FIXED_LEN_BYTE_ARRAY | ||
| // typed_value. Before the fix its declared length (16) exceeded the bytes written, failing the |
There was a problem hiding this comment.
Could we rephrase this so it describes the invariant instead of the history? Something like:
// DECIMAL16 (20-digit integer, above Long.MAX_VALUE) shreds to a FIXED_LEN_BYTE_ARRAY
// typed_value whose declared length must equal what FixedDecimalWriter emits
| "value", | ||
| optional(PrimitiveType.PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY) | ||
| .length(16) | ||
| .length(9) // decimalRequiredBytes(21) = 9, not a hardcoded 16 |
There was a problem hiding this comment.
Can we clean up this comment as well along with the other similar ones?
There was a problem hiding this comment.
Updated, thanks!
| .isEqualTo(PrimitiveType.PrimitiveTypeName.FIXED_LEN_BYTE_ARRAY); | ||
| assertThat(valPrimitive.getTypeLength()) | ||
| .isEqualTo(TypeUtil.decimalRequiredBytes(expectedPrecision)) | ||
| .isLessThan(16); |
There was a problem hiding this comment.
I went back to #14297 to see what was done.
The only test it had was for the FIXED_LEN_BYTE_ARRAY branch happened to land on precision 38, where decimalRequiredBytes(38) = 16 and the hardcoded 16 looked correct. Here we are adding coverage at precision 20 (9 bytes) and keeps the precision-38 case, but those are both boundaries of the affected range.
Could we add one more case at a middle precision, e.g. 28 (12 bytes) or 33 (14 bytes)? Easiest as a @ParameterizedTest taking precision + expected byte length, or just one extra @test. One non-boundary case can help prove the fix is doing the arithmetic correctly.
There was a problem hiding this comment.
Added the precision test using precision → expected length pairs, covering non-boundary cases that exercise the arithmetic, thanks!
VariantShreddingAnalyzer shredded a decimal with precision > 18 into a FIXED_LEN_BYTE_ARRAY typed_value with a hardcoded length of 16. The writer (FixedDecimalWriter) instead emits the minimum number of bytes for the precision, TypeUtil.decimalRequiredBytes (e.g. 9 bytes for precision 20), so the write fails with IllegalArgumentException: 'Fixed Binary size 9 does not match field type length 16'. This breaks write/read of any variant holding an int128-range value (stored as DECIMAL16), such as an integer with 19-35 digits. The hardcoded 16 was the only fixed-decimal width in the codebase not derived from decimalRequiredBytes; TypeToMessageType and every Parquet decimal writer already use it. decimalRequiredBytes is 9-15 bytes for precision 19-35 and 16 bytes only for precision 36-38, so the constant matched the writer just at 36-38; for precision 19-35 the declared length exceeded the bytes the writer emits. Set the length to TypeUtil.decimalRequiredBytes(maxPrecision) to realign this outlier. Precision <= 18 is unaffected (INT32/INT64).
79ac9a8 to
f63c329
Compare
Thanks for the review! agreed, BinaryAsDecimalReader decodes via new BigInteger(bytes), which is width-agnostic so the length change is physical-only with no read-side impact.
Good catch, fixed the commit message and PR description. |
|
Thanks @xiaoxuandev for filing the issue |
|
Thanks for making the fixes. Added Peter and Huaxin to review this before merging. |
Guosmilesmile
left a comment
There was a problem hiding this comment.
LGTM.Thanks for the pr!
VariantShreddingAnalyzershredded a decimal with precision > 18 into aFIXED_LEN_BYTE_ARRAYtyped_value with a hardcoded length of 16. The writer (FixedDecimalWriter) instead emits the minimum number of bytes for the precision,TypeUtil.decimalRequiredBytes(e.g. 9 bytes for precision 20), so the write fails withIllegalArgumentException: 'Fixed Binary size 9 does not match field type length 16'. This breaks write/read of any variant holding an int128-range value (stored as DECIMAL16), such as an integer with 19-34 digits.The hardcoded 16 was the only fixed-decimal width in the codebase not derived from decimalRequiredBytes; TypeToMessageType and every Parquet decimal writer already use it. decimalRequiredBytes is 9-15 bytes for precision 19-35 and 16 bytes only for precision 36-38, so the constant matched the writer just at 36-38; for precision 19-35 the declared length exceeded the bytes the writer emits. Set the length to TypeUtil.decimalRequiredBytes(maxPrecision) to realign this outlier. Precision <= 18 is unaffected (INT32/INT64).