Skip to content

Arrow: Fix vectorized read of all-null DELTA-encoded Parquet pages#17017

Open
raunaqmorarka wants to merge 1 commit into
apache:mainfrom
raunaqmorarka:fix-delta-all-null-page
Open

Arrow: Fix vectorized read of all-null DELTA-encoded Parquet pages#17017
raunaqmorarka wants to merge 1 commit into
apache:mainfrom
raunaqmorarka:fix-delta-all-null-page

Conversation

@raunaqmorarka

Copy link
Copy Markdown
Contributor

Problem

Vectorized reads of Parquet V2 files crash with ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0 when a string/binary column page using DELTA_BYTE_ARRAY / DELTA_LENGTH_BYTE_ARRAY has zero non-null values (an all-null page).

at VectorizedDeltaEncodedValuesReader.lambda$readIntegers$3(VectorizedDeltaEncodedValuesReader.java:127)
at VectorizedDeltaEncodedValuesReader.readValues(VectorizedDeltaEncodedValuesReader.java:153)
at VectorizedDeltaEncodedValuesReader.readIntegers(VectorizedDeltaEncodedValuesReader.java:122)

Cause

An all-null page decodes to a length stream with totalValueCount == 0. readIntegers(0, …) allocates a zero-length array, but readValues always wrote firstValue into result[0].

Fix

Skip the first-value write when total == 0.

Test

Added a Spark vectorized-read test with all-null string and binary columns in a V2, dictionary-disabled file. It reproduces the crash before the fix and passes after.

@robinsinghstudios

Copy link
Copy Markdown

Eagerly looking forward to this getting merged :)

An all-null page decodes to a length stream with totalValueCount 0, so
readValues wrote firstValue into a zero-length array.
@raunaqmorarka raunaqmorarka force-pushed the fix-delta-all-null-page branch from ae79239 to b330599 Compare June 30, 2026 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants