You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
The Parquet opener loads the page index (ColumnIndex plus OffsetIndex) for any file whose scan has a page-pruning predicate, before it knows whether the page index can prune anything. For predicates that row-group statistics already resolve, this is pure I/O and parsing overhead that prunes zero pages.
The clearest case is IS NOT NULL on a column that has no nulls. In datafusion/pruning, IS NOT NULL pruning rewrites to null_count != row_count, so a container is pruned only when it is entirely null. On a non-null column no page is ever all-null, so the page index is loaded and prunes nothing. On a wide fact table scanned with IS NOT NULL filters on non-null join keys, this adds roughly 280 KB of page index per file. Across tens of thousands of files that is gigabytes of wasted reads.
This surfaced downstream in DataFusion Comet (apache/datafusion-comet#3978): a TPC-DS q88 scan loads about 2.8 GB of page index for IS NOT NULL filters on non-null foreign keys, pruning nothing.
Describe the solution you'd like
Gate the page index load on whether row-group statistics leave any work for it to do.
Row-group pruning sorts each row group into one of three buckets:
Pruned: RG statistics prove no row matches. The whole row group is dropped and the page index is irrelevant.
Fully matched: RG statistics prove every row matches. The page index cannot prune anything (justified below).
Inconclusive: RG statistics prove neither. Some rows might match and some might not.
The page index can only prune in bucket 3. Page-index pruning removes a page if and only if the predicate is provably false for every row on that page. A page is a subset of the row group's rows. In bucket 2 the predicate is provably true for every row in the row group, so it is true for every row on every page, so no page can be all-non-matching and no page is prunable. There is nothing left to refine. In bucket 3 there exist possibly-non-matching rows that may be concentrated on some pages the page index can isolate, so the page index does refine and must be loaded.
So the rule is: skip the page index load only when every surviving row group is in bucket 2 (fully matched). A single bucket 3 row group forces the load. Note that "row group could not be pruned" is the wrong condition, because it merges buckets 2 and 3.
DataFusion already computes the relevant signal. PR #21637 added "fully matched" detection and uses it to skip page-index pruning work for fully-matched row groups. For IS NOT NULL, a row group with null_count == 0 is fully matched.
The gap is ordering. The opener state machine (datafusion/datasource-parquet/src/opener/mod.rs) runs:
LoadMetadata (footer, PageIndexPolicy::Skip)
-> PrepareFilters
-> LoadPageIndex // page index I/O happens here
-> PruneWithStatistics // row-group stats pruning / fully-matched decided here
-> ...
LoadPageIndex runs before PruneWithStatistics, so the fully-matched determination that would prove the page index useless happens after the bytes are already fetched. The existing optimization saves CPU (skips page-index pruning work) but not I/O.
Proposed change: make the fully-matched determination available before the page index load, and skip load_page_index when every surviving row group is fully matched by the page-pruning predicate using row-group statistics alone. Row-group statistics are present in the footer already loaded under PageIndexPolicy::Skip, so no extra I/O is required to make this decision.
Concretely for the IS NOT NULL case: skip the load when, for every referenced column, the row-group statistics report null_count == Some(0).
Describe alternatives you've considered
Classify the page-pruning predicate by which statistics it uses (StatisticsType in the pruning predicate's RequiredColumns) and skip the load when it references only NullCount / RowCount and never Min / Max. This is narrower than the fully-matched approach and still needs the row-group null-count gate, so the fully-matched route is preferred because it already exists and covers more predicate shapes.
Cache the full metadata including the page index so repeated opens of the same file pay the load only once. This helps when the page index is actually useful but does not help the non-selective case, where the cheapest fix is to not load it at all.
Additional context
Correctness notes for the gate:
Fully matched must be null-aware. For a predicate that rejects nulls, such as x > 50, fully matched requires min_value > 50 and null_count == 0. If the null count is positive, an all-null page would be pruned by x > 50, so the page index still has value and the load must not be skipped. The gate is only as correct as the underlying fully-matched computation's null handling, so it must depend on the null-aware definition. This should be verified in the Skip RowFilter and page pruning for fully matched row groups #21637 logic before relying on it.
Missing statistics fall back to loading.Statistics.null_count is optional in the Parquet thrift spec, and a column chunk may carry no Statistics at all. Treat a missing null_count (or missing statistics) as "not provably zero" and load the page index. The IS NOT NULL skip condition is therefore "statistics present and null_count == Some(0) for all referenced columns," conservatively false otherwise. Modern writers emit row-group null_count in practice, so the common case still benefits.
The fully-matched determination must use row-group statistics only, never the page index, since the whole point is to decide whether to load the page index.
The change is a reorder of the opener state machine so that row-group-stats pruning / fully-matched runs before the page index load. The staged structs (FiltersPreparedParquetOpen, RowGroupsPrunedParquetOpen, and related) need rewiring, and the bloom-filter stage should be checked for any dependence on the current ordering.
Relevant code:
Opener state machine and stages: datafusion/datasource-parquet/src/opener/mod.rs
Page index load helper (the missing_column_index || missing_offset_index guard): load_page_index in the same file
Fully-matched page pruning: PagePruningAccessPlanFilter in datafusion/datasource-parquet/src/page_filter.rs
IS NOT NULL rewrite to null_count != row_count: datafusion/pruning/src/pruning_predicate.rs
Is your feature request related to a problem or challenge?
The Parquet opener loads the page index (ColumnIndex plus OffsetIndex) for any file whose scan has a page-pruning predicate, before it knows whether the page index can prune anything. For predicates that row-group statistics already resolve, this is pure I/O and parsing overhead that prunes zero pages.
The clearest case is
IS NOT NULLon a column that has no nulls. Indatafusion/pruning,IS NOT NULLpruning rewrites tonull_count != row_count, so a container is pruned only when it is entirely null. On a non-null column no page is ever all-null, so the page index is loaded and prunes nothing. On a wide fact table scanned withIS NOT NULLfilters on non-null join keys, this adds roughly 280 KB of page index per file. Across tens of thousands of files that is gigabytes of wasted reads.This surfaced downstream in DataFusion Comet (apache/datafusion-comet#3978): a TPC-DS q88 scan loads about 2.8 GB of page index for
IS NOT NULLfilters on non-null foreign keys, pruning nothing.Describe the solution you'd like
Gate the page index load on whether row-group statistics leave any work for it to do.
Row-group pruning sorts each row group into one of three buckets:
The page index can only prune in bucket 3. Page-index pruning removes a page if and only if the predicate is provably false for every row on that page. A page is a subset of the row group's rows. In bucket 2 the predicate is provably true for every row in the row group, so it is true for every row on every page, so no page can be all-non-matching and no page is prunable. There is nothing left to refine. In bucket 3 there exist possibly-non-matching rows that may be concentrated on some pages the page index can isolate, so the page index does refine and must be loaded.
So the rule is: skip the page index load only when every surviving row group is in bucket 2 (fully matched). A single bucket 3 row group forces the load. Note that "row group could not be pruned" is the wrong condition, because it merges buckets 2 and 3.
DataFusion already computes the relevant signal. PR #21637 added "fully matched" detection and uses it to skip page-index pruning work for fully-matched row groups. For
IS NOT NULL, a row group withnull_count == 0is fully matched.The gap is ordering. The opener state machine (
datafusion/datasource-parquet/src/opener/mod.rs) runs:LoadPageIndexruns beforePruneWithStatistics, so the fully-matched determination that would prove the page index useless happens after the bytes are already fetched. The existing optimization saves CPU (skips page-index pruning work) but not I/O.Proposed change: make the fully-matched determination available before the page index load, and skip
load_page_indexwhen every surviving row group is fully matched by the page-pruning predicate using row-group statistics alone. Row-group statistics are present in the footer already loaded underPageIndexPolicy::Skip, so no extra I/O is required to make this decision.Concretely for the
IS NOT NULLcase: skip the load when, for every referenced column, the row-group statistics reportnull_count == Some(0).Describe alternatives you've considered
Classify the page-pruning predicate by which statistics it uses (
StatisticsTypein the pruning predicate'sRequiredColumns) and skip the load when it references onlyNullCount/RowCountand neverMin/Max. This is narrower than the fully-matched approach and still needs the row-group null-count gate, so the fully-matched route is preferred because it already exists and covers more predicate shapes.Cache the full metadata including the page index so repeated opens of the same file pay the load only once. This helps when the page index is actually useful but does not help the non-selective case, where the cheapest fix is to not load it at all.
Additional context
Correctness notes for the gate:
Fully matched must be null-aware. For a predicate that rejects nulls, such as
x > 50, fully matched requiresmin_value > 50andnull_count == 0. If the null count is positive, an all-null page would be pruned byx > 50, so the page index still has value and the load must not be skipped. The gate is only as correct as the underlying fully-matched computation's null handling, so it must depend on the null-aware definition. This should be verified in the Skip RowFilter and page pruning for fully matched row groups #21637 logic before relying on it.Missing statistics fall back to loading.
Statistics.null_countisoptionalin the Parquet thrift spec, and a column chunk may carry noStatisticsat all. Treat a missingnull_count(or missing statistics) as "not provably zero" and load the page index. TheIS NOT NULLskip condition is therefore "statistics present andnull_count == Some(0)for all referenced columns," conservatively false otherwise. Modern writers emit row-groupnull_countin practice, so the common case still benefits.The fully-matched determination must use row-group statistics only, never the page index, since the whole point is to decide whether to load the page index.
The change is a reorder of the opener state machine so that row-group-stats pruning / fully-matched runs before the page index load. The staged structs (
FiltersPreparedParquetOpen,RowGroupsPrunedParquetOpen, and related) need rewiring, and the bloom-filter stage should be checked for any dependence on the current ordering.Relevant code:
datafusion/datasource-parquet/src/opener/mod.rsmissing_column_index || missing_offset_indexguard):load_page_indexin the same filePagePruningAccessPlanFilterindatafusion/datasource-parquet/src/page_filter.rsIS NOT NULLrewrite tonull_count != row_count:datafusion/pruning/src/pruning_predicate.rs