Skip loading the Parquet page index when row-group statistics already prove it cannot prune

## Is your feature request related to a problem or challenge?

The Parquet opener loads the page index (ColumnIndex plus OffsetIndex) for any file whose scan has a page-pruning predicate, before it knows whether the page index can prune anything. For predicates that row-group statistics already resolve, this is pure I/O and parsing overhead that prunes zero pages.

The clearest case is `IS NOT NULL` on a column that has no nulls. In `datafusion/pruning`, `IS NOT NULL` pruning rewrites to `null_count != row_count`, so a container is pruned only when it is entirely null. On a non-null column no page is ever all-null, so the page index is loaded and prunes nothing. On a wide fact table scanned with `IS NOT NULL` filters on non-null join keys, this adds roughly 280 KB of page index per file. Across tens of thousands of files that is gigabytes of wasted reads.

This surfaced downstream in DataFusion Comet (apache/datafusion-comet#3978): a TPC-DS q88 scan loads about 2.8 GB of page index for `IS NOT NULL` filters on non-null foreign keys, pruning nothing.

## Describe the solution you'd like

Gate the page index load on whether row-group statistics leave any work for it to do.

Row-group pruning sorts each row group into one of three buckets:

1. **Pruned**: RG statistics prove no row matches. The whole row group is dropped and the page index is irrelevant.
2. **Fully matched**: RG statistics prove every row matches. The page index cannot prune anything (justified below).
3. **Inconclusive**: RG statistics prove neither. Some rows might match and some might not.

The page index can only prune in bucket 3. Page-index pruning removes a page if and only if the predicate is provably false for every row on that page. A page is a subset of the row group's rows. In bucket 2 the predicate is provably true for every row in the row group, so it is true for every row on every page, so no page can be all-non-matching and no page is prunable. There is nothing left to refine. In bucket 3 there exist possibly-non-matching rows that may be concentrated on some pages the page index can isolate, so the page index does refine and must be loaded.

So the rule is: **skip the page index load only when every surviving row group is in bucket 2 (fully matched). A single bucket 3 row group forces the load.** Note that "row group could not be pruned" is the wrong condition, because it merges buckets 2 and 3.

DataFusion already computes the relevant signal. PR #21637 added "fully matched" detection and uses it to skip page-index pruning work for fully-matched row groups. For `IS NOT NULL`, a row group with `null_count == 0` is fully matched.

The gap is ordering. The opener state machine (`datafusion/datasource-parquet/src/opener/mod.rs`) runs:

```
LoadMetadata (footer, PageIndexPolicy::Skip)
  -> PrepareFilters
  -> LoadPageIndex            // page index I/O happens here
  -> PruneWithStatistics      // row-group stats pruning / fully-matched decided here
  -> ...
```

`LoadPageIndex` runs before `PruneWithStatistics`, so the fully-matched determination that would prove the page index useless happens after the bytes are already fetched. The existing optimization saves CPU (skips page-index pruning work) but not I/O.

Proposed change: make the fully-matched determination available before the page index load, and skip `load_page_index` when every surviving row group is fully matched by the page-pruning predicate using row-group statistics alone. Row-group statistics are present in the footer already loaded under `PageIndexPolicy::Skip`, so no extra I/O is required to make this decision.

Concretely for the `IS NOT NULL` case: skip the load when, for every referenced column, the row-group statistics report `null_count == Some(0)`.

## Describe alternatives you've considered

- Classify the page-pruning predicate by which statistics it uses (`StatisticsType` in the pruning predicate's `RequiredColumns`) and skip the load when it references only `NullCount` / `RowCount` and never `Min` / `Max`. This is narrower than the fully-matched approach and still needs the row-group null-count gate, so the fully-matched route is preferred because it already exists and covers more predicate shapes.

- Cache the full metadata including the page index so repeated opens of the same file pay the load only once. This helps when the page index is actually useful but does not help the non-selective case, where the cheapest fix is to not load it at all.

## Additional context

Correctness notes for the gate:

- **Fully matched must be null-aware.** For a predicate that rejects nulls, such as `x > 50`, fully matched requires `min_value > 50` and `null_count == 0`. If the null count is positive, an all-null page would be pruned by `x > 50`, so the page index still has value and the load must not be skipped. The gate is only as correct as the underlying fully-matched computation's null handling, so it must depend on the null-aware definition. This should be verified in the #21637 logic before relying on it.

- **Missing statistics fall back to loading.** `Statistics.null_count` is `optional` in the Parquet thrift spec, and a column chunk may carry no `Statistics` at all. Treat a missing `null_count` (or missing statistics) as "not provably zero" and load the page index. The `IS NOT NULL` skip condition is therefore "statistics present and `null_count == Some(0)` for all referenced columns," conservatively false otherwise. Modern writers emit row-group `null_count` in practice, so the common case still benefits.

- **The fully-matched determination must use row-group statistics only**, never the page index, since the whole point is to decide whether to load the page index.

- **The change is a reorder of the opener state machine** so that row-group-stats pruning / fully-matched runs before the page index load. The staged structs (`FiltersPreparedParquetOpen`, `RowGroupsPrunedParquetOpen`, and related) need rewiring, and the bloom-filter stage should be checked for any dependence on the current ordering.

Relevant code:

- Opener state machine and stages: `datafusion/datasource-parquet/src/opener/mod.rs`
- Page index load helper (the `missing_column_index || missing_offset_index` guard): `load_page_index` in the same file
- Fully-matched page pruning: `PagePruningAccessPlanFilter` in `datafusion/datasource-parquet/src/page_filter.rs`
- `IS NOT NULL` rewrite to `null_count != row_count`: `datafusion/pruning/src/pruning_predicate.rs`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip loading the Parquet page index when row-group statistics already prove it cannot prune #22795

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Skip loading the Parquet page index when row-group statistics already prove it cannot prune #22795

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions