Skip to content

[core] Support OR/nested partition predicate pruning in format table scan#8367

Open
Zouxxyy wants to merge 2 commits into
apache:masterfrom
Zouxxyy:xinyu/format-or-partition-prune
Open

[core] Support OR/nested partition predicate pruning in format table scan#8367
Zouxxyy wants to merge 2 commits into
apache:masterfrom
Zouxxyy:xinyu/format-or-partition-prune

Conversation

@Zouxxyy

@Zouxxyy Zouxxyy commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Purpose

Support generic partition-directory pruning for format table scans when partition predicates contain OR/nested expressions.

Previously pruning used a per-field `Map<String, Predicate>` built from `splitAnd`, so any predicate referencing more than one partition field (for example `(dt='20260625' AND hour<'16') OR (dt='20260624' AND hour>='16')`) was dropped and the scan fell back to listing every partition directory.

This change:

  • replaces the per-field pruning model with incremental partial evaluation of the whole partition predicate during directory descent, so AND/OR/nested/cross-field partition predicates can prune format-table directories correctly;
  • keeps scan-path prefix optimization unchanged for leading equality conjuncts;
  • fixes `format-table.partition-path-only-value=true` with default/null partitions by allowing the configured default partition directory name through hidden-path filtering and by passing `defaultPartName` into `listPartitionEntries`.

Tests

  • `mvn -pl paimon-core -Pfast-build -Dtest='FormatTableScanTest,PartitionPathUtilsTest' -DfailIfNoTests=false test`
  • `mvn -pl paimon-spark/paimon-spark-ut -Pspark3,fast-build -DskipTests install`
  • `mvn -pl paimon-spark/paimon-spark-3.5 -Pspark3,fast-build -DfailIfNoTests=false -DwildcardSuites=org.apache.paimon.spark.sql.FormatTableTest -Dtest=none test`

Zouxxyy and others added 2 commits June 26, 2026 16:02
…scan

Format table scan pruned partition directories via a per-field
Map<String,Predicate> built with splitAnd, which dropped any predicate
referencing more than one partition field. A cross-field OR such as
(dt='a' AND hour<'16') OR (dt='b' AND hour>='16') was therefore dropped
entirely, falling back to listing every partition directory.

Replace the per-field model with incremental partial evaluation of the
whole partition predicate during directory descent (mightMatch): a leaf
is decided only once all the fields it references are bound along the
current path, otherwise it is treated as possibly-matching. This prunes
AND/OR/nested/cross-field predicates uniformly and never prunes a
directory that could contain a match. Works for both the default
key=value layout and format-table.partition-path-only-value.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In format-table.partition-path-only-value mode, the default partition
name (__DEFAULT_PARTITION__ by default) starts with '_' and was treated
as a hidden path by PartitionPathUtils. This caused null/default
partitions to be skipped entirely during partition discovery.

Allow the configured default partition directory name through the hidden
path check in only-value mode, and pass defaultPartName into
FormatTableScan.listPartitionEntries so that listPartitionEntries uses
the same semantics as findPartitions.

Also add regression tests for null/default partitions in OR pruning and
listPartitionEntries, and keep the prefix+OR edge case covered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant