Search before asking
Motivation
Paimon already has table and column statistics exposed through Statistics and ColStats, and Spark can use these statistics through PaimonStatistics for CBO. However, the current
NDV information is stored as a scalar distinctCount, which is not naturally mergeable or incrementally maintainable.
For large append-heavy tables, recomputing full-table column statistics can be expensive. At the same time, simply combining scalar NDV values from partitions or snapshots is not
correct: summing them overestimates when values overlap, while taking the maximum underestimates in many cases. As a result, NDV statistics can easily become stale or unavailable for
large tables, which affects Spark's cardinality estimation, join reordering, broadcast decisions, and aggregation planning.
For example, a table may append new data every day and queries may frequently join or aggregate by a high-cardinality column such as user_id or device_id. Each day's data can have
its own distinct value count, but the global distinct count cannot be derived correctly from those scalar counts because the same values may appear across multiple days. A mergeable
sketch, such as a Theta sketch, can represent each partition or snapshot as a compact summary and produce an approximate global NDV by unioning those sketches.
This proposal focuses on adding the missing statistics infrastructure in a backward-compatible way. Existing scalar statistics can continue to be exposed to engines, while richer
statistics can be stored in optional sidecar files and consumed when available.
Solution
No response
Anything else?
No response
Are you willing to submit a PR?
Search before asking
Motivation
Paimon already has table and column statistics exposed through
StatisticsandColStats, and Spark can use these statistics throughPaimonStatisticsfor CBO. However, the currentNDV information is stored as a scalar
distinctCount, which is not naturally mergeable or incrementally maintainable.For large append-heavy tables, recomputing full-table column statistics can be expensive. At the same time, simply combining scalar NDV values from partitions or snapshots is not
correct: summing them overestimates when values overlap, while taking the maximum underestimates in many cases. As a result, NDV statistics can easily become stale or unavailable for
large tables, which affects Spark's cardinality estimation, join reordering, broadcast decisions, and aggregation planning.
For example, a table may append new data every day and queries may frequently join or aggregate by a high-cardinality column such as
user_idordevice_id. Each day's data can haveits own distinct value count, but the global distinct count cannot be derived correctly from those scalar counts because the same values may appear across multiple days. A mergeable
sketch, such as a Theta sketch, can represent each partition or snapshot as a compact summary and produce an approximate global NDV by unioning those sketches.
This proposal focuses on adding the missing statistics infrastructure in a backward-compatible way. Existing scalar statistics can continue to be exposed to engines, while richer
statistics can be stored in optional sidecar files and consumed when available.
Solution
No response
Anything else?
No response
Are you willing to submit a PR?