Skip to content

[Feature] Add statistics sidecar with mergeable NDV sketches #8381

Description

@kerwin-zk

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Paimon already has table and column statistics exposed through Statistics and ColStats, and Spark can use these statistics through PaimonStatistics for CBO. However, the current
NDV information is stored as a scalar distinctCount, which is not naturally mergeable or incrementally maintainable.

For large append-heavy tables, recomputing full-table column statistics can be expensive. At the same time, simply combining scalar NDV values from partitions or snapshots is not
correct: summing them overestimates when values overlap, while taking the maximum underestimates in many cases. As a result, NDV statistics can easily become stale or unavailable for
large tables, which affects Spark's cardinality estimation, join reordering, broadcast decisions, and aggregation planning.

For example, a table may append new data every day and queries may frequently join or aggregate by a high-cardinality column such as user_id or device_id. Each day's data can have
its own distinct value count, but the global distinct count cannot be derived correctly from those scalar counts because the same values may appear across multiple days. A mergeable
sketch, such as a Theta sketch, can represent each partition or snapshot as a compact summary and produce an approximate global NDV by unioning those sketches.

This proposal focuses on adding the missing statistics infrastructure in a backward-compatible way. Existing scalar statistics can continue to be exposed to engines, while richer
statistics can be stored in optional sidecar files and consumed when available.

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions