Skip to content

refactor(physical-plan): externalize statistics traversal into StatisticsContext#23051

Open
asolimando wants to merge 1 commit into
apache:mainfrom
asolimando:asolimando/22958-statistics-from-inputs
Open

refactor(physical-plan): externalize statistics traversal into StatisticsContext#23051
asolimando wants to merge 1 commit into
apache:mainfrom
asolimando:asolimando/22958-statistics-from-inputs

Conversation

@asolimando

Copy link
Copy Markdown
Member

Which issue does this PR close?

Rationale for this change

Follow-up to #21815. Per @2010YOUY01's suggestion there, this decouples statistics
traversal/caching from each operator's computation: operators now express only local
propagation, while the walk and cache live in an external StatisticsContext.

What changes are included in this PR?

  • StatisticsContext owns the bottom-up walk + per-walk cache; compute(plan, args)
    resolves children, then calls the operator.
  • ExecutionPlan::statistics_from_inputs(input_stats, args) — stateless local
    propagation from pre-resolved child stats; default delegates to deprecated
    partition_statistics.
  • ExecutionPlan::child_stats_requests(partition) -> Vec<ChildStats> — per-child
    directive (At(Option<usize>) / Skip) for which partition each child is computed
    at (e.g. broadcast joins request the build side overall; UnionExec skips
    non-owning inputs).
  • Removes the unreleased statistics_with_args; partition_statistics stays
    deprecated.
  • Migrates all operators/callers; updates the 55.0.0 upgrade guide.

Are these changes tested?

Yes — existing statistics tests now run through StatisticsContext::compute, plus new
unit tests for the cache and the compute_statistics benchmark.

Are there any user-facing changes?

Yes (public ExecutionPlan API): partition_statistics deprecated in favor of
statistics_from_inputs; new StatisticsContext / ChildStats; unreleased
statistics_with_args removed. Upgrade guide updated.


Reviewer note: statistics_from_inputs takes args: &StatisticsArgs (currently just
partition) as a non-breaking extension seam rather than a bare partition — happy
to change if preferred (see #22958).


Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

…ticsContext

Decouple operator statistics propagation from tree traversal and caching,
per apache#22958.

- Add `StatisticsContext`, which owns the bottom-up walk and per-walk
  memoization cache. `compute(plan, args)` resolves each child before
  invoking the operator.
- Add `ExecutionPlan::statistics_from_inputs(&self, input_stats, args)`:
  a stateless per-operator method that computes a node's statistics from
  pre-resolved child statistics. Default delegates to the deprecated
  `partition_statistics`.
- Add `ExecutionPlan::child_stats_requests(&self, partition) -> Vec<ChildStats>`
  letting a node declare, per child, which partition to request
  (`ChildStats::At`) or to skip entirely (`ChildStats::Skip`). Union skips
  non-owning inputs on partition-specific requests.

Operators no longer call back into the framework to fetch child statistics;
caching is shared across a single walk via the context.
@github-actions github-actions Bot added documentation Improvements or additions to documentation optimizer Optimizer rules core Core DataFusion crate datasource Changes to the datasource crate ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate labels Jun 19, 2026
@github-actions

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v54.0.0 (current)
       Built [ 104.445s] (current)
     Parsing datafusion v54.0.0 (current)
      Parsed [   0.033s] (current)
    Building datafusion v54.0.0 (baseline)
       Built [ 103.566s] (baseline)
     Parsing datafusion v54.0.0 (baseline)
      Parsed [   0.035s] (baseline)
    Checking datafusion v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.658s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 210.162s] datafusion
    Building datafusion-datasource v54.0.0 (current)
       Built [  38.544s] (current)
     Parsing datafusion-datasource v54.0.0 (current)
      Parsed [   0.032s] (current)
    Building datafusion-datasource v54.0.0 (baseline)
       Built [  38.573s] (baseline)
     Parsing datafusion-datasource v54.0.0 (baseline)
      Parsed [   0.032s] (baseline)
    Checking datafusion-datasource v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.239s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  78.429s] datafusion-datasource
    Building datafusion-ffi v54.0.0 (current)
       Built [  60.358s] (current)
     Parsing datafusion-ffi v54.0.0 (current)
      Parsed [   0.060s] (current)
    Building datafusion-ffi v54.0.0 (baseline)
       Built [  62.099s] (baseline)
     Parsing datafusion-ffi v54.0.0 (baseline)
      Parsed [   0.061s] (baseline)
    Checking datafusion-ffi v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.266s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 124.066s] datafusion-ffi
    Building datafusion-physical-optimizer v54.0.0 (current)
       Built [  39.476s] (current)
     Parsing datafusion-physical-optimizer v54.0.0 (current)
      Parsed [   0.021s] (current)
    Building datafusion-physical-optimizer v54.0.0 (baseline)
       Built [  39.557s] (baseline)
     Parsing datafusion-physical-optimizer v54.0.0 (baseline)
      Parsed [   0.021s] (baseline)
    Checking datafusion-physical-optimizer v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.117s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  80.197s] datafusion-physical-optimizer
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  37.501s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.130s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  37.368s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.131s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.601s] 223 checks: 221 pass, 2 fail, 0 warn, 30 skip

--- failure inherent_method_missing: pub method removed or renamed ---

Description:
A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/inherent_method_missing.ron

Failed in:
  StatisticsArgs::compute_child_statistics, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/229699db0dd05312c530e37c144be4e87a6c9d34/datafusion/physical-plan/src/statistics.rs:119
  StatisticsArgs::compute_child_statistics, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/229699db0dd05312c530e37c144be4e87a6c9d34/datafusion/physical-plan/src/statistics.rs:119

--- failure trait_method_missing: pub trait method removed or renamed ---

Description:
A trait method is no longer callable, and may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#trait-item-signature
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/trait_method_missing.ron

Failed in:
  method statistics_with_args of trait ExecutionPlan, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/229699db0dd05312c530e37c144be4e87a6c9d34/datafusion/physical-plan/src/execution_plan.rs:529
  method statistics_with_args of trait ExecutionPlan, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/229699db0dd05312c530e37c144be4e87a6c9d34/datafusion/physical-plan/src/execution_plan.rs:529

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  76.814s] datafusion-physical-plan

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 19, 2026
@asolimando

Copy link
Copy Markdown
Member Author

@2010YOUY01, @alamb: I have prepared this follow-up PR after #21815, it would be great if you could take a look at your convenience, I am in particular looking for guidance on the point raised in #22958 (comment).

cc: @xudong963 in case you are interested in taking a look at this PR as well.

Thank you folks for all the design and code feedback so far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation ffi Changes to the ffi crate optimizer Optimizer rules physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Decouple operator statistics propagation from traversal/caching via statistics_from_inputs

1 participant