Skip to content

Fix projection functional dependency remapping#23028

Open
hhhizzz wants to merge 5 commits into
apache:mainfrom
hhhizzz:codex/datafusion-fd-groupby-prune-20260618
Open

Fix projection functional dependency remapping#23028
hhhizzz wants to merge 5 commits into
apache:mainfrom
hhhizzz:codex/datafusion-fd-groupby-prune-20260618

Conversation

@hhhizzz

@hhhizzz hhhizzz commented Jun 18, 2026

Copy link
Copy Markdown

Which issue does this PR close?

Rationale for this change

Projection functional dependency propagation builds a mapping from projected output expressions back to input field indices.

Before this change, projection expressions that are not direct input fields, such as planner-generated computed expressions, were omitted from that mapping. This shifted the projected positions of later passthrough columns. If a primary key column appears after such a computed expression, the projected schema can record the primary key functional dependency against the wrong output index.

This showed up in the TPC-DS q39 planning path after primary key constraints were added to the schemas: downstream aggregate planning was reasoning from incorrect functional dependency metadata.

What changes are included in this PR?

  • Preserves one output slot per projection expression when remapping functional dependencies.
  • Uses a sentinel for computed / non-input projection expressions so they do not match input functional dependencies, while still keeping later passthrough column positions aligned.
  • Applies the same positional behavior to aliases, wildcards, and direct projection expressions.
  • Adds a focused regression test for a leading computed projection before a primary key column.

Are these changes tested?

Yes.

cargo fmt --all -- --check
cargo test -p datafusion-expr projection_with_leading_computed_column_preserves_pk
cargo test -p datafusion --test tpcds_planning q39

I also ran a debug SF10 TPC-DS all-query comparison for the fix path separately; it completed with 0 failures. I am treating that as diagnostic evidence, not a formal benchmark claim.

Are there any user-facing changes?

No SQL syntax or public API changes.

Users may see corrected optimized plans and restored performance for queries affected by projection functional dependency remapping.

@github-actions github-actions Bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Jun 18, 2026
@github-actions

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-sql v54.0.0 (current)
       Built [  43.223s] (current)
     Parsing datafusion-sql v54.0.0 (current)
      Parsed [   0.031s] (current)
    Building datafusion-sql v54.0.0 (baseline)
       Built [  42.253s] (baseline)
     Parsing datafusion-sql v54.0.0 (baseline)
      Parsed [   0.031s] (baseline)
    Checking datafusion-sql v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.237s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure inherent_method_missing: pub method removed or renamed ---

Description:
A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/inherent_method_missing.ron

Failed in:
  SqlToRel::take_warnings, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/229699db0dd05312c530e37c144be4e87a6c9d34/datafusion/sql/src/planner.rs:493

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  86.891s] datafusion-sql
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 178.647s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.023s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 178.664s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.024s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.093s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 359.795s] datafusion-sqllogictest

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 19, 2026

@neilconway neilconway left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reporting this!

I believe optimize_projections should already do this (which is why the newly added SLTs pass without the changes in this PR). From digging around a little bit, it looks like the TPC-DS query is running into a different bug: if there's an intermediate Projection, it seems like the FDs aren't being propagated correctly due to a bug in calc_func_dependencies_for_project when a projection list contains a mix of computed and passthrough columns. Claude found this test case:

#[test]
fn projection_with_leading_computed_column_preserves_pk() {
    // input: [id (PK), name, amount]
    let input_fds = FunctionalDependencies::new_from_constraints(
        Some(&Constraints::new_unverified(vec![Constraint::PrimaryKey(vec![0])])),
        3,
    );
    // output of a CSE-style projection: [__common_expr_1 (computed), id, name, amount]
    // -> proj_indices must be [MAX, 0, 1, 2], NOT [0, 1, 2]
    let proj_indices = vec![usize::MAX, 0, 1, 2];
    let projected = input_fds.project_functional_dependencies(&proj_indices, 4);
    // determinant must remap to output position 1 (`id`), not 0 (`__common_expr_1`)
    assert_eq!(projected[0].source_indices, vec![1]);
}

Would you like to take a look at fixing that bug? I think if we do that then the existing projection optimization rewrite should apply here to the TPC-DS query.

Comment on lines +18 to +20
# Functional-dependency targets should only be added to aggregate
# GROUP BY outputs when the SQL query actually needs them after
# aggregation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests pass without the code changes in this PR.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the code and these tests are removed.

@github-actions github-actions Bot added logical-expr Logical plan and expressions and removed sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Jun 20, 2026
@hhhizzz hhhizzz changed the title Prune implicit FD group keys in SQL aggregates Fix projection functional dependency remapping Jun 20, 2026
@hhhizzz

hhhizzz commented Jun 20, 2026

Copy link
Copy Markdown
Author

@neilconway
Thanks for taking a look and for the concrete test case. You were correct: the original aggregate-pruning approach was not the right fix, and the added SLTs passed because optimize_projections already handles that part.

I reworked the PR to fix the FD propagation issue in calc_func_dependencies_for_project instead. The projection remapping now preserves one output slot per projection expression, using a sentinel for computed / non-input expressions, so later passthrough columns are remapped to the correct output positions.

I also dropped the previous aggregate planning changes and added a focused regression test for the computed-column-before-PK case:
projection_with_leading_computed_column_preserves_pk.

I also ran a debug SF10 TPC-DS all-query run for this path and it completed with 0 failures and Q39 recovered, but I am treating that only as diagnostic evidence rather than a formal benchmark claim.

Thanks again for pointing me at the right root cause.

@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TPC-DS q39 regression after adding primary key constraints: aggregate GROUP BY includes many unreferenced FD columns

2 participants