fix: avoid extraneous casts for equivalent nested types#20945
fix: avoid extraneous casts for equivalent nested types#20945feichai0017 wants to merge 7 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates DataFusion’s function argument coercion to treat structurally equivalent nested DataTypes as matching (ignoring nested field names / metadata), preventing unnecessary CASTs from being inserted during analysis for UDF calls.
Changes:
- Use
DataType::equals_datatype(via a helper) when checking whether current argument types already satisfy a function/UDF signature. - Update coercion paths to avoid rewriting arguments when nested types are equivalent.
- Add regression tests in both
datafusion-expranddatafusion-optimizerto ensure equivalent nested types don’t trigger casts.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
datafusion/expr/src/type_coercion/functions.rs |
Switches type matching from strict equality to equals_datatype for function/UDF argument matching and related coercion logic; adds a regression test. |
datafusion/optimizer/src/analyzer/type_coercion.rs |
Adds an analyzer regression test ensuring scalar UDF argument coercion doesn’t insert casts for equivalent nested list/struct types. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| ) | ||
| } | ||
|
|
||
| fn data_types_match(valid_types: &[DataType], current_types: &[DataType]) -> bool { |
There was a problem hiding this comment.
Seems like we aren't handling Map, Struct, or ListView -- is there a reason for that? In fact, the original bug report uses Map.
I wonder if we can simplify this to use equals_datatype from Arrow, as suggested by the bug reporter?
There was a problem hiding this comment.
Thanks for mention.
The original reproducer does go through a Map, but the actual mismatch at the coercion point is on the extracted map value (List<Struct<...>>), not on the Map type itself. That’s why I kept the fast-path relaxation narrow.
I did try a broader equals_datatype approach first, but it was too permissive in this path and regressed existing cases where runtime kernels still require exact type identity, especially around Struct. I agree ListView / LargeListView should be handled consistently with List / LargeList, and I’ve updated the matcher for that.
There was a problem hiding this comment.
Ah, got it.
Just to help me understand, can you point at an SLT test (e.g., involving structs) that would regress if we used equals_datatype? Or if such an SLT test doesn't already exist, it would probably be a good idea to add one as a sanity check.
There was a problem hiding this comment.
Yes, I locally verified that a broader equals_datatype-style matcher regresses existing SLTs.
In particular:
/datafusion/datafusion/sqllogictest/test_files/struct.slt
select [{a: 1, b: 2}, {b: 3, a: 4}];/datafusion/datafusion/sqllogictest/test_files/spark/array/array.slt
SELECT array(arrow_cast(array(1,2), 'LargeList(Int64)'), array(3));
With the broader matching, both end up failing in array construction (MutableArrayData) because those paths still require exact runtime type identity. That was the main reason I kept this matcher narrower than equals_datatype, especially around Struct.
I agree it would be useful to make that boundary explicit, so I can also add a focused sanity-check regression test in this PR.
There was a problem hiding this comment.
Thank you! That makes sense: the key point is that some Arrow kernels depend on struct field ordering, but the "field name" of a list has no influence on the representation of the data. Can we add a brief comment to data_type_matches to explain the rationale for the kinda-structural-equality we are implementing?
It seems like Map has the same behavior as the List variants: the "field name" does not impact the representation of the data. Should we handle that as well, for completeness?
| ) | ||
| } | ||
|
|
||
| fn data_types_match(valid_types: &[DataType], current_types: &[DataType]) -> bool { |
There was a problem hiding this comment.
Thank you! That makes sense: the key point is that some Arrow kernels depend on struct field ordering, but the "field name" of a list has no influence on the representation of the data. Can we add a brief comment to data_type_matches to explain the rationale for the kinda-structural-equality we are implementing?
It seems like Map has the same behavior as the List variants: the "field name" does not impact the representation of the data. Should we handle that as well, for completeness?
Thanks, I updated the matcher to make that boundary explicit. The fast-path match now:
That means cases like I also added:
|
Summary
This PR avoids inserting extraneous casts during function argument coercion when two nested types are structurally equivalent but differ only in nested field names or metadata.
Specifically, it:
DataTypes as matching during UDF argument coercionCASTsdatafusion-expranddatafusion-optimizerCloses #19943.
Tests
cargo test -p datafusion-optimizer./dev/rust_lint.shcargo test -p datafusion-exprcurrently fails on an existing snapshot mismatch inlogical_plan::plan::tests::test_display_pg_jsonon the currentmainbaseline, unrelated to this change