[SPARK-57294][PS] Support DataFrame.combine in fallback mode#56359
Closed
tonghuaroot wants to merge 1 commit into
Closed
[SPARK-57294][PS] Support DataFrame.combine in fallback mode#56359tonghuaroot wants to merge 1 commit into
tonghuaroot wants to merge 1 commit into
Conversation
### What changes were proposed in this pull request? This PR enables `DataFrame.combine` on pandas-on-Spark via the `compute.pandas_fallback` path by adding a `_combine_fallback` method to `pyspark.pandas.frame.DataFrame`, mirroring the existing `_asof_fallback` / `_set_axis_fallback` sibling methods. New tests (`test_combine` and its Spark Connect parity counterpart) are added and registered in `dev/sparktestsupport/modules.py`. ### Why are the changes needed? `combine` was previously declared via `_unsupported_function`, so calling it always raised `PandasNotImplementedError` even when `compute.pandas_fallback` was enabled. It is a good fallback candidate for the same reasons as the asof/set_axis siblings: its result is an ordinary single-level-index frame whose column dtypes (e.g. int64) map cleanly onto Spark types, so the generic `_build_fallback_method` round-trip succeeds. Wiring it through fallback closes a gap in the pandas-on-Spark fallback coverage. Signed-off-by: tonghuaroot <23011166+tonghuaroot@users.noreply.github.com>
0658644 to
008e425
Compare
Member
zhengruifeng
approved these changes
Jun 8, 2026
zhengruifeng
pushed a commit
that referenced
this pull request
Jun 8, 2026
### What changes were proposed in this pull request? This PR enables `DataFrame.combine` for pandas-on-Spark through the `compute.pandas_fallback` path. Previously `combine` was declared via `_unsupported_function`, so it always raised `PandasNotImplementedError`. This PR adds a `_combine_fallback` method to `pyspark.pandas.frame.DataFrame`, mirroring the existing `_asof_fallback` / `_set_axis_fallback` sibling methods, so that `__getattr__` dispatches `combine` through the generic `_build_fallback_method` when the fallback option is enabled. It also adds tests covering both the disabled (raises `PandasNotImplementedError`) and the fallback-enabled behavior, plus the Spark Connect parity test, and registers them in `dev/sparktestsupport/modules.py`. JIRA: https://issues.apache.org/jira/browse/SPARK-57294 ### Why are the changes needed? `combine` is a useful pandas DataFrame API that was unsupported on pandas-on-Spark even when users opted into `compute.pandas_fallback`. It is a sound fallback candidate for the same reasons as the existing asof / set_axis fallbacks: its result is an ordinary single-level-index DataFrame whose column dtypes (for example int64) map cleanly onto Spark types, so the generic fallback round-trip through `ps.from_pandas` / `as_spark_type` succeeds. Wiring it through fallback closes a gap in the pandas-on-Spark fallback coverage and gives users an explicit, opt-in way to run `combine`. ### Does this PR introduce _any_ user-facing change? Yes. With `compute.pandas_fallback` enabled, calling `DataFrame.combine` on a pandas-on-Spark DataFrame now executes via the pandas fallback path and returns a result instead of raising `PandasNotImplementedError`. A `PandasAPIOnSparkAdviceWarning` is emitted to indicate the call ran in fallback mode. When the option is disabled (the default), the behavior is unchanged and `PandasNotImplementedError` is still raised. ### How was this patch tested? Added `pyspark.pandas.tests.frame.test_combine` and the Spark Connect parity test `pyspark.pandas.tests.connect.frame.test_parity_combine`, both registered in `dev/sparktestsupport/modules.py`. The classic test covers two cases: - `test_disabled`: without `compute.pandas_fallback`, `combine` raises `PandasNotImplementedError`. - `test_fallback`: with the option enabled, `combine` (including the `overwrite=False` case) produces results equal to pandas, asserted with `assert_eq` (values and dtypes). Ran `test_combine` against a real local SparkSession: ``` $ python -m pytest python/pyspark/pandas/tests/frame/test_combine.py -v collected 4 items ... test_assert_classic_mode PASSED ... CombineTests::test_assert_classic_mode PASSED ... CombineTests::test_disabled PASSED ... CombineTests::test_fallback PASSED 4 passed in 11.32s ``` Environment: PySpark master (based on commit c082f82), pandas 2.2.3, PyArrow as bundled, OpenJDK 17.0.18, Python 3.11. The `PandasAPIOnSparkAdviceWarning: combine is executed in fallback mode` message confirms the call exercised the fallback path. ### Was this patch authored or co-authored using generative AI tooling? Yes, this patch was co-authored with generative AI tooling (Claude, Anthropic Opus 4.8). The contributor directed the change: choosing `combine` as the target, requiring that any fallback candidate first be verified to round-trip through the Spark type system before being proposed (which ruled out candidates such as `to_period` and `tz_localize`, whose result dtypes have no Spark mapping), and reviewing the implementation and the test results. The AI tooling assisted with drafting the implementation and the tests. Closes #56359 from tonghuaroot/pyspark-combine-fallback. Authored-by: tonghuaroot (童话) <tonghuaroot@gmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit 3e02257) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
Contributor
|
thanks, merged into master/4.x |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR enables
DataFrame.combinefor pandas-on-Spark through thecompute.pandas_fallbackpath. Previouslycombinewas declared via_unsupported_function, so it always raisedPandasNotImplementedError.This PR adds a
_combine_fallbackmethod topyspark.pandas.frame.DataFrame, mirroring the existing_asof_fallback/_set_axis_fallbacksibling methods, so that__getattr__dispatchescombinethrough the generic_build_fallback_methodwhen the fallback option is enabled.It also adds tests covering both the disabled (raises
PandasNotImplementedError) and the fallback-enabled behavior, plus theSpark Connect parity test, and registers them in
dev/sparktestsupport/modules.py.JIRA: https://issues.apache.org/jira/browse/SPARK-57294
Why are the changes needed?
combineis a useful pandas DataFrame API that was unsupported onpandas-on-Spark even when users opted into
compute.pandas_fallback.It is a sound fallback candidate for the same reasons as the existing
asof / set_axis fallbacks: its result is an ordinary single-level-index
DataFrame whose column dtypes (for example int64) map cleanly onto Spark
types, so the generic fallback round-trip through
ps.from_pandas/as_spark_typesucceeds. Wiring it through fallbackcloses a gap in the pandas-on-Spark fallback coverage and gives users an
explicit, opt-in way to run
combine.Does this PR introduce any user-facing change?
Yes. With
compute.pandas_fallbackenabled, callingDataFrame.combineon a pandas-on-Spark DataFrame now executes via thepandas fallback path and returns a result instead of raising
PandasNotImplementedError. APandasAPIOnSparkAdviceWarningis emittedto indicate the call ran in fallback mode. When the option is disabled
(the default), the behavior is unchanged and
PandasNotImplementedErroris still raised.
How was this patch tested?
Added
pyspark.pandas.tests.frame.test_combineand the Spark Connectparity test
pyspark.pandas.tests.connect.frame.test_parity_combine,both registered in
dev/sparktestsupport/modules.py. The classic testcovers two cases:
test_disabled: withoutcompute.pandas_fallback,combineraisesPandasNotImplementedError.test_fallback: with the option enabled,combine(including theoverwrite=Falsecase) produces results equal to pandas, assertedwith
assert_eq(values and dtypes).Ran
test_combineagainst a real local SparkSession:Environment: PySpark master (based on commit c082f82), pandas 2.2.3,
PyArrow as bundled, OpenJDK 17.0.18, Python 3.11. The
PandasAPIOnSparkAdviceWarning: combine is executed in fallback modemessage confirms the call exercised the fallback path.
Was this patch authored or co-authored using generative AI tooling?
Yes, this patch was co-authored with generative AI tooling (Claude,
Anthropic Opus 4.8). The contributor directed the change: choosing
combineas the target, requiring that any fallback candidate first beverified to round-trip through the Spark type system before being
proposed (which ruled out candidates such as
to_periodandtz_localize, whose result dtypes have no Spark mapping), and reviewingthe implementation and the test results. The AI tooling assisted with
drafting the implementation and the tests.