[SPARK-57294][PS] Support DataFrame.combine in fallback mode by tonghuaroot · Pull Request #56359 · apache/spark

tonghuaroot · 2026-06-07T01:38:12Z

What changes were proposed in this pull request?

This PR enables DataFrame.combine for pandas-on-Spark through the
compute.pandas_fallback path. Previously combine was declared via
_unsupported_function, so it always raised PandasNotImplementedError.
This PR adds a _combine_fallback method to
pyspark.pandas.frame.DataFrame, mirroring the existing
_asof_fallback / _set_axis_fallback sibling methods, so that
__getattr__ dispatches combine through the generic
_build_fallback_method when the fallback option is enabled.

It also adds tests covering both the disabled (raises
PandasNotImplementedError) and the fallback-enabled behavior, plus the
Spark Connect parity test, and registers them in
dev/sparktestsupport/modules.py.

JIRA: https://issues.apache.org/jira/browse/SPARK-57294

Why are the changes needed?

combine is a useful pandas DataFrame API that was unsupported on
pandas-on-Spark even when users opted into compute.pandas_fallback.
It is a sound fallback candidate for the same reasons as the existing
asof / set_axis fallbacks: its result is an ordinary single-level-index
DataFrame whose column dtypes (for example int64) map cleanly onto Spark
types, so the generic fallback round-trip through
ps.from_pandas / as_spark_type succeeds. Wiring it through fallback
closes a gap in the pandas-on-Spark fallback coverage and gives users an
explicit, opt-in way to run combine.

Does this PR introduce any user-facing change?

Yes. With compute.pandas_fallback enabled, calling
DataFrame.combine on a pandas-on-Spark DataFrame now executes via the
pandas fallback path and returns a result instead of raising
PandasNotImplementedError. A PandasAPIOnSparkAdviceWarning is emitted
to indicate the call ran in fallback mode. When the option is disabled
(the default), the behavior is unchanged and PandasNotImplementedError
is still raised.

How was this patch tested?

Added pyspark.pandas.tests.frame.test_combine and the Spark Connect
parity test pyspark.pandas.tests.connect.frame.test_parity_combine,
both registered in dev/sparktestsupport/modules.py. The classic test
covers two cases:

test_disabled: without compute.pandas_fallback, combine raises
PandasNotImplementedError.
test_fallback: with the option enabled, combine (including the
overwrite=False case) produces results equal to pandas, asserted
with assert_eq (values and dtypes).

Ran test_combine against a real local SparkSession:

$ python -m pytest python/pyspark/pandas/tests/frame/test_combine.py -v
collected 4 items
... test_assert_classic_mode PASSED
... CombineTests::test_assert_classic_mode PASSED
... CombineTests::test_disabled PASSED
... CombineTests::test_fallback PASSED
4 passed in 11.32s

Environment: PySpark master (based on commit c082f82), pandas 2.2.3,
PyArrow as bundled, OpenJDK 17.0.18, Python 3.11. The
PandasAPIOnSparkAdviceWarning: combine is executed in fallback mode
message confirms the call exercised the fallback path.

Was this patch authored or co-authored using generative AI tooling?

Yes, this patch was co-authored with generative AI tooling (Claude,
Anthropic Opus 4.8). The contributor directed the change: choosing
combine as the target, requiring that any fallback candidate first be
verified to round-trip through the Spark type system before being
proposed (which ruled out candidates such as to_period and
tz_localize, whose result dtypes have no Spark mapping), and reviewing
the implementation and the test results. The AI tooling assisted with
drafting the implementation and the tests.

### What changes were proposed in this pull request? This PR enables `DataFrame.combine` on pandas-on-Spark via the `compute.pandas_fallback` path by adding a `_combine_fallback` method to `pyspark.pandas.frame.DataFrame`, mirroring the existing `_asof_fallback` / `_set_axis_fallback` sibling methods. New tests (`test_combine` and its Spark Connect parity counterpart) are added and registered in `dev/sparktestsupport/modules.py`. ### Why are the changes needed? `combine` was previously declared via `_unsupported_function`, so calling it always raised `PandasNotImplementedError` even when `compute.pandas_fallback` was enabled. It is a good fallback candidate for the same reasons as the asof/set_axis siblings: its result is an ordinary single-level-index frame whose column dtypes (e.g. int64) map cleanly onto Spark types, so the generic `_build_fallback_method` round-trip succeeds. Wiring it through fallback closes a gap in the pandas-on-Spark fallback coverage. Signed-off-by: tonghuaroot <23011166+tonghuaroot@users.noreply.github.com>

HyukjinKwon · 2026-06-08T06:30:04Z

cc @zhengruifeng

### What changes were proposed in this pull request? This PR enables `DataFrame.combine` for pandas-on-Spark through the `compute.pandas_fallback` path. Previously `combine` was declared via `_unsupported_function`, so it always raised `PandasNotImplementedError`. This PR adds a `_combine_fallback` method to `pyspark.pandas.frame.DataFrame`, mirroring the existing `_asof_fallback` / `_set_axis_fallback` sibling methods, so that `__getattr__` dispatches `combine` through the generic `_build_fallback_method` when the fallback option is enabled. It also adds tests covering both the disabled (raises `PandasNotImplementedError`) and the fallback-enabled behavior, plus the Spark Connect parity test, and registers them in `dev/sparktestsupport/modules.py`. JIRA: https://issues.apache.org/jira/browse/SPARK-57294 ### Why are the changes needed? `combine` is a useful pandas DataFrame API that was unsupported on pandas-on-Spark even when users opted into `compute.pandas_fallback`. It is a sound fallback candidate for the same reasons as the existing asof / set_axis fallbacks: its result is an ordinary single-level-index DataFrame whose column dtypes (for example int64) map cleanly onto Spark types, so the generic fallback round-trip through `ps.from_pandas` / `as_spark_type` succeeds. Wiring it through fallback closes a gap in the pandas-on-Spark fallback coverage and gives users an explicit, opt-in way to run `combine`. ### Does this PR introduce _any_ user-facing change? Yes. With `compute.pandas_fallback` enabled, calling `DataFrame.combine` on a pandas-on-Spark DataFrame now executes via the pandas fallback path and returns a result instead of raising `PandasNotImplementedError`. A `PandasAPIOnSparkAdviceWarning` is emitted to indicate the call ran in fallback mode. When the option is disabled (the default), the behavior is unchanged and `PandasNotImplementedError` is still raised. ### How was this patch tested? Added `pyspark.pandas.tests.frame.test_combine` and the Spark Connect parity test `pyspark.pandas.tests.connect.frame.test_parity_combine`, both registered in `dev/sparktestsupport/modules.py`. The classic test covers two cases: - `test_disabled`: without `compute.pandas_fallback`, `combine` raises `PandasNotImplementedError`. - `test_fallback`: with the option enabled, `combine` (including the `overwrite=False` case) produces results equal to pandas, asserted with `assert_eq` (values and dtypes). Ran `test_combine` against a real local SparkSession: ``` $ python -m pytest python/pyspark/pandas/tests/frame/test_combine.py -v collected 4 items ... test_assert_classic_mode PASSED ... CombineTests::test_assert_classic_mode PASSED ... CombineTests::test_disabled PASSED ... CombineTests::test_fallback PASSED 4 passed in 11.32s ``` Environment: PySpark master (based on commit c082f82), pandas 2.2.3, PyArrow as bundled, OpenJDK 17.0.18, Python 3.11. The `PandasAPIOnSparkAdviceWarning: combine is executed in fallback mode` message confirms the call exercised the fallback path. ### Was this patch authored or co-authored using generative AI tooling? Yes, this patch was co-authored with generative AI tooling (Claude, Anthropic Opus 4.8). The contributor directed the change: choosing `combine` as the target, requiring that any fallback candidate first be verified to round-trip through the Spark type system before being proposed (which ruled out candidates such as `to_period` and `tz_localize`, whose result dtypes have no Spark mapping), and reviewing the implementation and the test results. The AI tooling assisted with drafting the implementation and the tests. Closes #56359 from tonghuaroot/pyspark-combine-fallback. Authored-by: tonghuaroot (童话) <tonghuaroot@gmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit 3e02257) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>

zhengruifeng · 2026-06-08T11:01:10Z

thanks, merged into master/4.x

tonghuaroot force-pushed the pyspark-combine-fallback branch from 0658644 to 008e425 Compare June 7, 2026 07:59

zhengruifeng approved these changes Jun 8, 2026

View reviewed changes

zhengruifeng closed this in 3e02257 Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57294][PS] Support DataFrame.combine in fallback mode#56359

[SPARK-57294][PS] Support DataFrame.combine in fallback mode#56359
tonghuaroot wants to merge 1 commit into
apache:masterfrom
tonghuaroot:pyspark-combine-fallback

tonghuaroot commented Jun 7, 2026 •

edited

Loading

Uh oh!

HyukjinKwon commented Jun 8, 2026

Uh oh!

zhengruifeng commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tonghuaroot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Jun 8, 2026

Uh oh!

zhengruifeng commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tonghuaroot commented Jun 7, 2026 •

edited

Loading