Skip to content

[SPARK-57294][PS] Support DataFrame.combine in fallback mode#56359

Closed
tonghuaroot wants to merge 1 commit into
apache:masterfrom
tonghuaroot:pyspark-combine-fallback
Closed

[SPARK-57294][PS] Support DataFrame.combine in fallback mode#56359
tonghuaroot wants to merge 1 commit into
apache:masterfrom
tonghuaroot:pyspark-combine-fallback

Conversation

@tonghuaroot
Copy link
Copy Markdown
Contributor

@tonghuaroot tonghuaroot commented Jun 7, 2026

What changes were proposed in this pull request?

This PR enables DataFrame.combine for pandas-on-Spark through the
compute.pandas_fallback path. Previously combine was declared via
_unsupported_function, so it always raised PandasNotImplementedError.
This PR adds a _combine_fallback method to
pyspark.pandas.frame.DataFrame, mirroring the existing
_asof_fallback / _set_axis_fallback sibling methods, so that
__getattr__ dispatches combine through the generic
_build_fallback_method when the fallback option is enabled.

It also adds tests covering both the disabled (raises
PandasNotImplementedError) and the fallback-enabled behavior, plus the
Spark Connect parity test, and registers them in
dev/sparktestsupport/modules.py.

JIRA: https://issues.apache.org/jira/browse/SPARK-57294

Why are the changes needed?

combine is a useful pandas DataFrame API that was unsupported on
pandas-on-Spark even when users opted into compute.pandas_fallback.
It is a sound fallback candidate for the same reasons as the existing
asof / set_axis fallbacks: its result is an ordinary single-level-index
DataFrame whose column dtypes (for example int64) map cleanly onto Spark
types, so the generic fallback round-trip through
ps.from_pandas / as_spark_type succeeds. Wiring it through fallback
closes a gap in the pandas-on-Spark fallback coverage and gives users an
explicit, opt-in way to run combine.

Does this PR introduce any user-facing change?

Yes. With compute.pandas_fallback enabled, calling
DataFrame.combine on a pandas-on-Spark DataFrame now executes via the
pandas fallback path and returns a result instead of raising
PandasNotImplementedError. A PandasAPIOnSparkAdviceWarning is emitted
to indicate the call ran in fallback mode. When the option is disabled
(the default), the behavior is unchanged and PandasNotImplementedError
is still raised.

How was this patch tested?

Added pyspark.pandas.tests.frame.test_combine and the Spark Connect
parity test pyspark.pandas.tests.connect.frame.test_parity_combine,
both registered in dev/sparktestsupport/modules.py. The classic test
covers two cases:

  • test_disabled: without compute.pandas_fallback, combine raises
    PandasNotImplementedError.
  • test_fallback: with the option enabled, combine (including the
    overwrite=False case) produces results equal to pandas, asserted
    with assert_eq (values and dtypes).

Ran test_combine against a real local SparkSession:

$ python -m pytest python/pyspark/pandas/tests/frame/test_combine.py -v
collected 4 items
... test_assert_classic_mode PASSED
... CombineTests::test_assert_classic_mode PASSED
... CombineTests::test_disabled PASSED
... CombineTests::test_fallback PASSED
4 passed in 11.32s

Environment: PySpark master (based on commit c082f82), pandas 2.2.3,
PyArrow as bundled, OpenJDK 17.0.18, Python 3.11. The
PandasAPIOnSparkAdviceWarning: combine is executed in fallback mode
message confirms the call exercised the fallback path.

Was this patch authored or co-authored using generative AI tooling?

Yes, this patch was co-authored with generative AI tooling (Claude,
Anthropic Opus 4.8). The contributor directed the change: choosing
combine as the target, requiring that any fallback candidate first be
verified to round-trip through the Spark type system before being
proposed (which ruled out candidates such as to_period and
tz_localize, whose result dtypes have no Spark mapping), and reviewing
the implementation and the test results. The AI tooling assisted with
drafting the implementation and the tests.

### What changes were proposed in this pull request?

This PR enables `DataFrame.combine` on pandas-on-Spark via the
`compute.pandas_fallback` path by adding a `_combine_fallback` method to
`pyspark.pandas.frame.DataFrame`, mirroring the existing
`_asof_fallback` / `_set_axis_fallback` sibling methods. New tests
(`test_combine` and its Spark Connect parity counterpart) are added and
registered in `dev/sparktestsupport/modules.py`.

### Why are the changes needed?

`combine` was previously declared via `_unsupported_function`, so calling
it always raised `PandasNotImplementedError` even when
`compute.pandas_fallback` was enabled. It is a good fallback candidate
for the same reasons as the asof/set_axis siblings: its result is an
ordinary single-level-index frame whose column dtypes (e.g. int64) map
cleanly onto Spark types, so the generic `_build_fallback_method`
round-trip succeeds. Wiring it through fallback closes a gap in the
pandas-on-Spark fallback coverage.

Signed-off-by: tonghuaroot <23011166+tonghuaroot@users.noreply.github.com>
@tonghuaroot tonghuaroot force-pushed the pyspark-combine-fallback branch from 0658644 to 008e425 Compare June 7, 2026 07:59
@HyukjinKwon
Copy link
Copy Markdown
Member

cc @zhengruifeng

zhengruifeng pushed a commit that referenced this pull request Jun 8, 2026
### What changes were proposed in this pull request?

This PR enables `DataFrame.combine` for pandas-on-Spark through the
`compute.pandas_fallback` path. Previously `combine` was declared via
`_unsupported_function`, so it always raised `PandasNotImplementedError`.
This PR adds a `_combine_fallback` method to
`pyspark.pandas.frame.DataFrame`, mirroring the existing
`_asof_fallback` / `_set_axis_fallback` sibling methods, so that
`__getattr__` dispatches `combine` through the generic
`_build_fallback_method` when the fallback option is enabled.

It also adds tests covering both the disabled (raises
`PandasNotImplementedError`) and the fallback-enabled behavior, plus the
Spark Connect parity test, and registers them in
`dev/sparktestsupport/modules.py`.

JIRA: https://issues.apache.org/jira/browse/SPARK-57294

### Why are the changes needed?

`combine` is a useful pandas DataFrame API that was unsupported on
pandas-on-Spark even when users opted into `compute.pandas_fallback`.
It is a sound fallback candidate for the same reasons as the existing
asof / set_axis fallbacks: its result is an ordinary single-level-index
DataFrame whose column dtypes (for example int64) map cleanly onto Spark
types, so the generic fallback round-trip through
`ps.from_pandas` / `as_spark_type` succeeds. Wiring it through fallback
closes a gap in the pandas-on-Spark fallback coverage and gives users an
explicit, opt-in way to run `combine`.

### Does this PR introduce _any_ user-facing change?

Yes. With `compute.pandas_fallback` enabled, calling
`DataFrame.combine` on a pandas-on-Spark DataFrame now executes via the
pandas fallback path and returns a result instead of raising
`PandasNotImplementedError`. A `PandasAPIOnSparkAdviceWarning` is emitted
to indicate the call ran in fallback mode. When the option is disabled
(the default), the behavior is unchanged and `PandasNotImplementedError`
is still raised.

### How was this patch tested?

Added `pyspark.pandas.tests.frame.test_combine` and the Spark Connect
parity test `pyspark.pandas.tests.connect.frame.test_parity_combine`,
both registered in `dev/sparktestsupport/modules.py`. The classic test
covers two cases:

- `test_disabled`: without `compute.pandas_fallback`, `combine` raises
  `PandasNotImplementedError`.
- `test_fallback`: with the option enabled, `combine` (including the
  `overwrite=False` case) produces results equal to pandas, asserted
  with `assert_eq` (values and dtypes).

Ran `test_combine` against a real local SparkSession:

```
$ python -m pytest python/pyspark/pandas/tests/frame/test_combine.py -v
collected 4 items
... test_assert_classic_mode PASSED
... CombineTests::test_assert_classic_mode PASSED
... CombineTests::test_disabled PASSED
... CombineTests::test_fallback PASSED
4 passed in 11.32s
```

Environment: PySpark master (based on commit c082f82), pandas 2.2.3,
PyArrow as bundled, OpenJDK 17.0.18, Python 3.11. The
`PandasAPIOnSparkAdviceWarning: combine is executed in fallback mode`
message confirms the call exercised the fallback path.

### Was this patch authored or co-authored using generative AI tooling?

Yes, this patch was co-authored with generative AI tooling (Claude,
Anthropic Opus 4.8). The contributor directed the change: choosing
`combine` as the target, requiring that any fallback candidate first be
verified to round-trip through the Spark type system before being
proposed (which ruled out candidates such as `to_period` and
`tz_localize`, whose result dtypes have no Spark mapping), and reviewing
the implementation and the test results. The AI tooling assisted with
drafting the implementation and the tests.

Closes #56359 from tonghuaroot/pyspark-combine-fallback.

Authored-by: tonghuaroot (童话) <tonghuaroot@gmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(cherry picked from commit 3e02257)
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
@zhengruifeng
Copy link
Copy Markdown
Contributor

thanks, merged into master/4.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants