Skip to content

refactor: name build-row and matchable-map presence checks in hash join#23024

Open
Phoenix500526 wants to merge 2 commits into
apache:mainfrom
Phoenix500526:issue/23008
Open

refactor: name build-row and matchable-map presence checks in hash join#23024
Phoenix500526 wants to merge 2 commits into
apache:mainfrom
Phoenix500526:issue/23008

Conversation

@Phoenix500526

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

HashJoinStream::state_after_build_ready relies on two different facts about the collected build side:

  • whether the build side physically has rows (batch().num_rows())
  • whether the hash map has any matchable entries (map().is_empty())

These diverge under NullEquality::NullEqualsNothing: build rows whose join key is NULL are omitted from the map, so the map can be empty while the build side still has rows. Conflating the two was the source of the bug fixed in #22893.

Today both checks are written inline as raw batch().num_rows() == 0 / map().is_empty(). Giving each a name on JoinLeftData makes the distinction explicit at the API surface, so future call sites are harder to get wrong.

What changes are included in this PR?

  • Add two helpers on JoinLeftData:
    • has_build_rows() — build-side row presence
    • has_matchable_build_rows() — matchable hash-map entry presence
  • Update state_after_build_ready to use them instead of the raw checks.

Pure readability refactor — no behavior change.

Are these changes tested?

Covered by the existing hash join tests; cargo test -p datafusion-physical-plan hash_join passes (804 tests). Since this is a readability-only refactor with no behavior change, no new tests are added.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added physical-plan Changes to the physical-plan crate auto detected api change Auto detected API change labels Jun 18, 2026
@Phoenix500526

Copy link
Copy Markdown
Contributor Author

The failing cargo test hash collisions check is unrelated to this PR. It fails on test_create_hashes_with_quality_hash_state in datafusion-common — that test (added in #22921) is not gated behind not(feature = "force_hash_collisions"), so it fails under the forced-collision build. This PR only touches physical-plan hash-join code.

I have fixed it separately in #23044; once that merges, rebasing this branch should turn the check green.

@neilconway neilconway left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Useful cleanup.

Comment thread datafusion/physical-plan/src/joins/hash_join/exec.rs Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using has_matchable_build_rows here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions github-actions Bot removed the auto detected api change Auto detected API change label Jun 20, 2026
@neilconway neilconway enabled auto-merge June 20, 2026 02:27
@kosiew

kosiew commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

hi @Phoenix500526
Can you merge with latest main and try to resolve all CI issues?

@Phoenix500526

Copy link
Copy Markdown
Contributor Author

hi @Phoenix500526 Can you merge with latest main and try to resolve all CI issues?

Sure, I opened a new PR #23053 to fix it.

`state_after_build_ready` distinguishes two facts about the collected build
side: whether it physically has rows, and whether its hash map has any
matchable entries. These differ under `NullEquality::NullEqualsNothing`, where
build rows with a NULL join key are omitted from the map -- the distinction
that was the source of the bug fixed in apache#22893.

Expose it as named methods on `JoinLeftData` (`has_build_rows` /
`has_matchable_build_rows`) instead of repeating raw `batch().num_rows()` /
`map().is_empty()` checks at the call site, so the invariant is visible at the
API surface and future uses are harder to get wrong. No behavior change.

Closes apache#23008
Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
Drop the explicit intra-doc link reference on `has_matchable_build_rows`: the
inline `[`NullEquality::NullEqualsNothing`]` link resolves on its own since the
type is in scope, so spelling out the path again is redundant.

Also use `has_matchable_build_rows()` for the empty-build-side check in
`HashJoinStream` instead of a raw `map().is_empty()`, keeping the presence
checks consistent. No behavior change.

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
auto-merge was automatically disabled June 20, 2026 08:22

Head branch was pushed to by a user without write access

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cleanup: Name build-row and matchable-map presence checks in hash join

3 participants