Backport "Right Semi Join" (Hash Right Semi Join) from PostgreSQL 18#1799
Backport "Right Semi Join" (Hash Right Semi Join) from PostgreSQL 18#1799kongfanshen-0801 wants to merge 12 commits into
Conversation
|
Thanks for the back porting. Could we keep the original commit history, thanks |
Hash joins can support semijoin with the LHS input on the right, using the existing logic for inner join, combined with the assurance that only the first match for each inner tuple is considered, which can be achieved by leveraging the HEAP_TUPLE_HAS_MATCH flag. This can be very useful in some cases since we may now have the option to hash the smaller table instead of the larger. Merge join could likely support "Right Semi Join" too. However, the benefit of swapping inputs tends to be small here, so we do not address that in this patch. Note that this patch also modifies a test query in join.sql to ensure it continues testing as intended. With this patch the original query would result in a right-semi-join rather than semi-join, compromising its original purpose of testing the fix for neqjoinsel's behavior for semi-joins. Author: Richard Guo Reviewed-by: wenhui qiu, Alena Rybakina, Japin Li Discussion: https://postgr.es/m/CAMbWs4_X1mN=ic+SxcyymUqFx9bB8pqSLTGJ-F=MHy4PW3eRXw@mail.gmail.com (cherry picked from commit aa86129e19d704afb93cb84ab9638f33d266ee9d)
When resetting a HashJoin node for rescans, if it is a single-batch join and there are no parameter changes for the inner subnode, we can just reuse the existing hash table without rebuilding it. However, for join types that depend on the inner-tuple match flags in the hash table, we need to reset these match flags to avoid incorrect results. This applies to right, right-anti, right-semi, and full joins. When I introduced "Right Semi Join" plan shapes in aa86129e1, I failed to reset the match flags in the hash table for right-semi joins in rescans. This oversight has been shown to produce incorrect results. This patch fixes it. Author: Richard Guo Discussion: https://postgr.es/m/CAMbWs4-nQF9io2WL2SkD0eXvfPdyBc9Q=hRwfQHCGV2usa0jyA@mail.gmail.com (cherry picked from commit 5668a857de4f3f12066b2bbc626b77be4fc95ee5)
This commit carries the Cloudberry/Greenplum-specific changes needed on
top of the two cherry-picked upstream commits (aa86129e1, 5668a857d),
which only touch the upstream PostgreSQL planner/executor files.
- nodes.h: move JOIN_RIGHT_SEMI to the END of the JoinType enum. Upstream
places it next to JOIN_RIGHT_ANTI, but in the Cloudberry tree that shifts
the integer values of the GPDB-specific JOIN_DEDUP_SEMI/REVERSE and
JOIN_UNIQUE_* codes. Value-dependent code then corrupts MPP motion
planning, producing a degenerate plan ("Gather Motion 0:1" /
"Redistribute Motion 1:0") that crashes with SIGSEGV in
setupCdbProcessList() during dispatch. Appending keeps every pre-existing
enum value stable.
- cdbpath.c (cdbpath_motion_for_join, both the serial and parallel switch):
handle JOIN_RIGHT_SEMI like JOIN_RIGHT/JOIN_RIGHT_ANTI. A right-semi join
emits inner (build-side) rows, so the inner side must not be replicated,
otherwise matched inner rows could be emitted more than once. Without this
the new join type would hit the switch default and elog(ERROR,
"unexpected join type") at plan time.
Note: this feature is only exercised by the PostgreSQL planner
(optimizer=off); GPORCA does not generate JOIN_RIGHT_SEMI.
a869ab3 to
4328d2f
Compare
|
Thanks for the review! I've restructured the branch to preserve the original commit history:
The code logic is unchanged from what I verified locally (Hash Right Semi Join chosen and correct via EXPLAIN ANALYZE + cross-validation). PTAL, thanks! |
Thanks, if have a chance, we could do some comparison of before and after for TPC-H Q 21 |
5dc8186 to
8ad6a81
Compare
Done — ran TPC-H Q21 before/after on the Setup: SF=100 (lineitem 600,037,902 rows), GPORCA, 3-segment single-host demo cluster, 3 timed runs each.
Q21's
So the right-semi shape shrinks the build side from ~200M to ~449K rows, cuts the semijoin hash spill from 512 → 2 batches, and drops requested End-to-end wall-clock is ~3% better here because on this small cluster the total is dominated by the three I also added an ORCA regression test ( |
|
Follow-up on the Q21 numbers above — clarifying why the end-to-end gain is only ~3%, and showing the isolated win. Where Q21's time actually goes (row executor)Q21 scans
Right-semi only changes the build side of the outer semijoin (
The 200M Isolated semijoin: ~4× when it's the dominant costRemoving the 3× full scans (small 10k-row LHS vs a unique 150M-row RHS,
→ ~3.9× faster (≈20 s vs ≈77 s) once the build-side choice is the bottleneck. Row vs vectorizedIn the row executor, steps 1 + 3 (scanning/deforming ~452M rows and row-at-a-time hash/probe) dominate and dilute the win. In a vectorized/columnar engine those steps get much cheaper (column-pruned, batched), so the relative weight of avoiding the 200M-row build+spill grows and the Q21 wall-clock gain becomes more pronounced — consistent with the larger Q21 improvement seen on the vectorized build this was ported from. Summary: the right-semi plan is genuinely chosen and the per-operator win is large (~4× isolated, 200M→449K build, 512→2 batches, 8.6 GB→14 MB); it just sits behind scan/deform-bound work in Q21 on the row executor, hence ~3% end-to-end here. |
Teaches GPORCA to produce JOIN_RIGHT_SEMI / JOIN_RIGHT_ANTI hash joins, so
the "Right Semi Join" optimization is available under the GPORCA optimizer
(optimizer=on), not just the PostgreSQL planner. ORCA can now build the hash
table on the smaller LHS of an IN/EXISTS/NOT EXISTS semijoin and emit the
matched (semi) or unmatched (anti) LHS rows.
Pieces:
* Physical operators CPhysicalRightSemiHashJoin / CPhysicalRightAntiSemiHashJoin
extending CPhysicalHashJoin; registered in COperator.h.
* Implementation xforms CXformLeftSemiJoin2RightSemiHashJoin /
CXformLeftAntiSemiJoin2RightAntiSemiHashJoin, registered in CXformFactory
and added to the candidate sets of CLogicalLeftSemiJoin /
CLogicalLeftAntiSemiJoin. New xform IDs are appended at the END of the
CXform enum to keep the CXformSet (CEnumSet<EXformId, ExfSentinel>) ABI
stable (mid-insertion crashes CSearchStage unless every TU is rebuilt;
same lesson as the PG JoinType enum).
* DXL: EdxljtRightSemijoin / EdxljtRightAntiSemijoin join types + tokens,
with Expr->DXL and DXL->PlStmt / CTranslatorUtils mappings to the GPDB
JOIN_RIGHT_SEMI / JOIN_RIGHT_ANTI executor join types.
* Cost model: CostRightSemiHashJoin / CostRightAntiSemiHashJoin reuse the
CostHashJoin formula with the build/probe roles swapped.
Build-side swap in CTranslatorDXLToPlStmt (the key adaptation to the PG
executor): the GPDB executor always builds the hash table on the plan's inner
(right) child, but ORCA places the semantically-preserved LHS as the DXL left
child. For JOIN_RIGHT_SEMI / JOIN_RIGHT_ANTI we therefore (a) translate the
DXL right child as the executor's outer/probe and the DXL left child (LHS) as
the inner/Hash build side, (b) keep the child contexts in outer-then-inner
order so target list, quals and hash clauses get the correct
OUTER_VAR/INNER_VAR, and (c) take the outer/inner hash keys from the opposite
hash-clause operands to match the swapped sides. This produces exactly the
plan shape the PostgreSQL planner emits (inner = LHS), so the existing PG
JOIN_RIGHT_SEMI/ANTI executor support (from the preceding backport commits)
runs it correctly -- this replaces Lightning's vec-executor PostBuildVecPlan
swap, which Apache Cloudberry does not have.
Adapted from hashdata-lightning commit c3be66315ee (by GongXun); the
Arrow/Acero vectorized executor, parallel-hash variants, and vec-engine GUC
cost gating were dropped (not applicable here).
Verified against the PostgreSQL planner (optimizer=off) on a 25-case
cross-optimizer correctness battery -- IN / EXISTS / NOT IN / NOT EXISTS,
multi-column / text / correlated keys, hash / replicated / random
distribution, empty / all-NULL / duplicate-key inputs, spill, self-semijoin,
three-way and projection cases -- every result set matches, and EXPLAIN shows
"Hash Right Semi Join" / "Hash Right Anti Join" building on the smaller LHS.
The cherry-picked upstream tests in join.sql produce optimizer- and MPP-dependent plan shapes, which made the ic-(orca-)parallel CI jobs fail: join_optimizer.out had no entry for the new hash-right-semi (tbl_rs) test, and the modified "semijoin selectivity for <>" EXPLAIN differs from upstream. - Wrap the two plan-sensitive EXPLAINs (the <> semijoin selectivity query and the tbl_rs right-semi rescan query) in -- start_ignore / -- end_ignore so the plan shape is not compared; the deterministic query results are still checked. - join_optimizer.out (GPORCA): add the tbl_rs section -- ORCA produces a Hash Right Semi Join and returns the 10 expected rows -- and refresh the <> section for the new query. - join.out (Postgres planner): the tbl_rs correlated query hits the planner's "skip-level correlations not supported" limitation, so the error is recorded as the expected output; ORCA handles it. Only these two backport-related sections change; the rest of the answer files are left at their canonical content.
8ad6a81 to
34dfda2
Compare
…H Q21 results
Adds a GUC (default on) to enable/disable GPORCA's right semi/anti hash join
xforms (CXformLeftSemiJoin2RightSemiHashJoin /
CXformLeftAntiSemiJoin2RightAntiSemiHashJoin). When turned off, the xforms
are disabled via the standard CConfigParamMapping traceflag mechanism
(GPOPT_DISABLE_XFORM_TF), so ORCA falls back to its regular left-semi / anti
plans.
Serves both as a kill-switch for the new plan shape and as the on/off toggle
for before/after performance comparisons (e.g. TPC-H Q21).
The GUC is registered in unsync_guc_name.h (per-segment, no QD/QE sync
required), matching the other optimizer_enable_* developer GUCs.
SET optimizer_enable_right_semi_join = off; -- ORCA uses plain Hash Semi Join
SET optimizer_enable_right_semi_join = on; -- ORCA may use Hash Right Semi Join
Regression test
---------------
Adds src/test/regress/sql/rightsemijoin.sql (registered in greenplum_schedule)
with both planner (rightsemijoin.out) and GPORCA (rightsemijoin_optimizer.out)
answer files. Under GPORCA it asserts that the GUC flips the plan between
Hash Right Semi/Anti Join (build on the small LHS) and the regular Hash
Semi/Anti Join, and that results are identical either way.
TPC-H Q21 before/after (SF=100, GPORCA, 3-segment demo cluster)
---------------------------------------------------------------
Q21 contains an EXISTS (semijoin) and a NOT EXISTS (anti join) over lineitem
(600,037,902 rows), exercising both new plan shapes. The semijoin
(l1 EXISTS l2) is the node that differs between the two plans:
ON : Hash Right Semi Join - builds the hash on the small LHS
(448,837 rows); streams lineitem l2 (200M rows).
OFF : Hash Semi Join - builds the hash on lineitem l2 (200M rows).
At a production-level statement_mem (7 GB) the build-side difference shows up
as a spill that ON avoids:
enable_right_semi_join semijoin build hash l3 anti-join (common) time
on 449K rows, 1 batch (no 126M rows, 4 batches 354.0 / 355.6 s
spill)
off 200M rows, 8 batches, 126M rows, 4 batches 388.3 / 397.7 s
spills 1.14 GB
So at realistic memory the OFF plan spills the 200M-row semijoin build (8
batches), while ON keeps it fully in memory (1 batch); end-to-end ~10% faster
and total "Memory wanted" drops 60 GB -> 38 GB.
The end-to-end gain is bounded because Q21 is scan-bound in the row executor:
the three lineitem scans (l1/l2/l3, ~452M rows) and the l3 anti-join spill are
common to both plans and dominate runtime. Isolating the semijoin (small LHS
vs a unique 150M-row RHS, so de-duplication does not help the OFF plan) shows
the operator-level win clearly: ~20 s (ON, build on 3K-row LHS, no spill) vs
~77 s (OFF, build/dedup the 150M-row side, spills 1.5 GB) -- about 3.9x. The
relative win grows further in a vectorized executor, where the common scan /
tuple-deform cost is much cheaper.
34dfda2 to
b5aeefd
Compare
|
Re-ran Q21 (SF=100) at a production-level
At realistic memory the off plan spills the 200M-row semijoin build (8 batches), while on keeps it fully in memory (1 batch) → ~10% end-to-end and total Memory wanted 60 GB → 38 GB. (At the tiny default The end-to-end gain is bounded because Q21 is scan-bound in the row executor: the three On bumping the scale further: SF=500 isn't feasible on this box — the loaded SF=100 set is already ~216 GB, so SF=500 would be ~1 TB+ of table data alone and exceeds the disk. More importantly, in the row executor a larger SF mostly scales both plans together (still scan-bound), so it wouldn't move the ratio much; the spill-avoidance is best demonstrated by the build-side-vs- |
…eaders Fix a server crash (SIGSEGV in ExecInitPartitionSelector) when GPORCA chose a Hash Right Semi Join for a semijoin over a partitioned table that also used dynamic partition elimination (e.g. TPC-H-style "ptid IN (select ... )" with a Partition Selector). Root cause: the GPDB executor builds a right-semi/right-anti hash join on the outer (LHS) child (via the build-side swap in CTranslatorDXLToPlStmt), which is the opposite of what PppsRequiredForJoins assumes (Partition Selector on the inner/build child, consumer Dynamic Scan on the outer/probe child). With the swap, the selector landed on the probe side and the consumer on the build side, producing a malformed plan that crashed ExecInitPartitionSelector. Fix: CPhysicalRightSemiHashJoin no longer overrides PppsRequired/PppsDerive, so it inherits CPhysical's base behaviour and does not host a join-driven Partition Selector (matching CPhysicalRightAntiSemiHashJoin, which already did not override them). A partitioned semijoin therefore falls back to the regular Hash Semi Join that carries the selector correctly, while non-partitioned right-semi plans are unaffected. Also add the Apache Software Foundation license header to the eight new GPORCA right-semi/right-anti source files (required by the Apache RAT check) and replace stray non-English text in two doc comments.
…nges Backporting "Right Semi Join" makes both the Postgres planner and GPORCA pick Hash Right Semi / Right Anti Join plan shapes across many existing tests (the build side and child order change relative to the old Hash Semi / Anti Join), which is the same answer-file churn the upstream PostgreSQL commit produced. Regenerate the affected answer files for the 3-segment suites (.out for the planner, _optimizer.out for GPORCA, covering the parallel suites that share them) and the singlenode suite. New _optimizer.out files are added for multirangetypes, partition_join and rpt_joins where GPORCA now diverges from the planner output. dpe_optimizer.out is regenerated after the right-semi/DPE crash fix, so it reflects the corrected plan (right-semi no longer hosts a Partition Selector). Results are unchanged; only EXPLAIN plan shapes and unordered-row presentation differ.
… Join Second round of Right Semi Join plan-shape churn surfaced by CI: - contrib/pax_storage/src/test/regress/expected/: the PAX storage test suite has its own expected/ directory, which also needs the Hash Right Semi/Anti Join plan updates (.out for the planner, _optimizer.out for GPORCA). New multirangetypes/rpt_joins _optimizer.out are added where GPORCA diverges from the planner output. - src/test/regress/expected/partition_append.out: previously missed; the semijoin there now uses a Hash Right Semi Join. Results are unchanged; only EXPLAIN plan shapes differ. dpe_optimizer.out (PAX) reflects the right-semi/DPE crash fix.
partition_append has no _optimizer.out, so GPORCA shared the planner .out. With Right Semi Join, GPORCA's plan for the partitioned semijoin now differs from the planner's, so split out a GPORCA-specific answer file. Results are unchanged; only the EXPLAIN plan shape differs (Hash Semi Join -> Hash Right Semi Join).
multirangetypes has nothing to do with Right Semi Join (it contains no semijoins), so it never needed a GPORCA-specific answer file. The files added earlier only papered over an unrelated, pre-existing GPORCA behavior on a "money" multirange EXCEPT (could not identify a hash function for type money) by baking the error into the expected output. Remove both the core and PAX copies; the unrelated multirange/money issue is not in scope for this PR.
The PAX rpt_joins planner answer file was not regenerated in the previous commit (only rpt_joins_optimizer.out was), so pax-ic-good-opt-off still failed on the Hash Right Semi Join plan change. Update rpt_joins.out from the opt-off CI results; results are unchanged, only the plan shape differs.
Larger-scale performance comparison (SF≈300, GPORCA)Loaded a larger TPC-H data set (lineitem 1.8B rows, orders 450M rows) to show the right-semi win at scale and the spill it avoids. Note on Q21 itself at this scale: with 1.8B-row To isolate the operator where right-semi is the right choice — a small probe-side driving table against a large, unique build key — I used select count(*) from skeys s where s.k in (select o_orderkey from orders); -- 10K vs 450M
→ ~4.6× faster (≈62 s vs ≈285 s). The off plan must de-duplicate the 450M-row side, spilling 4.1 GB across 1281 batches and asking for ~16.9 GB of work_mem; the right-semi plan builds the hash on the 3,385-row LHS, stays in memory (1 batch, 4 MB) and just streams The win grows with scale (≈3.9× at SF=100 → ≈4.6× here) and, more importantly, eliminates the large-side dedup spill entirely — which is exactly the case this backport targets. (SF=500 was attempted but the loaded heap tables exceed the 1 TB test disk before the load completes — physical size is ~2× the logical size — so this SF≈300 run is the largest that fits here.) |
What
Backports two upstream PostgreSQL commits that add Hash Right Semi Join:
aa86129e1— Support "Right Semi Join" plan shapes5668a857d— Fix right-semi-joins in HashJoin rescansThis lets the planner build the hash table on the smaller (LHS) side of an
IN/EXISTSsemijoin instead of always hashing the inner relation.Cloudberry/GPDB-specific adaptations
JOIN_RIGHT_SEMIis appended at the end of theJoinTypeenum rather than in upstream's mid-list position. Inserting it mid-list
shifts the integer values of the GPDB-only
JOIN_DEDUP_SEMI/_REVERSEandJOIN_UNIQUE_*codes, which corrupts MPP motion planning and crashes duringdispatch (SIGSEGV in
setupCdbProcessList). Appending keeps every existingvalue stable.
cdbpath_motion_for_join, serial + parallel switches):handle
JOIN_RIGHT_SEMIlikeJOIN_RIGHT/JOIN_RIGHT_ANTI— the inner(build) side must not be replicated, since a right-semi join emits
build-side rows.
JOIN_RIGHT_SEMIpath alongsideJOIN_SEMIwhilepreserving the existing GPDB
JOIN_DEDUP_SEMIhandling.Testing
Hash Right Semi Joinis chosen for small-build-side semijoins;results verified correct (dedup semantics, rescan correctness, MPP execution
across segments).
jointest expected output is still being reconciledagainst CI's canonical environment — hence this PR is opened as a draft.
Notes
optimizer=off); GPORCA does notgenerate
JOIN_RIGHT_SEMI.