appendcol lowers to a FULL JOIN of two ROW_NUMBER() OVER () windows
(empty PARTITION BY / ORDER BY) on _row_number_main_ = _row_number_subsearch_,
with no trailing sort. That positional zip is only correct on a serial,
order-preserving executor: a bare ROW_NUMBER() OVER () assigns sequence
numbers in input order and the join preserves it. On a parallel/distributed
backend the row-number assignment is arbitrary and the hash join drops
ordering, so columns get zipped onto the wrong rows and downstream `head`
slices a non-deterministic subset.
Fix visitAppendCol to not depend on implicit input-order preservation:
- derive an explicit window ORDER BY from each child's collation
(deriveCollationOrderKeys), so ROW_NUMBER assignment follows the upstream
sort; falls back to the prior bare OVER () when the input has no collation
(positional correspondence is undefined without a sort).
- add a trailing sort by the row-number columns after the join (NULLS LAST,
same pattern as streamstats) so output order is deterministic regardless of
how the backend executes the join.
No behavior change on the serial v2/Calcite engine; makes the lowering correct
on parallel backends. Updates CalcitePPLAppendcolTest expected plans/SparkSQL.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Description
appendcolzips a subsearch's columns onto the main search's rows by position. Its lowering (CalciteRelNodeVisitor.visitAppendCol) implements this as aFULL JOINof twoROW_NUMBER() OVER ()windows (emptyPARTITION BY/ORDER BY) on_row_number_main_ = _row_number_subsearch_, with no trailing sort.That positional zip is only correct on a serial, order-preserving executor: a bare
ROW_NUMBER() OVER ()assigns sequence numbers in input order, and the join preserves it. On a parallel/distributed backend the row-number assignment is arbitrary and the hash join drops ordering, so columns get zipped onto the wrong rows and a downstreamheadslices a non-deterministic subset.This is currently masked on the serial v2/Calcite engine, but it is a latent correctness bug for any parallel backend (the analytics engine, and the Spark pushdown path — the
verifyPPLToSparkSQLgolden output bakes in the same non-deterministicROW_NUMBER() OVER ()).Root cause (observed)
Running the query below through a parallel backend returned rows out of
sortorder, withcntattached to the wrong rows andMrows leaking into the top 10:A baseline
... | sort gender, state | head 10(noappendcol) returned correctly ordered rows on the same backend, isolating the cause to the row-number join.Fix
Make
visitAppendColindependent of implicit input-order preservation:ORDER BYfrom each child's collation (deriveCollationOrderKeys), soROW_NUMBERfollows the upstream sort. Falls back to the prior bareOVER ()when the input carries no collation (positional correspondence is undefined without a sort).NULLS LAST; extra subsearch-only rows sort last), the same patternstreamstatsalready uses, so output order no longer depends on how the backend executes the join.No behavior change on the serial v2/Calcite engine; the lowering becomes correct on parallel backends.
Results
CalcitePPLAppendcolITrun against the analytics-engine route (force-routed, parquet-backed indices) before/after, and on the v2/Calcite path:testAppendColtestAppendColOverrideTesting
CalcitePPLAppendcolTest(5 unit tests) — updated expected logical plans + Spark SQL; all pass.CalcitePPLAppendcolIT— 2/2 on the analytics-engine route and 2/2 on v2/Calcite.NewAddedCommandsIT.testAppendcol— passes.spotlessCheckclean on:coreand:ppl.Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.