[SPARK-57133][SQL] Add BIN BY relation operator parsing and resolution by vranes · Pull Request #56247 · apache/spark

vranes · 2026-06-01T13:17:37Z

What changes were proposed in this pull request?

This is the first PR in a planned series implementing the BIN BY relation operator (SPARK-57133). It adds the parser, analyzer, and error classes. Physical execution is intentionally stubbed and lands in a follow-up PR.

BIN BY is a relation-level operator (same grammar position as PIVOT / UNPIVOT) that aligns range-typed rows to fixed-width bin boundaries: it splits any row whose [range_start, range_end) crosses a boundary and proportionally redistributes selected numeric or day-time-interval values across the resulting sub-ranges. The target use case is telemetry and observability data, where each row carries its own measurement window (OpenTelemetry, Prometheus exports).

Syntax:

SELECT * FROM relation BIN BY (
  RANGE rangeStartCol TO rangeEndCol
  BIN WIDTH widthExpr
  [ALIGN TO originExpr]
  DISTRIBUTE UNIFORM (distributeCol [, distributeCol ...])
  [BIN_START AS aliasName] [BIN_END AS aliasName] [BIN_DISTRIBUTE_RATIO AS aliasName]
) [AS resultAlias];

What this PR adds:

Grammar (SqlBaseLexer.g4, SqlBaseParser.g4): the binByClause rule and 7 new non-reserved keywords (BIN, WIDTH, ALIGN, UNIFORM, BIN_START, BIN_END, BIN_DISTRIBUTE_RATIO), wired into relationExtension and the pipe operatorPipeRightSide, with an optional trailing table alias.
Logical plans (basicLogicalOperators.scala): UnresolvedBinBy (parser output) and the resolved BinBy, plus the BinByOutputAliases helper. This follows the two-class Unpivot -> UnpivotTransformer precedent.
AST builder (AstBuilder.scala): withBinBy, which wraps the node in a SubqueryAlias when a trailing alias is present.
Analyzer rule (ResolveBinBy.scala, wired into Analyzer.scala): resolves column references against the child output, validates types and foldability, fills the default origin (session-zone-anchored for TIMESTAMP, wall-clock epoch for TIMESTAMP_NTZ), captures the session time zone, and builds the output schema. Registered in RuleIdCollection; the BIN_BY / UNRESOLVED_BIN_BY tree patterns are added in TreePatterns.
Error classes (error-conditions.json, QueryCompilationErrors.scala): the 8 BIN_BY_* conditions, with analysis-time builders for the 7 raised during resolution (the runtime BIN_BY_INVALID_RANGE is defined here and raised in the execution PR).
Execution stub (SparkStrategies.scala): the lowering throws UnsupportedOperationException until the execution PR lands.

The output is the input columns plus three appended columns: bin_start and bin_end (matching the range column type) and bin_distribute_ratio (DOUBLE, the fraction of the original range that fell into the bin). All three are renameable.

Why are the changes needed?

Telemetry and observability sources emit rows that each carry their own [start, end) measurement window. Re-bucketing such data onto a fixed grid today requires verbose SQL with manual boundary arithmetic, row explosion, and proportional splitting. BIN BY expresses this as a single relation operator.

Does this PR introduce any user-facing change?

No. The operator parses and resolves, but physical execution is intentionally stubbed in this PR (the strategy throws UnsupportedOperationException), so BIN BY is not usable end to end yet; execution arrives in a follow-up PR. The 7 new keywords are non-reserved, so existing queries that use them as identifiers continue to parse unchanged.

How was this patch tested?

New unit tests, all passing:

PlanParserSuite: BIN BY parsing (minimal and maximal clauses, qualified column references, output renames, trailing alias, and the pipe form), parse-error cases, and confirmation that the new keywords remain usable as identifiers.
ResolveBinBySuite: resolution against the child output, session-zone capture, default-origin arithmetic (UTC, non-UTC, NTZ), output schema and renames, multipart disambiguation across a join, and every analysis-time error class.

build/sbt 'catalyst/testOnly *ResolveBinBySuite *PlanParserSuite' reports 107 tests passed.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

cloud-fan

1 blocking, 1 non-blocking, 0 nits.
A clean, well-tested incremental PR; the one blocking item is an analysis-time integration gap (DeduplicateRelations).

Correctness (1)

basicLogicalOperators.scala:1803: BinBy produces new attributes but isn't registered in DeduplicateRelations — self-joins over a shared BinBy subtree (e.g. a temp view referenced twice) leave conflicting ExprIds the analyzer can't resolve — see inline

Suggestions (1)

ResolveBinBy.scala:153: BIN_BY_COLUMN_NOT_FOUND is misleading for a nested/computed reference (the column exists) — see inline

Verification

Traced DeduplicateRelations: it renews produced attributes only for explicitly-enumerated node types (Generate/Expand/Window/ScriptTransformation/AttachDistributedSequence/FlatMap*/MapIn*) in both renewDuplicatedRelations and collectConflictPlans; BinBy matches neither and hits the child-only fallbacks, so its appendedAttributes are never renewed. The Unpivot analogue is safe only because it lowers to Expand (a registered producer). The comment at DeduplicateRelations.scala:494-499 confirms an unregistered producer fails analysis.

cloud-fan · 2026-06-02T12:13:16Z

+
+  override def output: Seq[Attribute] = child.output ++ appendedAttributes
+
+  override def producedAttributes: AttributeSet = AttributeSet(appendedAttributes)


BinBy is a new attribute-producing logical node (producedAttributes = AttributeSet(appendedAttributes)), but it isn't registered with DeduplicateRelations.

That rule resolves self-join attribute conflicts by enumerating every attribute-producing node explicitly — Generate, Expand, Window, ScriptTransformation, AttachDistributedSequence, the FlatMap*/MapIn* family — in both renewDuplicatedRelations (DeduplicateRelations.scala:114-220) and collectConflictPlans (:363-487). BinBy is in neither, so it falls to the generic fallbacks (case plan: LogicalPlan at :222, case _ => plan.children.flatMap(...) at :489), which renew/recurse the children but never renew the node's own appendedAttributes. The comment at :494-499 spells out the result: an unhandled producer of new references makes "the analysis ... fail" unless another rule resolves it — and there's none for BinBy.

The Unpivot analogue avoids this only because it lowers to an Expand (UnpivotTransformer.scala), which is a registered producer; BinBy resolves to a bespoke node that skipped registration. So a query that self-joins a shared BinBy subtree (a temp view referenced twice, or a future DataFrame self-join) leaves bin_start/bin_end/bin_distribute_ratio with conflicting ExprIds the analyzer can't resolve — failing at analysis time, before the stubbed strategy, with a confusing ambiguous-reference/conflict error.

Since analysis is fully functional in this PR, this is in scope here. Suggested fix: add a BinBy case to both dedup phases (renew appendedAttributes, exactly as the Expand case does) plus a self-join / duplicated-view regression test. If you'd rather defer the integration to the execution PR, please add a test documenting the current behavior so the gap is tracked.

Good catch, fixed.
Registered BinBy in both dedup phases. Added a self-join regression test in ResolveBinBySuite - it fails without the fix.

cloud-fan · 2026-06-02T12:13:16Z

+        case Some(_) =>
+          // Resolved to a NamedExpression that is not a top-level Attribute (e.g.,
+          // `RANGE struct_col.field TO ...` resolves to an Alias wrapping GetStructField).
+          throw QueryCompilationErrors.binByColumnNotFoundError(u.name)


When a reference resolves to something other than a top-level Attribute — e.g. RANGE struct_col.field TO ... resolving to an Alias(GetStructField) — this throws BIN_BY_COLUMN_NOT_FOUND ("The column outer.field was not found in the input relation"). That's a bit misleading: the column does exist; BIN BY simply requires a plain top-level column. A distinct message (nested/computed columns unsupported in BIN BY) would be clearer for the user.

Non-blocking — the comment here already acknowledges the case.

Added a distinct BIN_BY_REQUIRES_TOP_LEVEL_COLUMN condition for the case where the reference resolves to a non-top-level attribute, mirroring EVENT_TIME_MUST_BE_TOP_LEVEL_COLUMN.

…ords

…ecution stub

vranes added 3 commits June 1, 2026 13:02

[SPARK-57133][SQL] Add BIN BY relation operator parsing and resolution

dca8f10

[SPARK-57133][SQL] Fix import ordering in ResolveBinBy

13accbd

[SPARK-57133][SQL] Document new BIN BY keywords and fix CI failures

29ab168

cloud-fan reviewed Jun 2, 2026

View reviewed changes

vranes added 4 commits June 2, 2026 15:19

[SPARK-57133][SQL] Update remaining keyword lists for new BIN BY keyw…

90a2773

…ords

[SPARK-57133][SQL] Register BinBy in DeduplicateRelations

a5b8a8d

[SPARK-57133][SQL] Use a distinct error for nested columns in BIN BY

f937b24

[SPARK-57133][SQL] Use SparkException.internalError for the BIN BY ex…

079b828

…ecution stub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57133][SQL] Add BIN BY relation operator parsing and resolution#56247

[SPARK-57133][SQL] Add BIN BY relation operator parsing and resolution#56247
vranes wants to merge 7 commits into
apache:masterfrom
vranes:bin-by-parser

vranes commented Jun 1, 2026 •

edited

Loading

Uh oh!

cloud-fan left a comment

Uh oh!

cloud-fan Jun 2, 2026

Uh oh!

vranes Jun 2, 2026

Uh oh!

cloud-fan Jun 2, 2026

Uh oh!

vranes Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		override def output: Seq[Attribute] = child.output ++ appendedAttributes

		override def producedAttributes: AttributeSet = AttributeSet(appendedAttributes)

Conversation

vranes commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Correctness (1)

Suggestions (1)

Verification

Uh oh!

cloud-fan Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

vranes Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

vranes Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vranes commented Jun 1, 2026 •

edited

Loading