[SPARK-57285][SQL] Route nanosecond timestamp cast-to-string through the Types Framework#56355
[SPARK-57285][SQL] Route nanosecond timestamp cast-to-string through the Types Framework#56355MaxGekk wants to merge 6 commits into
Conversation
…the Types Framework ### What changes were proposed in this pull request? This PR makes the Types Framework (`TypeApiOps`) the single integration point for nanosecond timestamp `CAST(... AS STRING)`, for both the interpreted and codegen paths, with no change to the rendered string. Specifically: - `TypeApiOps` gains a zone-aware formatting hook: `format(v, zoneId)` and `formatUTF8(v, zoneId)`, both defaulting to the existing zone-less `format(v)` so zone-independent framework types (e.g. `TimeType`) are unaffected. - `TimestampNTZNanosTypeApiOps` / `TimestampLTZNanosTypeApiOps` implement the hook: NTZ is zone-independent (UTC `formatWithoutTimeZoneNanos`); LTZ renders in the session zone via `formatNanos`. The zone-less callers (EXPLAIN, SQL-literal `toSQLValue`) now format NTZ directly, while LTZ without a session zone keeps raising `UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING`. - `ToStringBase` no longer bypasses the framework: the interpreted path threads the session `zoneId` into `formatUTF8(v, zoneId)` for the nanos types, and the codegen path emits a runtime call into the ops reference object instead of inlining `formatNanos` / `formatWithoutTimeZoneNanos`. ### Why are the changes needed? SPARK-57256 implemented nanosecond cast-to-string inline in `ToStringBase`, deliberately bypassing the framework because the zone-less `TypeApiOps.format(v)` cannot render LTZ in the session time zone. That left nanos cast-to-string as a one-off outside the framework, inconsistent with the SPIP direction (SPARK-56822) of wiring the new types through the centralized `TypeOps` / `TypeApiOps`. This PR closes that gap. ### Does this PR introduce _any_ user-facing change? No. This is a refactor; the rendered string output is unchanged (zone-aware LTZ, zone-independent NTZ, precision flooring, trailing-zero trimming). NTZ `EXPLAIN` / SQL-literal rendering now succeeds instead of raising, which was previously unreachable zone-less behavior for an internal type. ### How was this patch tested? - Updated `TimestampNanosTypeOpsSuite` to cover the zone-aware hook (NTZ renders zone-independently, LTZ renders in the session zone and still raises when zone-less). - Existing `CastWithAnsiOnSuite`, `CastWithAnsiOffSuite`, `ToPrettyStringSuite`, and `TimestampNanosRowSuite` stay green unchanged (275 tests pass). ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor (Claude Opus 4.8)
The fraction TimestampFormatter has its own internal cache, so the manual per-zone formatter caching in TimestampLTZNanosTypeApiOps was redundant. Build the formatter per call instead.
Drop the dedicated nanosecond-timestamp branch in ToStringBase.castToString: all framework types now flow through the single zone-aware dispatch (ops.formatUTF8(v, zoneId)). Zone-independent types (TimeType, TIMESTAMP_NTZ nanos) ignore the zone; only TIMESTAMP_LTZ nanos renders in it. The TimeType interpreted-vs-codegen consistency test previously left the cast unresolved; since cast-to-string now evaluates the session zone, resolve a time zone in that test like a real cast does, mirroring why the microsecond TimestampType has no such consistency test.
|
@davidm-db @dejankrak-db @stevomitric Could you review this PR, please. |
| // LTZ rendering depends on the session time zone. The fraction formatter has its own internal | ||
| // cache, so build it per call rather than caching it here. | ||
| override def format(v: Any, zoneId: ZoneId): String = { | ||
| val formatter = TimestampFormatter.getFractionFormatter(zoneId) |
There was a problem hiding this comment.
This builds a fresh formatter per row, while NTZ caches its own. The zoneId is constant across a cast - could we cache it here the same way NTZ does rather than new-ing one each call?
There was a problem hiding this comment.
Good catch, fixed. TimestampLTZNanosTypeApiOps now takes the ZoneId as a constructor parameter and holds the fraction formatter in a @transient private lazy val, exactly like NTZ, so it is built once per ops instance (once per cast, per task) rather than per row. The zone is threaded in centrally via TypeApiOps.apply(dt, zoneId) (by-name, defaulting to the session-local time zone), and CAST passes its resolved zone. Done in cb0a03d.
TimestampLTZNanosTypeApiOps now takes a ZoneId constructor param (defaulting to the session-local time zone config) and builds its fraction formatter once per instance via a lazy val, mirroring NTZ, instead of constructing a fresh formatter per row. ToStringBase constructs the LTZ ops directly with the cast's resolved zone (interpreted and codegen), so the per-row formatter allocation is gone on both paths. With LTZ rendering driven by the carried zone, the zone-aware format(v, zoneId)/formatUTF8(v, zoneId) hook on TypeApiOps is no longer needed and is removed, simplifying the trait and the codegen (no ZoneId reference object). The zone-less framework lookup (EXPLAIN, SQL-literal toSQLValue, Row JSON) now renders LTZ in the session zone rather than raising, so the UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING error condition and its helper are removed.
Keep a single unified cast-to-string dispatch: ToStringBase calls TypeApiOps(from, zoneId) for both the interpreted and codegen paths instead of special-casing TIMESTAMP_LTZ nanos to construct the ops directly. TypeApiOps.apply gains a by-name zoneId parameter (defaulting to the session-local time zone config) that it threads into the LTZ ops. By-name is required so the zone is forced only when the LTZ ops is constructed: zone-independent and unsupported types never evaluate it, which matters because a CAST's zone is unresolved (None.get) until a time zone is assigned. TimestampLTZNanosTypeApiOps's zoneId is now a required constructor param (no default); the server-side catalyst TimestampLTZNanosTypeOps, which never renders (cast-to-string flows through TypeApiOps.apply), passes UTC rather than reading the session config on every construction.
genjavadoc turns the scaladoc `[[TypeApiOps.apply]]` member reference into
`{@link TypeApiOps.apply}`, which javadoc rejects ("reference not found") because
member references need `#`, not `.`. Use monospaced plain text instead of a link.
stevomitric
left a comment
There was a problem hiding this comment.
LGTM. Thanks @MaxGekk for resolving comments.
|
@uros-b @cloud-fan Could you review this PR, please. |
| // Route nanosecond timestamp cast-to-string through the Types Framework: emit a runtime | ||
| // call into the ops reference object. The cast's session zone is threaded into the lookup | ||
| // so LTZ carries it; NTZ is zone-independent (SPARK-57285). | ||
| val ops = TypeApiOps(from, zoneId).get |
There was a problem hiding this comment.
In the new codegen path and the interpreted path falls through to castToStringDefault, whose nanos cases were deleted, so it now lands on the generic terminal case: case _ => o => UTF8String.fromString(o.toString). TypeApiOps(...) returns None whenever typesFrameworkEnabled == false. The intended invariant ("nanos types imply the framework is on") is only enforced at set-time, and only on one flag:
.checkValue(
enabled => !enabled || SQLConf.get.typesFrameworkEnabled,
"REQUIREMENT",
_ => Map("confRequirement" ->
(s"'${TYPES_FRAMEWORK_ENABLED.key}' must be true to enable the nanosecond " +
"timestamp types.")))
TYPES_FRAMEWORK_ENABLED has no symmetric guard, so a session can set timestampNanosTypes.enabled=true, materialize nanos values, then set types.framework.enabled=false. In that (admittedly unusual, internal-flag) state, casting a nanos value to string would:
- interpreted: silently produce TimestampNanosVal.toString (wrong output) instead of a formatted timestamp;
- codegen: throw NoSuchElementException from .get.
What changes were proposed in this pull request?
This PR makes the Types Framework (
TypeApiOps) the single integration point for nanosecond timestampCAST(... AS STRING), for both the interpreted and codegen paths.Specifically:
TypeApiOps.applygains a by-namezoneIdparameter that defaults to the session-local time zone config (SqlApiConf.get.sessionLocalTimeZone) and is threaded into theTIMESTAMP_LTZnanos ops. It is by-name so the zone is forced only when the LTZ ops is actually constructed: zone-independent (TimeType,TIMESTAMP_NTZnanos) and unsupported types never evaluate it, which matters because aCAST's zone is unresolved (None.get) until a time zone is assigned.TimestampLTZNanosTypeApiOpsnow carries itsZoneIdas a required constructor parameter and holds the fraction formatter in a@transient private lazy val, so the formatter is built once per ops instance (once per cast, per task) rather than per row.TimestampNTZNanosTypeApiOpsstays zone-independent (UTC).ToStringBaseno longer bypasses or special-cases the framework: both the interpreted and codegen paths dispatch uniformly throughTypeApiOps(from, zoneId).CASTpasses its resolved zone; zone-less callers (EXPLAIN, SQL-literaltoSQLValue,Row.jsonValue) accept the session-zone default.UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRINGerror condition and itsDataTypeErrorshelper are removed.The microsecond timestamp types (
TIMESTAMP/TIMESTAMP_NTZ) remain handled inline inToStringBaseand are out of scope.Why are the changes needed?
SPARK-57256 implemented nanosecond cast-to-string inline in
ToStringBase, deliberately bypassing the framework because the zone-lessTypeApiOps.format(v)cannot render LTZ in the session time zone. That left nanos cast-to-string as a one-off integration outside the framework, inconsistent with the SPIP direction (SPARK-56822) of wiring the new types through the centralizedTypeOps/TypeApiOps. This PR closes that gap.Does this PR introduce any user-facing change?
Yes. Previously, rendering a
TIMESTAMP_LTZnanosecond value to string without an explicit time zone (EXPLAIN of a literal, SQL-literaltoSQLValue,Row.jsonValue) raisedUNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING. Now these callers render the value in the session-local time zone (spark.sql.session.timeZone), consistent with howCAST(... AS STRING)already rendered LTZ nanos. TheUNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRINGerror condition is removed. TheCAST(... AS STRING)output itself is unchanged.How was this patch tested?
TimestampNanosTypeOpsSuiteto cover the new behavior: NTZ renders zone-independently, LTZ renders in the zone carried by the ops instance, and zone-less LTZ now renders in the session-local time zone (instead of raising), exercising precision flooring.CastWithAnsiOnSuite,CastWithAnsiOffSuite,ToPrettyStringSuite, andTimestampNanosRowSuitestay green (275 tests pass), andSparkThrowableSuite(33 tests) confirms the removed error condition leaves the error framework consistent.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor (Claude Opus 4.8)