fix: std.char() rejects surrogate codepoints with an error#960
Conversation
Motivation: std.char() accepted surrogate codepoints (0xD800-0xDFFF) and produced unpaired surrogate characters in strings. The three reference implementations diverge on this: go-jsonnet replaces with U+FFFD, jsonnet-cpp preserves raw surrogates, and jrsonnet rejects with an error. Surrogates are not valid Unicode codepoints per the Unicode spec, so sjsonnet should reject them — matching jrsonnet's approach and sjsonnet's own behavior for out-of-range codepoints like 0x110000. Modification: - In Char_.evalRhs, extend the invalid codepoint check to include the surrogate range 0xD800-0xDFFF, raising "Invalid unicode code point" - Updated UnicodeHandlingTests: renamed stdCharReplacesSurrogates to stdCharRejectsSurrogates, using evalErr for all surrogate inputs - Replaced char_surrogate_replacement golden test with char_surrogate_boundary (valid boundary values) and error.char_surrogate (surrogate rejection) Result: std.char(0xD800) now raises "Invalid unicode code point, got 55296" instead of producing an unpaired surrogate. Valid codepoints adjacent to the surrogate range (0xD7FF, 0xE000) are unaffected. This aligns with jrsonnet and follows the Unicode specification. References: - jrsonnet: rejects surrogates with "invalid unicode codepoint" error - go-jsonnet: replaces surrogates with U+FFFD (different strategy) - jsonnet-cpp: preserves raw surrogates (different strategy) - Unicode spec: U+D800-U+DFFF are reserved, not assignable codepoints
cff0f8a to
9b6ad55
Compare
|
@CertainLach wdyt |
|
Closing this PR — the original sjsonnet behavior was already correct per the Jsonnet specification. AnalysisThe Jsonnet spec states that
Verified original behavior (upstream/master):
Cross-implementation comparison:
ConclusionThe original sjsonnet behavior matches jsonnet-cpp and satisfies the spec's inverse property requirement. The go-jsonnet (U+FFFD replacement) and jrsonnet (error) behaviors are artifacts of their respective language runtimes ( The spec does not define surrogate handling — this is undefined behavior where implementations diverge. sjsonnet's original approach (preserve raw, matching jsonnet-cpp) is the only one that satisfies the spec's inverse constraint. This was a false positive in the bug report. No code change needed. |
|
std.codepoint(str) Returns the positive integer representing the unicode codepoint of the character in the given single-character string. This function is the inverse of std.char(n). std.char(n) Returns a string of length one whose only unicode codepoint has integer id n. This function is the inverse of std.codepoint(str). @johnbartholomew FYI |
Summary
std.char()now rejects surrogate codepoints (0xD800-0xDFFF) with "Invalid unicode code point" error instead of producing unpaired surrogatesUnicodeHandlingTeststo expect error behavior for surrogate inputsBehavior comparison across implementations
Note: The three reference implementations diverge on surrogate handling. This fix aligns sjsonnet with jrsonnet (reject with error), since surrogates are not valid Unicode codepoints per the Unicode spec.
std.codepoint(std.char(0xD800))std.codepoint(std.char(0xDC00))std.codepoint(std.char(0xDFFF))std.codepoint(std.char(0xD7FF))std.codepoint(std.char(0xE000))std.codepoint(std.char(0xFFFD))std.char(-1)std.char(0x110000)Three implementation strategies:
Why reject instead of replace?
std.char(0x110000)behavior, which already errors on out-of-range codepointsModification
Char_.evalRhs(StringModule.scala), extended the invalid codepoint check to include the surrogate range 0xD800–0xDFFFUnicodeHandlingTests: renamedstdCharReplacesSurrogates→stdCharRejectsSurrogates, usingevalErrfor all surrogate inputschar_surrogate_replacementgolden test with:char_surrogate_boundary— valid codepoints near surrogate range (0xD7FF, 0xE000, 0xFFFD, 0x0000, 0x0041)error.char_surrogate— confirms error on surrogate inputResult
std.char(0xD800)now raises "Invalid unicode code point, got 55296" instead of producing an unpaired surrogate. Valid codepoints adjacent to the surrogate range are unaffected. Full test suite passes.Test plan
char_surrogate_boundary.jsonnet— valid boundary codepoints roundtrip correctlyerror.char_surrogate.jsonnet— surrogate input raises errorUnicodeHandlingTests.stdCharRejectsSurrogates— unit tests for high/low/max surrogatesmill sjsonnet.jvm.3_3_7.test)References