Skip to content

fix: std.char() rejects surrogate codepoints with an error#960

Closed
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:fix/std-char-surrogate
Closed

fix: std.char() rejects surrogate codepoints with an error#960
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:fix/std-char-surrogate

Conversation

@He-Pin

@He-Pin He-Pin commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • std.char() now rejects surrogate codepoints (0xD800-0xDFFF) with "Invalid unicode code point" error instead of producing unpaired surrogates
  • Updated UnicodeHandlingTests to expect error behavior for surrogate inputs
  • Added golden tests for valid boundary codepoints and error behavior on surrogates

Behavior comparison across implementations

Note: The three reference implementations diverge on surrogate handling. This fix aligns sjsonnet with jrsonnet (reject with error), since surrogates are not valid Unicode codepoints per the Unicode spec.

Expression go-jsonnet v0.22.0 jsonnet-cpp v0.22.0 jrsonnet v0.5.0-pre99 sjsonnet (before) sjsonnet (after)
std.codepoint(std.char(0xD800)) 65533 (U+FFFD) 55296 (raw) ERROR 55296 ❌ ERROR ✅
std.codepoint(std.char(0xDC00)) 65533 56320 ERROR 56320 ❌ ERROR ✅
std.codepoint(std.char(0xDFFF)) 65533 57343 ERROR 57343 ❌ ERROR ✅
std.codepoint(std.char(0xD7FF)) 55295 55295 55295 55295 ✅ 55295 ✅
std.codepoint(std.char(0xE000)) 57344 57344 57344 57344 ✅ 57344 ✅
std.codepoint(std.char(0xFFFD)) 65533 65533 65533 65533 ✅ 65533 ✅
std.char(-1) ERROR ERROR ERROR ERROR ✅ ERROR ✅
std.char(0x110000) ERROR ERROR ERROR ERROR ✅ ERROR ✅

Three implementation strategies:

  • jrsonnet: rejects surrogates with error — our choice, follows Unicode spec
  • go-jsonnet: replaces surrogates with U+FFFD (silent data transformation)
  • jsonnet-cpp: preserves raw surrogates (produces invalid Unicode output)

Why reject instead of replace?

  1. Surrogates are not valid Unicode codepoints — U+D800–U+DFFF are reserved for UTF-16 encoding
  2. Silent replacement masks bugs — user code that accidentally produces a surrogate gets no feedback
  3. Consistency with sjsonnet's own std.char(0x110000) behavior, which already errors on out-of-range codepoints

Modification

  • In Char_.evalRhs (StringModule.scala), extended the invalid codepoint check to include the surrogate range 0xD800–0xDFFF
  • Updated UnicodeHandlingTests: renamed stdCharReplacesSurrogatesstdCharRejectsSurrogates, using evalErr for all surrogate inputs
  • Replaced char_surrogate_replacement golden test with:
    • char_surrogate_boundary — valid codepoints near surrogate range (0xD7FF, 0xE000, 0xFFFD, 0x0000, 0x0041)
    • error.char_surrogate — confirms error on surrogate input

Result

std.char(0xD800) now raises "Invalid unicode code point, got 55296" instead of producing an unpaired surrogate. Valid codepoints adjacent to the surrogate range are unaffected. Full test suite passes.

Test plan

  • char_surrogate_boundary.jsonnet — valid boundary codepoints roundtrip correctly
  • error.char_surrogate.jsonnet — surrogate input raises error
  • UnicodeHandlingTests.stdCharRejectsSurrogates — unit tests for high/low/max surrogates
  • Full test suite passes (mill sjsonnet.jvm.3_3_7.test)

References

  • Unicode spec: U+D800–U+DFFF are reserved, not assignable codepoints
  • jrsonnet: rejects surrogates with "invalid unicode codepoint" error
  • go-jsonnet: replaces surrogates with U+FFFD (different strategy, not our target)
  • jsonnet-cpp: preserves raw surrogates (different strategy, not our target)

@He-Pin He-Pin marked this pull request as draft June 18, 2026 05:23
Motivation:
std.char() accepted surrogate codepoints (0xD800-0xDFFF) and produced
unpaired surrogate characters in strings. The three reference
implementations diverge on this: go-jsonnet replaces with U+FFFD,
jsonnet-cpp preserves raw surrogates, and jrsonnet rejects with an
error. Surrogates are not valid Unicode codepoints per the Unicode
spec, so sjsonnet should reject them — matching jrsonnet's approach
and sjsonnet's own behavior for out-of-range codepoints like 0x110000.

Modification:
- In Char_.evalRhs, extend the invalid codepoint check to include the
  surrogate range 0xD800-0xDFFF, raising "Invalid unicode code point"
- Updated UnicodeHandlingTests: renamed stdCharReplacesSurrogates to
  stdCharRejectsSurrogates, using evalErr for all surrogate inputs
- Replaced char_surrogate_replacement golden test with
  char_surrogate_boundary (valid boundary values) and
  error.char_surrogate (surrogate rejection)

Result:
std.char(0xD800) now raises "Invalid unicode code point, got 55296"
instead of producing an unpaired surrogate. Valid codepoints adjacent
to the surrogate range (0xD7FF, 0xE000) are unaffected. This aligns
with jrsonnet and follows the Unicode specification.

References:
- jrsonnet: rejects surrogates with "invalid unicode codepoint" error
- go-jsonnet: replaces surrogates with U+FFFD (different strategy)
- jsonnet-cpp: preserves raw surrogates (different strategy)
- Unicode spec: U+D800-U+DFFF are reserved, not assignable codepoints
@He-Pin He-Pin force-pushed the fix/std-char-surrogate branch from cff0f8a to 9b6ad55 Compare June 18, 2026 05:28
@He-Pin He-Pin changed the title fix: std.char() replaces surrogate codepoints with U+FFFD fix: std.char() rejects surrogate codepoints with an error Jun 18, 2026
@He-Pin He-Pin marked this pull request as ready for review June 18, 2026 05:29
@He-Pin

He-Pin commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@CertainLach wdyt

@He-Pin He-Pin marked this pull request as draft June 18, 2026 05:31
@He-Pin

He-Pin commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Closing this PR — the original sjsonnet behavior was already correct per the Jsonnet specification.

Analysis

The Jsonnet spec states that std.char(n) and std.codepoint(str) are inverse functions:

  • std.codepoint(str): "Returns the positive integer representing the unicode codepoint of the character in the given single-character string. This function is the inverse of std.char(n)."
  • std.char(n): "Returns a string of length one whose only unicode codepoint has integer id n. This function is the inverse of std.codepoint(str)."

Verified original behavior (upstream/master):

Expression Result Inverse property
std.codepoint(std.char(55296)) 55296 n == n
std.codepoint(std.char(56320)) 56320 n == n
std.codepoint(std.char(65)) 65 n == n
std.char(std.codepoint("A")) "A" s == s

Cross-implementation comparison:

Implementation std.codepoint(std.char(0xD800)) Inverse property Strategy
sjsonnet (original) 55296 ✅ holds Preserve raw (same as jsonnet-cpp)
jsonnet-cpp v0.22.0 55296 ✅ holds Preserve raw
go-jsonnet v0.22.0 65533 (U+FFFD) ❌ broken Replace (Go runtime artifact)
jrsonnet v0.5.0-pre99 ERROR ❌ N/A Reject (Rust runtime artifact)

Conclusion

The original sjsonnet behavior matches jsonnet-cpp and satisfies the spec's inverse property requirement. The go-jsonnet (U+FFFD replacement) and jrsonnet (error) behaviors are artifacts of their respective language runtimes (string(rune) in Go replaces invalid runes; char::from_u32 in Rust rejects them), not deliberate design decisions.

The spec does not define surrogate handling — this is undefined behavior where implementations diverge. sjsonnet's original approach (preserve raw, matching jsonnet-cpp) is the only one that satisfies the spec's inverse constraint.

This was a false positive in the bug report. No code change needed.

@He-Pin He-Pin closed this Jun 18, 2026
@He-Pin He-Pin deleted the fix/std-char-surrogate branch June 18, 2026 05:40
@He-Pin

He-Pin commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

std.codepoint(str)
Available since version 0.10.0.

Returns the positive integer representing the unicode codepoint of the character in the given single-character string. This function is the inverse of std.char(n).

std.char(n)
Available since version 0.10.0.

Returns a string of length one whose only unicode codepoint has integer id n. This function is the inverse of std.codepoint(str).

@johnbartholomew FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant