feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760
feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760goel-skd wants to merge 5 commits into
Conversation
bec8884 to
69cc006
Compare
Replace the ASCII-only ToLower with utf8proc simple case mapping so case-insensitive name handling matches Iceberg Java's toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used for name matching. EqualsIgnoreCase now compares lowercased forms. Wire utf8proc into both the CMake (vendored/system) and Meson builds. See apache#613.
69cc006 to
f42e2da
Compare
wgtmac
left a comment
There was a problem hiding this comment.
I haven't checked the implementation and test yet. Just post some thoughts around APIs.
Thanks much @wgtmac. I responded to your comments. |
ToLower: note it uses Unicode simple (1:1) case mapping and document where it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper: spell out the ASCII-only behavior and why no Unicode variant is provided. Also document EqualsIgnoreCase inheriting ToLower's mapping. Addresses API review comments on apache#760. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ToLower: note it uses Unicode simple (1:1) case mapping and document where it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper: spell out the ASCII-only behavior and why no Unicode variant is provided. Also document EqualsIgnoreCase inheriting ToLower's mapping. Addresses API review comments on apache#760.
c692240 to
9ffebc3
Compare
The byte-slice in StartsWithIgnoreCase (str.substr(0, prefix.size()) before lowercasing) is wrong when ToLower changes byte length: "İ" (U+0130) is two bytes but lower-cases to "i", so "İx" should match prefix "i" but does not. This test pins that behavior; it fails against the current implementation and is fixed by the following commit. Relates to apache#760.
Compare the ToLower forms of both inputs instead of byte-slicing str to prefix.size() before lowercasing. ToLower can change a string's byte length (e.g. "İ" U+0130 lower-cases to "i"), so the old slice could split a code point or wrongly reject a valid match. Makes the regression test from the previous commit pass. Relates to apache#760.
| // NOTE: These convert ASCII letters only; all other bytes, including non-ASCII | ||
| // (multibyte UTF-8) bytes, are passed through unchanged. | ||
| // See https://github.com/apache/iceberg-cpp/issues/613. | ||
| static std::string ToLower(std::string_view str) { |
There was a problem hiding this comment.
I think there are quite a few places where ToLower is only used for ASCII strings. Should we switch those to use ToUpper? Or should we leave ToLower as it is and introduce a new ToLowerUTF8 instead?
There was a problem hiding this comment.
Thanks for the review @zhjwpku. A few thoughts:
On switching those to ToUpper: it probably isn't a straight swap. The case-insensitive field maps in schema.cc/type.cc are keyed by lowercase (to match Java's toLowerCase(Locale.ROOT)), so flipping to upper means changing comparison targets everywhere, and the field-name path needs unicode anyway (ToUpper is ASCII-only). It also puts the "is this ASCII?" decision back on every caller.
On ToLowerUTF8: we discussed the Ascii/Utf8 split earlier in this PR with @wgtmac and landed on keeping a single ToUpper/ToLower rather than ToXXXAscii/ToXXXUtf8.
I think we can get the speedup without splitting the API by adding an ASCII fast-path inside ToLower, i.e. scan once for any byte >= 0x80, if the input is pure ASCII, do the cheap ASCII lowercase and skip utf8proc, otherwise fall back. One function, no caller changes, still unicode-correct, and the ASCII majority gets the fast path for free.
Since it's a perf optimization and not correctness, I'd suggest doing it as a small follow-up so it doesn't hold up this PR. WDYT?
There was a problem hiding this comment.
One concern I still have is that the current ASCII-only behavior was a reasonable trade-off for most existing call sites: metrics modes, UUIDs, HTTP header/property names, enum-like strings, etc. are overwhelmingly ASCII. Moving ToLower, EqualsIgnoreCase, and StartsWithIgnoreCase to always decode UTF-8 and allocate lower-cased strings means the common path pays for Unicode support even when it is not needed.
This feels a bit unfortunate for a minor/non-ASCII case. In SQL engines we often keep ASCII-only fast paths for identifier/string handling, because most real-world names are ASCII and these helpers can sit on hot-ish paths.
Could we think about preserving the fast path here? For example:
- keep the existing byte-wise ASCII path when both inputs are ASCII;
- only fall back to utf8proc when non-ASCII bytes are actually present;
- make
EqualsIgnoreCasestreaming/short-circuiting instead of materializing two lower-cased strings; - avoid lower-casing the whole
strinStartsWithIgnoreCasewhen only the prefix needs to be compared.
I’m supportive of improving Unicode behavior, but I’d prefer not to regress the common ASCII path globally if we can avoid it.
| /// above) maps to U+0069 ("i") here, but to U+0069 U+0307 ("i" + combining dot above) | ||
| /// in Java. For ASCII and the large majority of letters the two agree. | ||
| /// | ||
| /// Invalid UTF-8 input is returned unchanged. |
There was a problem hiding this comment.
Why not using Result<std::string> so we can return an error for invalid UTF8 input.
| return false; | ||
| } | ||
| return EqualsIgnoreCase(str.substr(0, prefix.size()), prefix); | ||
| return ToLower(str).starts_with(ToLower(prefix)); |
| static bool EqualsIgnoreCase(std::string_view lhs, std::string_view rhs) { | ||
| return std::ranges::equal( | ||
| lhs, rhs, [](char lc, char rc) { return ToLowerAscii(lc) == ToLowerAscii(rc); }); | ||
| return ToLower(lhs) == ToLower(rhs); |
There was a problem hiding this comment.
Since ToLower returns as is for invalid UTF8 input, this line is a UB instead.
| // NOTE: These convert ASCII letters only; all other bytes, including non-ASCII | ||
| // (multibyte UTF-8) bytes, are passed through unchanged. | ||
| // See https://github.com/apache/iceberg-cpp/issues/613. | ||
| static std::string ToLower(std::string_view str) { |
There was a problem hiding this comment.
One concern I still have is that the current ASCII-only behavior was a reasonable trade-off for most existing call sites: metrics modes, UUIDs, HTTP header/property names, enum-like strings, etc. are overwhelmingly ASCII. Moving ToLower, EqualsIgnoreCase, and StartsWithIgnoreCase to always decode UTF-8 and allocate lower-cased strings means the common path pays for Unicode support even when it is not needed.
This feels a bit unfortunate for a minor/non-ASCII case. In SQL engines we often keep ASCII-only fast paths for identifier/string handling, because most real-world names are ASCII and these helpers can sit on hot-ish paths.
Could we think about preserving the fast path here? For example:
- keep the existing byte-wise ASCII path when both inputs are ASCII;
- only fall back to utf8proc when non-ASCII bytes are actually present;
- make
EqualsIgnoreCasestreaming/short-circuiting instead of materializing two lower-cased strings; - avoid lower-casing the whole
strinStartsWithIgnoreCasewhen only the prefix needs to be compared.
I’m supportive of improving Unicode behavior, but I’d prefer not to regress the common ASCII path globally if we can avoid it.
| /// Intended for case-insensitive name matching, similar to Iceberg Java's | ||
| /// toLowerCase(Locale.ROOT). The mapping is locale-independent, matching the intent | ||
| /// of Locale.ROOT. It uses simple (1:1) case mapping rather than Java's full case | ||
| /// mapping, so results differ for a few code points; e.g. U+0130 (capital I with dot |
There was a problem hiding this comment.
If they don't match, how can iceberg-cpp read identifier path containing such characters written by java?
| /// in Java. For ASCII and the large majority of letters the two agree. | ||
| /// | ||
| /// Invalid UTF-8 input is returned unchanged. | ||
| /// See https://github.com/apache/iceberg-cpp/issues/613. |
There was a problem hiding this comment.
Not sure what this comment means.
|
Thanks for all the review here. Since this is a shared util it's picked up a bunch of good ideas, but some of them expand the scope, so I want to draw a line on what lands here vs. what we track separately. What I'll do in this PR:
On the simple vs. full case mapping question (@manuzhang): reading Java-written tables isn't affected, since field names are stored verbatim in metadata and not lower-cased. The simple/full difference only shows up in case-insensitive name resolution, and only for a few code points like U+0130. For ASCII and the vast majority of letters they're identical. Matching Java bit-for-bit on those edge cases is a bigger change and I'd rather not block this PR on it. Two things I'd punt to follow-up issues:
Does that split sound reasonable? If so I'll push the fast-path + doc changes and open the two follow-ups. |
|
Thanks @goel-skd for your effort! I think 1:1 Java mapping is difficult to achieve so it's fine to be out of this PR's scope. However, it's correctness issue for invalid UTF-8 so it would be better to address them altogether in this PR. WDYT? |
Replaces the ASCII-only
StringUtils::ToLowerwith a Unicode-awareimplementation backed by utf8proc,
so case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT).ToLowernow lower-cases UTF-8 input using utf8proc simple (1:1) casemapping (e.g.
CAFÉ→café,GROẞE→große). Invalid UTF-8 isreturned unchanged rather than erroring.
EqualsIgnoreCasenow compares the lowercased forms of both inputs, so itis case-insensitive for non-ASCII letters too.
ToUpperis intentionally left ASCII-only — it is not used for namematching.
utf8proc is wired into both the CMake (vendored via FetchContent / system
package) and Meson (
subprojects/utf8proc.wrap) builds.Testing
Added/updated
string_util_test.cc:ToLowerUnicode,ToUpperAsciiOnly,and Unicode cases in
EqualsIgnoreCase(including invalid-UTF-8pass-through).
Closes #613.
Follow-up to #748