feat(string_util): make ToLower Unicode-aware via utf8proc (2/2) by goel-skd · Pull Request #760 · apache/iceberg-cpp

goel-skd · 2026-06-19T01:30:41Z

Replaces the ASCII-only StringUtils::ToLower with a Unicode-aware
implementation backed by utf8proc,
so case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT).

ToLower now lower-cases UTF-8 input using utf8proc simple (1:1) case
mapping (e.g. CAFÉ → café, GROẞE → große). Invalid UTF-8 is
returned unchanged rather than erroring.
EqualsIgnoreCase now compares the lowercased forms of both inputs, so it
is case-insensitive for non-ASCII letters too.
ToUpper is intentionally left ASCII-only — it is not used for name
matching.
utf8proc is wired into both the CMake (vendored via FetchContent / system
package) and Meson (subprojects/utf8proc.wrap) builds.

Testing
Added/updated string_util_test.cc: ToLowerUnicode, ToUpperAsciiOnly,
and Unicode cases in EqualsIgnoreCase (including invalid-UTF-8
pass-through).

Closes #613.

Follow-up to #748

Replace the ASCII-only ToLower with utf8proc simple case mapping so case-insensitive name handling matches Iceberg Java's toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used for name matching. EqualsIgnoreCase now compares lowercased forms. Wire utf8proc into both the CMake (vendored/system) and Meson builds. See apache#613.

wgtmac

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

goel-skd · 2026-06-24T04:11:00Z

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

Thanks much @wgtmac. I responded to your comments.

ToLower: note it uses Unicode simple (1:1) case mapping and document where it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper: spell out the ASCII-only behavior and why no Unicode variant is provided. Also document EqualsIgnoreCase inheriting ToLower's mapping. Addresses API review comments on apache#760. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ToLower: note it uses Unicode simple (1:1) case mapping and document where it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper: spell out the ASCII-only behavior and why no Unicode variant is provided. Also document EqualsIgnoreCase inheriting ToLower's mapping. Addresses API review comments on apache#760.

The byte-slice in StartsWithIgnoreCase (str.substr(0, prefix.size()) before lowercasing) is wrong when ToLower changes byte length: "İ" (U+0130) is two bytes but lower-cases to "i", so "İx" should match prefix "i" but does not. This test pins that behavior; it fails against the current implementation and is fixed by the following commit. Relates to apache#760.

Compare the ToLower forms of both inputs instead of byte-slicing str to prefix.size() before lowercasing. ToLower can change a string's byte length (e.g. "İ" U+0130 lower-cases to "i"), so the old slice could split a code point or wrongly reject a valid match. Makes the regression test from the previous commit pass. Relates to apache#760.

zhjwpku · 2026-06-28T11:19:27Z

-  // NOTE: These convert ASCII letters only; all other bytes, including non-ASCII
-  // (multibyte UTF-8) bytes, are passed through unchanged.
-  // See https://github.com/apache/iceberg-cpp/issues/613.
-  static std::string ToLower(std::string_view str) {


I think there are quite a few places where ToLower is only used for ASCII strings. Should we switch those to use ToUpper? Or should we leave ToLower as it is and introduce a new ToLowerUTF8 instead?

Thanks for the review @zhjwpku. A few thoughts:

On switching those to ToUpper: it probably isn't a straight swap. The case-insensitive field maps in schema.cc/type.cc are keyed by lowercase (to match Java's toLowerCase(Locale.ROOT)), so flipping to upper means changing comparison targets everywhere, and the field-name path needs unicode anyway (ToUpper is ASCII-only). It also puts the "is this ASCII?" decision back on every caller.

On ToLowerUTF8: we discussed the Ascii/Utf8 split earlier in this PR with @wgtmac and landed on keeping a single ToUpper/ToLower rather than ToXXXAscii/ToXXXUtf8.

I think we can get the speedup without splitting the API by adding an ASCII fast-path inside ToLower, i.e. scan once for any byte >= 0x80, if the input is pure ASCII, do the cheap ASCII lowercase and skip utf8proc, otherwise fall back. One function, no caller changes, still unicode-correct, and the ASCII majority gets the fast path for free.

Since it's a perf optimization and not correctness, I'd suggest doing it as a small follow-up so it doesn't hold up this PR. WDYT?

One concern I still have is that the current ASCII-only behavior was a reasonable trade-off for most existing call sites: metrics modes, UUIDs, HTTP header/property names, enum-like strings, etc. are overwhelmingly ASCII. Moving ToLower, EqualsIgnoreCase, and StartsWithIgnoreCase to always decode UTF-8 and allocate lower-cased strings means the common path pays for Unicode support even when it is not needed.

This feels a bit unfortunate for a minor/non-ASCII case. In SQL engines we often keep ASCII-only fast paths for identifier/string handling, because most real-world names are ASCII and these helpers can sit on hot-ish paths.

Could we think about preserving the fast path here? For example:

keep the existing byte-wise ASCII path when both inputs are ASCII;

only fall back to utf8proc when non-ASCII bytes are actually present;

make EqualsIgnoreCase streaming/short-circuiting instead of materializing two lower-cased strings;

avoid lower-casing the whole str in StartsWithIgnoreCase when only the prefix needs to be compared.

I’m supportive of improving Unicode behavior, but I’d prefer not to regress the common ASCII path globally if we can avoid it.

wgtmac · 2026-06-29T08:59:13Z

+  /// above) maps to U+0069 ("i") here, but to U+0069 U+0307 ("i" + combining dot above)
+  /// in Java. For ASCII and the large majority of letters the two agree.
+  ///
+  /// Invalid UTF-8 input is returned unchanged.


Why not using Result<std::string> so we can return an error for invalid UTF8 input.

wgtmac · 2026-06-29T09:47:26Z

-      return false;
-    }
-    return EqualsIgnoreCase(str.substr(0, prefix.size()), prefix);
+    return ToLower(str).starts_with(ToLower(prefix));


wgtmac · 2026-06-29T09:48:03Z

  static bool EqualsIgnoreCase(std::string_view lhs, std::string_view rhs) {
-    return std::ranges::equal(
-        lhs, rhs, [](char lc, char rc) { return ToLowerAscii(lc) == ToLowerAscii(rc); });
+    return ToLower(lhs) == ToLower(rhs);


Since ToLower returns as is for invalid UTF8 input, this line is a UB instead.

wgtmac · 2026-06-29T09:55:59Z

-  // NOTE: These convert ASCII letters only; all other bytes, including non-ASCII
-  // (multibyte UTF-8) bytes, are passed through unchanged.
-  // See https://github.com/apache/iceberg-cpp/issues/613.
-  static std::string ToLower(std::string_view str) {


One concern I still have is that the current ASCII-only behavior was a reasonable trade-off for most existing call sites: metrics modes, UUIDs, HTTP header/property names, enum-like strings, etc. are overwhelmingly ASCII. Moving ToLower, EqualsIgnoreCase, and StartsWithIgnoreCase to always decode UTF-8 and allocate lower-cased strings means the common path pays for Unicode support even when it is not needed.

This feels a bit unfortunate for a minor/non-ASCII case. In SQL engines we often keep ASCII-only fast paths for identifier/string handling, because most real-world names are ASCII and these helpers can sit on hot-ish paths.

Could we think about preserving the fast path here? For example:

keep the existing byte-wise ASCII path when both inputs are ASCII;

only fall back to utf8proc when non-ASCII bytes are actually present;

make EqualsIgnoreCase streaming/short-circuiting instead of materializing two lower-cased strings;

avoid lower-casing the whole str in StartsWithIgnoreCase when only the prefix needs to be compared.

I’m supportive of improving Unicode behavior, but I’d prefer not to regress the common ASCII path globally if we can avoid it.

manuzhang · 2026-06-29T11:30:05Z

+  /// Intended for case-insensitive name matching, similar to Iceberg Java's
+  /// toLowerCase(Locale.ROOT). The mapping is locale-independent, matching the intent
+  /// of Locale.ROOT. It uses simple (1:1) case mapping rather than Java's full case
+  /// mapping, so results differ for a few code points; e.g. U+0130 (capital I with dot


If they don't match, how can iceberg-cpp read identifier path containing such characters written by java?

manuzhang · 2026-06-29T11:31:13Z

+  /// in Java. For ASCII and the large majority of letters the two agree.
+  ///
+  /// Invalid UTF-8 input is returned unchanged.
+  /// See https://github.com/apache/iceberg-cpp/issues/613.


Not sure what this comment means.

goel-skd · 2026-06-29T23:16:49Z

Thanks for all the review here. Since this is a shared util it's picked up a bunch of good ideas, but some of them expand the scope, so I want to draw a line on what lands here vs. what we track separately.

What I'll do in this PR:

Add ASCII fast-paths to ToLower, EqualsIgnoreCase, and StartsWithIgnoreCase. If the inputs are pure ASCII (no byte >= 0x80) we just do the cheap byte-wise thing with an early-out and no allocation, and only fall back to utf8proc when there's actually a non-ASCII byte. That keeps the common case (modes, UUIDs, header/property names, enum-ish strings) as fast as it was before, which is the regression @wgtmac and @zhjwpku registered concerns about.
Reword the ToLower doc comment so the behavior and the small divergence from Java are actually clear (@manuzhang's point).

On the simple vs. full case mapping question (@manuzhang): reading Java-written tables isn't affected, since field names are stored verbatim in metadata and not lower-cased. The simple/full difference only shows up in case-insensitive name resolution, and only for a few code points like U+0130. For ASCII and the vast majority of letters they're identical. Matching Java bit-for-bit on those edge cases is a bigger change and I'd rather not block this PR on it.

Two things I'd punt to follow-up issues:

Having ToLower return Result<std::string> so it can reject invalid UTF-8 instead of passing it through. That changes the contract for every current caller, so it deserves its own PR.
Bit-for-bit parity with Java's toLowerCase(Locale.ROOT) on the code points where Java does full case mapping (e.g. U+0130 İ -> i̇). We're using simple 1:1 lowercasing here, which was a deliberate non-goal in the design doc: it's what makes ß/GROẞE match Java, whereas casefold would diverge there. So the remaining full-mapping cases are worth tracking on their own.

Does that split sound reasonable? If so I'll push the fast-path + doc changes and open the two follow-ups.

manuzhang · 2026-06-30T02:04:35Z

@goel-skd Please update the PR description (supposing this no longer fully fixes #613 now) and open a new issue with your design and plan, and link all your follow-up PRs there.

wgtmac · 2026-06-30T03:10:31Z

Thanks @goel-skd for your effort! I think 1:1 Java mapping is difficult to achieve so it's fine to be out of this PR's scope. However, it's correctness issue for invalid UTF-8 so it would be better to address them altogether in this PR. WDYT?

goel-skd force-pushed the feat-613-unicode-lowercase branch from bec8884 to 69cc006 Compare June 19, 2026 01:40

goel-skd force-pushed the feat-613-unicode-lowercase branch from 69cc006 to f42e2da Compare June 19, 2026 02:13

wgtmac reviewed Jun 19, 2026

View reviewed changes

Comment thread cmake_modules/IcebergThirdpartyToolchain.cmake

Add license info to LICENSE

b8639d6

wgtmac requested changes Jun 21, 2026

View reviewed changes

Comment thread src/iceberg/util/string_util.h Outdated

Comment thread src/iceberg/util/string_util.h Outdated

Comment thread src/iceberg/util/string_util.h

goel-skd force-pushed the feat-613-unicode-lowercase branch from c692240 to 9ffebc3 Compare June 26, 2026 00:56

goel-skd requested a review from wgtmac June 26, 2026 01:11

wgtmac reviewed Jun 26, 2026

View reviewed changes

Comment thread src/iceberg/util/string_util.h

goel-skd added 2 commits June 27, 2026 11:27

goel-skd requested a review from wgtmac June 27, 2026 15:33

zhjwpku reviewed Jun 28, 2026

View reviewed changes

wgtmac reviewed Jun 29, 2026

View reviewed changes

manuzhang reviewed Jun 29, 2026

View reviewed changes

Uh oh!

Conversation

goel-skd commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goel-skd commented Jun 24, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goel-skd commented Jun 29, 2026

Uh oh!

manuzhang commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

goel-skd commented Jun 19, 2026 •

edited

Loading

manuzhang commented Jun 30, 2026 •

edited

Loading