Skip to content

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760

Open
goel-skd wants to merge 5 commits into
apache:mainfrom
goel-skd:feat-613-unicode-lowercase
Open

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760
goel-skd wants to merge 5 commits into
apache:mainfrom
goel-skd:feat-613-unicode-lowercase

Conversation

@goel-skd

@goel-skd goel-skd commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Replaces the ASCII-only StringUtils::ToLower with a Unicode-aware
implementation backed by utf8proc,
so case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT).

  • ToLower now lower-cases UTF-8 input using utf8proc simple (1:1) case
    mapping (e.g. CAFÉcafé, GROẞEgroße). Invalid UTF-8 is
    returned unchanged rather than erroring.

  • EqualsIgnoreCase now compares the lowercased forms of both inputs, so it
    is case-insensitive for non-ASCII letters too.

  • ToUpper is intentionally left ASCII-only — it is not used for name
    matching.

  • utf8proc is wired into both the CMake (vendored via FetchContent / system
    package) and Meson (subprojects/utf8proc.wrap) builds.

    Testing

  • Added/updated string_util_test.cc: ToLowerUnicode, ToUpperAsciiOnly,
    and Unicode cases in EqualsIgnoreCase (including invalid-UTF-8
    pass-through).

Closes #613.

Follow-up to #748

@goel-skd goel-skd force-pushed the feat-613-unicode-lowercase branch from bec8884 to 69cc006 Compare June 19, 2026 01:40
Replace the ASCII-only ToLower with utf8proc simple case mapping so
case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used
for name matching. EqualsIgnoreCase now compares lowercased forms.

Wire utf8proc into both the CMake (vendored/system) and Meson builds.

See apache#613.
@goel-skd goel-skd force-pushed the feat-613-unicode-lowercase branch from 69cc006 to f42e2da Compare June 19, 2026 02:13
Comment thread cmake_modules/IcebergThirdpartyToolchain.cmake

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

Comment thread src/iceberg/util/string_util.h Outdated
Comment thread src/iceberg/util/string_util.h Outdated
Comment thread src/iceberg/util/string_util.h
@goel-skd

Copy link
Copy Markdown
Contributor Author

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

Thanks much @wgtmac. I responded to your comments.

goel-skd added a commit to goel-skd/iceberg-cpp that referenced this pull request Jun 26, 2026
ToLower: note it uses Unicode simple (1:1) case mapping and document where
it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper:
spell out the ASCII-only behavior and why no Unicode variant is provided.
Also document EqualsIgnoreCase inheriting ToLower's mapping.

Addresses API review comments on apache#760.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ToLower: note it uses Unicode simple (1:1) case mapping and document where
it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper:
spell out the ASCII-only behavior and why no Unicode variant is provided.
Also document EqualsIgnoreCase inheriting ToLower's mapping.

Addresses API review comments on apache#760.
@goel-skd goel-skd force-pushed the feat-613-unicode-lowercase branch from c692240 to 9ffebc3 Compare June 26, 2026 00:56
@goel-skd goel-skd requested a review from wgtmac June 26, 2026 01:11
Comment thread src/iceberg/util/string_util.h
goel-skd added 2 commits June 27, 2026 11:27
The byte-slice in StartsWithIgnoreCase (str.substr(0, prefix.size()) before
lowercasing) is wrong when ToLower changes byte length: "İ" (U+0130) is two
bytes but lower-cases to "i", so "İx" should match prefix "i" but does not.
This test pins that behavior; it fails against the current implementation and
is fixed by the following commit.

Relates to apache#760.
Compare the ToLower forms of both inputs instead of byte-slicing str to
prefix.size() before lowercasing. ToLower can change a string's byte length
(e.g. "İ" U+0130 lower-cases to "i"), so the old slice could split a code
point or wrongly reject a valid match. Makes the regression test from the
previous commit pass.

Relates to apache#760.
@goel-skd goel-skd requested a review from wgtmac June 27, 2026 15:33
// NOTE: These convert ASCII letters only; all other bytes, including non-ASCII
// (multibyte UTF-8) bytes, are passed through unchanged.
// See https://github.com/apache/iceberg-cpp/issues/613.
static std::string ToLower(std::string_view str) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are quite a few places where ToLower is only used for ASCII strings. Should we switch those to use ToUpper? Or should we leave ToLower as it is and introduce a new ToLowerUTF8 instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @zhjwpku. A few thoughts:

On switching those to ToUpper: it probably isn't a straight swap. The case-insensitive field maps in schema.cc/type.cc are keyed by lowercase (to match Java's toLowerCase(Locale.ROOT)), so flipping to upper means changing comparison targets everywhere, and the field-name path needs unicode anyway (ToUpper is ASCII-only). It also puts the "is this ASCII?" decision back on every caller.

On ToLowerUTF8: we discussed the Ascii/Utf8 split earlier in this PR with @wgtmac and landed on keeping a single ToUpper/ToLower rather than ToXXXAscii/ToXXXUtf8.

I think we can get the speedup without splitting the API by adding an ASCII fast-path inside ToLower, i.e. scan once for any byte >= 0x80, if the input is pure ASCII, do the cheap ASCII lowercase and skip utf8proc, otherwise fall back. One function, no caller changes, still unicode-correct, and the ASCII majority gets the fast path for free.

Since it's a perf optimization and not correctness, I'd suggest doing it as a small follow-up so it doesn't hold up this PR. WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern I still have is that the current ASCII-only behavior was a reasonable trade-off for most existing call sites: metrics modes, UUIDs, HTTP header/property names, enum-like strings, etc. are overwhelmingly ASCII. Moving ToLower, EqualsIgnoreCase, and StartsWithIgnoreCase to always decode UTF-8 and allocate lower-cased strings means the common path pays for Unicode support even when it is not needed.

This feels a bit unfortunate for a minor/non-ASCII case. In SQL engines we often keep ASCII-only fast paths for identifier/string handling, because most real-world names are ASCII and these helpers can sit on hot-ish paths.

Could we think about preserving the fast path here? For example:

  • keep the existing byte-wise ASCII path when both inputs are ASCII;
  • only fall back to utf8proc when non-ASCII bytes are actually present;
  • make EqualsIgnoreCase streaming/short-circuiting instead of materializing two lower-cased strings;
  • avoid lower-casing the whole str in StartsWithIgnoreCase when only the prefix needs to be compared.

I’m supportive of improving Unicode behavior, but I’d prefer not to regress the common ASCII path globally if we can avoid it.

/// above) maps to U+0069 ("i") here, but to U+0069 U+0307 ("i" + combining dot above)
/// in Java. For ASCII and the large majority of letters the two agree.
///
/// Invalid UTF-8 input is returned unchanged.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using Result<std::string> so we can return an error for invalid UTF8 input.

return false;
}
return EqualsIgnoreCase(str.substr(0, prefix.size()), prefix);
return ToLower(str).starts_with(ToLower(prefix));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

static bool EqualsIgnoreCase(std::string_view lhs, std::string_view rhs) {
return std::ranges::equal(
lhs, rhs, [](char lc, char rc) { return ToLowerAscii(lc) == ToLowerAscii(rc); });
return ToLower(lhs) == ToLower(rhs);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ToLower returns as is for invalid UTF8 input, this line is a UB instead.

// NOTE: These convert ASCII letters only; all other bytes, including non-ASCII
// (multibyte UTF-8) bytes, are passed through unchanged.
// See https://github.com/apache/iceberg-cpp/issues/613.
static std::string ToLower(std::string_view str) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern I still have is that the current ASCII-only behavior was a reasonable trade-off for most existing call sites: metrics modes, UUIDs, HTTP header/property names, enum-like strings, etc. are overwhelmingly ASCII. Moving ToLower, EqualsIgnoreCase, and StartsWithIgnoreCase to always decode UTF-8 and allocate lower-cased strings means the common path pays for Unicode support even when it is not needed.

This feels a bit unfortunate for a minor/non-ASCII case. In SQL engines we often keep ASCII-only fast paths for identifier/string handling, because most real-world names are ASCII and these helpers can sit on hot-ish paths.

Could we think about preserving the fast path here? For example:

  • keep the existing byte-wise ASCII path when both inputs are ASCII;
  • only fall back to utf8proc when non-ASCII bytes are actually present;
  • make EqualsIgnoreCase streaming/short-circuiting instead of materializing two lower-cased strings;
  • avoid lower-casing the whole str in StartsWithIgnoreCase when only the prefix needs to be compared.

I’m supportive of improving Unicode behavior, but I’d prefer not to regress the common ASCII path globally if we can avoid it.

/// Intended for case-insensitive name matching, similar to Iceberg Java's
/// toLowerCase(Locale.ROOT). The mapping is locale-independent, matching the intent
/// of Locale.ROOT. It uses simple (1:1) case mapping rather than Java's full case
/// mapping, so results differ for a few code points; e.g. U+0130 (capital I with dot

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they don't match, how can iceberg-cpp read identifier path containing such characters written by java?

/// in Java. For ASCII and the large majority of letters the two agree.
///
/// Invalid UTF-8 input is returned unchanged.
/// See https://github.com/apache/iceberg-cpp/issues/613.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this comment means.

@goel-skd

Copy link
Copy Markdown
Contributor Author

Thanks for all the review here. Since this is a shared util it's picked up a bunch of good ideas, but some of them expand the scope, so I want to draw a line on what lands here vs. what we track separately.

What I'll do in this PR:

  • Add ASCII fast-paths to ToLower, EqualsIgnoreCase, and StartsWithIgnoreCase. If the inputs are pure ASCII (no byte >= 0x80) we just do the cheap byte-wise thing with an early-out and no allocation, and only fall back to utf8proc when there's actually a non-ASCII byte. That keeps the common case (modes, UUIDs, header/property names, enum-ish strings) as fast as it was before, which is the regression @wgtmac and @zhjwpku registered concerns about.
  • Reword the ToLower doc comment so the behavior and the small divergence from Java are actually clear (@manuzhang's point).

On the simple vs. full case mapping question (@manuzhang): reading Java-written tables isn't affected, since field names are stored verbatim in metadata and not lower-cased. The simple/full difference only shows up in case-insensitive name resolution, and only for a few code points like U+0130. For ASCII and the vast majority of letters they're identical. Matching Java bit-for-bit on those edge cases is a bigger change and I'd rather not block this PR on it.

Two things I'd punt to follow-up issues:

  • Having ToLower return Result<std::string> so it can reject invalid UTF-8 instead of passing it through. That changes the contract for every current caller, so it deserves its own PR.
  • Bit-for-bit parity with Java's toLowerCase(Locale.ROOT) on the code points where Java does full case mapping (e.g. U+0130 İ -> ). We're using simple 1:1 lowercasing here, which was a deliberate non-goal in the design doc: it's what makes ß/GROẞE match Java, whereas casefold would diverge there. So the remaining full-mapping cases are worth tracking on their own.

Does that split sound reasonable? If so I'll push the fast-path + doc changes and open the two follow-ups.

@manuzhang

manuzhang commented Jun 30, 2026

Copy link
Copy Markdown
Member

@goel-skd Please update the PR description (supposing this no longer fully fixes #613 now) and open a new issue with your design and plan, and link all your follow-up PRs there.

@wgtmac

wgtmac commented Jun 30, 2026

Copy link
Copy Markdown
Member

Thanks @goel-skd for your effort! I think 1:1 Java mapping is difficult to achieve so it's fine to be out of this PR's scope. However, it's correctness issue for invalid UTF-8 so it would be better to address them altogether in this PR. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

case insensitive field matching behavior different from iceberg-python and iceberg-java

4 participants