feat(string_util): make ToLower Unicode-aware via utf8proc (2/2) by goel-skd · Pull Request #760 · apache/iceberg-cpp

goel-skd · 2026-06-19T01:30:41Z

Replaces the ASCII-only StringUtils::ToLower with a Unicode-aware
implementation backed by utf8proc,
so case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT).

ToLower now lower-cases UTF-8 input using utf8proc simple (1:1) case
mapping (e.g. CAFÉ → café, GROẞE → große). Invalid UTF-8 is
returned unchanged rather than erroring.
EqualsIgnoreCase now compares the lowercased forms of both inputs, so it
is case-insensitive for non-ASCII letters too.
ToUpper is intentionally left ASCII-only — it is not used for name
matching.
utf8proc is wired into both the CMake (vendored via FetchContent / system
package) and Meson (subprojects/utf8proc.wrap) builds.

Testing
Added/updated string_util_test.cc: ToLowerUnicode, ToUpperAsciiOnly,
and Unicode cases in EqualsIgnoreCase (including invalid-UTF-8
pass-through).

Closes #613.

Follow-up to #748

Replace the ASCII-only ToLower with utf8proc simple case mapping so case-insensitive name handling matches Iceberg Java's toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used for name matching. EqualsIgnoreCase now compares lowercased forms. Wire utf8proc into both the CMake (vendored/system) and Meson builds. See apache#613.

wgtmac

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

goel-skd · 2026-06-24T04:11:00Z

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

Thanks much @wgtmac. I responded to your comments.

ToLower: note it uses Unicode simple (1:1) case mapping and document where it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper: spell out the ASCII-only behavior and why no Unicode variant is provided. Also document EqualsIgnoreCase inheriting ToLower's mapping. Addresses API review comments on apache#760. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ToLower: note it uses Unicode simple (1:1) case mapping and document where it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper: spell out the ASCII-only behavior and why no Unicode variant is provided. Also document EqualsIgnoreCase inheriting ToLower's mapping. Addresses API review comments on apache#760.

wgtmac · 2026-06-26T09:24:25Z

+    return ToLower(lhs) == ToLower(rhs);
  }

  static bool StartsWithIgnoreCase(std::string_view str, std::string_view prefix) {


Now that EqualsIgnoreCase inherits ToLower's Unicode simple-mapping behavior, this helper should not slice str by prefix.size() before lowercasing. ToLower can change UTF-8 byte length; for example, İ is two bytes but maps to i here. So StartsWithIgnoreCase("\xC4\xB0x", "i") currently slices an invalid one-byte prefix and returns false, even though the lowercased string starts with the lowercased prefix.

Could we compare ToLower(str).starts_with(ToLower(prefix)) or otherwise compare by decoded code point, and add a test for this case?

goel-skd force-pushed the feat-613-unicode-lowercase branch from bec8884 to 69cc006 Compare June 19, 2026 01:40

goel-skd force-pushed the feat-613-unicode-lowercase branch from 69cc006 to f42e2da Compare June 19, 2026 02:13

wgtmac reviewed Jun 19, 2026

View reviewed changes

Comment thread cmake_modules/IcebergThirdpartyToolchain.cmake

Add license info to LICENSE

b8639d6

wgtmac requested changes Jun 21, 2026

View reviewed changes

Comment thread src/iceberg/util/string_util.h Outdated

Comment thread src/iceberg/util/string_util.h Outdated

Comment thread src/iceberg/util/string_util.h

goel-skd force-pushed the feat-613-unicode-lowercase branch from c692240 to 9ffebc3 Compare June 26, 2026 00:56

goel-skd requested a review from wgtmac June 26, 2026 01:11

wgtmac reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760
goel-skd wants to merge 3 commits into
apache:mainfrom
goel-skd:feat-613-unicode-lowercase

goel-skd commented Jun 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

wgtmac left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goel-skd commented Jun 24, 2026

Uh oh!

wgtmac Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

goel-skd commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goel-skd commented Jun 24, 2026

Uh oh!

wgtmac Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

goel-skd commented Jun 19, 2026 •

edited

Loading