Skip to content

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760

Open
goel-skd wants to merge 3 commits into
apache:mainfrom
goel-skd:feat-613-unicode-lowercase
Open

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760
goel-skd wants to merge 3 commits into
apache:mainfrom
goel-skd:feat-613-unicode-lowercase

Conversation

@goel-skd

@goel-skd goel-skd commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Replaces the ASCII-only StringUtils::ToLower with a Unicode-aware
implementation backed by utf8proc,
so case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT).

  • ToLower now lower-cases UTF-8 input using utf8proc simple (1:1) case
    mapping (e.g. CAFÉcafé, GROẞEgroße). Invalid UTF-8 is
    returned unchanged rather than erroring.

  • EqualsIgnoreCase now compares the lowercased forms of both inputs, so it
    is case-insensitive for non-ASCII letters too.

  • ToUpper is intentionally left ASCII-only — it is not used for name
    matching.

  • utf8proc is wired into both the CMake (vendored via FetchContent / system
    package) and Meson (subprojects/utf8proc.wrap) builds.

    Testing

  • Added/updated string_util_test.cc: ToLowerUnicode, ToUpperAsciiOnly,
    and Unicode cases in EqualsIgnoreCase (including invalid-UTF-8
    pass-through).

Closes #613.

Follow-up to #748

@goel-skd goel-skd force-pushed the feat-613-unicode-lowercase branch from bec8884 to 69cc006 Compare June 19, 2026 01:40
Replace the ASCII-only ToLower with utf8proc simple case mapping so
case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used
for name matching. EqualsIgnoreCase now compares lowercased forms.

Wire utf8proc into both the CMake (vendored/system) and Meson builds.

See apache#613.
@goel-skd goel-skd force-pushed the feat-613-unicode-lowercase branch from 69cc006 to f42e2da Compare June 19, 2026 02:13
Comment thread cmake_modules/IcebergThirdpartyToolchain.cmake

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

Comment thread src/iceberg/util/string_util.h Outdated
Comment thread src/iceberg/util/string_util.h Outdated
Comment thread src/iceberg/util/string_util.h
@goel-skd

Copy link
Copy Markdown
Contributor Author

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

Thanks much @wgtmac. I responded to your comments.

goel-skd added a commit to goel-skd/iceberg-cpp that referenced this pull request Jun 26, 2026
ToLower: note it uses Unicode simple (1:1) case mapping and document where
it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper:
spell out the ASCII-only behavior and why no Unicode variant is provided.
Also document EqualsIgnoreCase inheriting ToLower's mapping.

Addresses API review comments on apache#760.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ToLower: note it uses Unicode simple (1:1) case mapping and document where
it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper:
spell out the ASCII-only behavior and why no Unicode variant is provided.
Also document EqualsIgnoreCase inheriting ToLower's mapping.

Addresses API review comments on apache#760.
@goel-skd goel-skd force-pushed the feat-613-unicode-lowercase branch from c692240 to 9ffebc3 Compare June 26, 2026 00:56
@goel-skd goel-skd requested a review from wgtmac June 26, 2026 01:11
return ToLower(lhs) == ToLower(rhs);
}

static bool StartsWithIgnoreCase(std::string_view str, std::string_view prefix) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that EqualsIgnoreCase inherits ToLower's Unicode simple-mapping behavior, this helper should not slice str by prefix.size() before lowercasing. ToLower can change UTF-8 byte length; for example, İ is two bytes but maps to i here. So StartsWithIgnoreCase("\xC4\xB0x", "i") currently slices an invalid one-byte prefix and returns false, even though the lowercased string starts with the lowercased prefix.

Could we compare ToLower(str).starts_with(ToLower(prefix)) or otherwise compare by decoded code point, and add a test for this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

case insensitive field matching behavior different from iceberg-python and iceberg-java

2 participants