feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760
feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760goel-skd wants to merge 3 commits into
Conversation
bec8884 to
69cc006
Compare
Replace the ASCII-only ToLower with utf8proc simple case mapping so case-insensitive name handling matches Iceberg Java's toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used for name matching. EqualsIgnoreCase now compares lowercased forms. Wire utf8proc into both the CMake (vendored/system) and Meson builds. See apache#613.
69cc006 to
f42e2da
Compare
wgtmac
left a comment
There was a problem hiding this comment.
I haven't checked the implementation and test yet. Just post some thoughts around APIs.
Thanks much @wgtmac. I responded to your comments. |
ToLower: note it uses Unicode simple (1:1) case mapping and document where it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper: spell out the ASCII-only behavior and why no Unicode variant is provided. Also document EqualsIgnoreCase inheriting ToLower's mapping. Addresses API review comments on apache#760. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ToLower: note it uses Unicode simple (1:1) case mapping and document where it diverges from Java's full toLowerCase(Locale.ROOT) (e.g. U+0130). ToUpper: spell out the ASCII-only behavior and why no Unicode variant is provided. Also document EqualsIgnoreCase inheriting ToLower's mapping. Addresses API review comments on apache#760.
c692240 to
9ffebc3
Compare
| return ToLower(lhs) == ToLower(rhs); | ||
| } | ||
|
|
||
| static bool StartsWithIgnoreCase(std::string_view str, std::string_view prefix) { |
There was a problem hiding this comment.
Now that EqualsIgnoreCase inherits ToLower's Unicode simple-mapping behavior, this helper should not slice str by prefix.size() before lowercasing. ToLower can change UTF-8 byte length; for example, İ is two bytes but maps to i here. So StartsWithIgnoreCase("\xC4\xB0x", "i") currently slices an invalid one-byte prefix and returns false, even though the lowercased string starts with the lowercased prefix.
Could we compare ToLower(str).starts_with(ToLower(prefix)) or otherwise compare by decoded code point, and add a test for this case?
Replaces the ASCII-only
StringUtils::ToLowerwith a Unicode-awareimplementation backed by utf8proc,
so case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT).ToLowernow lower-cases UTF-8 input using utf8proc simple (1:1) casemapping (e.g.
CAFÉ→café,GROẞE→große). Invalid UTF-8 isreturned unchanged rather than erroring.
EqualsIgnoreCasenow compares the lowercased forms of both inputs, so itis case-insensitive for non-ASCII letters too.
ToUpperis intentionally left ASCII-only — it is not used for namematching.
utf8proc is wired into both the CMake (vendored via FetchContent / system
package) and Meson (
subprojects/utf8proc.wrap) builds.Testing
Added/updated
string_util_test.cc:ToLowerUnicode,ToUpperAsciiOnly,and Unicode cases in
EqualsIgnoreCase(including invalid-UTF-8pass-through).
Closes #613.
Follow-up to #748