Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

Q: What do little WASPs want to be when they grow up? A: The very best person they can possibly be.


comp / comp.unix.bsd.freebsd.misc / Re: locale/LC_CTYPE vs strcasecmp?

SubjectAuthor
* locale/LC_CTYPE vs strcasecmp?Winston
`* Re: locale/LC_CTYPE vs strcasecmp?Christian Weisgerber
 `- Re: locale/LC_CTYPE vs strcasecmp?Winston

1
Subject: locale/LC_CTYPE vs strcasecmp?
From: Winston
Newsgroups: comp.unix.bsd.freebsd.misc
Organization: A noiseless patient Spider
Date: Tue, 26 Mar 2024 10:24 UTC
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: wbe@UBEBLOCK.psr.com.invalid (Winston)
Newsgroups: comp.unix.bsd.freebsd.misc
Subject: locale/LC_CTYPE vs strcasecmp?
Date: Tue, 26 Mar 2024 06:24:31 -0400
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <ydfrwdgujk.fsf@UBEblock.psr.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Date: Tue, 26 Mar 2024 11:24:21 +0100 (CET)
Injection-Info: dont-email.me; posting-host="51999ad8c6bbb3e6217bb40e1ad7bf90";
logging-data="1854173"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+NFeFhCr4YrU/1cshL5E6Q"
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:NMFt44bXt0aR73MuQPGHe/+AAqg=
sha1:hXWccXv/AzxbkqCLisAuvIrsC3s=
Mail-Copies-To: never
View all headers

In FreeBSD 14.0-RELEASE:

The man page says strcasecmp_l() takes an explicit locale.
The implication is that strcasecmp() uses the current locale
(presumably as set by setlocale()).

After calling setlocale(LC_ALL, "uk_UA.UTF-8"), I'm seeing that
strcasecmp() is not, in fact, case-independently matching non-ASCII
UTF-8 strings: it's case sensitive (the ASCII equivalent in this
case being that "Abc" isn't matching "abc").

Is that a bug, does strcasecmp not, in fact, use the current
locale, or am I missing something?

TIA,
-WBE

Subject: Re: locale/LC_CTYPE vs strcasecmp?
From: Christian Weisgerber
Newsgroups: comp.unix.bsd.freebsd.misc
Date: Tue, 26 Mar 2024 19:47 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!news.szaf.org!inka.de!mips.inka.de!.POSTED.localhost!not-for-mail
From: naddy@mips.inka.de (Christian Weisgerber)
Newsgroups: comp.unix.bsd.freebsd.misc
Subject: Re: locale/LC_CTYPE vs strcasecmp?
Date: Tue, 26 Mar 2024 19:47:03 -0000 (UTC)
Message-ID: <slrnv069hn.tsn.naddy@lorvorc.mips.inka.de>
References: <ydfrwdgujk.fsf@UBEblock.psr.com>
Injection-Date: Tue, 26 Mar 2024 19:47:03 -0000 (UTC)
Injection-Info: lorvorc.mips.inka.de; posting-host="localhost:::1";
logging-data="30616"; mail-complaints-to="usenet@mips.inka.de"
User-Agent: slrn/1.0.3 (FreeBSD)
View all headers

On 2024-03-26, Winston <wbe@UBEBLOCK.psr.com.invalid> wrote:

> The man page says strcasecmp_l() takes an explicit locale.
> The implication is that strcasecmp() uses the current locale
> (presumably as set by setlocale()).

Yes.
src/lib/libc/string/strcasecmp.c:

57 int
58 strcasecmp(const char *s1, const char *s2)
59 {
60 return strcasecmp_l(s1, s2, __get_locale());
61 }

> After calling setlocale(LC_ALL, "uk_UA.UTF-8"), I'm seeing that
> strcasecmp() is not, in fact, case-independently matching non-ASCII
> UTF-8 strings: it's case sensitive (the ASCII equivalent in this
> case being that "Abc" isn't matching "abc").

UTF-8 characters are multibyte. You need to convert the strings
to wide characters and use wcscasecmp().

--
Christian "naddy" Weisgerber naddy@mips.inka.de

Subject: Re: locale/LC_CTYPE vs strcasecmp?
From: Winston
Newsgroups: comp.unix.bsd.freebsd.misc
Organization: A noiseless patient Spider
Date: Wed, 27 Mar 2024 15:16 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: wbe@UBEBLOCK.psr.com.invalid (Winston)
Newsgroups: comp.unix.bsd.freebsd.misc
Subject: Re: locale/LC_CTYPE vs strcasecmp?
Date: Wed, 27 Mar 2024 11:16:40 -0400
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <ydbk6zhfhj.fsf@UBEblock.psr.com>
References: <ydfrwdgujk.fsf@UBEblock.psr.com>
<slrnv069hn.tsn.naddy@lorvorc.mips.inka.de>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Date: Wed, 27 Mar 2024 15:16:29 +0100 (CET)
Injection-Info: dont-email.me; posting-host="9e8416354768422ed635604b84d6a43a";
logging-data="3070795"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/OY3C99n/+Zn42YZZuaJCn"
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:yi0zYXnQEHXtbIqMFlsQ8X3t8BE=
sha1:MM8Pf9xfF7OedUlprhnxgyNLt/M=
Mail-Copies-To: never
View all headers

I originally posted:
>> The man page says strcasecmp_l() takes an explicit locale.
>> The implication is that strcasecmp() uses the current locale
>> (presumably as set by setlocale()).

to which Christian Weisgerber <naddy@mips.inka.de> kindly replied:
> Yes.
> src/lib/libc/string/strcasecmp.c:
>
> 57 int
> 58 strcasecmp(const char *s1, const char *s2)
> 59 {
> 60 return strcasecmp_l(s1, s2, __get_locale());
> 61 }

:-)

>> After calling setlocale(LC_ALL, "uk_UA.UTF-8"), I'm seeing that
>> strcasecmp() is not, in fact, case-independently matching non-ASCII
>> UTF-8 strings: it's case sensitive (the ASCII equivalent in this
>> case being that "Abc" isn't matching "abc").

> UTF-8 characters are multibyte. You need to convert the strings
> to wide characters and use wcscasecmp().

As one would expect and perfectly reasonable, but something (I forget
what now) led me to think that if strcasecmp accepted UTF-8 locales,
maybe it *would* be willing to, just operating one byte at a time
instead of two.

Thanks for confirming that, Christian. Onward to upgrading this
code that should have been doing that already ...
-WBE

1

rocksolid light 0.9.8
clearnet tor