Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

Is that really YOU that is reading this?


comp / comp.lang.python / Re: UTF_16 question

SubjectAuthor
* UTF_16 questionjak
+* Re: UTF_16 questionStefan Ram
|`- Re: UTF_16 questionjak
`* Re: UTF_16 questionRichard Damon
 `- Re: UTF_16 questionjak

1
Subject: UTF_16 question
From: jak
Newsgroups: comp.lang.python
Organization: A noiseless patient Spider
Date: Sat, 27 Apr 2024 18:45 UTC
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nospam@please.ty (jak)
Newsgroups: comp.lang.python
Subject: UTF_16 question
Date: Sat, 27 Apr 2024 20:45:35 +0200
Organization: A noiseless patient Spider
Lines: 15
Message-ID: <v0jh4g$h14g$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 27 Apr 2024 20:45:36 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="48bf1fe914c1564bbdf160bf0eb11031";
logging-data="558224"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+GYW2GUAtU4NhEJmYdW1VE"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18.2
Cancel-Lock: sha1:HkQVe+FLsYOnXrAhhIiW6nuVwgE=
X-Mozilla-News-Host: snews://news.eternal-september.org:563
View all headers

Hi everyone,
one thing that I do not understand is happening to me: I have some text
files with different characteristics, among these there are that they
have an UTF_32_le coding, utf_32be, utf_16_le, utf_16_be all of them
without BOM. With those utf_32_xx I have no problem but with the
UTF_16_xx I have. If I have an utf_16_le coded file and I read it with
encoding='utf_16_le' I have no problem I read it, with
encoding='utf_16_be' I can read it without any error even if the data I
receive have the inverted bytes. The same thing happens with the
utf_16_be codified file, I read it, both with encoding='utf_16_be' and
with 'utf_16_le' without errors but in the last case the bytes are
inverted. What did I not understand? What am I doing wrong?

thanks in advance

Subject: Re: UTF_16 question
From: Stefan Ram
Newsgroups: comp.lang.python
Organization: Stefan Ram
Date: Sat, 27 Apr 2024 19:13 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.python
Subject: Re: UTF_16 question
Date: 27 Apr 2024 19:13:15 GMT
Organization: Stefan Ram
Lines: 10
Expires: 1 Feb 2025 11:59:58 GMT
Message-ID: <encodings-20240427201155@ram.dialup.fu-berlin.de>
References: <v0jh4g$h14g$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de fetuaFBWyCG1khX/2J8XCQVNPJ7MgS+weRYYoUfjDttPs9
Cancel-Lock: sha1:qmgznTHmUjV/oFE+YmrPiXX+QT0= sha256:ydWIMhCTzv8nOWnnhxC4MU5d7uj695BGfo6aRW2nHy4=
X-Copyright: (C) Copyright 2024 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
View all headers

jak <nospam@please.ty> wrote or quoted:
> I read it, both with encoding='utf_16_be' and
>with 'utf_16_le' without errors but in the last case the bytes are
>inverted.

I think the order of the octets (bytes) is exactly the difference
between these two encodings, so your observation isn't really
surprising. The computer can't report an error here since it
can't infer the correct encoding from the file data. It's like
that koan: "A bit has the value 1. What does that mean?".

Subject: Re: UTF_16 question
From: jak
Newsgroups: comp.lang.python
Organization: A noiseless patient Spider
Date: Sun, 28 Apr 2024 00:50 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nospam@please.ty (jak)
Newsgroups: comp.lang.python
Subject: Re: UTF_16 question
Date: Sun, 28 Apr 2024 02:50:08 +0200
Organization: A noiseless patient Spider
Lines: 19
Message-ID: <v0k6g1$lq8j$1@dont-email.me>
References: <v0jh4g$h14g$1@dont-email.me>
<encodings-20240427201155@ram.dialup.fu-berlin.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 28 Apr 2024 02:50:09 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="ebf7a5bfae2e8cc118e3830b88339564";
logging-data="715027"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/rzsMg0Hs+gC616RWgx7M8"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18.2
Cancel-Lock: sha1:EIxYoEwKY6J5qyrSTbBDSqC9DGo=
In-Reply-To: <encodings-20240427201155@ram.dialup.fu-berlin.de>
View all headers

Stefan Ram ha scritto:
> jak <nospam@please.ty> wrote or quoted:
>> I read it, both with encoding='utf_16_be' and
>> with 'utf_16_le' without errors but in the last case the bytes are
>> inverted.
>
> I think the order of the octets (bytes) is exactly the difference
> between these two encodings, so your observation isn't really
> surprising. The computer can't report an error here since it
> can't infer the correct encoding from the file data. It's like
> that koan: "A bit has the value 1. What does that mean?".
>

Understood. They are just 2 bytes and there is no difference between
them.

Thank you.

Subject: Re: UTF_16 question
From: Richard Damon
Newsgroups: comp.lang.python
Date: Mon, 29 Apr 2024 16:41 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: richard@damon-family.org (Richard Damon)
Newsgroups: comp.lang.python
Subject: Re: UTF_16 question
Date: Mon, 29 Apr 2024 12:41:48 -0400
Lines: 34
Message-ID: <mailman.1.1714409701.3326.python-list@python.org>
References: <v0jh4g$h14g$1@dont-email.me>
<08F2BE28-1252-4BD6-AED0-2323E112E0A1@damon-family.org>
Mime-Version: 1.0 (1.0)
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de UlC3mfKOUpKtq3BUu2CReAoUzh3dkSO7u+7Az1yWj/Cw==
Cancel-Lock: sha1:ztQPVokRJEJ8DcL9ZF+Jz/zUEPY= sha256:WVgrWAoEo4o5wgLUYqVXWDQNNNvSnVN+8xBhRtRXSIc=
Return-Path: <richard@damon-family.org>
X-Original-To: Python-List@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=damon-family.org header.i=richard@damon-family.org
header.b=ogk9f4lo; dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.021
X-Spam-Evidence: '*H*': 0.96; '*S*': 0.00; 'everyone,': 0.03; '(most':
0.05; 'codes': 0.07; 'characters,': 0.09; 'url:mailman': 0.15;
'coding,': 0.16; 'encoded,': 0.16; 'encoding': 0.16; 'from:addr
:damon-family.org': 0.16; 'from:addr:richard': 0.16;
'from:name:richard damon': 0.16; 'received:12': 0.16;
'received:apple': 0.16; 'received:smtpclient.apple': 0.16;
'wrote:': 0.16; 'problem': 0.16; 'subject:question': 0.17;
'to:addr:python-list': 0.20; 'code': 0.23; 'url-
ip:188.166.95.178/32': 0.25; 'url-ip:188.166.95/24': 0.25;
'url:listinfo': 0.25; 'url-ip:188.166/16': 0.25; 'bit': 0.27;
'wrong': 0.28; 'it,': 0.29; 'error': 0.29; 'url-ip:188/8': 0.31;
'happening': 0.32; 'python-list': 0.32; 'to:name:python': 0.32;
'unless': 0.32; 'but': 0.32; 'there': 0.33; 'able': 0.34; 'same':
0.34; 'header:In-Reply-To:1': 0.34; '8bit%:40': 0.35; 'invalid':
0.35; 'files': 0.36; 'errors': 0.36; 'those': 0.36; 'file': 0.38;
'read': 0.38; 'thanks': 0.38; 'text': 0.39; 'seeing': 0.39;
'both': 0.40; 'something': 0.40; 'likely': 0.61; 'among': 0.65;
'skip:e 20': 0.67; 'right': 0.68; 'created.': 0.69; 'ignore':
0.71; 'receive': 0.71; 'skip:\xe2 10': 0.71; 'happens': 0.84;
'characters': 0.84; 'coded': 0.84; 'valid,': 0.84; 'me:': 0.91
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=damon-family.org;
s=s1-ionos; t=1714409697; x=1715014497; i=richard@damon-family.org;
bh=fF9tH6AgjA/iHsyuqRs9d2nlAzA55i5BmNVrixVyO9c=;
h=X-UI-Sender-Class:Content-Type:Content-Transfer-Encoding:From:
Mime-Version:Subject:Date:Message-Id:References:In-Reply-To:To:cc:
content-transfer-encoding:content-type:date:from:message-id:
mime-version:reply-to:subject:to;
b=ogk9f4loFHshrFMByh9Kki6xBrKTogLVCS2IUopuvIR0I4g17sgamkAcOo+Go+6l
9vbfVoGIbOubasZh5MHDkK/lohhNeragt7ZXjeyrlqbB+h2Rm0YfPLyZVVRJvv5Dc
+34lwrgqqU+SsPgDJTPiOohAbdXb++c6yMX0ZTfmfGuENklfFep3B0vwGBIid34Co
gWWKrfo40KZnAMZ3A6KbOZgcv5pKTxDbMb91+YY+/xKu4ew5RjcXSfhMAxT3y2DqW
TjPD9rlj87mpivKL613Xo2QQfk989kPfBeQAIfp4TRVOD7vyRhkwjuRNOqrRLAGsi
6iO7cSTf2ZVDZHgCDA==
X-UI-Sender-Class: 55c96926-9e95-11ee-ae09-1f7a4046a0f6
In-Reply-To: <v0jh4g$h14g$1@dont-email.me>
X-Mailer: iPad Mail (21E236)
X-Provags-ID: V03:K1:3TAVw/KCbZju2AoTcF1cO0L4hbtcmJazhyyb1zenBIqydZT+XAd
5BiBLQg3X0U0c/SrilNIpaYs3k0xciKbfJtXjVHOnC7pSCWNeqw3beC0ZDhBihWy5nWtuKr
J8eo8kkDWj9o/5XNG+5+EOWhbZRjb2Mcm3KyYe+fM1GL7im1AmQbBLd0GOlSCDBuuVgGUgn
gYtMXHPyP/H7Clp+X/u+g==
X-Spam-Flag: NO
UI-OutboundReport: notjunk:1;M01:P0:Cqv7DuPyaXA=;O+4uxvSlRR3nNuB04pbTUAy1r3z
by+0/vSDeHrIrpubBoFzLXIHVPZo5fXRdM8L+SHXQYvwBl6TuUDgtKA8i66iif30OqqETnNQc
LV7YHA5vKEaII8m8oEkdbtvW5G0cqnb7Kh1pR8nAhLgvYgqZ3D3uWKFg8Er2RmWQdEy3HJdUT
sjaHcSaTrFJpihD2uBJzH71ruaPlkdauEuiyWr/78f7iYSqtf5NlaTPxMbzJnz8NYRyEVvKfc
qeH9LA+FHLXfdKB1y72PjPiHBZl6TVBuVgwQ2KA9sReOlNkwU+kjGdGv+mWM6dU9TVtpkU7ll
ctitM6jcDi4ZF10TPQW/Y3BVXHutBKI5wcxvfBRO1s76EkGJJ/70gNgvPR90uC9dWRncm4bT1
1GImqCuIQfioiOWafZBeA+Zl6p0Gss9v1R070wkeJSAIu3ecsOcI42vkcmvzpZco73tUWgGZe
HfLiNIAWsc9TLjchZpb9TWTeZSTOzJjqRSk07Q3UOWiey34p/OQ2+vTtlFE7goWXAjHTCKNnk
T2BtpUVmbR2/sEgusWhSiGlFEYaflFaoUV3AK87PWIqofyLHB7ZVYXxfZuEYyd6JgdIdHwGkW
1l0Xd3S+OeXBdr06RqDh+F8doFDZJRp7RA5YK50vEcdzbiPCNqfFCAsORzBHYbWtqV+b1Rz2F
/bIbI2R7GxX1g+9txXJSLHhGFk2haLkkPTZNloPJCHWb1HD6A1B+ryFObi5Rv5iicTHrWrliW
gR6/51Q/wFMAtvKqjCsBvRtRsBMATfrwCysvXNx8v36pTw8AqV67Mk=
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <08F2BE28-1252-4BD6-AED0-2323E112E0A1@damon-family.org>
X-Mailman-Original-References: <v0jh4g$h14g$1@dont-email.me>
View all headers

> On Apr 29, 2024, at 12:23 PM, jak via Python-list <python-list@python.org> wrote:
>
> Hi everyone,
> one thing that I do not understand is happening to me: I have some text
> files with different characteristics, among these there are that they
> have an UTF_32_le coding, utf_32be, utf_16_le, utf_16_be all of them
> without BOM. With those utf_32_xx I have no problem but with the
> UTF_16_xx I have. If I have an utf_16_le coded file and I read it with
> encoding='utf_16_le' I have no problem I read it, with
> encoding='utf_16_be' I can read it without any error even if the data I
> receive have the inverted bytes. The same thing happens with the
> utf_16_be codified file, I read it, both with encoding='utf_16_be' and
> with 'utf_16_le' without errors but in the last case the bytes are
> inverted. What did I not understand? What am I doing wrong?
>
> thanks in advance
>
> --
> https://mail.python.org/mailman/listinfo/python-list

That is why the BOM was created. A lot of files can be “correctly” read as either UTF-16-LE or UTF-1-BE encoded, as most of the 16 bit codes are valid, so unless the wrong encoding happens to hit something that is invalid (most likely something looking like a Surrogage Pair without a match), there isn’t an error in reading the file. The BOM character was specifically designed to be an invalid code if read by the wrong encoding (if you ignore the possibility of the file having a NUL right after the BOM)

If you know the files likely contains a lot of “ASCII” characters, then you might be able to detect that you got it wrong, due to seeing a lot of 0xXX00 characters and few 0x00XX characters, but that doesn’t create an “error” normally.

Subject: Re: UTF_16 question
From: jak
Newsgroups: comp.lang.python
Organization: A noiseless patient Spider
Date: Wed, 1 May 2024 17:07 UTC
References: 1 2 3
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nospam@please.ty (jak)
Newsgroups: comp.lang.python
Subject: Re: UTF_16 question
Date: Wed, 1 May 2024 19:07:02 +0200
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <v0tsrm$39c2q$1@dont-email.me>
References: <v0jh4g$h14g$1@dont-email.me>
<08F2BE28-1252-4BD6-AED0-2323E112E0A1@damon-family.org>
<mailman.1.1714409701.3326.python-list@python.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 01 May 2024 19:07:02 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="c9a3d396de3bf2208b6558793b5886b6";
logging-data="3453018"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19dUWCNy0fptHbXExfTOitU"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.18.2
Cancel-Lock: sha1:3OtjVHh82dmGpT1Lo2GNq93HOr4=
In-Reply-To: <mailman.1.1714409701.3326.python-list@python.org>
View all headers

Richard Damon ha scritto:
>> On Apr 29, 2024, at 12:23 PM, jak via Python-list <python-list@python.org> wrote:
>>
>> Hi everyone,
>> one thing that I do not understand is happening to me: I have some text
>> files with different characteristics, among these there are that they
>> have an UTF_32_le coding, utf_32be, utf_16_le, utf_16_be all of them
>> without BOM. With those utf_32_xx I have no problem but with the
>> UTF_16_xx I have. If I have an utf_16_le coded file and I read it with
>> encoding='utf_16_le' I have no problem I read it, with
>> encoding='utf_16_be' I can read it without any error even if the data I
>> receive have the inverted bytes. The same thing happens with the
>> utf_16_be codified file, I read it, both with encoding='utf_16_be' and
>> with 'utf_16_le' without errors but in the last case the bytes are
>> inverted. What did I not understand? What am I doing wrong?
>>
>> thanks in advance
>>
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>
> That is why the BOM was created. A lot of files can be “correctly” read as either UTF-16-LE or UTF-1-BE encoded, as most of the 16 bit codes are valid, so unless the wrong encoding happens to hit something that is invalid (most likely something looking like a Surrogage Pair without a match), there isn’t an error in reading the file. The BOM character was specifically designed to be an invalid code if read by the wrong encoding (if you ignore the possibility of the file having a NUL right after the BOM)
>
> If you know the files likely contains a lot of “ASCII” characters, then you might be able to detect that you got it wrong, due to seeing a lot of 0xXX00 characters and few 0x00XX characters, but that doesn’t create an “error” normally.
>

Thanks to you too for the reply. I was actually looking for a way to
distinguish "utf16le" texts from "utf16be" ones. Unfortunately, whoever
created this log file archive thought that the BOM was not important and
so omitted it. Now they want to switch to "utf8 " and also save the
previous. Fortunately I can be sure that the text of the log files
is in some European language, so after converting the file to "utf8" I
make sure that most of the bytes are less than the value 0x7F and if not
I reconvert them by replacing "utf16 " "le" with "be" or vice versa. The
strategy seems to be working. In the future, by writing files in "utf8"
they will no longer have problems like this.

1

rocksolid light 0.9.8
clearnet tor