Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

BOFH excuse #331: those damn raccoons!


comp / comp.lang.python / Chardet oddity

SubjectAuthor
* Chardet oddityAlbert-Jan Roskam
+- Re: Chardet oddityStefan Ram
`- Re: Chardet oddityMark Bourne

1
Subject: Chardet oddity
From: Albert-Jan Roskam
Newsgroups: comp.lang.python
Date: Wed, 23 Oct 2024 17:07 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!ereborbbs.duckdns.org!newsfeed.xs3.de!weretis.net!feeder8.news.weretis.net!fu-berlin.de!uni-berlin.de!not-for-mail
From: sjeik_appie@hotmail.com (Albert-Jan Roskam)
Newsgroups: comp.lang.python
Subject: Chardet oddity
Date: Wed, 23 Oct 2024 19:07:14 +0200
Lines: 15
Message-ID: <mailman.31.1729703240.4695.python-list@python.org>
References: <DB9PR10MB668924668A3BA86F698C6E42834D2@DB9PR10MB6689.EURPRD10.PROD.OUTLOOK.COM>
Mime-Version: 1.0
Content-Type: text/plain; charset="utf-8"
X-Trace: news.uni-berlin.de vq+TKzXYcHEFpu0/L6XdVQ7XVNx4YosNQkhNEc7KHtWw==
Cancel-Lock: sha1:f5f8or7yxAKheO+Q63YO0duDE/0= sha256:3RYdN3XT8PTs02rmyVB3tNb5Pm1D+S+YZ61yvW7ga5g=
Return-Path: <sjeik_appie@hotmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=hotmail.com header.i=@hotmail.com header.b=FKuMOKob;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.081
X-Spam-Evidence: '*H*': 0.85; '*S*': 0.02; '(which': 0.04; 'way?':
0.09; 'resulted': 0.16; 'windows-1252': 0.16; 'python': 0.16;
'uses': 0.19; 'to:addr:python-list': 0.20; "i've": 0.22; 'ran':
0.22; 'thanks!': 0.24; "isn't": 0.27; 'script': 0.33; "skip:' 10":
0.37; 'skip:u 20': 0.39; 'method': 0.61; 'skip:o 10': 0.61;
'skip:i 20': 0.62; 'times.': 0.64; 'confidence': 0.76; 'returned':
0.81; 'received:40.92.90': 0.84
ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none;
b=NaeLM3ZF5K1WVU268jFr5Hlw9RG6QpVXylquhhE2atQZspglYDo3+v2ddLfP6XznYIwXBpse9l8FY67IEpLkjhiZl0yz9CpnyyYjyDd+hyF82VUxpMHA6PjeWyXD8tmP9OlVD42B580v6CRWDcs1qFNA3n0TY2fCO3UG4//cPvl64WhENnZL1MvXsQnZWpR2GxNvbFUIjPPf5VmPlHrFdx7XCEVQOkrAEDDcHrx+/uVvJaF9BuzhAw0Hg1Q1ihWc8wDhTFm5BRC1JeIT+797VTi2OJmoBKLH5dto7l+oNN+Noqp7oc6sQJX+LzyEcoPjBIvaOs+wlraGHdZiU+oAEQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
s=arcselector10001;
h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
bh=2iAVxyamQQS7hRmw4suyqIRHVz90zTdmk6BQfFosVdY=;
b=XYEMM5nFaIaR0K9JNkgZPJxfmPhWBNemVMNKuFHsomsYkoMHidsqORcOg7mNkje9yoKow7b9EhBwu6PHqwvpGmfLsPk5kMtMc4pB+eRkEt0rqY47CUZvncDSMPXPOZvZEQfJBSJA0syK+tTp733Z2yLEIu0W2d/4Rkemp24zzK7/3uYmGPSu3Qnt2j6Rgs+xuu1Z59RSpvWaA6nPMdq/8nb/Rb+MxoqAdjzWY7+idO582Dn5L5J7To3Sz3D2P5DVQ0XAjDP2eVd0Nk6e6UiPwCp/rrEBnYO32E3ry4uzGpHNQ21eZEMWpkrf1CUe+Y7gfDgNxbhJdyl0TiWaUBeDlw==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none;
dkim=none; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hotmail.com;
s=selector1;
h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
bh=2iAVxyamQQS7hRmw4suyqIRHVz90zTdmk6BQfFosVdY=;
b=FKuMOKobi0xTClm6bRW4CGB71vSlycbBgCmP8QvHBFXwKA7XqPdKp+xrX5qPJVMW5L88tbxLBuNSADN3mpRfTqVfJT6WCv2U2rBfowsO8s8WCEDK1PI+6I2P6ti+TLESye4HhpXoOjrhAdaVad346AKmGp/4zlZAXTv1LB5FbPyrBUoBOoSZcDMTBfGfCBnANQLq3+4WcXtgiz6h8wZiyHw/MxjJx8QBgRozg2bHGLRdMKn60sH91CZukTYnwyIgdQ3niBjtHVBbZBqO/g4hea7UxAGQV31tM/X6fm74mp6DmK5p5oMNR9m3jmK7bzdqMn9S9QyKAiAvfCNfX0XQzg==
X-Android-Message-ID: <91bdad5c-4dff-4b2f-80ac-0e87f984560a@email.android.com>
X-ClientProxiedBy: AS4P189CA0040.EURP189.PROD.OUTLOOK.COM
(2603:10a6:20b:5dd::9) To DB9PR10MB6689.EURPRD10.PROD.OUTLOOK.COM
(2603:10a6:10:3d3::21)
X-Microsoft-Original-Message-ID: <91bdad5c-4dff-4b2f-80ac-0e87f984560a@email.android.com>
X-MS-Exchange-MessageSentRepresentingType: 1
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: DB9PR10MB6689:EE_|AS1PR10MB5649:EE_
X-MS-Office365-Filtering-Correlation-Id: 841d8b0c-e51c-480c-ca8e-08dcf3852570
X-Microsoft-Antispam: BCL:0;
ARA:14566002|7092599003|15080799006|5072599009|461199028|8060799006|19110799003|3412199025|440099028;
X-Microsoft-Antispam-Message-Info: ITiP/ECMsudFFehx66IXoRlnYMWifbl79bsQTSLCx/OHZe+Fixt27ZID3MDfY+Fr7lb2g5D6YyOwSLCNm0WvgLzLZdZCDRyproJNMG11/cHG0BTC0rbezXngORcSC/8d/hPu6s4mbRw33I8uIV56e/cJOt+YyvFljLpnKM484lCHsUV0fryOfTvQ7oCPWAzh2htyPhxDoOLC84uFVQpOFEXH93LpH7OiqwuwVkFzS2AhgMAWDPyf2DjaKWfp83HZuM8EVnMd7gTe7kf9SVMpP2LPZ2Gm10rTN3vs3IK+0EmZpniokSoqMZDJtmn795G01f9kAonnAvgJlRLxXabVQ5aiKKjw5UWIDD0SiGecsWZ9iMTlRoHVVJ4KBETaeA+tGHg1ssgoUSmLNccaZMZvXetHZWjCxUDheeEbqP+/WhZ5LLVKOsoUZRnlFjl1iQDhhOjhhvXx1KTQK9e+H8ajC4HN2KVmvqb3/dnLCDI6/93v0SKmxJ62ISCucAwUfXneHKH8iYxb0zElX1VuyImi+nBXqvdPKy+sOOn/i+y2CFsksLOTFii2rnD0ljn+t2RUgtpMIEqOVQ8YKudS6qshzHh9ghSglncFA7Zx6w0BKpJShsWfpP2wtia0Ja9BTW5YxgP74/pVWoDPBH6DOH7gAIp32bQNG9siAlXJ/NvnrKXHUxcOCVSnPSyOSKQz7VT9E5AxhaeEgIQq9ZV+Yzvt7uYjEfgb7JwGKpBVJNjjDXM=
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: EPCXMdYD+St2wqKQUWLEAC+FHdlql0iphOjH0dTRqaZW6
eI8XSfz6J4rLXV6zfM8zJW+4SRnhzoEtZNEkz+/NKZA0j
b/NCt2VLSUYZXAdu1EuaiAkqGnuB1dEQySfNYloMt+kQV
F44BER+7wkqf1hxFjGxXcQFIKZPZxuams+F4cgguhMXQk
vPg4TDyHluVUEtdHAdPKuMd5ggbRMbuwVGDQp8C8Kas3w
h00aO9qCsw/xn0dn1hgQr80IPeQQqGyNhfVLbjxQ4EFWL
dH3S2n5GAgbt2JRc90tkUAmwM2BhutkxK1vISnAMEN5Tw
LMmOFaGS/AkmOr6ck4nLsKJ0gErEiszwVbdzau4GS9zH+
Pd+lpLptPiyLsKKYWq/uWD7nyczJTBAhaiu4oUCOi/+pj
5vxSUbW3NOmDD9451jVrAsNQN8/XnEK+XLLizycjp1ZaJ
nB3hycBtK2WWRmOWol5cbwL8JYhRXkwB1/qTWQZNzWndv
VNY0Zrldq0HYj0bZ6I03wNQVLPceFFxReKOf1k6Bol1TM
ZREHk5/aCc0T2/w/BKE77v21nsZJi2WzeE6hf6gGpccDA
aD6QKlnDNWwPeQJ5JFDW306wQ1WOFBLQCokv0rpoPWPxg
IPF2Snh6wh9n1LEwWi1/DKLvY0eJ1Voxd3s8/gac0l392
RGumh8lIbeYDr2h6R35xeHAJwLe3cYGhbnGD3liY0q4f2
9d9fCiwR1P/A8fRmI9OakxFNLpzDUS6Tke8lNuKCm+fo/
cN34V40Zj63SR7koZ4/DG56ABciOACr1FgoL1Fz5+fhY9
iWMTuyODBRExiKaVBzv4dtdwOvZXDxcvIuBdqCdUgjbZw
7RVBDYlEz5hBNC2tf/REctLiHNhOLDxgDSaHxEmROKbGB
siOrDSjTrZLHVuMAIKXTehNvvXWlJo/5+LlXL2YQv6h/t
2nBgaELfuIX76h1UJ9Z1ua/doOejbmSQHluz36OGp0hyz
sHLeZBBvNufsc1cnjj+5OZzXi1HUnn3Brq78xpSoLiNII
jS5YHvZGTGPRVCOVvYnjnBHltEiIB5zzWyM4vjS0C58MU
erUdHIjVFykTB+0SPFqXNYgvvWpMgjfD/denGie+6E4Sn
S9RJ53hlW/AHsgGsvDoA01bIXeYWMPNVa+ae1rtamTxdr
mzYIDFzJWk7J9f81JX8RyxGutYd/jk1MCt3wMXmR88hoJ
IdH5WGZOTLNUXavD1kOn6Ix1QLsefun2efW8THanLZCI=
X-OriginatorOrg: sct-15-20-7719-20-msonline-outlook-4359a.templateTenant
X-MS-Exchange-CrossTenant-Network-Message-Id: 841d8b0c-e51c-480c-ca8e-08dcf3852570
X-MS-Exchange-CrossTenant-AuthSource: DB9PR10MB6689.EURPRD10.PROD.OUTLOOK.COM
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Oct 2024 17:07:16.8139 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: AS1PR10MB5649
X-Content-Filtered-By: Mailman/MimeDel 2.1.39
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <DB9PR10MB668924668A3BA86F698C6E42834D2@DB9PR10MB6689.EURPRD10.PROD.OUTLOOK.COM>
View all headers

Today I used chardet.detect in the repl and it returned windows-1252
(incorrect, because it later resulted in a UnicodeDecodeError). When I ran
chardet as a script (which uses UniversalLineDetector) this returned
MacRoman. Isn't charset.detect the correct way? I've used this method many
times.
# Interpreter
>>> contents = open(FILENAME, "rb").read()
>>> chardet.detect(content)
{'encoding': 'Windows-1252', 'confidence': 0.7282676610947401, 'language':
''}
# Terminal
$ python -m chardet FILENAME
FILENAME: MacRoman with confidence 0.7167379080370483
Thanks!
Albert-Jan

Subject: Re: Chardet oddity
From: Stefan Ram
Newsgroups: comp.lang.python
Organization: Stefan Ram
Date: Wed, 23 Oct 2024 17:43 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!feeder2.eternal-september.org!weretis.net!feeder8.news.weretis.net!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.python
Subject: Re: Chardet oddity
Date: 23 Oct 2024 17:43:51 GMT
Organization: Stefan Ram
Lines: 35
Expires: 1 Jul 2025 11:59:58 GMT
Message-ID: <script-20241023184256@ram.dialup.fu-berlin.de>
References: <mailman.31.1729703240.4695.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de z0B0X/coqrkU/Bq3ccRpGQ0f6frU+Z1oMDp39HlZMWK7MJ
Cancel-Lock: sha1:poABownIunSwC19W2Q9VxF30/1E= sha256:gO11dRV6Yu3oIYrI/NurQMRJ2z8HvB84DV1n5Qh7Jd0=
X-Copyright: (C) Copyright 2024 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
View all headers

Albert-Jan Roskam <sjeik_appie@hotmail.com> wrote or quoted:
>Today I used chardet.detect in the repl and it returned windows-1252
>(incorrect, because it later resulted in a UnicodeDecodeError). When I ran
>chardet as a script (which uses UniversalLineDetector) this returned
>MacRoman. Isn't charset.detect the correct way? I've used this method many
>times.

Oof, that's a head-scratcher! Looks like chardet's throwing
you a curveball. Usually, chardet.detect() is the go-to method,
but it seems to be off its game here.

The script version's using UniversalLineDetector under the hood
(as you wrote), which might be giving it an edge in this case.

It's weird that the confidence levels are so close, though.
Maybe the file's got some quirks that are tripping up the
simpler detect() method.

I'd say stick with the script version for now if it's giving
you better results.

Here's how you can use it in your code:

from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()
with open(FILENAME, 'rb') as file:
for line in file:
detector.feed(line)
if detector.done:
break
detector.close()
print(detector.result)

Subject: Re: Chardet oddity
From: Mark Bourne
Newsgroups: comp.lang.python
Organization: A noiseless patient Spider
Date: Wed, 23 Oct 2024 19:42 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nntp.mbourne@spamgourmet.com (Mark Bourne)
Newsgroups: comp.lang.python
Subject: Re: Chardet oddity
Date: Wed, 23 Oct 2024 20:42:00 +0100
Organization: A noiseless patient Spider
Lines: 25
Message-ID: <vfbjia$28es4$1@dont-email.me>
References: <DB9PR10MB668924668A3BA86F698C6E42834D2@DB9PR10MB6689.EURPRD10.PROD.OUTLOOK.COM>
<mailman.31.1729703240.4695.python-list@python.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 23 Oct 2024 21:42:02 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="d9ebccd64acef610715925884f5fd91a";
logging-data="2374532"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+SICwbEXRFLM8/wgSkT1ch"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
SeaMonkey/2.53.19
Cancel-Lock: sha1:9qNFkqUydK0eZp7I+FcJiVQBjAA=
In-Reply-To: <mailman.31.1729703240.4695.python-list@python.org>
View all headers

Albert-Jan Roskam wrote:
> Today I used chardet.detect in the repl and it returned windows-1252
> (incorrect, because it later resulted in a UnicodeDecodeError). When I ran
> chardet as a script (which uses UniversalLineDetector) this returned
> MacRoman. Isn't charset.detect the correct way? I've used this method many
> times.
> # Interpreter
> >>> contents = open(FILENAME, "rb").read()
> >>> chardet.detect(content)

Is that copy and pasted from the terminal, or retyped with possible
transcription errors? As written, you've assigned the open file handle
to `contents`, but passed `content` (with no "s") to `chardet.detect` -
so the result would depend on whatever was previously assigned to `content`.

> {'encoding': 'Windows-1252', 'confidence': 0.7282676610947401, 'language':
> ''}
> # Terminal
> $ python -m chardet FILENAME
> FILENAME: MacRoman with confidence 0.7167379080370483
> Thanks!
> Albert-Jan

--
Mark.

1

rocksolid light 0.9.8
clearnet tor