Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

A day for firm decisions!!!!! Or is it?


comp / comp.lang.python / Re: Chardet oddity

SubjectAuthor
o Re: Chardet oddityAlbert-Jan Roskam

1
Subject: Re: Chardet oddity
From: Albert-Jan Roskam
Newsgroups: comp.lang.python
Date: Fri, 25 Oct 2024 10:31 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!feeder2.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: sjeik_appie@hotmail.com (Albert-Jan Roskam)
Newsgroups: comp.lang.python
Subject: Re: Chardet oddity
Date: Fri, 25 Oct 2024 12:31:25 +0200
Lines: 80
Message-ID: <mailman.47.1729852305.4695.python-list@python.org>
References: <CALk2KRX=pSzA-+zQ1LPcPwUBLdU=_wXtvZtrn73+0fw-2X_w1g@mail.gmail.com>
<DB9PR10MB6689557635AD6999D9C5BDE4834F2@DB9PR10MB6689.EURPRD10.PROD.OUTLOOK.COM>
Mime-Version: 1.0
Content-Type: text/plain; charset="utf-8"
X-Trace: news.uni-berlin.de a9ftvx0wnnQbppmWKrUfqw3d3CWPCv2cORr+/kSg3qAA==
Cancel-Lock: sha1:uJNPMMKEI1n7gRUcMpkCGlsREXw= sha256:zsldFoIupimJVyVJw69kbSb8fRau+xVt2SCv4gSALQc=
Return-Path: <sjeik_appie@hotmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=hotmail.com header.i=@hotmail.com header.b=qv3N9ihO;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.001
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(which': 0.04; 'def':
0.04; 'skip:= 10': 0.05; 'variable': 0.05; '&gt;&gt;&gt;': 0.07;
'loop': 0.07; 'cc:addr:python-list': 0.09; 'derived': 0.09; 'kid':
0.09; 'terminal': 0.09; 'way?': 0.09; '&gt;': 0.14; 'cc:no real
name:2**0': 0.14; 'import': 0.15; '&quot;if': 0.16; 'assuming':
0.16; 'behaviour': 0.16; 'email addr:python.org)': 0.16;
'encoding': 0.16; 'encoding.': 0.16; 'filename': 0.16; 'input.':
0.16; 'inspection': 0.16; 'interpreter': 0.16; 'main()': 0.16;
'resulted': 0.16; 'windows-1252': 0.16; 'python': 0.16;
'probably': 0.17; 'uses': 0.19; 'calls': 0.19; 'figure': 0.19;
'cc:addr:python.org': 0.20; "i've": 0.22; 'ran': 0.22; 'thanks!':
0.24; 'cc:2**0': 0.25; 'tried': 0.26; "isn't": 0.27; 'bit': 0.27;
'function': 0.27; 'email addr:python.org&gt;': 0.28; 'think':
0.29; 'whole': 0.30; 'approach': 0.31; 'module': 0.31; 'python-
list': 0.32; 'but': 0.32; 'hold': 0.33; 'script': 0.33; 'header
:In-Reply-To:1': 0.34; 'able': 0.34; 'same': 0.34; 'particularly':
0.35; 'following': 0.35; 'files': 0.36; "skip:' 10": 0.37; 'file':
0.38; 'way': 0.38; 'read': 0.38; 'both': 0.38; 'thanks': 0.39;
'quite': 0.39; 'break': 0.39; 'methods': 0.39; 'skip:u 20': 0.39;
'still': 0.40; 'file:': 0.40; 'something': 0.40; 'method': 0.61;
'skip:o 10': 0.61; 'day,': 0.62; 'seen': 0.62; 'gives': 0.62;
'mode': 0.62; 'skip:b 10': 0.63; 'your': 0.64; 'times.': 0.64;
'saw': 0.65; 'further': 0.69; 'depending': 0.70; 'confidence':
0.76; 'returned': 0.81; 'crucial': 0.84; 'email name:&lt;python-
list': 0.84; 'received:40.92.90': 0.84; 'roland': 0.84; 'skip:d
30': 0.86
ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none;
b=wUWZ1TNaug7xl9/zGf2AeK3IJdPi03N+2oecvgBcrbFEEbuCHyj+kxA5d0317nOSeoLegFpVuUgvqmk+U+EHYI5d/zIOWGyZLwrnpEJAnfgDkRJYVJh9GDCxmWVzmFRgkCi21haqm50+0C/Z/C87BpIRZDfK2jtJO9Nmu5FxDsivN3oXI5w3alGPzcADpOZw75Nwc+dohOsmAx6JETQ+QrbrbK3uVg9XXsXOj4vCmPAxLmAxyj7QLuSy+4Xflsku9LVuXQCtR9Uec6j6AO6uiDC3ph0EQBskzvMB43kFLiMqvlyIkfGCXFIu/lnLsosUzz+Owiv5PgRKEHWvwcu/Mg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
s=arcselector10001;
h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
bh=Yh6FnICL1CYzOFYmgt2RFk264JjxUTeAIJBjG8T9owQ=;
b=c7dXUZyweEylgF138/5AtdwHEzIkO1Gr9DM6C/XeSCFVxJ9+k57QSZ7W5ocMxwxOo33t/XgWzsPOcDc7CXOTIs+QXZP/JUY5NauYoQmDjQO5lp9KB/WDbwCxcBQkSjKS2+hYVrcVar9xj8oI69K6F2OI5xeQggaYg4w5SmSJyz5tR8ze/0uP/IjSJAI7E05VgPpz60Yk88DFw01Mn3MANJmWFzUiQwrTozLZ0DBuJaIR13zmI5kBy4+74V89NsLo+oJ2e4lhT90O0OWX+8SmlCyzPXoH0nS7PtPGVz+z6mCX7otwKtcDZMX3b4bJ/7jA5X7uwBw1wTuXwzA2FcKYfg==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none;
dkim=none; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hotmail.com;
s=selector1;
h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
bh=Yh6FnICL1CYzOFYmgt2RFk264JjxUTeAIJBjG8T9owQ=;
b=qv3N9ihOpjMjRK9dzfCiu81f2QXL9ASIIajcgMJSu3FIUjJX3rb6dBHZGuZgl5vUedzni0nSmCaABwOT8QD1ThloDANrqhy1tugtH/765Of4UBJX+f1wKFg2VPWQZHvb7YjvItLFUSl8/FQqLsZKyohAXQMlx5Qm5C0rW4yf8QZMISDXNnaepBbKsdZaYuqDqMX3alxycSSLGWn8dIDhtNmyPR3z6giFHP45RHr7jDUD929f9fpyZhOhWHOBh+BPyTzPgXVwsSTbV8KCN+P03RrOnmnUfv+wd078I+KApV8dWDQeq1lD/voFHtUixJ7M7SCnhNWlT8SxNfqZqIYGPw==
X-Android-Message-ID: <20c36f31-a71b-48ac-bcde-596cbb458261@email.android.com>
In-Reply-To: <CALk2KRX=pSzA-+zQ1LPcPwUBLdU=_wXtvZtrn73+0fw-2X_w1g@mail.gmail.com>
X-ClientProxiedBy: AM4PR05CA0018.eurprd05.prod.outlook.com (2603:10a6:205::31)
To DB9PR10MB6689.EURPRD10.PROD.OUTLOOK.COM
(2603:10a6:10:3d3::21)
X-Microsoft-Original-Message-ID: <20c36f31-a71b-48ac-bcde-596cbb458261@email.android.com>
X-MS-Exchange-MessageSentRepresentingType: 1
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: DB9PR10MB6689:EE_|AM7PR10MB3921:EE_
X-MS-Office365-Filtering-Correlation-Id: ba146a7d-589d-4e8d-d875-08dcf4e037d6
X-Microsoft-Antispam: BCL:0;
ARA:14566002|5072599009|7092599003|461199028|19110799003|15080799006|8060799006|3412199025|440099028;
X-Microsoft-Antispam-Message-Info: HcDgJ65hJ7z7iifUYmvx4MxoHpFnGlOJpSC81V3yx7bNRY0qhZ/aIcojXbn+iR+v9fj7Q527EE3RMJTnvIQvKXSqAFkxGU5BaVjf3CTyM4SeoeEpsDgJjGJYBTMN0u64Kp1gp7Iq6pODNKhtSLr5q1dUp3dxzckYaII8Xf+SqpPrXY3HROztRE5TqDBP0I5asZhHojTUsZD1bxkJyKImYuceN0o/m4aGZR39lsT1gEvVLYkPccH5fPaFh4hjuJMAImitRVyEHJeZrbHagBbDB/UKrw+I4ps0hwxufbchVRY9dV843M6Hv095H+eUbpNyaJrzvpWDMbWfKCOkLaU6AfihKwtsi3wlUXp957EeoRqvBaJcmZN6FRYGNBNmrClun4AdngNzVILHCZThsW7HpxnFCuafxxxuJbe7iFj+24y6zQoe0Zsns17oBaeIVGzgKIiuGqTtWRM5wSJg4Mv8LRKVeHLluoeDl1mKe+5yO/Z74G9zf9gqSNlcVOgLNwNGe0i3woXIt7Cibj4F0ubLcq0EGZySHhZUQ9HCXeZEi/2tHEm6q+PPNVi0PShc6KDQRKgzWEdDDCcQCRFm4sFiuwtsKJZexRYcc6l/LBUpvUZz9sBTk0yon/8Tzo4k136WSOCoNuQKXUkOxH+FOuwZyjeGcMr601N7rZJ2DUvPaYP1VXfUKkRNH6/aOuTP0IV43D2V/bSzhpbCGsp/M+EnHK7wsXz6ZrQiWNP8mCkb6lU=
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: DT32485ttkO/gyBEmGP2iBWRBM4cN0vY3rtRl4qsadRcU
zP2OFIeTnZ+oo96rgnpjuA6hQovAlOYSj1rKX4KKwk/Tg
/Zgsmzhi1rjD8ij9R8ht+HQIiEWg11ImiPI8tYyaWyPXa
hj3G75ylR9GQeyCBJkQi1SRwwHZeRdnIWO3E/9IYCCjKp
QD9dEUOmyPlPIizwU/a6huZKop/Slj3Bg+EgdNRYxonxQ
VC16cgJtKKV6/SFdxdvAOC7aPw78Mi5gKqDtVZXxyoZKh
B1CvvseUy9D8P5Yk5xvrhp7Y4nh6haGkl8ThZDJrWPr/R
D12uZA43SuJKaA7GViDJefotPLBYY2Z5/jfr33Z9Fbyn4
6393F+0EDrAXTXckU2GpG7yrc44mWjWI9ahdJxdeHlWYW
617AzmvqpuVoJiWoK7MU36DkroCE0zvIUd3dgItM7C2gf
jmVsMId8Dw6LaP9u/j0+zimJjXTIKnO9ccDZkJGA8pYdq
JYNSIKKWYXjoEZy1JsILUIlAFrKiZInBHIgyVAigzV8A/
JzJ83OeO6QciMBbUgvKxsfmjRlDuSzCdy3S19uazoKzYU
0K2LdY5j9gKksdLUXNTukU8Az34NG4A2CUzDaA0pj73pg
4s0WaFSBlqd/m5nq8p6bVhGiH7HuK/c0bvccpwYWNE+J8
7wFnt9TA9WRnofIPMEuYpAHfiOpP1PaczqC/ZKwQNFdbK
TdkFziOi3p0zWoZqorgeCyn9kWcoP7TKLa1Q6Ka0pKT/7
8lJDksose70sKnqFxqQmRqbpqOeHWQbxZyhIvL0kzAEdW
nzvZMW7snIWjay+sHgidRQLRXy93perD/b7iUxnZie0yf
AxUYHyCFBDy6dOwmZU+6wBABsImCxndzpyY4KqzdI7/9h
Qy9r9LCcAhek8T0CwPYA/Kb+51GIMoDDSPQJjRcCgbSwz
uopi1aeun0eNSdphNHoXEnXTRAuimhJPhH1S/dUOk7Lan
NNbabxLImnaR1sx/kEvunrkImx+OI2uCuyVpDNXlVnvhW
ln+EsfbfjG5GVI6bKh/HgQZ5Cf/xqC89FciLkgNpZ5V7C
VPT19O4JYkf/61v9DTWNA6p8NWt+KzC24F4CH2hqguZfs
l76LBGkwLTfotIiOeph9aTdemvxfiOqhQN6jj9kj4PMF6
UF1odyiKV3VT4RlD6lthU5+G55Msyi4hTLVOKkmgZuF/T
QMAnJ8STcZWbGPhczbRjhH1aQ6Gel2CDCKUbkCp4edSw=
X-OriginatorOrg: sct-15-20-7719-20-msonline-outlook-4359a.templateTenant
X-MS-Exchange-CrossTenant-Network-Message-Id: ba146a7d-589d-4e8d-d875-08dcf4e037d6
X-MS-Exchange-CrossTenant-AuthSource: DB9PR10MB6689.EURPRD10.PROD.OUTLOOK.COM
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 25 Oct 2024 10:31:43.0336 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM7PR10MB3921
X-Content-Filtered-By: Mailman/MimeDel 2.1.39
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <DB9PR10MB6689557635AD6999D9C5BDE4834F2@DB9PR10MB6689.EURPRD10.PROD.OUTLOOK.COM>
View all headers

On Oct 24, 2024 17:51, Roland Mueller via Python-list
<python-list@python.org> wrote:

ke 23. lokak. 2024 klo 20.11 Albert-Jan Roskam via Python-list (
python-list@python.org) kirjoitti:

>    Today I used chardet.detect in the repl and it returned
windows-1252
>    (incorrect, because it later resulted in a UnicodeDecodeError).
When I
> ran
>    chardet as a script (which uses UniversalLineDetector) this
returned
>    MacRoman. Isn't charset.detect the correct way? I've used this
method
> many
>    times.
>    # Interpreter
>    >>> contents = open(FILENAME, "rb").read()
>    >>> chardet.detect(content)
>    {'encoding': 'Windows-1252', 'confidence': 0.7282676610947401,
> 'language':
>    ''}
>    # Terminal
>    $ python -m chardet FILENAME
>    FILENAME: MacRoman with confidence 0.7167379080370483
>    Thanks!
>    Albert-Jan
>

The entry point for the module chardet is chardet.cli.chardetect:main
and
main() calls function description_of(lines, name).
'lines' is an opened file in mode 'rb' and name will hold the filename.

Following way I tried this in interactive mode: I think the crucial
difference is that  description_of(lines, name) reads
the opened file line by line and stops after something has been detected
in
some line.

When reading the whole file into the variable contents probably gives
another result depending on the input.
This behaviour I was not able to repeat.
I am assuming that you used the same Python for both tests.

>>> from chardet.cli import chardetect
>>> chardetect.description_of(open('/tmp/DATE', 'rb'), 'some file')
'some file: ascii with confidence 1.0'
>>>

Your approach
>>> from chardet import detect
>>> detect(open('/tmp/DATE','rb').read())
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

from /usr/lib/python3/dist-packages/chardet/cli/chardetect.py

def description_of(lines, name='stdin'):
    u = UniversalDetector()
    for line in lines:
        line = bytearray(line)
        u.feed(line)
        # shortcut out of the loop to save reading further -
particularly
useful if we read a BOM.
        if u.done:
            break
    u.close()
    result = u.result

=============
Hi Mark, Roland,
Thanks for your replies. I experimented a bit with both methods and the
derived encoding still differed, even after I removed the "if u.done: 
break" (I removed that because I've seen cp1252 files with a utf8 BOM in
the past. I kid you not!). BUT next day, at closer inspection I saw that
the file was quite a mess. I contained mojibake. So I don't blame chardet
for not being able to figure out the encoding. 
Albert-Jan

1

rocksolid light 0.9.8
clearnet tor