Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

BOFH excuse #413: Cow-tippers tipped a cow onto the server.


comp / comp.lang.python / Re: Chardet oddity

SubjectAuthor
o Re: Chardet oddityRoland Mueller

1
Subject: Re: Chardet oddity
From: Roland Mueller
Newsgroups: comp.lang.python
Date: Thu, 24 Oct 2024 15:51 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!feeder2.eternal-september.org!news.szaf.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: roland.em0001@googlemail.com (Roland Mueller)
Newsgroups: comp.lang.python
Subject: Re: Chardet oddity
Date: Thu, 24 Oct 2024 18:51:47 +0300
Lines: 67
Message-ID: <mailman.36.1729785122.4695.python-list@python.org>
References: <DB9PR10MB668924668A3BA86F698C6E42834D2@DB9PR10MB6689.EURPRD10.PROD.OUTLOOK.COM>
<CALk2KRX=pSzA-+zQ1LPcPwUBLdU=_wXtvZtrn73+0fw-2X_w1g@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de CqXv8etSzkZcMOeyMHikkQXoDh0Ap3IP+r2nRYtq5b/A==
Cancel-Lock: sha1:/cqzw+fVfIPr77acrn6GYVHIgdQ= sha256:UM+YwVaVh6vWgevYVvCqHxXDTVhPj6xAAIcLwcCg6EA=
Return-Path: <roland.em0001@googlemail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=googlemail.com header.i=@googlemail.com header.b=SEhYKHVD;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.002
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(which': 0.04; 'def':
0.04; 'variable': 0.05; '&gt;&gt;&gt;': 0.07; 'loop': 0.07;
'url:mailman': 0.09; '8bit%:3': 0.09; 'from:addr:googlemail.com':
0.09; 'terminal': 0.09; 'way?': 0.09; 'import': 0.15;
'url:listinfo': 0.15; 'assuming': 0.16; 'behaviour': 0.16; 'email
addr:python.org)': 0.16; 'filename': 0.16; 'input.': 0.16;
'interpreter': 0.16; 'main()': 0.16; 'resulted': 0.16;
'windows-1252': 0.16; 'python': 0.16; 'probably': 0.17; 'uses':
0.19; 'calls': 0.19; 'to:addr:python-list': 0.20; 'url-
ip:188.166.95.178/32': 0.20; 'url-ip:188.166.95/24': 0.20; "i've":
0.22; 'ran': 0.22; 'thanks!': 0.24; 'url-ip:188.166/16': 0.24;
'tried': 0.26; "isn't": 0.27; 'function': 0.27; '>>>': 0.28;
'think': 0.29; 'whole': 0.30; 'approach': 0.31; 'module': 0.31;
'message-id:@mail.gmail.com': 0.31; 'python-list': 0.32; 'hold':
0.33; 'script': 0.33; 'header:In-Reply-To:1': 0.34;
'received:google.com': 0.34; 'able': 0.34; 'same': 0.34;
'particularly': 0.35; 'following': 0.35; '...': 0.37; "skip:' 10":
0.37; 'file': 0.38; 'way': 0.38; 'read': 0.38; '8bit%:14': 0.38;
'both': 0.38; 'break': 0.39; 'skip:u 20': 0.39; 'file:': 0.40;
'something': 0.40; 'method': 0.61; 'skip:o 10': 0.61; 'skip:\xc2
10': 0.62; 'gives': 0.62; 'mode': 0.62; 'skip:b 10': 0.63; 'your':
0.64; 'times.': 0.64; 'further': 0.69; 'depending': 0.70;
'8bit%:6': 0.71; 'confidence': 0.76; 'returned': 0.81; 'crucial':
0.84; 'skip:d 30': 0.86
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=googlemail.com; s=20230601; t=1729785119; x=1730389919; darn=python.org;
h=to:subject:message-id:date:from:in-reply-to:references:mime-version
:from:to:cc:subject:date:message-id:reply-to;
bh=cmSpui0sWUFqYML5at8rkHAaT35fj0ZIGI1kYkHHQD4=;
b=SEhYKHVD4P2tgTFE/CmXcXndjgqoFTonoRRunNtbjCt2qUrEE3AGWhU6dUNDAFpiuY
6YUzlcbQN8Vav4U+K55/XvASkbLpBLOmEfEp1lyvZ9JEbhP9vSQYZQklzkmhzlVcLBYi
HKofX3STzUMhoLf+dN/OQr/lyTCxw9oOQ9+ipCSkXI2WR7Uv9uixjukVxB8ljsQnJVFw
BpkgPrPUyx1U7W/i6ExaWd8rhkG1oY2HFO30/qUbEPwIF0JbdGOVRZMWggS52oH8qUtl
a53erw386dCODM8JNotP/mdrkiWg5BHuCCLNMCkMVQ679ukgwVGo/dQKPiaOWr3HxV4e
/9Xw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1729785119; x=1730389919;
h=to:subject:message-id:date:from:in-reply-to:references:mime-version
:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
bh=cmSpui0sWUFqYML5at8rkHAaT35fj0ZIGI1kYkHHQD4=;
b=TKbifHi6yL1SLvulqMcMd+Qy+ZfoWG9wwg0fHbGkJrCp5wXMfUEGSkqQzlw29giRCD
khslLQkT8emifwx9gk+0R2VMPMVjWh24NPj5RiRGBzTeFo7R8UlJn7guFBKxfNd8ZGOZ
p2n7m//j11NN6yfVx6MiImChImaYWlIPGOvplbpv79Xev/RkvcYi195b/U5ddYUxnCZk
E4jkpMTBSuqeHkvU6VN4C5nXnCCBtgtnM8aAAlFWiu7+NTmQS7OWgh4HKuGgbvXLjt9r
C+cVQAztb90vRlFsLXiu/QMObIAc3T10pQLSMSDj0xUFFx8lPS5h6NO9ZIahMPM5Fhct
hofw==
X-Gm-Message-State: AOJu0YyIrK81rVM/5o64oKpzXXx+FDznzfkqg/EiQPpEfEAqpma1MNHf
XggUdQEHhb3xXX4eB7xAMlWHZWNpQo4kWHZxaJwsFf+ocHQAzz82Yn6PWUfSHe1JYq8BzYcr+8O
a34CmHIbgK7F48DLabewBJA6njed+Eovq
X-Google-Smtp-Source: AGHT+IGrfCtlTgVL27CVTX08saFdchYKF4M/HVNw4HGUTjX8kM/YDUFIqvMrk5BD13fSO49xuy9gT89wxksoBMqi5Hs=
X-Received: by 2002:a17:90a:e508:b0:2e2:d821:1b78 with SMTP id
98e67ed59e1d1-2e76b1d80d9mr7140953a91.0.1729785118740; Thu, 24 Oct 2024
08:51:58 -0700 (PDT)
In-Reply-To: <DB9PR10MB668924668A3BA86F698C6E42834D2@DB9PR10MB6689.EURPRD10.PROD.OUTLOOK.COM>
X-Content-Filtered-By: Mailman/MimeDel 2.1.39
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CALk2KRX=pSzA-+zQ1LPcPwUBLdU=_wXtvZtrn73+0fw-2X_w1g@mail.gmail.com>
X-Mailman-Original-References: <DB9PR10MB668924668A3BA86F698C6E42834D2@DB9PR10MB6689.EURPRD10.PROD.OUTLOOK.COM>
View all headers

ke 23. lokak. 2024 klo 20.11 Albert-Jan Roskam via Python-list (
python-list@python.org) kirjoitti:

> Today I used chardet.detect in the repl and it returned windows-1252
> (incorrect, because it later resulted in a UnicodeDecodeError). When I
> ran
> chardet as a script (which uses UniversalLineDetector) this returned
> MacRoman. Isn't charset.detect the correct way? I've used this method
> many
> times.
> # Interpreter
> >>> contents = open(FILENAME, "rb").read()
> >>> chardet.detect(content)
> {'encoding': 'Windows-1252', 'confidence': 0.7282676610947401,
> 'language':
> ''}
> # Terminal
> $ python -m chardet FILENAME
> FILENAME: MacRoman with confidence 0.7167379080370483
> Thanks!
> Albert-Jan
>

The entry point for the module chardet is chardet.cli.chardetect:main and
main() calls function description_of(lines, name).
'lines' is an opened file in mode 'rb' and name will hold the filename.

Following way I tried this in interactive mode: I think the crucial
difference is that description_of(lines, name) reads
the opened file line by line and stops after something has been detected in
some line.

When reading the whole file into the variable contents probably gives
another result depending on the input.
This behaviour I was not able to repeat.
I am assuming that you used the same Python for both tests.

>>> from chardet.cli import chardetect
>>> chardetect.description_of(open('/tmp/DATE', 'rb'), 'some file')
'some file: ascii with confidence 1.0'
>>>

Your approach
>>> from chardet import detect
>>> detect(open('/tmp/DATE','rb').read())
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

from /usr/lib/python3/dist-packages/chardet/cli/chardetect.py

def description_of(lines, name='stdin'):
u = UniversalDetector()
for line in lines:
line = bytearray(line)
u.feed(line)
# shortcut out of the loop to save reading further - particularly
useful if we read a BOM.
if u.done:
break
u.close()
result = u.result
...

> --
> https://mail.python.org/mailman/listinfo/python-list
>

1

rocksolid light 0.9.8
clearnet tor