Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

Q: What do they call the alphabet in Arkansas? A: The impossible dream.


comp / comp.lang.python / Re: From JoyceUlysses.txt -- words occurring exactly once

SubjectAuthor
* From JoyceUlysses.txt -- words occurring exactly onceHenHanna
+* Re: From JoyceUlysses.txt -- words occurring exactly oncedn
|`* Re: From JoyceUlysses.txt -- words occurring exactly onceHenHanna
| +- Re: From JoyceUlysses.txt -- words occurring exactly oncePeter J. Holzer
| +- Re: From JoyceUlysses.txt -- words occurring exactly onceThomas Passin
| +- Re: From JoyceUlysses.txt -- words occurring exactly oncedn
| +- Re: From JoyceUlysses.txt -- words occurring exactly onceGrant Edwards
| +- Re: From JoyceUlysses.txt -- words occurring exactly onceThomas Passin
| +- Re: From JoyceUlysses.txt -- words occurring exactly onceMats Wichmann
| +* Re: From JoyceUlysses.txt -- words occurring exactly onceLarry Martell
| |`- Re: From JoyceUlysses.txt -- words occurring exactly onceStefan Ram
| +- Re: From JoyceUlysses.txt -- words occurring exactly onceThomas Passin
| +- RE: From JoyceUlysses.txt -- words occurring exactly once<avi.e.gross
| +- Re: From JoyceUlysses.txt -- words occurring exactly onceThomas Passin
| +- RE: From JoyceUlysses.txt -- words occurring exactly once<avi.e.gross
| `- Re: From JoyceUlysses.txt -- words occurring exactly onceGrant Edwards
+* Re: From JoyceUlysses.txt -- words occurring exactly oncePieter van Oostrum
|`- Re: From JoyceUlysses.txt -- words occurring exactly onceGrant Edwards
+- Re: From JoyceUlysses.txt -- words occurring exactly oncedieter.maurer
+- Re: From JoyceUlysses.txt -- words occurring exactly onceThomas Passin
`* Re: From JoyceUlysses.txt -- words occurring exactly onceMats Wichmann
 `* Re: From JoyceUlysses.txt -- words occurring exactly onceEdward Teach
  +* Re: From JoyceUlysses.txt -- words occurring exactly onceGrant Edwards
  |`* Re: From JoyceUlysses.txt -- words occurring exactly onceEdward Teach
  | +- Re: From JoyceUlysses.txt -- words occurring exactly onceGrant Edwards
  | +- RE: From JoyceUlysses.txt -- words occurring exactly once<avi.e.gross
  | `- Re: From JoyceUlysses.txt -- words occurring exactly onceChris Angelico
  `- Re: From JoyceUlysses.txt -- words occurring exactly oncedieter.maurer

Pages:12
Subject: From JoyceUlysses.txt -- words occurring exactly once
From: HenHanna
Newsgroups: comp.lang.python
Organization: A noiseless patient Spider
Date: Thu, 30 May 2024 20:03 UTC
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: HenHanna@devnull.tb (HenHanna)
Newsgroups: comp.lang.python
Subject: From JoyceUlysses.txt -- words occurring exactly once
Date: Thu, 30 May 2024 13:03:33 -0700
Organization: A noiseless patient Spider
Lines: 16
Message-ID: <v3am2l$1qf6m$3@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 30 May 2024 22:03:34 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="f52218980f176c0dd32f4029d8d739d1";
logging-data="1916118"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/FqWZNoQYtJEzRsmctCjFFLS08xFiKSGQ="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:qVeTg5nVaGBRkWGJ02v1uSlTkfk=
Content-Language: en-US
View all headers

Given a text file of a novel (JoyceUlysses.txt) ...

could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?

-- Also, a list of words occurring once, twice or 3 times

re: hyphenated words (you can treat it anyway you like)

but ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: dn
Newsgroups: comp.lang.python
Organization: DWM
Date: Thu, 30 May 2024 21:18 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: PythonList@DancesWithMice.info (dn)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Fri, 31 May 2024 09:18:44 +1200
Organization: DWM
Lines: 29
Message-ID: <mailman.74.1717103931.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de cbGifQXv2HMlMiWJGISjPQSPBUjs+cH7QnsAHoLl7yJQ==
Cancel-Lock: sha1:k/AtO4Ch7jPb4/1YF0mPI5eJmew= sha256:gg/PH2TaNLa7KLjkwUnEpU397Cz0Ily0nRHeKRds4ag=
Return-Path: <PythonList@DancesWithMice.info>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=danceswithmice.info header.i=@danceswithmice.info
header.b=czHZL/hA; dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.065
X-Spam-Evidence: '*H*': 0.88; '*S*': 0.01; '=dn': 0.09;
'from:addr:danceswithmice.info': 0.09; 'from:addr:pythonlist':
0.09; 'hyphenated': 0.09; 'received:192.168.1.64': 0.09;
'skip:\xc2 20': 0.09; 'message-id:@DancesWithMice.info': 0.16;
'received:cloud': 0.16; 'received:rangi.cloud': 0.16; 'skip:\xc2
60': 0.16; 'subject: -- ': 0.16; 'subject:words': 0.16; 'wrote:':
0.16; 'python': 0.16; 'to:addr:python-list': 0.20; 'code': 0.23;
"i'd": 0.24; '(and': 0.25; 'header:User-Agent:1': 0.30;
'header:Organization:1': 0.31; 'program': 0.31; 'python-list':
0.32; 'split': 0.32; 'received:192.168.1': 0.32; 'but': 0.32;
'someone': 0.34; 'header:In-Reply-To:1': 0.34; 'words': 0.35;
'also,': 0.36; 'received:192.168': 0.37; 'file': 0.38; 'could':
0.38; 'text': 0.39; 'list': 0.39; 'use': 0.39; 're:': 0.64;
'exactly': 0.68; 'times': 0.69; '8bit%:100': 0.76; '(you': 0.76;
'treat': 0.76; 'counter.': 0.84; 'novel': 0.84; 'occurring': 0.84;
'\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0': 0.84;
'subject:From': 0.91; 'subject:once': 0.91; 'will.': 0.91
DKIM-Filter: OpenDKIM Filter v2.11.0 vps.rangi.cloud 24B033BA7
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=danceswithmice.info;
s=staff; t=1717103929;
bh=25JYohB1EzAoB6SLu5j9zO+8uClwGaB6v2xk1H3ACSw=;
h=Date:From:Subject:To:References:In-Reply-To:From;
b=czHZL/hATAzRGxIbEP0z+FDVKJcLjOYCOeSpPj35ZDK0iOV0+ejSuIZRwW2L2ASqy
VnW41Rv3Ogr7tOgkXxS8KpohZ2XkH5DcIGVQ8jsb6wV+o8wYW+DrJTwICDX8soTogz
VpR0njGM5TXp7GGLhSHDwbvCZA6n5TSpcrwHXb4jrIuQQGe1Nv7bSBOzTfJF/hv1tX
GWXt68MhdFg+I3jiJ7Hw6rI1aMxOWLzrG/s1mH4X9k4q6XGbsYJmh6Sn+vj+WNhkHx
pCdOgGUfxf7tZbWBiew8gWGuhdYlgkUGL1gpQQWGHBH9tiL0UxxXWxWs9ywbWIupOB
UoIdkVIqCelAg==
User-Agent: Mozilla Thunderbird
Content-Language: en-US
In-Reply-To: <v3am2l$1qf6m$3@dont-email.me>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
View all headers

On 31/05/24 08:03, HenHanna via Python-list wrote:
>
> Given a text file of a novel (JoyceUlysses.txt) ...
>
> could someone give me a pretty fast (and simple) Python program that'd
> give me a list of all words occurring exactly once?
>
>               -- Also, a list of words occurring once, twice or 3 times
>
>
>
> re: hyphenated words        (you can treat it anyway you like)
>
>        but ideally, i'd treat  [editor-in-chief]
>                                [go-ahead]  [pen-knife]
>                                [know-how]  [far-fetched] ...
>        as one unit.

Did you mention the pay-rate for this work?

Split into words - defined as you will.
Use Counter.

Show some (of your) code and we'll be happy to critique...
--
Regards,
=dn

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: HenHanna
Newsgroups: comp.lang.python
Organization: A noiseless patient Spider
Date: Fri, 31 May 2024 02:26 UTC
References: 1 2 3
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: HenHanna@devnull.tb (HenHanna)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Thu, 30 May 2024 19:26:37 -0700
Organization: A noiseless patient Spider
Lines: 55
Message-ID: <v3bcgu$229eq$1@dont-email.me>
References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 31 May 2024 04:26:39 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="512d4da52372882ea9ed0f897c428d2f";
logging-data="2172378"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX188xAmHgAeMmP8eO2meU0RrGCbrL25AdAI="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:guna61j28ShkA563dS3cH82OGvw=
Content-Language: en-US
In-Reply-To: <mailman.74.1717103931.2909.python-list@python.org>
View all headers

On 5/30/2024 2:18 PM, dn wrote:
> On 31/05/24 08:03, HenHanna via Python-list wrote:
>>
>> Given a text file of a novel (JoyceUlysses.txt) ...
>>
>> could someone give me a pretty fast (and simple) Python program that'd
>> give me a list of all words occurring exactly once?
>>
>>                -- Also, a list of words occurring once, twice or 3 times
>>
>>
>>
>> re: hyphenated words        (you can treat it anyway you like)
>>
>>         but ideally, i'd treat  [editor-in-chief]
>>                                 [go-ahead]  [pen-knife]
>>                                 [know-how]  [far-fetched] ...
>>         as one unit.

>
> Split into words - defined as you will.
> Use Counter.
>
> Show some (of your) code and we'll be happy to critique...

hard to decide what to do with hyphens
and apostrophes
(I'd, he's, can't, haven't, A's and B's)

2-step-Process

1. make a file listing all words (one word per line)

2. then, doing the counting. using
from collections import Counter

Related code (for 1) that i'd used before:

Rfile = open("JoyceUlysses.txt", 'r')

with open( 'Out.txt', 'w' ) as fo:
for line in Rfile:
line = line.rstrip()
wLis = line.split()
for w in wLis:
if w != "":
w = w.rstrip(";:,'\"[]()*&^%$#@!,./<>?_-+=")
w = w.lstrip(";:,'\"[]()*&^%$#@!,./<>?_-+=")
fo.write(w.lower())
fo.write('\n')

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Pieter van Oostrum
Newsgroups: comp.lang.python
Date: Fri, 31 May 2024 12:39 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: pieter-l@vanoostrum.org (Pieter van Oostrum)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Fri, 31 May 2024 14:39:37 +0200
Lines: 28
Message-ID: <m2mso6i29i.fsf@cochabamba.kpn>
References: <v3am2l$1qf6m$3@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: individual.net wazufSPawVd4qh5iWJ26Jg55d5+GB0JfDINLhrwjE1fqwRjrEp
Cancel-Lock: sha1:euhsv/I146Ieh0Texz3G4o8Cke4= sha1:Ry4Z3R7FYMazMr/vWcuPoJ7pt4s= sha256:XTSeGtFcVHnJ1I10mWUCMeFuIS6MljBr1wpcosxxG1A=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (darwin)
View all headers

HenHanna <HenHanna@devnull.tb> writes:

> Given a text file of a novel (JoyceUlysses.txt) ...
>
> could someone give me a pretty fast (and simple) Python program that'd
> give me a list of all words occurring exactly once?
>
> -- Also, a list of words occurring once, twice or 3 times
>
>
>
> re: hyphenated words (you can treat it anyway you like)
>
> but ideally, i'd treat [editor-in-chief]
> [go-ahead] [pen-knife]
> [know-how] [far-fetched] ...
> as one unit.
>

That is a famous Unix task : (Sorry, no Python)

grep -o '\w*' JoyceUlysses.txt | sort | uniq -c | sort -n

--
Pieter van Oostrum <pieter@vanoostrum.org>
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Grant Edwards
Newsgroups: comp.lang.python
Date: Fri, 31 May 2024 18:58 UTC
References: 1 2 3
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: grant.b.edwards@gmail.com (Grant Edwards)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Fri, 31 May 2024 14:58:55 -0400 (EDT)
Lines: 27
Message-ID: <mailman.75.1717181937.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me> <m2mso6i29i.fsf@cochabamba.kpn>
<4VrXTW4wHHznWBT@mail.python.org>
X-Trace: news.uni-berlin.de cVrZOlxkR0OTfZXqMVLlkw68Mcs1wD4KVyjjhHSybFvQ==
Cancel-Lock: sha1:gwEVVBMVMl8nfNIiGgiT1FKFcKk= sha256:9WvZ7uigIS6xplqwgRoq6+Mi65S7RnPsxVaOO9y2f7M=
Return-Path: <grant.b.edwards@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.045
X-Spam-Evidence: '*H*': 0.91; '*S*': 0.00; 'class.': 0.07; 'van':
0.07; 'grep': 0.09; 'hyphenated': 0.09; 'python"': 0.09;
'python)': 0.09; 'writes:': 0.09; 'from:addr:grant.b.edwards':
0.16; 'from:name:grant edwards': 0.16; 'subject: -- ': 0.16;
'subject:words': 0.16; 'wrote:': 0.16; 'python': 0.16; 'to:addr
:python-list': 0.20; 'option': 0.20; "i'd": 0.24; '(and': 0.25;
'task': 0.26; 'header:User-Agent:1': 0.30; 'program': 0.31;
"doesn't": 0.32; 'assume': 0.32; 'python-list': 0.32; 'but': 0.32;
'someone': 0.34; 'words': 0.35; 'from:addr:gmail.com': 0.35;
'also,': 0.36; 'file': 0.38; 'could': 0.38; 'text': 0.39; 'list':
0.39; "couldn't": 0.40; 'exact': 0.40; 'remember': 0.61; 're:':
0.64; 'came': 0.65; 'message-id:invalid': 0.68; 'exactly': 0.68;
'times': 0.69; '(you': 0.76; 'treat': 0.76; 'points': 0.84;
'novel': 0.84; 'occurring': 0.84; 'subject:From': 0.91;
'subject:once': 0.91
User-Agent: slrn/1.0.3 (Linux)
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <4VrXTW4wHHznWBT@mail.python.org>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<m2mso6i29i.fsf@cochabamba.kpn>
View all headers

On 2024-05-31, Pieter van Oostrum via Python-list <python-list@python.org> wrote:
> HenHanna <HenHanna@devnull.tb> writes:
>
>> Given a text file of a novel (JoyceUlysses.txt) ...
>>
>> could someone give me a pretty fast (and simple) Python program that'd
>> give me a list of all words occurring exactly once?
>>
>> -- Also, a list of words occurring once, twice or 3 times
>>
>> re: hyphenated words (you can treat it anyway you like)
>>
>> but ideally, i'd treat [editor-in-chief]
>> [go-ahead] [pen-knife]
>> [know-how] [far-fetched] ...
>> as one unit.
>>
>
> That is a famous Unix task : (Sorry, no Python)
>
> grep -o '\w*' JoyceUlysses.txt | sort | uniq -c | sort -n

Yep, that's what came to my mind (though I couldn't remember the exact
grep option without looking it up). However, I assume that doesn't
get you very many points on a homework assignemnt from an "Intruction
to Python" class.

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: dieter.maurer@online.de
Newsgroups: comp.lang.python
Date: Fri, 31 May 2024 17:59 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: dieter.maurer@online.de
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Fri, 31 May 2024 19:59:15 +0200
Lines: 24
Message-ID: <mailman.76.1717182444.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de /hYdz68JdeXUH8luZKNrmAIJahujsOpvh2eERQ9UIQrQ==
Cancel-Lock: sha1:Gaq61B/IS30GsLy8ua6EYYzbrTY= sha256:MfHHiRuZSqKT9QLdTvQPtQqwppLDYqFbSPT8RJs+2Pc=
Return-Path: <dieter.maurer@online.de>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=online.de header.i=dieter.maurer@online.de
header.b=BzCtiJM6; dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.113
X-Spam-Level: *
X-Spam-Evidence: '*H*': 0.78; '*S*': 0.00; 'received:212.227': 0.07;
'simple.': 0.07; 'cc:addr:python-list': 0.09; 'parse': 0.09;
'cc:no real name:2**0': 0.14; 'characters.': 0.16; 'subject: -- ':
0.16; 'subject:words': 0.16; 'python': 0.16; 'solve': 0.19;
'cc:addr:python.org': 0.20; 'received:de': 0.23; '(and': 0.25;
'depends': 0.25; 'cc:2**0': 0.25; 'task': 0.26; 'bit': 0.27;
'program': 0.31; 'received:kundenserver.de': 0.32;
'received:mout.kundenserver.de': 0.32; 'split': 0.32; 'someone':
0.34; 'able': 0.34; 'header:In-Reply-To:1': 0.34; 'words': 0.35;
'also,': 0.36; 'count': 0.36; 'lists': 0.37; 'this.': 0.37;
'received:192.168': 0.37; 'file': 0.38; 'text': 0.39; 'list':
0.39; 'use': 0.39; 'wrote': 0.39; 'method': 0.61; 'received:212':
0.62; 'willing': 0.64; 'your': 0.64; 'invest': 0.67; 'exactly':
0.68; 'sequence': 0.69; 'times': 0.69; 'yourself': 0.75; 'novel':
0.84; 'occurring': 0.84; 'time).': 0.84; 'subject:From': 0.91;
'subject:once': 0.91
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=online.de;
s=s42582890; t=1717182441; x=1717787241; i=dieter.maurer@online.de;
bh=8nGC1jG8Lcp2uQJWuhM187AYFalTUAwZWys0ATQF5Ko=;
h=X-UI-Sender-Class:MIME-Version:Content-Type:
Content-Transfer-Encoding:Message-ID:Date:From:To:Cc:Subject:
In-Reply-To:References:cc:content-transfer-encoding:content-type:
date:from:message-id:mime-version:reply-to:subject:to;
b=BzCtiJM6LXVnRnUmonV3ImeY+9DqunfrDqjNcfcWhIzYtNqGzSIvHT1C8hsG2U81
kveiBLafKA/O8M8oECou+HkUinnUxQKvbmM6BSN1+P4f9kiZqjt+nXdLE6ZhWMM36
uXj1sHyMrXuDLEg6iJMIWpL466EidtWe+gsL8b2y4yU2KA0DY8Dk1g3S3yYkRYYVk
oC3BK/A0kIL4Lz2fgvWRtcGX3Nmm4+cY7L4YH3sCQpVCNtMlbO2+0W/v4rOKBYcKa
7qWfzlEiloS4H52yCa+BChBjc0hrfYBMaCYt9YLx4TGXKLpNSwgbbjmbcylpZwUZC
z9IZcrNY/16uBycRTQ==
X-UI-Sender-Class: 6003b46c-3fee-4677-9b8b-2b628d989298
In-Reply-To: <v3am2l$1qf6m$3@dont-email.me>
X-Mailer: VM 8.0.12-devo-585 under 21.4 (patch 24) "Standard C" XEmacs Lucid
(x86_64-linux-gnu)
X-Provags-ID: V03:K1:1Jm45ksQkUW9q/QEO0dRhXd0/y16pgbG0hEl4gvFO+ASdEPv18r
r9XNTbK/4C5sMiYNpj9r9Z3jCYAyL86+mmAzlnbKhV81AdfuU0AsZh9B8ZzsLylN5K/q712
z4cqOAIW4Re4kEoSWSM383b0nKXP2ZzjNKYduHyD9VAsbEbEf1ZehXtRxkBcqqzi8M2ld+K
ATOxZ5oGKaiW2myV6R5Lw==
X-Spam-Flag: NO
UI-OutboundReport: notjunk:1;M01:P0:ndjf9IPVu08=;SCEuu4FcoOfDPdqJzoftFHetqDh
6kHakAXjuT+XKB8sGLCnCilZY0jDwrscJxdMT/DmkvGDJ+FUypajL6KKMiExm42ruWuQ8PTug
j6hm4d2C0iDKfhahbRkDUMGJILSjuN/q2KWjJzRb2qovWw1f7Bd6D03OWxE+Z9qhhx8uF2Ht6
kd1NFyxm3bekSujJC2R4iGYd/lo3aw1LsElFMyXvaACmX0/dNuRBCziHtZPGqDy9eHP+9hzFj
hzbsCW5oh6Ij2bZtKWJ3ca7Q1LsC0k6GC7pVgSrQx1ebhZVacqwyz525nxc60zKt7TWRk+eGQ
QnIcO/UDVea3jFZSozDLPHJ1/EZIBMZlu/N7f+G+koRbnWsvczmSsxKuJTnC5kQnUuOIcYH5P
2uWvHTSwOV8MOlftJ7ozUEobybfCIQBsThOtQ36+pfP+/a/ry2w5mi11f3vwkFlcaCo6oWowk
Fpd8iO+Xaus9755uIQ2KfCBeYNWlLhfTsOnJcQ0C3MIQbkGqjYW41Y5oWlKwA6msc+Z6WFbHt
lKxOvvGE838f90M51TneZfzDDsDuhhKYCnEuOOWQtLn+BeKEjFMYJ0QAJbbrm79UGrXtoHVpo
W8NaONUsdQMVMp5OKLSHaMqisFjoq8m0qTlQMJMUmX2rVW50o3l2q3dJljevSrl7qz+DbR0Lb
YoAsfl0eNngsJKdIw1tHtGzc+kTuHyeIqQpg1vCWV/BCwylj22ddViXLJ2d1g2TJt2MvxUd7Y
Ab685/BMSsZwTWnUwrrwM+UTgjoZIZFdBd6pwkvypMaC+I7xfYjNA4=
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <26202.4083.590062.42312@ixdm.fritz.box>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
View all headers

HenHanna wrote at 2024-5-30 13:03 -0700:
>
>Given a text file of a novel (JoyceUlysses.txt) ...
>
>could someone give me a pretty fast (and simple) Python program that'd
>give me a list of all words occurring exactly once?

Your task can be split into several subtasks:
* parse the text into words

This depends on your notion of "word".
In the simplest case, a word is any maximal sequence of non-whitespace
characters. In this case, you can use `split` for this task

* Make a list unique -- you can use `set` for this

> -- Also, a list of words occurring once, twice or 3 times

For this you count the number of occurrences in a `list`.
You can use the `count` method of lists for this.

All individual subtasks are simple. I am confident that
you will be able to solve them by yourself (if you are willing
to invest a bit of time).

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Thomas Passin
Newsgroups: comp.lang.python
Date: Fri, 31 May 2024 21:27 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: list1@tompassin.net (Thomas Passin)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Fri, 31 May 2024 17:27:00 -0400
Lines: 36
Message-ID: <mailman.77.1717199313.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<9eafa07f-a8da-4929-bc92-3a26ba464d34@tompassin.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de pnz+nndiVuNBHaSKjNPdRg5IcXyqDpXc1PjdYkyXpq6Q==
Cancel-Lock: sha1:WX9NBNf/RYJ/S1LJnLBxUAbQRsk= sha256:7JSuJY9x/+0mng931UFQpv/aD9z1gsBzx7hnvzOfuXo=
Return-Path: <list1@tompassin.net>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=tompassin.net header.i=@tompassin.net header.b=Du91tRsq;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.028
X-Spam-Evidence: '*H*': 0.94; '*S*': 0.00; 'fairly': 0.05; 'python:':
0.05; 'hyphenated': 0.09; 'readable': 0.09; 'received:23.83.212':
0.09; 'received:elm.relay.mailchannels.net': 0.09; 'skip:\xc2 20':
0.09; '1),': 0.16; 'received:10.0.0': 0.16; 'received:64.90':
0.16; 'received:64.90.62': 0.16; 'received:64.90.62.162': 0.16;
'received:dreamhost.com': 0.16; 'repeated': 0.16; 'skip:\xc2 60':
0.16; 'subject: -- ': 0.16; 'subject:words': 0.16; 'suggestions,':
0.16; 'word,': 0.16; 'wrote:': 0.16; 'python': 0.16; 'probably':
0.17; 'pm,': 0.19; 'to:addr:python-list': 0.20; "i'd": 0.24;
'(and': 0.25; 'output': 0.28; 'header:User-Agent:1': 0.30;
'program': 0.31; 'python-list': 0.32; 'received:10.0': 0.32;
'received:mailchannels.net': 0.32;
'received:relay.mailchannels.net': 0.32; 'but': 0.32; 'someone':
0.34; 'header:In-Reply-To:1': 0.34; 'words': 0.35; 'also,': 0.36;
'file': 0.38; 'could': 0.38; 'text': 0.39; 'handle': 0.39; 'list':
0.39; 'counts': 0.60; 'format': 0.62; 'skip:m 20': 0.63;
'definition': 0.64; 're:': 0.64; 'header:Received:6': 0.67;
'received:64': 0.67; 'exactly': 0.68; 'times': 0.69; 'direct':
0.73; '8bit%:100': 0.76; '(you': 0.76; 'treat': 0.76; 'thousand':
0.84; 'novel': 0.84; 'occurring': 0.84;
'\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0': 0.84;
'punctuation': 0.91; 'subject:From': 0.91; 'subject:once': 0.91
X-Sender-Id: dreamhost|x-authsender|tpassin@tompassin.net
ARC-Seal: i=1; s=arc-2022; d=mailchannels.net; t=1717190821; a=rsa-sha256;
cv=none;
b=JcwlKmm/cUCAlCfWayz4nvMj2FFg5FR1XeFb/yOX4Qrcbnsqw5uEyfWSLuvahVJqNSrUda
Cu/Wq018yOz+zAwRa0zJ3XhUqvYHWem7wAQvR1mOpfagtReKCOCiJxH1/FKeWW4trx7FyQ
oAIvXZ6FxbXmG5P1YNIS5wpB1xtvTJs2cz2iCBKOKaEc4NiRCzZWahYQqbLt8/gUPUtYwJ
8pnfozf638HbKos02s53xTK6LFEH/P6Ml9rEhNOskWQQGuNkPoNS1DH7HwH3vmIbyILsWJ
m/XuwzF4sDVpSJen79Tzb3UygK7yiPCBvcRkS/1KC79pnkkEot8tiAAe6qpcgA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
d=mailchannels.net; s=arc-2022; t=1717190821;
h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
to:to:cc:mime-version:mime-version:content-type:content-type:
content-transfer-encoding:content-transfer-encoding:
in-reply-to:in-reply-to:references:references:dkim-signature;
bh=5RnMkIJeBQnPyczd+04A8YLx4d4JWIK59z9C4uwpNTg=;
b=Q7MjfFqYFAD3vHfluX468BvowmM7BWH6UiCIvNuJHezKBcebKE97/dYOxngwV3jiLR5DAe
CSFbCLbMpgEtA1pVfV3i4k2MDfPzlAZ+f5C6+LGQ6myNEUm/kBMR1wrXoF5K0Emz6yIeg1
m4GUeWMf7yInjTNhscH8IH1JKEejVlA6EYwK5YuGg7gk/KZMXqFquO9RcOe/jX/uBf+U4e
TwJW87Z5VV2e66MAfn9nEhURnGpxHtSRWfk5fb0peju++V6W84PBBDU1tMUt/DPL4rqSQH
ZB1dppqK4JOnV9wC4XnqmOvpcLdGk+NQpnFvSscPeq/uBoY7LAgjW5SiCOHGwg==
ARC-Authentication-Results: i=1; rspamd-7f76976655-7qjzw;
auth=pass smtp.auth=dreamhost smtp.mailfrom=list1@tompassin.net
X-Sender-Id: dreamhost|x-authsender|tpassin@tompassin.net
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|tpassin@tompassin.net
X-MailChannels-Auth-Id: dreamhost
X-Duck-Celery: 59c469895b50df1c_1717190822163_4063293501
X-MC-Loop-Signature: 1717190822163:223739391
X-MC-Ingress-Time: 1717190822163
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tompassin.net;
s=dreamhost; t=1717190821;
bh=5RnMkIJeBQnPyczd+04A8YLx4d4JWIK59z9C4uwpNTg=;
h=Date:Subject:To:From:Content-Type:Content-Transfer-Encoding;
b=Du91tRsq/Ww6WgkDXpRNT4DySfYqgdbkL6/z21SYPeQ7JnCGBo+/eY6IYiF5KjRiN
L79hGWNwQ64M4JH5E63PHSjHtfWOVrzvTD3EUds2htzg93GUe/1qQjsmaoOvNBzq9F
vv6aXkgyL10Geg4h1ywRzQyTtklFJ93MSQRWDu4RkaDN3nsrEtn64e2iCWDbf1pRrU
hbtjWS+xaPYVqnaKFNcCNooufn/fYnaoj+KZUx7qyxlSGbHOCy+RugHyZ2ugAI6LiZ
6bFFVeZFfuDhV0xiVU80kYJGXBHsjejtz7ZY2e6rrndHNhHz0GhQobHPuJWKMlD2tc
Zt5mUO5QBcauQ==
User-Agent: Mozilla Thunderbird
Content-Language: en-US
In-Reply-To: <v3am2l$1qf6m$3@dont-email.me>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <9eafa07f-a8da-4929-bc92-3a26ba464d34@tompassin.net>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
View all headers

On 5/30/2024 4:03 PM, HenHanna via Python-list wrote:
>
> Given a text file of a novel (JoyceUlysses.txt) ...
>
> could someone give me a pretty fast (and simple) Python program that'd
> give me a list of all words occurring exactly once?
>
>               -- Also, a list of words occurring once, twice or 3 times
>
>
>
> re: hyphenated words        (you can treat it anyway you like)
>
>        but ideally, i'd treat  [editor-in-chief]
>                                [go-ahead]  [pen-knife]
>                                [know-how]  [far-fetched] ...
>        as one unit.

You will probably get a thousand different suggestions, but here's a
fairly direct and readable one in Python:

s1 = 'Is this word is the only word repeated in this string'

counts = {}
for w in s1.lower().split():
counts[w] = counts.get(w, 0) + 1
print(sorted(counts.items()))
# [('in', 1), ('is', 2), ('only', 1), ('repeated', 1), ('string', 1),
('the', 1), ('this', 2), ('word', 2)]

Of course you can adjust the definition of what constitutes a word,
handle punctuation and so on, and tinker with the output format to suit
yourself. You would replace s1.lower().split() with, e.g.,
my_custom_word_splitter(s1).

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Peter J. Holzer
Newsgroups: comp.lang.python
Date: Sat, 1 Jun 2024 08:04 UTC
References: 1 2 3 4 5
Attachments: signature.asc (application/pgp-signature)
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: hjp-python@hjp.at (Peter J. Holzer)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Sat, 1 Jun 2024 10:04:29 +0200
Lines: 60
Message-ID: <mailman.78.1717229487.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<20240601080429.ygyg75jzdoxdofa2@hjp.at>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
protocol="application/pgp-signature"; boundary="ngybx5dsscshk3th"
X-Trace: news.uni-berlin.de 03pZHU2kr4wYnj6CKEHpoQlpS9hftZA1DCs+sXY0DEfQ==
Cancel-Lock: sha1:WEO47sNFfTzNxyCcldX7mzPSpgA= sha256:knyvwsq7jyowsSHpYAFm9guIBUhNWQFtFbvBt15xQYw=
Return-Path: <hjp-python@hjp.at>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.003
X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'content-
type:multipart/signed': 0.05; 'despite': 0.05; 'mark.': 0.07;
'-0700,': 0.09; 'content-type:application/pgp-signature': 0.09;
'filename:fname piece:asc': 0.09; 'filename:fname
piece:signature': 0.09; 'filename:fname:signature.asc': 0.09;
'"creative': 0.16; '__/': 0.16; 'anyway.': 0.16; 'challenge!"':
0.16; 'convention,': 0.16; 'from:addr:hjp-python': 0.16;
'from:addr:hjp.at': 0.16; 'from:name:peter j. holzer': 0.16;
'hjp@hjp.at': 0.16; 'holzer': 0.16; 'reality.': 0.16; 'stick':
0.16; 'stross,': 0.16; 'subject: -- ': 0.16; 'subject:words':
0.16; 'unicode': 0.16; 'unlikely': 0.16; 'url-ip:212.17.106/24':
0.16; 'url-ip:212.17/16': 0.16; 'url:hjp': 0.16; 'word:': 0.16;
'|_|_)': 0.16; 'wrote:': 0.16; 'to:addr:python-list': 0.20;
"isn't": 0.27; 'sense': 0.28; 'personally': 0.32; 'python-list':
0.32; 'mark': 0.32; 'but': 0.32; 'same': 0.34; 'header:In-Reply-
To:1': 0.34; 'hard': 0.37; 'single': 0.39; 'use': 0.39; 'decide':
0.39; 'both': 0.40; 'received:212': 0.62; 'between': 0.63; 'your':
0.64; 'received:userid': 0.66; '[1]': 0.67; 'right': 0.68;
'closing': 0.69; 'sentence': 0.69; 'url-ip:212/8': 0.69; 'names,':
0.81; 'left': 0.83; 'characters': 0.84; 'quotation': 0.84;
'received:at': 0.84; 'subject:From': 0.91; 'subject:once': 0.91;
'texts': 0.91
Mail-Followup-To: python-list@python.org
Content-Disposition: inline
In-Reply-To: <v3bcgu$229eq$1@dont-email.me>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <20240601080429.ygyg75jzdoxdofa2@hjp.at>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
View all headers

On 2024-05-30 19:26:37 -0700, HenHanna via Python-list wrote:
> hard to decide what to do with hyphens
> and apostrophes
> (I'd, he's, can't, haven't, A's and B's)

Especially since the same character is used as both an apostrophe and a
closing quotation mark. And while that's pretty unambiguous between to
characters it isn't at the end of a word:

This is Alex’ house.
This type of building is called an ‘Alex’ house.
The sentence ‘We are meeting at Alex’ house’ contains an apostrophe.

(using proper unicode quotation marks. It get's worse if you stick to
ASCII.)

Personally I like to use U+0027 APOSTROPHE as an apostrophe and U+2018
LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK as
single quotation marks[1], but despite the suggestive names, this is not
the common typographical convention, so your texts are unlikely to make
this distinction.

hp

[1] Which I use rarely, anyway.

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Attachments: signature.asc (application/pgp-signature)
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Thomas Passin
Newsgroups: comp.lang.python
Date: Sat, 1 Jun 2024 13:38 UTC
References: 1 2 3 4 5 6
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: list1@tompassin.net (Thomas Passin)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Sat, 1 Jun 2024 09:38:51 -0400
Lines: 33
Message-ID: <mailman.80.1717253662.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me> <20240601080429.ygyg75jzdoxdofa2@hjp.at>
<f03d6493-88d8-4edb-acec-ce55e75b9e57@tompassin.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de uh9/OVtf/n7MdMIfAnVS4wc5xPcnk7tu+T2rxC40xwJw==
Cancel-Lock: sha1:AYiuAFz5bz5Hbr2lm54Kg6XnOQA= sha256:zYlh3dQ1wLESW2jcxLCeZNXqUkeinHIZW8hMem/ot50=
Return-Path: <list1@tompassin.net>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=tompassin.net header.i=@tompassin.net header.b=xR+EGifr;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.046
X-Spam-Evidence: '*H*': 0.91; '*S*': 0.00; 'despite': 0.05; 'mark.':
0.07; 'spaces': 0.07; '-0700,': 0.09; 'ok,': 0.09; 'anyway.':
0.16; 'convention,': 0.16; 'discard': 0.16; 'holzer': 0.16;
'like.': 0.16; 'on).': 0.16; 'received:10.0.0': 0.16;
'received:64.90': 0.16; 'received:64.90.62': 0.16;
'received:64.90.62.162': 0.16; 'received:dreamhost.com': 0.16;
'stick': 0.16; 'subject: -- ': 0.16; 'subject:words': 0.16;
'unicode': 0.16; 'unlikely': 0.16; 'word:': 0.16; 'wrote:': 0.16;
'to:addr:python-list': 0.20; 'anything': 0.25; "isn't": 0.27;
'header:User-Agent:1': 0.30; 'am,': 0.31; 'approach': 0.31;
'personally': 0.32; 'python-list': 0.32; 'received:10.0': 0.32;
'received:mailchannels.net': 0.32;
'received:relay.mailchannels.net': 0.32; 'mark': 0.32; 'but':
0.32; "i'm": 0.33; 'same': 0.34; 'header:In-Reply-To:1': 0.34;
'usual': 0.35; 'words': 0.35; 'yes,': 0.35; 'hard': 0.37;
'single': 0.39; 'use': 0.39; 'decide': 0.39; 'both': 0.40;
'policy': 0.62; 'miss': 0.62; 'between': 0.63; 'your': 0.64;
'[1]': 0.67; 'header:Received:6': 0.67; 'received:64': 0.67;
'right': 0.68; 'closing': 0.69; 'remaining': 0.69; 'sentence':
0.69; 'depending': 0.70; 'care': 0.71; 'names,': 0.81; 'left':
0.83; 'characters': 0.84; 'exceptions': 0.84; 'quotation': 0.84;
'punctuation': 0.91; 'subject:From': 0.91; 'subject:once': 0.91;
'texts': 0.91
X-Sender-Id: dreamhost|x-authsender|tpassin@tompassin.net
ARC-Seal: i=1; s=arc-2022; d=mailchannels.net; t=1717249132; a=rsa-sha256;
cv=none;
b=J0ucJcCxz06Y2taK1mmcmt/YZriAEkMYwA+K1La5g02ZTlsY1WRZh7MUFmzYBNBWyw5R/6
SrrOfkN3ut+wVkep0yImlY/BzREiWZi+HiCeVjh2kOb9uNYXmSIGPCjfwtko8NdAYnBedY
/4IJm7iIL51rYPdb3v9j4rvZR7y9RoiDPP4LCUYLWLgYQw6FgeTxaC/i8o1CTPZEOpYTS7
2x9aOwYSSSyUSNtZYNREEUygMfxQvbNiPE3GYX5j1DJR9EhN5cGFQnKeDy9zB00F2uF63E
1yeEoiK8WYOvJRd6o8S9Br4c8Xr+bsMUL4e1T0QQClmD6qLCjBKNJTRdz/3m6A==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
d=mailchannels.net; s=arc-2022; t=1717249132;
h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
to:to:cc:mime-version:mime-version:content-type:content-type:
content-transfer-encoding:content-transfer-encoding:
in-reply-to:in-reply-to:references:references:dkim-signature;
bh=FrJ2fEVN5RpfvwOiGk/xBphziUF9R1Yz7vci4NO/LVE=;
b=O2PTtAH8Llxkh3awYJNPALu7mCQ6NpPCLmt5MmG83PcO8O7KVzYcDb2PGcJ/jJkaKhhzTC
qlyAG94tjuVL4EJqsgsP6tE1mFIRuPIqe3eVrryDAhvTJ95moW9IHH8LdcePDX4kZBS9dr
7QyriYSZm1bRmXOAZPexXt9PUssj+q+K10LYSVN8WE5MHtFjMgIFYkN+FskIUDXchAbYHL
DxHSTlmKf4MQyuqCKgUNN8L30JB+LQLBYp4E4juLjIeomHpRhW96niUSaOnbrpZFDsgUBl
iopuAl+J+fuVRuFhHOrVgUNGxewLFQPfBKcvT32+y1jK6d0KqT5yJDv4CREWiA==
ARC-Authentication-Results: i=1; rspamd-7f76976655-dmqqh;
auth=pass smtp.auth=dreamhost smtp.mailfrom=list1@tompassin.net
X-Sender-Id: dreamhost|x-authsender|tpassin@tompassin.net
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|tpassin@tompassin.net
X-MailChannels-Auth-Id: dreamhost
X-Rock-Abaft: 1efc193879e11ec6_1717249132288_2021294822
X-MC-Loop-Signature: 1717249132288:2952552907
X-MC-Ingress-Time: 1717249132287
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tompassin.net;
s=dreamhost; t=1717249131;
bh=FrJ2fEVN5RpfvwOiGk/xBphziUF9R1Yz7vci4NO/LVE=;
h=Date:Subject:To:From:Content-Type:Content-Transfer-Encoding;
b=xR+EGifruzXaZfOaxpZhcxejrwX7Ifz7lj+sTlbWX54VGMDgN2Q/XEHhaU8hHi5oZ
+soDzz0QsHv9/bBwzZYNwi6mCk/BmWujZAa/iKARq9S+6IQm5JAWqi+Gz6XIQNcwI3
O7XJpNzPr/BbZh9pY2+GT5sslxA4UkX2WjdqDfS1nJE00EXaTQvSa3/jXQr0AyO8BM
DSHgacIDpaMbj18ZvxGzw27rok3XQibx53lFeTQxwgGtOVCdgkzJIfT2iCF/XJKZii
XJznRMQ8TxMbpNQNwHRGZzLTeQ33wuDf5jk4qFNsfysZKR96BOvZQXZAREnwOsvm2q
pzQbezTZ76Klg==
User-Agent: Mozilla Thunderbird
Content-Language: en-US
In-Reply-To: <20240601080429.ygyg75jzdoxdofa2@hjp.at>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <f03d6493-88d8-4edb-acec-ce55e75b9e57@tompassin.net>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me> <20240601080429.ygyg75jzdoxdofa2@hjp.at>
View all headers

On 6/1/2024 4:04 AM, Peter J. Holzer via Python-list wrote:
> On 2024-05-30 19:26:37 -0700, HenHanna via Python-list wrote:
>> hard to decide what to do with hyphens
>> and apostrophes
>> (I'd, he's, can't, haven't, A's and B's)
>
> Especially since the same character is used as both an apostrophe and a
> closing quotation mark. And while that's pretty unambiguous between to
> characters it isn't at the end of a word:
>
> This is Alex’ house.
> This type of building is called an ‘Alex’ house.
> The sentence ‘We are meeting at Alex’ house’ contains an apostrophe.
>
> (using proper unicode quotation marks. It get's worse if you stick to
> ASCII.)
>
> Personally I like to use U+0027 APOSTROPHE as an apostrophe and U+2018
> LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK as
> single quotation marks[1], but despite the suggestive names, this is not
> the common typographical convention, so your texts are unlikely to make
> this distinction.
>
> hp
>
> [1] Which I use rarely, anyway.

My usual approach is to replace punctuation by spaces and then to
discard anything remaining that is only one character long (or sometimes
two, depending on what I'm working on). Yes, OK, I will miss words like
"I". Usually I don't care about them. Make exceptions to the policy if
you like.

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Mats Wichmann
Newsgroups: comp.lang.python
Date: Sat, 1 Jun 2024 19:34 UTC
References: 1 2 3
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: mats@wichmann.us (Mats Wichmann)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Sat, 1 Jun 2024 13:34:11 -0600
Lines: 51
Message-ID: <mailman.81.1717270463.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
<32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: news.uni-berlin.de DJmMM0oasuPZ7IFUDrdNnw+2A99v5KhTG+ucGhUbljRQ==
Cancel-Lock: sha1:Ad9o6rLizb0FKFNGpBoiEzJ8zX4= sha256:2JhIKkPIblCrSc4zZeEIOD8/7/ug8d6TJrweQguzH/s=
Return-Path: <mats@wichmann.us>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="1024-bit key; unprotected key"
header.d=pobox.com header.i=@pobox.com header.b=TUVszAXP;
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.055
X-Spam-Evidence: '*H*': 0.89; '*S*': 0.00; 'usage': 0.05; ':-)': 0.09;
"hasn't": 0.09; 'language,': 0.09; 'parse': 0.09; 'regex': 0.09;
'import': 0.15; '"simple"': 0.16; 'assumptions': 0.16;
'characters.': 0.16; 'dieter': 0.16; 'hyphenation': 0.16; 'nltk':
0.16; 'received:64.147': 0.16; 'subject: -- ': 0.16;
'subject:words': 0.16; 'tries': 0.16; 'wrote:': 0.16; 'problem':
0.16; 'python': 0.16; 'to:addr:python-list': 0.20; 'way.': 0.22;
"what's": 0.22; 'lines': 0.23; '(and': 0.25; 'depends': 0.25;
'object': 0.26; 'task': 0.26; 'bit': 0.27; 'example,': 0.28;
'asked': 0.29; 'header:User-Agent:1': 0.30; 'program': 0.31;
'python-list': 0.32; 'split': 0.32; 'trademarks': 0.32; 'but':
0.32; 'there': 0.33; 'someone': 0.34; 'same': 0.34; 'header:In-
Reply-To:1': 0.34; '"the': 0.35; 'words': 0.35; 'count': 0.36;
'people': 0.36; 'source': 0.36; 'really': 0.37; "it's": 0.37;
'hard': 0.37; 'received:192.168': 0.37; 'file': 0.38; 'could':
0.38; 'least': 0.39; 'text': 0.39; 'list': 0.39; 'use': 0.39;
'wrote': 0.39; 'forms': 0.40; 'gone': 0.40; 'both': 0.40;
'something': 0.40; 'want': 0.40; 'counts': 0.60; "there's": 0.61;
'skip:o 20': 0.63; 'document.': 0.64; 'your': 0.64; 'received:64':
0.67; 'exactly': 0.68; 'counter': 0.69; 'piece': 0.69; 'sequence':
0.69; 'longer': 0.71; 'experts': 0.76; 'choice': 0.76; 'quick':
0.77; 'happens': 0.84; 'novel': 0.84; 'occurring': 0.84;
'remained': 0.84; 'punctuation': 0.91; 'subject:From': 0.91;
'subject:once': 0.91
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=pobox.com; h=message-id
:date:mime-version:subject:to:references:from:in-reply-to
:content-type:content-transfer-encoding; s=sasl; bh=NSXpv3d8DSRm
G94iwxJAk6PFNJrtD3R7Cgm3iLnekJo=; b=TUVszAXPGd0k5VB5fFWsswbUVGno
90bbQh5zlrCFjJQtu3YdmOM5PLG1P2VKFmdpQ5t+hCJLgiiOYz3zoa27zW64FOFe
cDYbVvEQNzqpJYg9VswE+Hf1Rr4QGja6fzEH7JFtvbdB/0fmyiUUpDZLJ8ZcWGGh
atSyes/xajtRz84=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=wichmann.us;
h=message-id:date:mime-version:subject:to:references:from:in-reply-to:content-type:content-transfer-encoding;
s=2018-07.pbsmtp; bh=NSXpv3d8DSRmG94iwxJAk6PFNJrtD3R7Cgm3iLnekJo=;
b=PFtbCuiZ3btC1/34KxyEtqgD4u6E8erEPYcDMNvnIM61JBy2h8ZaRfcaKKJddjXmXCjbR6I0WE3iF9d3VBrl2P0OjwuXZwwuzLQfnpa4OnQzkyhrhTOu8cZwP+rtU4+YoTyGBgtt0pXJz2p72xqGqyKgr9nG/8vTgEzLY6lRpmo=
User-Agent: Mozilla Thunderbird
Content-Language: en-US
Autocrypt: addr=mats@wichmann.us; keydata=
xsDiBD9xp6oRBAC1vd3YI8Gcr1CxpV1gldNQu0uQsNaICDk+Ai3+R163s/P83JOYG+SBEA3P
v7iZx70qpQ3RzP7KrjF1Nm6j0em9ccUX2fPQUCAxXw5Hiq7CSMiwQQZRI6shcnyMh9XTKViT
WK5MrKDyvjDEn7epjKzKwPS5SG039l6XaOKU0A4uGwCgsNqUQqC0gMMcbKlJV8ql58iKmbMD
/ii8FPQrXmyS/FnsPs7UddV5qMHKm7NUH5oiKuMVyakInRyq9iIxuu3D4Ec6mWRKcGsjmIkW
HXCSz0aefs6dsqNqpU54cYioJ3wP5LzHK7oclgJPryVt5Qezbdutf8SQf8gVkaNIlkxwGUzi
bKTZ6CHzwlz9nNgeel0XPUcZzFxGA/4paeCg2rMSVuAhUQbsLYHu4XzTs9P16zaXkrtxc4m5
b+BF5xsLgTpyO5l859XudS2Gp+7/Y37dAU4QlyGGOboWmF1y9U5DnzBwG8ghsnym+ga58MJh
LdRdQQ6xQolCpEXOuzm40f2r5uMxF3KOJ7WpIPuGAkeCPru9BmlATH+zOs0gTWF0cyBXaWNo
bWFubiA8bWF0c0B3aWNobWFubi51cz7CYQQTEQIAIQIbAwYLCQgHAwIDFQIDAxYCAQIeAQIX
gAUCT0VyZwIZAQAKCRDAMaCQc9hUxiZBAJ9cWziGp7hVfsu5T+cQptc3rLNndQCgrZh8u5LW
BfJ5e/Y+3PwZ8UEm+ELOwE0EP5is8BAEAMtwzcA8TYf5UTjDMgwcSNoErTc9ag+IX05QFgL8
aF8sfJRv5atcitqQy0gSIsOzI+L/AFdPN/+QQI3dL1tCq14t32KPDtigDhzm6jVPXX5z+V9u
xnD8XTp+ZvNcWoHXjViM8aXeLLEiCpiVCho307h3XShvqoKINWRQWeAsKKDDAAMFA/48zaey
wiiEyvI0meJ1KkNHxdLP0yLODr1WV6j9xkPkLWOaIDw7dlwEOlF1N1YtZ2wa0p1wsttdIbIx
ffgwXmcH4zrdxUIMz3U0BqYzk5H+5cYFXECXTFVOmweS+JECYMj80PjRoKCO1eVO1N30zksB
36NnhZWPRWIhjK3ZarIYH8JGBBgRAgAGBQI/mKzwAAoJEMAxoJBz2FTG6VEAoKDYHfDp5Q3q
PuPvPahCE9HsXMgAAJ9INTqcLSJrOfyJ8q95nBO1T26H2Q==
In-Reply-To: <26202.4083.590062.42312@ixdm.fritz.box>
X-Pobox-Relay-ID: ECE497A6-204D-11EF-8BF4-B84BEB2EC81B-81526775!pb-smtp1.pobox.com
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
View all headers

On 5/31/24 11:59, Dieter Maurer via Python-list wrote:

hmmm, I "sent" this but there was some problem and it remained unsent.
Just in case it hasn't All Been Said Already, here's the retry:

> HenHanna wrote at 2024-5-30 13:03 -0700:
>>
>> Given a text file of a novel (JoyceUlysses.txt) ...
>>
>> could someone give me a pretty fast (and simple) Python program that'd
>> give me a list of all words occurring exactly once?
>
> Your task can be split into several subtasks:
> * parse the text into words
>
> This depends on your notion of "word".
> In the simplest case, a word is any maximal sequence of non-whitespace
> characters. In this case, you can use `split` for this task

This piece is by far "the hard part", because of the ambiguity. For
example, if I just say non-whitespace, then I get as distinct words
followed by punctuation. What about hyphenation - of which there's both
the compound word forms and the ones at the end of lines if the source
text has been formatted that way. Are all-lowercase words different
than the same word starting with a capital? What about non-initial
capitals, as happens a fair bit in modern usage with acronyms,
trademarks (perhaps not in Ulysses? :-) ), etc. What about accented letters?

If you want what's at least a quick starting point to play with, you
could use a very simple regex - a fair amount of thought has gone into
what a "word character" is (\w), so it deals with excluding both
punctuation and whitespace.

import re
from collections import Counter

with open("JoyceUlysses/txt", "r") as f:
wordcount = Counter(re.findall(r'\w+', f.read().lower()))

Now you have a Counter object counting all the "words" with their
occurrence counts (by this definition) in the document. You can fish
through that to answer the questions asked (find entries with a count of
1, 2, 3, etc.)

Some people Go Big and use something that actually tries to recognize
the language, and opposed to making assumptions from ranges of
characters. nltk is a choice there. But at this point it's not really
"simple" any longer (though nltk experts might end up disagreeing with
that).

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Edward Teach
Newsgroups: comp.lang.python
Organization: A noiseless patient Spider
Date: Mon, 3 Jun 2024 09:47 UTC
References: 1 2 3 4
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: hackbeard@linuxmail.org (Edward Teach)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Mon, 3 Jun 2024 10:47:42 +0100
Organization: A noiseless patient Spider
Lines: 64
Message-ID: <20240603104742.1664b37c@fedora>
References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
<32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
<mailman.81.1717270463.2909.python-list@python.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 03 Jun 2024 11:47:43 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="602e6409aa41f0e072c7cf1eb8fa1f04";
logging-data="4041997"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/kv36pp+ejX8JsjUPMD+gjpMHBQVlA75g="
Cancel-Lock: sha1:6RsdLb9HubUBIyloxZIya8aEXx0=
X-Newsreader: Claws Mail 4.2.0 (GTK 3.24.42; x86_64-redhat-linux-gnu)
View all headers

On Sat, 1 Jun 2024 13:34:11 -0600
Mats Wichmann <mats@wichmann.us> wrote:

> On 5/31/24 11:59, Dieter Maurer via Python-list wrote:
>
> hmmm, I "sent" this but there was some problem and it remained
> unsent. Just in case it hasn't All Been Said Already, here's the
> retry:
>
> > HenHanna wrote at 2024-5-30 13:03 -0700:
> >>
> >> Given a text file of a novel (JoyceUlysses.txt) ...
> >>
> >> could someone give me a pretty fast (and simple) Python program
> >> that'd give me a list of all words occurring exactly once?
> >
> > Your task can be split into several subtasks:
> > * parse the text into words
> >
> > This depends on your notion of "word".
> > In the simplest case, a word is any maximal sequence of
> > non-whitespace characters. In this case, you can use `split` for
> > this task
>
> This piece is by far "the hard part", because of the ambiguity. For
> example, if I just say non-whitespace, then I get as distinct words
> followed by punctuation. What about hyphenation - of which there's
> both the compound word forms and the ones at the end of lines if the
> source text has been formatted that way. Are all-lowercase words
> different than the same word starting with a capital? What about
> non-initial capitals, as happens a fair bit in modern usage with
> acronyms, trademarks (perhaps not in Ulysses? :-) ), etc. What about
> accented letters?
>
> If you want what's at least a quick starting point to play with, you
> could use a very simple regex - a fair amount of thought has gone
> into what a "word character" is (\w), so it deals with excluding both
> punctuation and whitespace.
>
> import re
> from collections import Counter
>
> with open("JoyceUlysses/txt", "r") as f:
> wordcount = Counter(re.findall(r'\w+', f.read().lower()))
>
> Now you have a Counter object counting all the "words" with their
> occurrence counts (by this definition) in the document. You can fish
> through that to answer the questions asked (find entries with a count
> of 1, 2, 3, etc.)
>
> Some people Go Big and use something that actually tries to recognize
> the language, and opposed to making assumptions from ranges of
> characters. nltk is a choice there. But at this point it's not
> really "simple" any longer (though nltk experts might end up
> disagreeing with that).
>
>

The Gutenburg Project publishes "plain text". That's another problem,
because "plain text" means UTF-8....and that means unicode...and that
means running some sort of unicode-to-ascii conversion in order to get
something like "words". A couple of hours....a couple of hundred lines
of C....problem solved!

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Grant Edwards
Newsgroups: comp.lang.python
Date: Mon, 3 Jun 2024 18:58 UTC
References: 1 2 3 4 5 6
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: grant.b.edwards@gmail.com (Grant Edwards)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Mon, 03 Jun 2024 14:58:26 -0400 (EDT)
Lines: 14
Message-ID: <mailman.83.1717441107.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
<32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
<mailman.81.1717270463.2909.python-list@python.org>
<20240603104742.1664b37c@fedora> <4VtNKZ70YdznVGW@mail.python.org>
X-Trace: news.uni-berlin.de PGnm2PorituhI1bgi5A0zwjPb8Pud2BnrlsrVxN6DhHQ==
Cancel-Lock: sha1:cDcvjmujbaSBcvnyp1zME3r0peY= sha256:t5ZBfBYKxhugaOUseR4loukCiqnA4gL48zQT7z2RpPs=
Return-Path: <grant.b.edwards@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.118
X-Spam-Level: *
X-Spam-Evidence: '*H*': 0.79; '*S*': 0.03; 'edward': 0.09;
'conversion': 0.16; 'from:addr:grant.b.edwards': 0.16;
'from:name:grant edwards': 0.16; 'subject: -- ': 0.16;
'subject:words': 0.16; 'unicode': 0.16; 'wrote:': 0.16; 'to:addr
:python-list': 0.20; 'problem,': 0.22; 'teach': 0.22; 'lines':
0.23; 'python,': 0.25; 'header:User-Agent:1': 0.30; 'python-list':
0.32; "i'm": 0.33; 'running': 0.34; 'from:addr:gmail.com': 0.35;
'couple': 0.37; 'means': 0.38; 'read': 0.38; 'something': 0.40;
'back': 0.67; 'message-id:invalid': 0.68; 'right': 0.68; 'order':
0.69; 'converted': 0.84; 'subject:From': 0.91; 'subject:once':
0.91; 'hundred': 0.93
User-Agent: slrn/1.0.3 (Linux)
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <4VtNKZ70YdznVGW@mail.python.org>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
<32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
<mailman.81.1717270463.2909.python-list@python.org>
<20240603104742.1664b37c@fedora>
View all headers

On 2024-06-03, Edward Teach via Python-list <python-list@python.org> wrote:

> The Gutenburg Project publishes "plain text". That's another
> problem, because "plain text" means UTF-8....and that means
> unicode...and that means running some sort of unicode-to-ascii
> conversion in order to get something like "words". A couple of
> hours....a couple of hundred lines of C....problem solved!

I'm curious. Why does it need to be converted frum Unicode to ASCII?

When you read it into Python, it gets converted right back to Unicode...

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Edward Teach
Newsgroups: comp.lang.python
Organization: A noiseless patient Spider
Date: Tue, 4 Jun 2024 11:21 UTC
References: 1 2 3 4 5 6 7
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: hackbeard@linuxmail.org (Edward Teach)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Tue, 4 Jun 2024 12:21:34 +0100
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <20240604122134.2696c36d@fedora>
References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
<32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
<mailman.81.1717270463.2909.python-list@python.org>
<20240603104742.1664b37c@fedora>
<4VtNKZ70YdznVGW@mail.python.org>
<mailman.83.1717441107.2909.python-list@python.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 04 Jun 2024 13:21:34 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="1b9374df964f0fc85f31d2efaaac78e4";
logging-data="442896"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+q8yCoeI/4EraqfWKXVH/k4BQDxEsgu7Y="
Cancel-Lock: sha1:BGECS2OiOQImGU+uM9LIYpVeybE=
X-Newsreader: Claws Mail 4.2.0 (GTK 3.24.42; x86_64-redhat-linux-gnu)
View all headers

On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)
Grant Edwards <grant.b.edwards@gmail.com> wrote:

> On 2024-06-03, Edward Teach via Python-list <python-list@python.org>
> wrote:
>
> > The Gutenburg Project publishes "plain text". That's another
> > problem, because "plain text" means UTF-8....and that means
> > unicode...and that means running some sort of unicode-to-ascii
> > conversion in order to get something like "words". A couple of
> > hours....a couple of hundred lines of C....problem solved!
>
> I'm curious. Why does it need to be converted frum Unicode to ASCII?
>
> When you read it into Python, it gets converted right back to
> Unicode...
>
>
>

Well.....when using the file linux.words as a useful master list of
"words".....linux.words is strict ASCII........

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: dieter.maurer@online.de
Newsgroups: comp.lang.python
Date: Tue, 4 Jun 2024 16:13 UTC
References: 1 2 3 4 5 6
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!weretis.net!feeder8.news.weretis.net!fu-berlin.de!uni-berlin.de!not-for-mail
From: dieter.maurer@online.de
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Tue, 4 Jun 2024 18:13:47 +0200
Lines: 12
Message-ID: <mailman.84.1717519110.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
<32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
<mailman.81.1717270463.2909.python-list@python.org>
<20240603104742.1664b37c@fedora>
<26207.15675.710915.692146@ixdm.fritz.box>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Trace: news.uni-berlin.de pJ/a4/9Fzzf6KeDC4HB3Hw85inBWEGeN4Z8kQ57s5dpQ==
Cancel-Lock: sha1:9zNt1SYZr44rqec4N0kAjv2HvPU= sha256:DEmntPTCxS1YTRgeNtsIZDXhWBIVXtoAhZMIqRBQ48Y=
Return-Path: <dieter.maurer@online.de>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=online.de header.i=dieter.maurer@online.de
header.b=XmlMwmj3; dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.015
X-Spam-Evidence: '*H*': 0.97; '*S*': 0.00; 'received:212.227': 0.07;
'cc:addr:python-list': 0.09; 'edward': 0.09; 'expression': 0.09;
'received:212.227.126': 0.09; 'cc:no real name:2**0': 0.14;
'conversion': 0.16; 'subject: -- ': 0.16; 'subject:words': 0.16;
'unicode': 0.16; 'cc:addr:python.org': 0.20; 'problem,': 0.22;
'teach': 0.22; 'lines': 0.23; 'received:de': 0.23; 'cc:2**0':
0.25; 'example,': 0.28; 'letter,': 0.32;
'received:kundenserver.de': 0.32; 'received:mout.kundenserver.de':
0.32; 'header:In-Reply-To:1': 0.34; 'running': 0.34; 'couple':
0.37; 'received:192.168': 0.37; 'means': 0.38; 'wrote': 0.39;
'received:212': 0.62; 'order': 0.69; 'subject:From': 0.91;
'subject:once': 0.91; 'hundred': 0.93
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=online.de;
s=s42582890; t=1717519107; x=1718123907; i=dieter.maurer@online.de;
bh=GC3YqqDMA9pBuJv/v+17B8jKtim+PGprscUpK3vKzt0=;
h=X-UI-Sender-Class:MIME-Version:Content-Type:
Content-Transfer-Encoding:Message-ID:Date:From:To:Cc:Subject:
In-Reply-To:References:cc:content-transfer-encoding:content-type:
date:from:message-id:mime-version:reply-to:subject:to;
b=XmlMwmj369eWHYSPx3jtdvtQ7NghB+K1sexCBbVivkTnkCYHCzj5s175Yarn7trw
E/O+p0YglAJob6MS6J7+NM+BaNvvf5Bu548NrCoFNKcOGtaKQ0kbYksAlBb9oH18V
bR5pIB3eG4OJF6mNumdNaljOF5VA19QDg6Y5AqVXV3JcdhprbJDhH7N+6bsptrxEG
S4VOBZ/Z3U6SVE1TyGe5eNWKNkqRsPmqthdfp3E1ehmsT5/DJOtOu7bOJkErsUoZa
AGofbKWPHAO5l5lqoD714NQsnRuIFNcTlOhypTyzmQPks0eqaXa19q4rIxFfQOUXk
G/KdqpcuDGLfh8XQFQ==
X-UI-Sender-Class: 6003b46c-3fee-4677-9b8b-2b628d989298
In-Reply-To: <20240603104742.1664b37c@fedora>
X-Mailer: VM 8.0.12-devo-585 under 21.4 (patch 24) "Standard C" XEmacs Lucid
(x86_64-linux-gnu)
X-Provags-ID: V03:K1:YGE7aF/IRFYJas64R/+WK/lkSPtQqht2a1jsaVpl8mDg/0cInpV
j2Im8NK12e2upUmqkHKboBhaMuErtv5d0s8vDu0HUGCHJOAz3O7VtM+1xRQhNjr9vj2p6pb
QBkYQ0I10gI+4NZozU9Usv5mu9191lIwW/MZ9hVaxsa0pqAB39wxkXasm93AWBjK/4eW4Zd
2pHyAn11IAQcrlUjEJQYA==
X-Spam-Flag: NO
UI-OutboundReport: notjunk:1;M01:P0:Ph1T08M8Bsc=;4iMOe4DOOQwqzzgE0c+kw8q57i4
zYFqOuTKzSVoNjBvkHaIUoMgqJJ6i9V+eZQjWnrhMe4P9q38Hixkn+7+KCOR/RsA7dm3PUNDJ
gFMJOtjm14xqTElMIu7fz8nEil39DxTTCucjII9E2aza8+5X/ryFmuxYIMJn0SQB1vsOBWBuE
9iADxUNyWFtaHr0Nr2nQHit8JSbPZrAYok1rCZ6sV89l+B2wONhEo44Z2ys7WvqMO9/B9zjw2
T2keC2aTU0wXccdUHA+bgRbZEs+2tZPen+Ze4pxArPAeY/UEO04SPkdHQcb87v/tgqz8jq23m
dVW6gW1Ttle0GoBW2LTxNcnPMix7LtsK3m/wZb7fE7y/DKlzhim/hTrZqU4Hzowi8+V/rSCff
32M0dq2gw+6wgD8i/jTWxPS2POfRjp8VBW/zuCJKeQoKSsezn3KGl3wmI/19MXyXQRXGTAQMm
Gp7Mg3tK6auRhOgJXZtWRycX/V95slyXGNXWgp2tihF3goAbm47qbmQkLi/QTKNVkGdJlj/B3
Q0Q3kQvK25zJ3m6hAJNYz1hwoQRCE6JmVIiEUfG8FFN7mRl4hDnUdcx/XKBpIuSwtyNrjc3EO
QfxHmoDKlGIBReP4UCY+i3Ejbdq+GKuY1gdlDl0iXthLdGFzZjikzJRxAVvpUXcErsBjYdoir
T1FMJ7+Fg6Xh5txoFvPANr2aynlTuZtBAM39Ee1drNJCwTwAouIh3lUH89STcFj0z5eG+TkSX
zWoXeimx9q9Cf1Cwl53mnugJXuAwg8uLy6ZdVwVwyyg+2iy0hd87eg=
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <26207.15675.710915.692146@ixdm.fritz.box>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
<32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
<mailman.81.1717270463.2909.python-list@python.org>
<20240603104742.1664b37c@fedora>
View all headers

Edward Teach wrote at 2024-6-3 10:47 +0100:
> ...
>The Gutenburg Project publishes "plain text". That's another problem,
>because "plain text" means UTF-8....and that means unicode...and that
>means running some sort of unicode-to-ascii conversion in order to get
>something like "words". A couple of hours....a couple of hundred lines
>of C....problem solved!

Unicode supports the notion "owrd" even better "ASCII".
For example, the `\w` (word charavter) regular expression wild card,
works for Unicode like for ASCII (of course with enhanced letter,
digits, punctuation, etc.)

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Grant Edwards
Newsgroups: comp.lang.python
Date: Tue, 4 Jun 2024 17:05 UTC
References: 1 2 3 4 5 6 7 8 9
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: grant.b.edwards@gmail.com (Grant Edwards)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Tue, 04 Jun 2024 13:05:10 -0400 (EDT)
Lines: 28
Message-ID: <mailman.85.1717520712.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
<32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
<mailman.81.1717270463.2909.python-list@python.org>
<20240603104742.1664b37c@fedora> <4VtNKZ70YdznVGW@mail.python.org>
<mailman.83.1717441107.2909.python-list@python.org>
<20240604122134.2696c36d@fedora> <4VtxmQ15hSznVHV@mail.python.org>
X-Trace: news.uni-berlin.de g9DMro8xvOu24ENTchPMbAn51AXKwRw/s/TwuGtPVlVg==
Cancel-Lock: sha1:oeWVZg1PZQ4I22ZHb1cbC9tyOj8= sha256:YTFSlLjBmOGBOdBdw3a6F1gVPIlM+9myrl+k9Z+Pg0Y=
Return-Path: <grant.b.edwards@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.039
X-Spam-Evidence: '*H*': 0.92; '*S*': 0.00; 'is.': 0.05; 'edward':
0.09; '2024': 0.16; 'conversion': 0.16;
'from:addr:grant.b.edwards': 0.16; 'from:name:grant edwards':
0.16; 'missed': 0.16; 'subject: -- ': 0.16; 'subject:words': 0.16;
'unicode': 0.16; 'wrote:': 0.16; 'problem': 0.16; 'grant': 0.17;
'to:addr:python-list': 0.20; 'problem,': 0.22; 'teach': 0.22;
'lines': 0.23; 'python,': 0.25; 'jun': 0.26; 'header:User-
Agent:1': 0.30; 'guess': 0.32; 'python-list': 0.32; "i'm": 0.33;
'skip:" 20': 0.34; 'running': 0.34; 'from:addr:gmail.com': 0.35;
'mon,': 0.36; 'couple': 0.37; 'using': 0.37; 'file': 0.38;
'means': 0.38; 'read': 0.38; 'list': 0.39; 'use': 0.39; 'decide':
0.39; 'master': 0.39; 'something': 0.40; 'back': 0.67; 'message-
id:invalid': 0.68; 'right': 0.68; 'order': 0.69; 'converted':
0.84; 'subject:From': 0.91; 'subject:once': 0.91; 'hundred': 0.93
User-Agent: slrn/1.0.3 (Linux)
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <4VtxmQ15hSznVHV@mail.python.org>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
<32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
<mailman.81.1717270463.2909.python-list@python.org>
<20240603104742.1664b37c@fedora> <4VtNKZ70YdznVGW@mail.python.org>
<mailman.83.1717441107.2909.python-list@python.org>
<20240604122134.2696c36d@fedora>
View all headers

On 2024-06-04, Edward Teach via Python-list <python-list@python.org> wrote:
> On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)
> Grant Edwards <grant.b.edwards@gmail.com> wrote:
>
>> On 2024-06-03, Edward Teach via Python-list <python-list@python.org>
>> wrote:
>>
>> > The Gutenburg Project publishes "plain text". That's another
>> > problem, because "plain text" means UTF-8....and that means
>> > unicode...and that means running some sort of unicode-to-ascii
>> > conversion in order to get something like "words". A couple of
>> > hours....a couple of hundred lines of C....problem solved!
>>
>> I'm curious. Why does it need to be converted frum Unicode to ASCII?
>>
>> When you read it into Python, it gets converted right back to
>> Unicode...

> Well.....when using the file linux.words as a useful master list of
> "words".....linux.words is strict ASCII........

I guess I missed the part of the problem description where it said to
use linux.words to decide what a word is. :)

--
Grant

Subject: RE: From JoyceUlysses.txt -- words occurring exactly once
From: <avi.e.gross@gmail.com>
Newsgroups: comp.lang.python
Date: Tue, 4 Jun 2024 21:30 UTC
References: 1 2 3 4 5 6 7 8 9
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: <avi.e.gross@gmail.com>
Newsgroups: comp.lang.python
Subject: RE: From JoyceUlysses.txt -- words occurring exactly once
Date: Tue, 4 Jun 2024 17:30:47 -0400
Lines: 75
Message-ID: <mailman.86.1717536655.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
<32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
<mailman.81.1717270463.2909.python-list@python.org>
<20240603104742.1664b37c@fedora> <4VtNKZ70YdznVGW@mail.python.org>
<mailman.83.1717441107.2909.python-list@python.org>
<20240604122134.2696c36d@fedora>
<008a01dab6c6$77557500$66005f00$@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Trace: news.uni-berlin.de yH2Mkm43MRk81DFR0XD0dQvgd3oeaNWClYqtdNLlCImQ==
Cancel-Lock: sha1:9vNrrSJ2Iu1NC3YH5VweAZ4d4lc= sha256:6Vl34JJAqVwrSybHvKDnvhSaihynraCmL1yb4XEfdWM=
Return-Path: <avi.e.gross@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=GBE+eoDc;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.069
X-Spam-Evidence: '*H*': 0.86; '*S*': 0.00; 'containing': 0.05;
'fairly': 0.05; 'used.': 0.07; 'cases.': 0.09; 'edward': 0.09;
'general,': 0.09; 'parse': 0.09; 'received:108': 0.09;
'url:mailman': 0.15; '2024': 0.16; 'categories': 0.16; 'context.':
0.16; 'conversion': 0.16; 'expressions': 0.16; 'idea.': 0.16;
'mentioned,': 0.16; 'received:mail-oi1-x229.google.com': 0.16;
'sets,': 0.16; 'subject: -- ': 0.16; 'subject:words': 0.16;
'subset': 0.16; 'uncommon': 0.16; 'unicode': 0.16; 'wrote:': 0.16;
'python': 0.16; 'grant': 0.17; 'probably': 0.17; 'message-
id:@gmail.com': 0.18; 'uses': 0.19; 'to:addr:python-list': 0.20;
'problem,': 0.22; 'purposes': 0.22; 'teach': 0.22; 'lines': 0.23;
'skip:- 10': 0.25; 'url-ip:188.166.95.178/32': 0.25; 'url-
ip:188.166.95/24': 0.25; 'python,': 0.25; 'depends': 0.25;
'url:listinfo': 0.25; 'url-ip:188.166/16': 0.25; 'jun': 0.26;
'sense': 0.28; 'wrong': 0.28; 'attempt': 0.31; 'default': 0.31;
'flow': 0.31; 'url-ip:188/8': 0.31; 'carefully': 0.32; 'concept':
0.32; 'context': 0.32; 'python-list': 0.32; 'structure': 0.32;
'unless': 0.32; 'but': 0.32; "i'm": 0.33; 'there': 0.33; 'same':
0.34; 'mean': 0.34; 'skip:" 20': 0.34; 'header:In-Reply-To:1':
0.34; 'received:google.com': 0.34; 'running': 0.34; 'complex':
0.35; 'meaning': 0.35; 'words': 0.35; 'from:addr:gmail.com': 0.35;
'files': 0.36; 'count': 0.36; 'mon,': 0.36; 'couple': 0.37;
'lists': 0.37; 'using': 0.37; 'file': 0.38; 'way': 0.38; 'means':
0.38; 'read': 0.38; 'two': 0.39; 'text': 0.39; 'enough': 0.39;
'mentioned': 0.39; 'valid': 0.39; 'list': 0.39; 'master': 0.39;
'on.': 0.39; 'both': 0.40; 'something': 0.40; 'try': 0.40;
'counts': 0.60; 'english': 0.60; 'including': 0.60; 'from:': 0.62;
'to:': 0.62; 'format': 0.62; 'here': 0.62; 'once': 0.63; 're:':
0.64; 'specialized': 0.64; 'your': 0.64; 'look': 0.65; 'well':
0.65; 'earlier': 0.67; 'back': 0.67; 'body': 0.67; 'right': 0.68;
'exactly': 0.68; 'order': 0.69; 'rules': 0.70; 'june': 0.73;
'easy': 0.74; 'analyze': 0.75; 'languages,': 0.76; 'sent:': 0.78;
'capture': 0.84; 'categories.': 0.84; 'characters': 0.84;
'components.': 0.84; 'converted': 0.84; 'exercise': 0.84; 'minor':
0.84; 'occurring': 0.84; 'polish': 0.84; 'uniquely': 0.84;
'subject:From': 0.91; 'subject:once': 0.91; 'hundred': 0.93
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20230601; t=1717536652; x=1718141452; darn=python.org;
h=thread-index:content-language:content-transfer-encoding
:mime-version:message-id:date:subject:in-reply-to:references:to:from
:from:to:cc:subject:date:message-id:reply-to;
bh=YDrgcNqydqtmM5FhToYMk3cXcZf+vvBbCChocjykkC8=;
b=GBE+eoDcfLl/aZlSkYgspIL5QkdiabRnp5uYcqd+b8m62OZhtcVxmf5s0N7xe8VxUa
rr9aWb7ZVMm/cTsg8XdBBDFPeyMy3WC2mCDVRrbKXjJJ1wg/ps9uC3XmBU5X+5+9KJKg
SPtwlhuDxAi1/nhEXLbxMtzp95aAx6xmsRuOeFHkQE2XIitUOtov4riMHdv2V84je7e8
T/DZQjlgBFwwZWvJLxEYg37RD1+zjp/vdiER6askVNF36uIxdHCO1RpUMELeaDBZa1RZ
zCGLIwewAGC/pRhQHYNQKX4ljqTJyfjA9MwIz6oxRVfwnS/V7AoTn5Pk2TjX3Nrk26sW
Hf7Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1717536652; x=1718141452;
h=thread-index:content-language:content-transfer-encoding
:mime-version:message-id:date:subject:in-reply-to:references:to:from
:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
bh=YDrgcNqydqtmM5FhToYMk3cXcZf+vvBbCChocjykkC8=;
b=WmxDMTiVR/k9T/AhxbsMH+zH7ob2AyFFdyz6iazKfy69PW7K2k9zRzRaaraZKB4mBi
5mMpCW5A5Iwk8awxaWt//VwldqeXvMXl5H9uRC/lXXiZSTyCTKW9EI2B3ulBBiS1nuvG
NloyPfel5q8e8657DeLs+zwWBB67H9uF7jIjDmNp0LGDOD/Atr0u0i6eaC6n7WSjkkxO
zUjYRDhsvXHQV5hfcHOavEVEF3BIGonxuGwDCzyDdzZtezEbE8iRTJf04YmL4tIBOXnA
Taqs45/uxJmDZI95jNUUzdo/LxyBVLHfSdI2PbAA6VAGCRnpUTI0mXbhMB9wcdL9wFxM
nKmw==
X-Forwarded-Encrypted: i=1;
AJvYcCXJFuIaZg374lAO8AxKSPirTJMWdHvLYcFoyO928TM11G6XZ0GV58DmyUWg0vG4dq/FyDn8+ns/cU8l2UQT20thjNwn0vIN
X-Gm-Message-State: AOJu0YzG2Vvr7GxodPxttvYgzbVP/FyXLqONnBsvUM4C41ZN04mCe0Fn
urmDvfkdggL7mViu7E7vsf+xD7hoGwW4JzVKCTIk7WCrNHpbjq+X
X-Google-Smtp-Source: AGHT+IFBH1ED6lVaNlGuTovOZYeeipUfR5HONR+QwfxfimlSbJorRh4aXhMnk2P3ZI9/avl3JfOaXw==
X-Received: by 2002:a05:6870:a491:b0:24f:e9e5:c5d9 with SMTP id
586e51a60fabf-25121deb533mr814125fac.29.1717536651463;
Tue, 04 Jun 2024 14:30:51 -0700 (PDT)
In-Reply-To: <20240604122134.2696c36d@fedora>
X-Mailer: Microsoft Outlook 16.0
Content-Language: en-us
Thread-Index: AQFaUOX6FFD6/KMzUDJ9KVS4Th55PQHQNq4QATDnG8MBkP3hrwCyFjLzAZKOYl4DDbYPlQGasxI8slzsztA=
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <008a01dab6c6$77557500$66005f00$@gmail.com>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
<32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
<mailman.81.1717270463.2909.python-list@python.org>
<20240603104742.1664b37c@fedora> <4VtNKZ70YdznVGW@mail.python.org>
<mailman.83.1717441107.2909.python-list@python.org>
<20240604122134.2696c36d@fedora>
View all headers

>> Well.....when using the file linux.words as a useful master list of
>> "words".....linux.words is strict ASCII........

The meaning of "words" depends on the context. The contents of the file
mentioned are a minor attempt to capture a common subset of words in English
but probably are not what you mean by words in other contexts including
words also in ASCII format like names and especially uncommon names or
words like UNESCO. There are other selected lists of words such as valid
Scrabble words or WORLDLE words for specialized purposes that exclude words
of lengths that can not be used. The person looking to count words in a work
must determine what words make sense for their purpose.

ASCII is a small subset of UNICODE. So when using a concept of word that
includes many characters from many character sets, and in many languages,
things may not be easy to parse uniquely such as words containing something
like an apostrophe earlier on as in d'eau. Words can flow in different
directions. There can be fairly complex rules and sometimes things like
compound words may need to be considered to either be one or multiple words
and may even occur both ways in the same work so is every body the same as
everybody?

So what is being discussed here may have several components. One is to
tokenize all the text to make a set of categories. Another is to count them.
Perhaps another might even analyze and combine multiple categories or even
look at words in context to determine if two uses of the same word are
different enough to try to keep both apart in two categories Is polish the
same as Polish?

Once that is decided, you have a fairly simple exercise in storing the data
in a searchable data structure and doing your searches to get subsets and
counts and so on.

As mentioned, the default native format in Python is UNICODE and ASCII files
being read in may well be UNICODE internally unless you carefully ask
otherwise. The conversion from ASCII to UNICODE is trivial.

As for how well the regular expressions like \w work in general, I have no
idea. I can be very sure they are way more costly than the simpler ones you
can write that just know enough about what English words in ASCII look like
and perhaps get it wrong on some edge cases.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On
Behalf Of Edward Teach via Python-list
Sent: Tuesday, June 4, 2024 7:22 AM
To: python-list@python.org
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once

On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)
Grant Edwards <grant.b.edwards@gmail.com> wrote:

> On 2024-06-03, Edward Teach via Python-list <python-list@python.org>
> wrote:
>
> > The Gutenburg Project publishes "plain text". That's another
> > problem, because "plain text" means UTF-8....and that means
> > unicode...and that means running some sort of unicode-to-ascii
> > conversion in order to get something like "words". A couple of
> > hours....a couple of hundred lines of C....problem solved!
>
> I'm curious. Why does it need to be converted frum Unicode to ASCII?
>
> When you read it into Python, it gets converted right back to
> Unicode...
>
>
>

Well.....when using the file linux.words as a useful master list of
"words".....linux.words is strict ASCII........

--
https://mail.python.org/mailman/listinfo/python-list

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Chris Angelico
Newsgroups: comp.lang.python
Date: Tue, 4 Jun 2024 22:02 UTC
References: 1 2 3 4 5 6 7 8 9
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: rosuav@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Wed, 5 Jun 2024 08:02:26 +1000
Lines: 32
Message-ID: <mailman.88.1717538560.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
<32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
<mailman.81.1717270463.2909.python-list@python.org>
<20240603104742.1664b37c@fedora> <4VtNKZ70YdznVGW@mail.python.org>
<mailman.83.1717441107.2909.python-list@python.org>
<20240604122134.2696c36d@fedora>
<CAPTjJmomgE02LpfiMi5ZdORkeMrA5NbTp4VdPn3_9v68F2BfMQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de FfLl3Dj5lDTF+g6Bpc03lwMpdzcueQrdXSrKkHkbHOJg==
Cancel-Lock: sha1:ziZWVMAUEmWoeMDAYrGlqcIK2to= sha256:ylNe8yCLHQRLEXXG0H54Mdg6W8gLdm1lST+2VKhoNUY=
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=nUdRVbH4;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.044
X-Spam-Evidence: '*H*': 0.91; '*S*': 0.00; 'utf-8': 0.07; 'edward':
0.09; '2024': 0.16; 'chrisa': 0.16; 'conversion': 0.16;
'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16;
'is).': 0.16; 'subject: -- ': 0.16; 'subject:words': 0.16;
'unicode': 0.16; 'wrote:': 0.16; 'grant': 0.17; "aren't": 0.19;
'to:addr:python-list': 0.20; 'problem,': 0.22; 'teach': 0.22;
'lines': 0.23; '(and': 0.25; 'python,': 0.25; 'jun': 0.26;
'python-list': 0.32; 'message-id:@mail.gmail.com': 0.32; "i'm":
0.33; 'there': 0.33; 'skip:" 20': 0.34; 'header:In-Reply-To:1':
0.34; 'received:google.com': 0.34; 'running': 0.34;
'from:addr:gmail.com': 0.35; 'mon,': 0.36; 'couple': 0.37;
'using': 0.37; 'file': 0.38; 'means': 0.38; 'read': 0.38; 'list':
0.39; 'master': 0.39; 'wed,': 0.39; 'something': 0.40; 'english':
0.60; 'gave': 0.61; 'back': 0.67; 'right': 0.68; 'order': 0.69;
'skip:/ 10': 0.69; 'converted': 0.84; 'subject:From': 0.91;
'subject:once': 0.91; 'hundred': 0.93
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20230601; t=1717538558; x=1718143358; darn=python.org;
h=to:subject:message-id:date:from:in-reply-to:references:mime-version
:from:to:cc:subject:date:message-id:reply-to;
bh=/VdVKpnO/6InaFI0hEsj6jcTD58PpY/3/Jt1pvsO0ac=;
b=nUdRVbH4SKd0W1p/J7ulGYu1ZQHgcSjj0iMs7WPIV4t1SIwkq1fpK0EZPYoiBaWJ3l
mCjnPvSNlVGpsP8Uw1ekKRNdB3MgxVBUYGfZh2gfOIvFabGICxuyCTxgayNh4tzgfsAl
P0wdl3AlRgeeGO0ok+h0kmtMWfi72OCPP1B8BjAH4Ss1GC32RMFxYQqaZN/V+y1ovsWN
xMVWEZxznfFDMWrSZnfbLaQmdSfmf2XtZ4zP7SuVHYzF1hhg9QmJ0Tav1Nb+bExlfFY7
X9lqtEfkxwY+mDfZTM7H6Meao5PNSi0XjZJA4OAlFFQTwWkHyp7+akmEMplwVaY3duCm
RdBA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1717538558; x=1718143358;
h=to:subject:message-id:date:from:in-reply-to:references:mime-version
:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
bh=/VdVKpnO/6InaFI0hEsj6jcTD58PpY/3/Jt1pvsO0ac=;
b=ZyUkpx70zcUPrUfnTi1+4ypsmmTLX2HwUwlfhtMdmWMjniKCqYK3C2ARW/b7EALexV
VC35ysTZox2hDcSCN4ylTxlKTpnsg7BylSfQNVscrDjZ+62RdBCp05lYoty5KT1yOAF1
6GbkWYzzb1siHraAi9Wzl4029FjQWpBObndbeKc7NGa6m8CnULTM77N1tvXHAbQKGdA1
rNDlOtW13tvZzJ5okQP4VllJldVE3d3OGkuAxz9zdavNP2PRHpzDEM6SAw+PZaTZjhLW
ftNkz6KqoNnAQVA/ZheLnJ8p8NsXTFpa0xaQY+nLGDceTrDaqYZncTyAnGpFOROQjkl4
k/HQ==
X-Gm-Message-State: AOJu0YwYkaieOxETS4bjvXzVN8T2LRXEUDn3VIV6pi0fabg+IzIDLeqg
UKmTuBYCmZvNs+pDh+Cbk+o/7K7yDrho1qGgAutaCWV7xTyuCfxUs8na9+kqMDI9xhmFPsiqSlc
6y9Q3x9tKbnNHlS5eAZ3mirpSE3uyEQ==
X-Google-Smtp-Source: AGHT+IGjngwSn+II6H6PdL+WparTW2L4nbzzeMwDYE/9sXn8hiwaeyGhjaIhBTnNz4vihWtfgbTxelD43yx/KbY+b00=
X-Received: by 2002:a2e:9683:0:b0:2e9:855b:acb5 with SMTP id
38308e7fff4ca-2eac79ec597mr2613611fa.20.1717538558122; Tue, 04 Jun 2024
15:02:38 -0700 (PDT)
In-Reply-To: <20240604122134.2696c36d@fedora>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmomgE02LpfiMi5ZdORkeMrA5NbTp4VdPn3_9v68F2BfMQ@mail.gmail.com>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<26202.4083.590062.42312@ixdm.fritz.box>
<32b20599-1cf1-4aeb-904b-b9afa3dea3a3@wichmann.us>
<mailman.81.1717270463.2909.python-list@python.org>
<20240603104742.1664b37c@fedora> <4VtNKZ70YdznVGW@mail.python.org>
<mailman.83.1717441107.2909.python-list@python.org>
<20240604122134.2696c36d@fedora>
View all headers

On Wed, 5 Jun 2024 at 02:49, Edward Teach via Python-list
<python-list@python.org> wrote:
>
> On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)
> Grant Edwards <grant.b.edwards@gmail.com> wrote:
>
> > On 2024-06-03, Edward Teach via Python-list <python-list@python.org>
> > wrote:
> >
> > > The Gutenburg Project publishes "plain text". That's another
> > > problem, because "plain text" means UTF-8....and that means
> > > unicode...and that means running some sort of unicode-to-ascii
> > > conversion in order to get something like "words". A couple of
> > > hours....a couple of hundred lines of C....problem solved!
> >
> > I'm curious. Why does it need to be converted frum Unicode to ASCII?
> >
> > When you read it into Python, it gets converted right back to
> > Unicode...
> >
>
> Well.....when using the file linux.words as a useful master list of
> "words".....linux.words is strict ASCII........
>

Whatever gave you that idea? I have a large number of dictionaries in
/usr/share/dict, all of them encoded UTF-8 except one (and I don't
know why that is). Even the English ones aren't entirely ASCII.

There is no need to "convert from Unicode to ASCII", which makes no sense.

ChrisA

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: dn
Newsgroups: comp.lang.python
Organization: DWM
Date: Wed, 5 Jun 2024 04:33 UTC
References: 1 2 3 4 5
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: PythonList@DancesWithMice.info (dn)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Wed, 5 Jun 2024 16:33:15 +1200
Organization: DWM
Lines: 77
Message-ID: <mailman.90.1717562014.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de nhsv2tp/RdlI+QJ6wU3lZA19sb2LhqHQHZoV7wDSkCcg==
Cancel-Lock: sha1:gqKXnxoDNiUzht9Mr0zvKpl70M0= sha256:X6bz/M0f7cFtiOjYlV7PEKdodsRnlSgYFC7vaNOgqAg=
Return-Path: <PythonList@DancesWithMice.info>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=danceswithmice.info header.i=@danceswithmice.info
header.b=pPKpXtvF; dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.100
X-Spam-Level: *
X-Spam-Evidence: '*H*': 0.80; '*S*': 0.00; 'coders': 0.05; 'tests':
0.07; '=dn': 0.09; 'from:addr:danceswithmice.info': 0.09;
'from:addr:pythonlist': 0.09; 'hyphenated': 0.09; 'insist': 0.09;
'received:192.168.1.64': 0.09; 'skip:\xc2 20': 0.09; 'import':
0.15; '2.\xc2\xa0': 0.16; 'message-id:@DancesWithMice.info': 0.16;
'nuances': 0.16; 'received:cloud': 0.16; 'received:rangi.cloud':
0.16; 'reminded': 0.16; 'skip:\xc2 50': 0.16; 'skip:\xc2 60':
0.16; 'solved': 0.16; 'subject: -- ': 0.16; 'subject:words': 0.16;
'tests,': 0.16; 'wrote:': 0.16; 'python': 0.16; "can't": 0.17;
'pm,': 0.19; 'to:addr:python-list': 0.20; 'issue': 0.21;
'integration': 0.22; 'code': 0.23; "i'd": 0.24; '(and': 0.25;
'python,': 0.25; 'programming': 0.25; 'listing': 0.26; 'else':
0.27; '>>>': 0.28; 'teacher': 0.28; 'header:User-Agent:1': 0.30;
'attempt': 0.31; 'code,': 0.31; 'header:Organization:1': 0.31;
'program': 0.31; 'python-list': 0.32; 'split': 0.32; 'skip:2 10':
0.32; 'received:192.168.1': 0.32; 'but': 0.32; "i'm": 0.33;
'there': 0.33; 'someone': 0.34; 'able': 0.34; 'header:In-Reply-
To:1': 0.34; 'words': 0.35; 'also,': 0.36; 'possibly': 0.36;
'using': 0.37; 'hard': 0.37; 'this.': 0.37; 'received:192.168':
0.37; 'file': 0.38; 'could': 0.38; 'text': 0.39; 'otherwise':
0.39; 'list': 0.39; 'use': 0.39; 'decide': 0.39; 'define': 0.40;
'learn': 0.40; 'try': 0.40; 'should': 0.40; 'url-ip:104.21/16':
0.61; 'seen': 0.62; 'skip:\xc2 10': 0.62; 'here': 0.62; 'come':
0.62; 'skip:b 10': 0.63; 'our': 0.64; 'complete': 0.64; 'skip:r
20': 0.64; 'clear': 0.64; 'full': 0.64; 're:': 0.64; 'back': 0.67;
'per': 0.68; 'exactly': 0.68; 'acceptance': 0.69; 'counter': 0.69;
'manner': 0.69; 'times': 0.69; 'interesting': 0.71; 'history':
0.75; '8bit%:100': 0.76; '(you': 0.76; 'treat': 0.76; 'seek':
0.81; 'unit': 0.81; 'counter.': 0.84; 'initiative,': 0.84;
'novel': 0.84; 'occurring': 0.84; 'url:blogs': 0.84; 'sad': 0.91;
'subject:From': 0.91; 'subject:once': 0.91; 'will.': 0.91;
'aspects': 0.93; 'ibm': 0.95
DKIM-Filter: OpenDKIM Filter v2.11.0 vps.rangi.cloud 8EF753AD9
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=danceswithmice.info;
s=staff; t=1717562000;
bh=tfdyBviXzyOAeM66+zNXHQzESe85IgwlzMGYXcyfGP8=;
h=Date:From:Subject:To:References:In-Reply-To:From;
b=pPKpXtvFdhpHneauI+lLatdtEPP6ZR0F7o4Sbj59uisgtLOp8V2daejWCNbr1VgL5
FbK/qw6eTcf14GCdBzpszmEldF27imncKRGdEHEGlIycriR9ruEif4HnT0DWxwW+qX
0QemKeHu+R3BEjXyjo0KpDlD4r0A6qXPuPlHTXfuwXZ0LqBOa9qjDron1xV90jgXna
Yl2LL6xF9EIhoSFKy/D3ETVrMxejtDJxDorb4TDKFXG+b0ynjVhkm0e5fi+FfwkG8M
vvkY5wDliJ7REUehTFiKoaz1lSkQNOT5GwhEF6qnHcmXcSBZ4OADTjQeEpr6PCLdux
LOJWiArK63GnQ==
User-Agent: Mozilla Thunderbird
Content-Language: en-US
In-Reply-To: <v3bcgu$229eq$1@dont-email.me>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
View all headers

On 31/05/24 14:26, HenHanna via Python-list wrote:
> On 5/30/2024 2:18 PM, dn wrote:
>> On 31/05/24 08:03, HenHanna via Python-list wrote:
>>>
>>> Given a text file of a novel (JoyceUlysses.txt) ...
>>>
>>> could someone give me a pretty fast (and simple) Python program
>>> that'd give me a list of all words occurring exactly once?
>>>
>>>                -- Also, a list of words occurring once, twice or 3 times
>>>
>>>
>>>
>>> re: hyphenated words        (you can treat it anyway you like)
>>>
>>>         but ideally, i'd treat  [editor-in-chief]
>>>                                 [go-ahead]  [pen-knife]
>>>                                 [know-how]  [far-fetched] ...
>>>         as one unit.
>
>
>>
>> Split into words - defined as you will.
>> Use Counter.
>>
>> Show some (of your) code and we'll be happy to critique...
>
>
> hard to decide what to do with hyphens
>                and apostrophes
>              (I'd,  he's,  can't, haven't,  A's  and  B's)
>
>
> 2-step-Process
>
>           1. make a file listing all words (one word per line)
>
>           2.  then, doing the counting.  using
>                               from collections import Counter

Apologies for lateness - only just able to come back to this.

This issue is not Python, and is not solved by code!

If you/your teacher can't define a "word", the code, any code, will
almost-certainly be wrong!

One of the interesting aspects of our work is that we can write all
manner of tests to try to ensure that the code is correct: unit tests,
integration tests, system tests, acceptance tests, eye-tests, ...

However, there is no such thing as a test (or proof) that statements of
requirements are complete or correct!
(nor for any other previous stages of the full project life-cycle)

As coders we need to learn to require clear specifications and not
attempt to read-between-the-lines, use our initiative, or otherwise 'not
bother the ...'. When there is ambiguity, we should go back to the
user/client/boss and seek clarification. They are the
domain/subject-matter experts...

I'm reminded of a cartoon, possibly from some IBM source, first seen in
black-and-white but here in living-color:
https://www.monolithic.org/blogs/presidents-sphere/what-the-customer-really-wants

That has been the sad history of programming and dev.projects - wherein
we are blamed for every short-coming, because no-one else understands
the nuances of development projects.

If we don't insist on clarity, are we our own worst enemy?

--
Regards,
=dn

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Grant Edwards
Newsgroups: comp.lang.python
Date: Wed, 5 Jun 2024 15:24 UTC
References: 1 2 3 4 5 6
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: grant.b.edwards@gmail.com (Grant Edwards)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Wed, 05 Jun 2024 11:24:32 -0400 (EDT)
Lines: 16
Message-ID: <mailman.91.1717601074.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
<4VvWTr6YX0znVFD@mail.python.org>
X-Trace: news.uni-berlin.de XhvujOObe/7AU5JNY/pc/QQU4mt3D5t7+0rQfdTsf6uQ==
Cancel-Lock: sha1:YNcE9WxHLKUoiE/AIEvX22oEI2A= sha256:C30x+oAKYfy23Mk1l0mWVScY/VTE6SLu4uY7wR5OnaM=
Return-Path: <grant.b.edwards@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.122
X-Spam-Level: *
X-Spam-Evidence: '*H*': 0.77; '*S*': 0.02; 'comments': 0.03;
'from:addr:grant.b.edwards': 0.16; 'from:name:grant edwards':
0.16; 'subject: -- ': 0.16; 'subject:words': 0.16; 'vague': 0.16;
'wrote:': 0.16; "can't": 0.17; 'to:addr:python-list': 0.20;
'(and': 0.25; "wasn't": 0.26; 'teacher': 0.28; 'header:User-
Agent:1': 0.30; 'code,': 0.31; 'program': 0.31; 'python-list':
0.32; 'there': 0.33; 'from:addr:gmail.com': 0.35; 'could': 0.38;
'put': 0.38; 'define': 0.40; 'skip:h 10': 0.61; 'here': 0.62;
'requirement': 0.64; 'worked': 0.67; 'back': 0.67; 'message-
id:invalid': 0.68; 'interpreted': 0.69; 'practical': 0.84;
'subject:From': 0.91; 'subject:once': 0.91
User-Agent: slrn/1.0.3 (Linux)
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <4VvWTr6YX0znVFD@mail.python.org>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
View all headers

On 2024-06-05, dn via Python-list <python-list@python.org> wrote:

> If you/your teacher can't define a "word", the code, any code, will
> almost-certainly be wrong!

Back when I was a student...

When there was a homework/project assignemnt with a vague requirement
(and it wasn't practical to get the requirement refined), what always
worked for me was to put in the project report or program comments or
somewhere a statement that the requirement could be interpreted in
different ways and here is the precise interpretation of the
requirement that is being implemented.

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Thomas Passin
Newsgroups: comp.lang.python
Date: Wed, 5 Jun 2024 11:10 UTC
References: 1 2 3 4 5 6
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: list1@tompassin.net (Thomas Passin)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Wed, 5 Jun 2024 07:10:19 -0400
Lines: 85
Message-ID: <mailman.93.1717699659.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
<8409fd89-8b42-43c4-8511-704d57b3a4be@tompassin.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de CKIbykCGcB194PY/GaRrHQ4B2fsnWfg8+/mljxV7F7rA==
Cancel-Lock: sha1:js8ErNhaGGJ+/1i1kOyg6Yqyl8o= sha256:kK4Br7WRlmk28E1oVXktg6z+wQqX6o0TN/5vuwZmo04=
Return-Path: <list1@tompassin.net>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=tompassin.net header.i=@tompassin.net header.b=nUnuuDQ1;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: UNSURE 0.206
X-Spam-Level: **
X-Spam-Evidence: '*H*': 0.59; '*S*': 0.00; 'coders': 0.05; 'skip:\xc2
30': 0.07; 'tests': 0.07; 'hyphenated': 0.09; 'insist': 0.09;
'skip:\xc2 20': 0.09; 'import': 0.15; '2.\xc2\xa0': 0.16; '>>>>':
0.16; 'nuances': 0.16; 'received:10.0.0': 0.16; 'received:64.90':
0.16; 'received:64.90.62': 0.16; 'received:64.90.62.162': 0.16;
'received:dreamhost.com': 0.16; 'reminded': 0.16; 'skip:\xc2 60':
0.16; 'solved': 0.16; 'subject: -- ': 0.16; 'subject:words': 0.16;
'tests,': 0.16; 'wrote:': 0.16; 'python': 0.16; "can't": 0.17;
'pm,': 0.19; 'to:addr:python-list': 0.20; 'issue': 0.21;
'integration': 0.22; 'code': 0.23; "i'd": 0.24; '(and': 0.25;
'python,': 0.25; 'programming': 0.25; 'listing': 0.26; 'else':
0.27; '>>>': 0.28; 'teacher': 0.28; 'header:User-Agent:1': 0.30;
'attempt': 0.31; 'code,': 0.31; 'am,': 0.31; 'program': 0.31;
'do.': 0.32; 'python-list': 0.32; 'realize': 0.32;
'received:10.0': 0.32; 'received:mailchannels.net': 0.32;
'received:relay.mailchannels.net': 0.32; 'split': 0.32; 'skip:2
10': 0.32; 'but': 0.32; "i'm": 0.33; 'there': 0.33; 'someone':
0.34; 'able': 0.34; 'header:In-Reply-To:1': 0.34; 'words': 0.35;
'also,': 0.36; 'possibly': 0.36; 'using': 0.37; "it's": 0.37;
'hard': 0.37; 'this.': 0.37; 'file': 0.38; 'could': 0.38; 'text':
0.39; 'otherwise': 0.39; 'list': 0.39; 'use': 0.39; 'decide':
0.39; 'finding': 0.39; 'define': 0.40; 'learn': 0.40; 'try': 0.40;
'should': 0.40; 'lack': 0.60; 'url-ip:104.21/16': 0.61; 'seen':
0.62; 'skip:\xc2 10': 0.62; 'here': 0.62; 'come': 0.62; 'skip:b
10': 0.63; 'our': 0.64; 'complete': 0.64; 'skip:r 20': 0.64;
'clear': 0.64; 'full': 0.64; 're:': 0.64; 'years': 0.65; 'back':
0.67; 'header:Received:6': 0.67; 'received:64': 0.67; 'per': 0.68;
'exactly': 0.68; 'acceptable': 0.69; 'acceptance': 0.69;
'clarity': 0.69; 'counter': 0.69; 'manner': 0.69; 'times': 0.69;
'truly': 0.70; 'interesting': 0.71; 'history': 0.75; '8bit%:100':
0.76; '(you': 0.76; 'supposed': 0.76; 'treat': 0.76; 'seek': 0.81;
'unit': 0.81; 'counter.': 0.84; 'initiative,': 0.84; 'novel':
0.84; 'occurring': 0.84; 'url:blogs': 0.84; 'sad': 0.91;
'subject:From': 0.91; 'subject:once': 0.91; 'will.': 0.91;
'aspects': 0.93; 'ibm': 0.95
X-Sender-Id: dreamhost|x-authsender|tpassin@tompassin.net
ARC-Seal: i=1; s=arc-2022; d=mailchannels.net; t=1717585820; a=rsa-sha256;
cv=none;
b=DhvfwQvygwK0fAubR9mnMVK5XTdAcynfoBsdYs5TTWCj77pLMOd5RVYyGQS3nYVzHrgjk2
+nkaBDSNgIZdTkl/oY/7Mcb/VV8e9UjAJlBVE3+4oEQcmrdlR/YV28dx+FiUQwwyg6B/Wn
LWxNCIY30ppZeQWbh6bZO8EXApZK9q/vlsPT+5jopgg63E4ZSUaa2toqciDk7FBf+t8KuX
R9u9CTAivRk4tJQjgv4G/EKrL5Hnco0sRppNPOhZolRoKbm+kJycAQyFjzAofegULaRoIK
fU5WOmVzabmL9phFXibhpa4RXNb0FUkD4MmqbOCXPIomUIpn1aj/gQVkR+7h1A==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
d=mailchannels.net; s=arc-2022; t=1717585820;
h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
to:to:cc:mime-version:mime-version:content-type:content-type:
content-transfer-encoding:content-transfer-encoding:
in-reply-to:in-reply-to:references:references:dkim-signature;
bh=MPaKaPa4ON/2KTDFReCnaPWc6NkMRSGnJ7LH/y6fW+o=;
b=iN3ZUO3kC1H68kU4JD0TnpTtRTEDDUEyG2kwUnOFm03WYKgi1j68fGmKiHzN2ALU2b2zFi
Doq/+w10b3yCEcb0VE48vHVuA7BXT7wTfFhGHsv/0GRSHh4eWrdHpI53pPhHyHWz+CXTUD
bfY8Kn4ZnaIgnRQkNt4HfLBhnqGTSS6yVsNFEp4m+s9xX4ME+zNagJJwQvG4jq6B1Ah+lu
A9JdumY7vcsVXP+XaQL2dNZh8zuBHBSKOj9yXGyUtXF7chjuOI09GvurlazQYpcFV0MVly
6jnJTfcKdxrMCU8l6NTO8j8zSKNgAgMnDRVJvgxFLRoLyjrgnap65QED78pobg==
ARC-Authentication-Results: i=1; rspamd-7f76976655-hc9r6;
auth=pass smtp.auth=dreamhost smtp.mailfrom=list1@tompassin.net
X-Sender-Id: dreamhost|x-authsender|tpassin@tompassin.net
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|tpassin@tompassin.net
X-MailChannels-Auth-Id: dreamhost
X-Befitting-Army: 291ef1c337117b89_1717585821504_1005025361
X-MC-Loop-Signature: 1717585821504:1814110104
X-MC-Ingress-Time: 1717585821504
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tompassin.net;
s=dreamhost; t=1717585820;
bh=MPaKaPa4ON/2KTDFReCnaPWc6NkMRSGnJ7LH/y6fW+o=;
h=Date:Subject:To:From:Content-Type:Content-Transfer-Encoding;
b=nUnuuDQ1QWi9Ns6feXzMNOKyWw7pu12cgz2wysZ/a2TqcOiK5tBStHcHR6me2iQIs
YW5EI6wWjbuKHfWYI9LVzAmTWXaenmKHilto/QZXtfK+1JWjeuY43v7Q1kwK1BDPdR
zRNgG2NOxXE2UynNaOvfOskW/vjOU3KwSjiPOXY1thzWY54QpA6ldNHRZY5DD4VQDJ
+FxkcfloYXXhLcvitAVzW0VLWSPhsDAG925Sw/huP8b7R3KVSlbBBkDq6CkTgte1CY
ZQ5rK5qnEqW1SjOB+g8Tr2Kb9+f6otJFf/vqKLd5m3ke8twbQSI6GWXbjq4KPIniry
faUjuR/2KjBIQ==
User-Agent: Mozilla Thunderbird
Content-Language: en-US
In-Reply-To: <3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
X-Mailman-Approved-At: Thu, 06 Jun 2024 14:47:38 -0400
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <8409fd89-8b42-43c4-8511-704d57b3a4be@tompassin.net>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
View all headers

On 6/5/2024 12:33 AM, dn via Python-list wrote:
> On 31/05/24 14:26, HenHanna via Python-list wrote:
>> On 5/30/2024 2:18 PM, dn wrote:
>>> On 31/05/24 08:03, HenHanna via Python-list wrote:
>>>>
>>>> Given a text file of a novel (JoyceUlysses.txt) ...
>>>>
>>>> could someone give me a pretty fast (and simple) Python program
>>>> that'd give me a list of all words occurring exactly once?
>>>>
>>>>                -- Also, a list of words occurring once, twice or 3
>>>> times
>>>>
>>>>
>>>>
>>>> re: hyphenated words        (you can treat it anyway you like)
>>>>
>>>>         but ideally, i'd treat  [editor-in-chief]
>>>>                                 [go-ahead]  [pen-knife]
>>>>                                 [know-how]  [far-fetched] ...
>>>>         as one unit.
>>
>>
>>>
>>> Split into words - defined as you will.
>>> Use Counter.
>>>
>>> Show some (of your) code and we'll be happy to critique...
>>
>>
>> hard to decide what to do with hyphens
>>                 and apostrophes
>>               (I'd,  he's,  can't, haven't,  A's  and  B's)
>>
>>
>> 2-step-Process
>>
>>            1. make a file listing all words (one word per line)
>>
>>            2.  then, doing the counting.  using
>>                                from collections import Counter
>
>
> Apologies for lateness - only just able to come back to this.
>
> This issue is not Python, and is not solved by code!
>
> If you/your teacher can't define a "word", the code, any code, will
> almost-certainly be wrong!
>
>
> One of the interesting aspects of our work is that we can write all
> manner of tests to try to ensure that the code is correct: unit tests,
> integration tests, system tests, acceptance tests, eye-tests, ...
>
> However, there is no such thing as a test (or proof) that statements of
> requirements are complete or correct!
> (nor for any other previous stages of the full project life-cycle)
>
> As coders we need to learn to require clear specifications and not
> attempt to read-between-the-lines, use our initiative, or otherwise 'not
> bother the ...'. When there is ambiguity, we should go back to the
> user/client/boss and seek clarification. They are the
> domain/subject-matter experts...
>
> I'm reminded of a cartoon, possibly from some IBM source, first seen in
> black-and-white but here in living-color:
> https://www.monolithic.org/blogs/presidents-sphere/what-the-customer-really-wants

That one's been kicking around for years ... good job in finding a link
for it!

> That has been the sad history of programming and dev.projects - wherein
> we are blamed for every short-coming, because no-one else understands
> the nuances of development projects.

Of course, we see this lack of clarity all the time in questions to the
list. I often wonder how these askers can possibly come up with
acceptable code if they don't realize they don't truly know what it's
supposed to do.

> If we don't insist on clarity, are we our own worst enemy?
>
>

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Mats Wichmann
Newsgroups: comp.lang.python
Date: Fri, 7 Jun 2024 14:37 UTC
References: 1 2 3 4 5 6 7
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: mats@wichmann.us (Mats Wichmann)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Fri, 7 Jun 2024 08:37:07 -0600
Lines: 13
Message-ID: <mailman.96.1717861133.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
<8409fd89-8b42-43c4-8511-704d57b3a4be@tompassin.net>
<2f37a78b-0757-4e1a-860a-9fe3f86200cf@wichmann.us>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de BRKEeI0af+H9Fus7pS2KoARe35w+4oz1ReIpVk/9H0bw==
Cancel-Lock: sha1:JkGEp+lKsNRUmKzu9L8bLXAZVBA= sha256:BjsiKHiV3SCDe31+YTDMYNBHXk26yh3RJ8mUi1Xejfg=
Return-Path: <mats@wichmann.us>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="1024-bit key; unprotected key"
header.d=pobox.com header.i=@pobox.com header.b=gzWnD2y5;
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: UNSURE 0.216
X-Spam-Level: **
X-Spam-Evidence: '*H*': 0.59; '*S*': 0.02; 'list.\xc2\xa0': 0.09;
'received:64.147': 0.16; 'subject: -- ': 0.16; 'subject:words':
0.16; 'understood.': 0.16; 'wrote:': 0.16; 'problem': 0.16;
'to:addr:python-list': 0.20; 'code': 0.23; "isn't": 0.27; 'else':
0.27; 'fact': 0.28; 'header:User-Agent:1': 0.30; 'do.': 0.32;
'python-list': 0.32; 'realize': 0.32; 'someone': 0.34; 'header:In-
Reply-To:1': 0.34; 'possibly': 0.36; "it's": 0.37;
'received:192.168': 0.37; 'explain': 0.40; 'something': 0.40;
'lack': 0.60; 'come': 0.62; 'received:64': 0.67; 'acceptable':
0.69; 'clarity': 0.69; 'shed': 0.69; 'times': 0.69; 'truly': 0.70;
'supposed': 0.76; 'fortunately,': 0.84; 'subject:From': 0.91;
'subject:once': 0.91
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=pobox.com; h=message-id
:date:mime-version:subject:to:references:from:in-reply-to
:content-type:content-transfer-encoding; s=sasl; bh=YR+L07Yrcwaz
a0iwJ80JBbfylSqNgzA+5gHKP9+cea0=; b=gzWnD2y5XK2ErJRL8jVl7u9JyK6U
aD0Q6/yw1ufJBdxEvtrPFaMy3i5FuKjFk7Ym04qzFiysoFgBydjK7bSidRc+SI85
HxwPx8qN48KLbahdJTx1usajK0+CLs1byXZT9R3fQ9Ox8VLQM2IvUSzBb3EslkWe
6WFlVhov46ngynQ=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=wichmann.us;
h=message-id:date:mime-version:subject:to:references:from:in-reply-to:content-type:content-transfer-encoding;
s=2018-07.pbsmtp; bh=GfBHig6mi3RdVxHH10kFU8ktEahlkFPpD1WkQuCau9Q=;
b=BVjSR14+2rBpZyPiJwbZObzHHo/e6Rvq5YjlIfoc6xLswRAkwpZJU9F6xjfcuJkUiGCUrIb15IwTO1N3enxzRWteYAWztUDLjNz7bEs3F7Ys6HC02htZFdpWdHX9852/rQ0Yb+orrzTUxL7xnIhMcf2fQ/goDxsfVYcZQ+yQ6MQ=
User-Agent: Mozilla Thunderbird
Content-Language: en-US
Autocrypt: addr=mats@wichmann.us; keydata=
xsDiBD9xp6oRBAC1vd3YI8Gcr1CxpV1gldNQu0uQsNaICDk+Ai3+R163s/P83JOYG+SBEA3P
v7iZx70qpQ3RzP7KrjF1Nm6j0em9ccUX2fPQUCAxXw5Hiq7CSMiwQQZRI6shcnyMh9XTKViT
WK5MrKDyvjDEn7epjKzKwPS5SG039l6XaOKU0A4uGwCgsNqUQqC0gMMcbKlJV8ql58iKmbMD
/ii8FPQrXmyS/FnsPs7UddV5qMHKm7NUH5oiKuMVyakInRyq9iIxuu3D4Ec6mWRKcGsjmIkW
HXCSz0aefs6dsqNqpU54cYioJ3wP5LzHK7oclgJPryVt5Qezbdutf8SQf8gVkaNIlkxwGUzi
bKTZ6CHzwlz9nNgeel0XPUcZzFxGA/4paeCg2rMSVuAhUQbsLYHu4XzTs9P16zaXkrtxc4m5
b+BF5xsLgTpyO5l859XudS2Gp+7/Y37dAU4QlyGGOboWmF1y9U5DnzBwG8ghsnym+ga58MJh
LdRdQQ6xQolCpEXOuzm40f2r5uMxF3KOJ7WpIPuGAkeCPru9BmlATH+zOs0gTWF0cyBXaWNo
bWFubiA8bWF0c0B3aWNobWFubi51cz7CYQQTEQIAIQIbAwYLCQgHAwIDFQIDAxYCAQIeAQIX
gAUCT0VyZwIZAQAKCRDAMaCQc9hUxiZBAJ9cWziGp7hVfsu5T+cQptc3rLNndQCgrZh8u5LW
BfJ5e/Y+3PwZ8UEm+ELOwE0EP5is8BAEAMtwzcA8TYf5UTjDMgwcSNoErTc9ag+IX05QFgL8
aF8sfJRv5atcitqQy0gSIsOzI+L/AFdPN/+QQI3dL1tCq14t32KPDtigDhzm6jVPXX5z+V9u
xnD8XTp+ZvNcWoHXjViM8aXeLLEiCpiVCho307h3XShvqoKINWRQWeAsKKDDAAMFA/48zaey
wiiEyvI0meJ1KkNHxdLP0yLODr1WV6j9xkPkLWOaIDw7dlwEOlF1N1YtZ2wa0p1wsttdIbIx
ffgwXmcH4zrdxUIMz3U0BqYzk5H+5cYFXECXTFVOmweS+JECYMj80PjRoKCO1eVO1N30zksB
36NnhZWPRWIhjK3ZarIYH8JGBBgRAgAGBQI/mKzwAAoJEMAxoJBz2FTG6VEAoKDYHfDp5Q3q
PuPvPahCE9HsXMgAAJ9INTqcLSJrOfyJ8q95nBO1T26H2Q==
In-Reply-To: <8409fd89-8b42-43c4-8511-704d57b3a4be@tompassin.net>
X-Pobox-Relay-ID: 6B92517E-24DB-11EF-9692-6488940A682E-81526775!pb-smtp2.pobox.com
X-Mailman-Approved-At: Sat, 08 Jun 2024 11:38:52 -0400
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <2f37a78b-0757-4e1a-860a-9fe3f86200cf@wichmann.us>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
<8409fd89-8b42-43c4-8511-704d57b3a4be@tompassin.net>
View all headers

On 6/5/24 05:10, Thomas Passin via Python-list wrote:

> Of course, we see this lack of clarity all the time in questions to the
> list.  I often wonder how these askers can possibly come up with
> acceptable code if they don't realize they don't truly know what it's
> supposed to do.

Fortunately, having to explain to someone else why something is giving
you trouble can help shed light on the fact the problem statement isn't
clear, or isn't clearly understood. Sometimes (sadly, many times it
doesn't).

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Larry Martell
Newsgroups: comp.lang.python
Date: Sat, 8 Jun 2024 15:54 UTC
References: 1 2 3 4 5 6 7 8
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: larry.martell@gmail.com (Larry Martell)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Sat, 8 Jun 2024 10:54:07 -0500
Lines: 26
Message-ID: <mailman.97.1717862064.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
<8409fd89-8b42-43c4-8511-704d57b3a4be@tompassin.net>
<2f37a78b-0757-4e1a-860a-9fe3f86200cf@wichmann.us>
<CACwCsY5Ga8Lq8-gPZWSsEWbgmP+VKsF7DjLpN_GFbBEgjuPKBw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de JwUyZXHqF918TIgBqv8etgA9CMyRQPLV7gqPDN8s+v4Q==
Cancel-Lock: sha1:+LcZ5u6pSW4D5VbXYzblaFot+AY= sha256:3zPm2dR0nilwvJVqRPFcpzYbrqxnp477JbOb5peDFPs=
Return-Path: <larry.martell@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=Cqmkdhs/;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.057
X-Spam-Evidence: '*H*': 0.89; '*S*': 0.00; 'cc:addr:python-list':
0.09; 'email addr:python.org>': 0.09; 'junior': 0.09;
'list.\xc2\xa0': 0.09; 'something,': 0.09; '&gt;': 0.14; 'cc:no
real name:2**0': 0.14; '2024': 0.16; 'mats': 0.16; 'subject: -- ':
0.16; 'subject:words': 0.16; 'to:name:mats wichmann': 0.16;
'understood.': 0.16; 'wichmann': 0.16; 'wrote:': 0.16; 'problem':
0.16; 'cc:addr:python.org': 0.20; 'sat,': 0.22; 'code': 0.23;
'cc:2**0': 0.25; 'jun': 0.26; "isn't": 0.27; 'else': 0.27; 'fact':
0.28; 'email addr:python.org&gt;': 0.28; 'question': 0.32; 'do.':
0.32; 'python-list': 0.32; 'realize': 0.32; 'message-
id:@mail.gmail.com': 0.32; 'but': 0.32; 'someone': 0.34; 'header
:In-Reply-To:1': 0.34; 'received:google.com': 0.34;
'from:addr:gmail.com': 0.35; 'possibly': 0.36; 'people': 0.36;
'really': 0.37; "it's": 0.37; 'case.': 0.40; 'explain': 0.40;
'something': 0.40; 'lack': 0.60; 'come': 0.62; 'clear': 0.64;
'process.': 0.65; 'that,': 0.67; 'acceptable': 0.69; 'clarity':
0.69; 'shed': 0.69; 'times': 0.69; 'interview': 0.70; 'truly':
0.70; 'supposed': 0.76; 'email name:&lt;python-list': 0.84;
'fortunately,': 0.84; 'want.': 0.84; 'subject:From': 0.91;
'subject:once': 0.91
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20230601; t=1717862061; x=1718466861; darn=python.org;
h=cc:to:subject:message-id:date:from:in-reply-to:references
:mime-version:from:to:cc:subject:date:message-id:reply-to;
bh=37SBzSFYOOhBNNzj8xULPe+GkNO/7Ar7tbaQCvr1/to=;
b=Cqmkdhs/+IQ7N6L/HyqFVZXxrwETQHAVd0lXl8mfsEFICb5g6aNVGwN1jEz85op0Cr
bIN4PzATg03Rf0gmB/C7YLs18ImmdLlpKSyXTt8mZcYID1SOfMr9Si93LthPgFPwFUrE
dzu8+sKtxthYz5sGFgaX5Dkv6djuC9uL6ldEMsi2FzCEioRWV6MRwrxQM1N/JFZPPR+p
DHs3FU46u6b20nJ3DAFISIjUt5eF1s3Lq4hSUIhdX94fsuE7DUt4tm0dta5NIvPwt/5D
xkqJPoMlxj1n31w7MElHsr+zxVbsF6Lf7caLIPdAq+8pI1+b2qqY44B0Ulv4IpiKcXHz
rjTQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1717862061; x=1718466861;
h=cc:to:subject:message-id:date:from:in-reply-to:references
:mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
:reply-to;
bh=37SBzSFYOOhBNNzj8xULPe+GkNO/7Ar7tbaQCvr1/to=;
b=WZD5MUMYm5tz8+JulBIAyn4TcV6wbknhpMl1LEyshVKLteeLTTVL84U9cZBPdE282T
wBntHvFZDl+sxrwQqJOUCPbAWEahkJW9MYO4BRZs/R1/CmtfB68FOSQsDbuUSOAfZ28s
gLfhGm4tjTOmNmT9n6hS2+gM9J6sE2vuA2tYjj4n0bWoKk8lP7qgbcLIbjD0YgMV4/Ob
4BADh9oDvI8xGYsjdkwfK3wMHD8Tl/bFLh85lCrJankiP5G1+5Cfzqhhfpcmypp2xJUJ
VrfRdoG/ufIh7xqRJSU5SFRC/g17fRADFm9jCTbOWnQqx7PLMgXqnMxQGb52R4zTKXsE
83BA==
X-Gm-Message-State: AOJu0YxLKtO2bSJlBQVxxENisJigD4R3IGRfZvGlQLznw3iQVliIechz
YJ04nos7JEA3uNbnvg13amCfhGtpfGgtC7say77TuYAsSFtnskXBZJF8GgkI2dmmeqcaVBaYa1L
hNUukv7or2CvtNcZjHn1GgWxLCg==
X-Google-Smtp-Source: AGHT+IFLRoAl3vCAcrfyVWVl08EHjZ3Pir0MUvt5slGhZEByRz47yqGGsczxZ+nA2aPPMOvMt2ZXLvRL+YbJWL3HgSg=
X-Received: by 2002:a17:90a:bb0f:b0:2c2:cd5c:62ac with SMTP id
98e67ed59e1d1-2c2cd5c66c6mr4067720a91.9.1717862060535; Sat, 08 Jun 2024
08:54:20 -0700 (PDT)
In-Reply-To: <2f37a78b-0757-4e1a-860a-9fe3f86200cf@wichmann.us>
X-Content-Filtered-By: Mailman/MimeDel 2.1.39
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CACwCsY5Ga8Lq8-gPZWSsEWbgmP+VKsF7DjLpN_GFbBEgjuPKBw@mail.gmail.com>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
<8409fd89-8b42-43c4-8511-704d57b3a4be@tompassin.net>
<2f37a78b-0757-4e1a-860a-9fe3f86200cf@wichmann.us>
View all headers

On Sat, Jun 8, 2024 at 10:39 AM Mats Wichmann via Python-list <
python-list@python.org> wrote:

> On 6/5/24 05:10, Thomas Passin via Python-list wrote:
>
> > Of course, we see this lack of clarity all the time in questions to the
> > list. I often wonder how these askers can possibly come up with
> > acceptable code if they don't realize they don't truly know what it's
> > supposed to do.
>
> Fortunately, having to explain to someone else why something is giving
> you trouble can help shed light on the fact the problem statement isn't
> clear, or isn't clearly understood. Sometimes (sadly, many times it
> doesn't).

The original question struck me as homework or an interview question for a
junior position. But having no clear requirements or specifications is good
training for the real world where that is often the case. When you question
that, you are told to just do something, and then you’re told it’s not what
is wanted. That frustrates people but it’s often part of the process.
People need to see something to help them know what they really want.

>

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Stefan Ram
Newsgroups: comp.lang.python
Organization: Stefan Ram
Date: Sat, 8 Jun 2024 16:06 UTC
References: 1 2 3 4 5 6 7 8 9
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: 8 Jun 2024 16:06:07 GMT
Organization: Stefan Ram
Lines: 7
Expires: 1 Feb 2025 11:59:58 GMT
Message-ID: <what-20240608170532@ram.dialup.fu-berlin.de>
References: <v3am2l$1qf6m$3@dont-email.me> <aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info> <mailman.74.1717103931.2909.python-list@python.org> <v3bcgu$229eq$1@dont-email.me> <3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info> <8409fd89-8b42-43c4-8511-704d57b3a4be@tompassin.net> <2f37a78b-0757-4e1a-860a-9fe3f86200cf@wichmann.us> <CACwCsY5Ga8Lq8-gPZWSsEWbgmP+VKsF7DjLpN_GFbBEgjuPKBw@mail.gmail.com> <mailman.97.1717862064.2909.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de QIrsCrB6tEDgJVV8ITtXaQ7kvbxq6e6QV2t8L6pvuuNYtZ
Cancel-Lock: sha1:+OKJfEeQ6b8MVvPv4ytz6L9aZVI= sha256:/1lpBPCuhdNVW6rqsbpGqqpelmpDP2E+sSnStxyQO0c=
X-Copyright: (C) Copyright 2024 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
View all headers

Larry Martell <larry.martell@gmail.com> wrote or quoted:
>People need to see something to help them know what they really want.

|The hardest single part of building a software system is
|deciding precisely what to build.
Brooks, F.P. Jr., The Mythical Man-Month: Essays on Software
Engineering, Addison Wesley, Reading, MA, 1995, Second Edition.

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Thomas Passin
Newsgroups: comp.lang.python
Date: Sat, 8 Jun 2024 17:10 UTC
References: 1 2 3 4 5 6 7 8 9
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: list1@tompassin.net (Thomas Passin)
Newsgroups: comp.lang.python
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Sat, 8 Jun 2024 13:10:13 -0400
Lines: 40
Message-ID: <mailman.98.1717868916.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
<8409fd89-8b42-43c4-8511-704d57b3a4be@tompassin.net>
<2f37a78b-0757-4e1a-860a-9fe3f86200cf@wichmann.us>
<CACwCsY5Ga8Lq8-gPZWSsEWbgmP+VKsF7DjLpN_GFbBEgjuPKBw@mail.gmail.com>
<f3a15c33-8fd2-44f7-a9ab-442a663af7be@tompassin.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de mOW6rm3i2uo+LSt/d2OkEgi5Hy3lPvz+S73kEZQfVc9A==
Cancel-Lock: sha1:RwRQULYNRFMPEFQZAhfuhvaXpzA= sha256:47lrAeNJ1l7MBoQOdulfSNK5j8R6RNn5WkI8L5ePlFo=
Return-Path: <list1@tompassin.net>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=tompassin.net header.i=@tompassin.net header.b=T9DuitOB;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.122
X-Spam-Level: *
X-Spam-Evidence: '*H*': 0.76; '*S*': 0.00; 'approaches': 0.09; 'email
addr:python.org>': 0.09; 'junior': 0.09; 'situation,': 0.09;
'something,': 0.09; '2024': 0.16; 'alluding': 0.16; 'along.':
0.16; 'joy': 0.16; 'mats': 0.16; 'received:10.0.0': 0.16;
'received:64.90': 0.16; 'received:64.90.62': 0.16;
'received:64.90.62.162': 0.16; 'received:dreamhost.com': 0.16;
'sorry!': 0.16; 'subject: -- ': 0.16; 'subject:words': 0.16;
'understood.': 0.16; 'wichmann': 0.16; 'wrote:': 0.16; 'problem':
0.16; 'to:addr:python-list': 0.20; "i've": 0.22; 'sat,': 0.22;
'code': 0.23; 'jun': 0.26; "isn't": 0.27; 'else': 0.27; '>>>':
0.28; 'fact': 0.28; 'header:User-Agent:1': 0.30; 'am,': 0.31;
'question': 0.32; 'do.': 0.32; 'python-list': 0.32; 'realize':
0.32; 'received:10.0': 0.32; 'received:mailchannels.net': 0.32;
'received:relay.mailchannels.net': 0.32; 'but': 0.32; "i'm": 0.33;
'there': 0.33; 'someone': 0.34; 'header:In-Reply-To:1': 0.34;
'possibly': 0.36; 'people': 0.36; 'really': 0.37; "it's": 0.37;
'way': 0.38; 'could': 0.38; 'two': 0.39; 'enough': 0.39; 'on.':
0.39; 'to.': 0.39; 'case.': 0.40; 'explain': 0.40; 'management,':
0.40; 'something': 0.40; 'should': 0.40; 'lack': 0.60; 'come':
0.62; 'clear': 0.64; 'your': 0.64; 'process.': 0.65; 'worked':
0.67; 'header:Received:6': 0.67; 'received:64': 0.67; 'that,':
0.67; 'live': 0.68; 'right': 0.68; 'acceptable': 0.69; 'clarity':
0.69; 'end,': 0.69; 'lucky': 0.69; 'mutual': 0.69; 'shed': 0.69;
'times': 0.69; 'interview': 0.70; 'person.': 0.70; 'truly': 0.70;
'supposed': 0.76; 'out,': 0.78; 'spent': 0.81; 'client': 0.82;
'known': 0.84; 'fortunately,': 0.84; 'want.': 0.84;
'subject:From': 0.91; 'subject:once': 0.91
X-Sender-Id: dreamhost|x-authsender|tpassin@tompassin.net
ARC-Seal: i=1; s=arc-2022; d=mailchannels.net; t=1717866615; a=rsa-sha256;
cv=none;
b=XL4X+XdYx2NYCciwouabc+Y+RzVJnFBuOt2PDPquxq6ngGK+cHbhKUoswysf3AYoLtiG3L
0vhqA3Ub5gl09CpdmfgXwMd2bNceXe8wm07670IXkAj0+QuX6NvnYHWGmrmYl8nWLUC8jg
Bb73hA0WMM6Fbh5JQySUVV4+inbde/dT6jI0aBugmgQ4ZA8UmGr1I7IXIIvFz9se5x9D6y
MbTs0Ih9OI8Xa7G87v3zYYhVFz0tj8yPPQnbn+JHiraxM07jkI1Qrzj5ZYLmXzdYqyPuM2
mpLbKFnX+z9Tt5do+Q9kyFv8vwA/BoVCvRQ2C20/DRaefmJ5qG7e3gwy9QzIEA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
d=mailchannels.net; s=arc-2022; t=1717866615;
h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
to:to:cc:mime-version:mime-version:content-type:content-type:
content-transfer-encoding:content-transfer-encoding:
in-reply-to:in-reply-to:references:references:dkim-signature;
bh=3o1xoKIcVuz1zRbgkiPl+Xnr0EwbHP13isCMlNNkDlQ=;
b=lbyT6sNPfwK0dGXmPkdECnKBHO/eNkb+bt4izyrPzzc5AmIgp2GrR97jtQw8bCTL2cRzds
RG6hbhWXgj8EuXqX4z5cUOrP4ucnnWynXMPx6DUf1+Ryz6JEQhc095A04Wwo4A0wp1qq0R
7Oegp+w1Ll5Tw2LxVteexlBqdlInNTDDJRhLoeML54W3DJ4OBRdyx2A70S5W948tMGz6M3
QO0pCefmMJEzDSDoxNx8e6rjwcKbTpt4t5rXoDqU30j3g8Xe7nL9UuVe2kGeKoAdx7izlj
6fzW8FOl/FoEkEHXY+UfBnMv0LdUIBdHj2l4gwR/3wNtFUEWinzQGx45anpH3g==
ARC-Authentication-Results: i=1; rspamd-79677bdb95-2khdz;
auth=pass smtp.auth=dreamhost smtp.mailfrom=list1@tompassin.net
X-Sender-Id: dreamhost|x-authsender|tpassin@tompassin.net
X-MC-Relay: Neutral
X-MailChannels-SenderId: dreamhost|x-authsender|tpassin@tompassin.net
X-MailChannels-Auth-Id: dreamhost
X-Hook-Troubled: 5a9d2ea379324a32_1717866615368_3604530397
X-MC-Loop-Signature: 1717866615368:2123503115
X-MC-Ingress-Time: 1717866615368
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tompassin.net;
s=dreamhost; t=1717866614;
bh=3o1xoKIcVuz1zRbgkiPl+Xnr0EwbHP13isCMlNNkDlQ=;
h=Date:Subject:To:From:Content-Type:Content-Transfer-Encoding;
b=T9DuitOBBsfXzUSwTUPjDu7XOBw6CIb+3z0XSiORBjc1z0gEkuj9UoiUSrJ75siMv
Ws1pRK5tf0/M2eWqOU9x8Z/Y4FRXLe3wPcjLNFHlHktExiuxqxZ/TLUXnJ07DD9HUF
WdiO1ey9W/zqGXKExMjIkzWs+aqT+wh8iA2Wcri+ujxGH7eDwV2gzZQjGvte0vikOg
hxU2iO2r6XRcsONh/D0w+WHAUWIiK4EP2LaaMT94TJ4EodxdR3gajG7zRA27ooOB0O
azdrI13nGzlQdAitM2+P16ah0vT4udgFVPYWKFlt+j5M5RDeuBs0iWfmBxoIBmva0t
6iMrrHRv2O5YQ==
User-Agent: Mozilla Thunderbird
Content-Language: en-US
In-Reply-To: <CACwCsY5Ga8Lq8-gPZWSsEWbgmP+VKsF7DjLpN_GFbBEgjuPKBw@mail.gmail.com>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <f3a15c33-8fd2-44f7-a9ab-442a663af7be@tompassin.net>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
<8409fd89-8b42-43c4-8511-704d57b3a4be@tompassin.net>
<2f37a78b-0757-4e1a-860a-9fe3f86200cf@wichmann.us>
<CACwCsY5Ga8Lq8-gPZWSsEWbgmP+VKsF7DjLpN_GFbBEgjuPKBw@mail.gmail.com>
View all headers

On 6/8/2024 11:54 AM, Larry Martell via Python-list wrote:
> On Sat, Jun 8, 2024 at 10:39 AM Mats Wichmann via Python-list <
> python-list@python.org> wrote:
>
>> On 6/5/24 05:10, Thomas Passin via Python-list wrote:
>>
>>> Of course, we see this lack of clarity all the time in questions to the
>>> list. I often wonder how these askers can possibly come up with
>>> acceptable code if they don't realize they don't truly know what it's
>>> supposed to do.
>>
>> Fortunately, having to explain to someone else why something is giving
>> you trouble can help shed light on the fact the problem statement isn't
>> clear, or isn't clearly understood. Sometimes (sadly, many times it
>> doesn't).
>
>
> The original question struck me as homework or an interview question for a
> junior position. But having no clear requirements or specifications is good
> training for the real world where that is often the case. When you question
> that, you are told to just do something, and then you’re told it’s not what
> is wanted. That frustrates people but it’s often part of the process.
> People need to see something to help them know what they really want.

At the extremes, there are two kinds of approaches you are alluding to.
One is what I learned to call "rock management": "Bring me a rock ...
no, that's not the right one, bring me another ... no that's not what
I'm looking for, bring me another...". If this is your situation, so,
so sorry!

At the other end, there is a mutual evolution of the requirements
because you and your client could not have known what they should be
until you have spent effort and time feeling your way along. With the
right client and management, this kind of project can be a joy to work
on. I've been lucky enough to have worked on several projects of this kind.

In truth, there always are requirements. Often (usually?) they are not
thought out, not consistent, not articulated clearly, and not
communicated well. They may live only in the mind of one person.

Subject: RE: From JoyceUlysses.txt -- words occurring exactly once
From: <avi.e.gross@gmail.com>
Newsgroups: comp.lang.python
Date: Sat, 8 Jun 2024 18:46 UTC
References: 1 2 3 4 5 6 7 8 9 10
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: <avi.e.gross@gmail.com>
Newsgroups: comp.lang.python
Subject: RE: From JoyceUlysses.txt -- words occurring exactly once
Date: Sat, 8 Jun 2024 14:46:43 -0400
Lines: 148
Message-ID: <mailman.99.1717872407.2909.python-list@python.org>
References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
<8409fd89-8b42-43c4-8511-704d57b3a4be@tompassin.net>
<2f37a78b-0757-4e1a-860a-9fe3f86200cf@wichmann.us>
<CACwCsY5Ga8Lq8-gPZWSsEWbgmP+VKsF7DjLpN_GFbBEgjuPKBw@mail.gmail.com>
<f3a15c33-8fd2-44f7-a9ab-442a663af7be@tompassin.net>
<005b01dab9d4$350d7290$9f2857b0$@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain;
charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de 9sxggSLKm4XsKDvWv6cRDw7F0GGI2PE6duVCaHE/b2/A==
Cancel-Lock: sha1:O8P98eZQQWlir09Pvvg5A80Sa+4= sha256:Rksk8DPVvfTVMjOcdt51aBPG2VkWWRHq/6yYCD08yN4=
Return-Path: <avi.e.gross@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=AGZSOGDO;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.011
X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; 'projects,': 0.03;
'fairly': 0.05; 'random': 0.05; 'programmer': 0.07; 'aligned':
0.09; 'approaches': 0.09; 'email addr:python.org>': 0.09;
'enough.': 0.09; 'graph': 0.09; 'junior': 0.09; 'received:108':
0.09; 'situation,': 0.09; 'something,': 0.09; 'trivial': 0.09;
'url:mailman': 0.15; '2024': 0.16; 'all)': 0.16; 'alluding': 0.16;
'along.': 0.16; 'arbitrary': 0.16; 'arguments': 0.16; 'bugs':
0.16; 'column': 0.16; 'columns': 0.16; 'datasets': 0.16;
'diagram': 0.16; 'doable': 0.16; 'extensions': 0.16; 'extracting':
0.16; 'joy': 0.16; 'labels': 0.16; 'layers': 0.16; 'mats': 0.16;
'overlay': 0.16; 'reuse': 0.16; 'rewrite': 0.16; 'similar.': 0.16;
'size.': 0.16; 'sorry!': 0.16; 'subject: -- ': 0.16;
'subject:words': 0.16; 'times,': 0.16; 'understood.': 0.16;
'wichmann': 0.16; 'wrote:': 0.16; 'problem': 0.16; 'python': 0.16;
'values': 0.17; 'code.': 0.17; 'message-id:@gmail.com': 0.18;
'it?': 0.19; 'to:addr:python-list': 0.20; 'all,': 0.20; "i've":
0.22; 'languages': 0.22; 'creates': 0.22; 'sat,': 0.22; 'code':
0.23; 'feedback': 0.23; 'lines': 0.23; 'idea': 0.24; 'skip:- 10':
0.25; 'url-ip:188.166.95.178/32': 0.25; 'url-ip:188.166.95/24':
0.25; 'actual': 0.25; 'url:listinfo': 0.25; 'cannot': 0.25; 'url-
ip:188.166/16': 0.25; 'seems': 0.26; 'available,': 0.26; 'jun':
0.26; "isn't": 0.27; 'else': 0.27; 'bit': 0.27; 'function': 0.27;
'done': 0.28; '>>>': 0.28; 'fact': 0.28; 'purpose': 0.28;
'request.': 0.28; 'series': 0.28; 'ideas': 0.28; 'suggest': 0.28;
'recently': 0.29; 'asked': 0.29; 'takes': 0.31; 'am,': 0.31;
'approach': 0.31; 'before.': 0.31; 'modify': 0.31; 'url-ip:188/8':
0.31; 'program': 0.31; 'question': 0.32; 'structure': 0.32;
'live': 0.68; 'right': 0.68; 'exactly': 0.68; 'acceptable': 0.69;
'and,': 0.69; 'clarity': 0.69; 'end,': 0.69; 'lucky': 0.69;
'mutual': 0.69; 'shed': 0.69; 'showed': 0.69; 'analysis': 0.69;
'times': 0.69; 'within': 0.69; 'below': 0.69; 'interview': 0.70;
'person.': 0.70; 'truly': 0.70; 'longer': 0.71; 'future': 0.72;
'deal': 0.73; 'june': 0.73; 'easy': 0.74; 'chain': 0.76;
'documented': 0.76; 'limits': 0.76; 'supposed': 0.76; 'out,':
0.78; 'sent:': 0.78; 'quickly': 0.80; 'spent': 0.81; 'client':
0.82; 'more.': 0.82; 'known': 0.84; 'adjusted': 0.84; 'adjusting':
0.84; 'coded': 0.84; 'danger': 0.84; 'experiments': 0.84;
'fortunately,': 0.84; 'from.': 0.84; 'localized': 0.84;
'measurement': 0.84; 'measurements': 0.84; 'occurring': 0.84;
'periods': 0.84; 'pie': 0.84; 'saturday,': 0.84; 'sizes': 0.84;
'want.': 0.84; 'wiser': 0.84; 'anticipated': 0.91; 'chart': 0.91;
'subject:From': 0.91; 'subject:once': 0.91; 'goals': 0.96
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20230601; t=1717872404; x=1718477204; darn=python.org;
h=content-language:thread-index:content-transfer-encoding
:mime-version:message-id:date:subject:in-reply-to:references:to:from
:from:to:cc:subject:date:message-id:reply-to;
bh=8A62dVua4PfRHoJ3ERi2NOGlgNV9s6JyFyZdIQ5rXCY=;
b=AGZSOGDOr35CzQZ7hYsrSIkPS3dn702CaTuI+rgZWI6hHjp4agwQXDarhavIYbLKT4
+d/+d27yrFQBmd3N+pSTr4cWbSmmaH1SU43M1jb00k1ZI0K4HE/Z+R1f+SIE9b91rC4b
3ZJ9zcltC9AJWd6U5hv2GGtZ2lHIMPOkTUNvelb3HqPY6nQR7IEDIbKBQ0xHvDOVvDSq
9b2p6j7fj/jIQe5r45WhzoJPJv9SCZoyi4gCS3mUigbo8c7YTjLXHB7ji86qmeexK+tU
GDjEQ8WiOvyEV+qvMwKd3P8QN9WbHZGGkmJRSaXo/BaZPCFqr9ev+GPRMDn3iQL86H9G
Qf4w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20230601; t=1717872404; x=1718477204;
h=content-language:thread-index:content-transfer-encoding
:mime-version:message-id:date:subject:in-reply-to:references:to:from
:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
bh=8A62dVua4PfRHoJ3ERi2NOGlgNV9s6JyFyZdIQ5rXCY=;
b=rzJVJABUxWF3hnxqhriZ1dH7yQ8XSym0PUObUzJxM2CeZgs2AgMKka5llHwKb0BErx
B8i9K6h1/oNMSk0OYvBgRRr1+DLb87YTQosjBveq5bMtKi3IZ5+NrgT77WsTbhHJZmSR
157R93Aq9Ozl9mhcx2t+pB5fuyoUH5AR4XvTMpniwtEF1Z47OrOn1bRnhJGYahowiJWL
Ks6cCjIIk8NZyya5ofCo9DaRlmRpInOjP+lsYmvzorFrZDvR4+hfYodAgkn0rbMBzURk
T31r8D94/UMNVjVua/UnqOYfEh9vvH23AxLYGntzOBn4Uog6BNw4R2SUoObdoDPilexF
BWOw==
X-Forwarded-Encrypted: i=1;
AJvYcCWFFsUK92ek6WKDx7e/b6scoY1hrD94z2MjTS62tl8WX9kPTvYdxIWKL3cpVPMEY9FmVIWnI7aWhnLF3tvEctmvYr8oMBzM
X-Gm-Message-State: AOJu0YwHcr6zwORg4gGYv4uZZXVjCyQs4s3W9gCfrbBngTjR2sw0YxCb
heh/ABIlHVqM3EhhKLeftbzUC8r1JjdfjyhZzEIwL4lKntn95GzgsM4mkw==
X-Google-Smtp-Source: AGHT+IFweml7KyMT4oGfSKyH422csrVBII7Rbg50Nvib1mrS4/gNeeB9FoSDjpQuRaU29QpHUuLxHw==
X-Received: by 2002:a25:aa29:0:b0:de6:896:26f0 with SMTP id
3f1490d57ef6-dfaf654a112mr5663568276.1.1717872404104;
Sat, 08 Jun 2024 11:46:44 -0700 (PDT)
In-Reply-To: <f3a15c33-8fd2-44f7-a9ab-442a663af7be@tompassin.net>
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AQFaUOX6FFD6/KMzUDJ9KVS4Th55PQH+0ZnQAlA2fisCQxdJEQIRrgSVAm37ZPcChJ9WGgKIx8UtASPUKdOyNOZfMA==
Content-Language: en-us
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <005b01dab9d4$350d7290$9f2857b0$@gmail.com>
X-Mailman-Original-References: <v3am2l$1qf6m$3@dont-email.me>
<aef0bc5c-b0b6-4d7d-af05-cc22c165f327@DancesWithMice.info>
<mailman.74.1717103931.2909.python-list@python.org>
<v3bcgu$229eq$1@dont-email.me>
<3dedbc3b-7db0-4a39-863f-56324d434b12@DancesWithMice.info>
<8409fd89-8b42-43c4-8511-704d57b3a4be@tompassin.net>
<2f37a78b-0757-4e1a-860a-9fe3f86200cf@wichmann.us>
<CACwCsY5Ga8Lq8-gPZWSsEWbgmP+VKsF7DjLpN_GFbBEgjuPKBw@mail.gmail.com>
<f3a15c33-8fd2-44f7-a9ab-442a663af7be@tompassin.net>
View all headers

Agreed, Thomas.

As someone who has spent lots of time writing code OR requirements of various levels or having to deal with the bugs afterwards, there can be a huge disconnect between the people trying to decide what to do and the people having to do it. It is not necessarily easy to come back later and ask for changes that wewre not anticipated in the design or implementation.

I recently wrote a program where the original specifications seemed reasonable. In one part, I was asked to make a graph with some random number (or all) of the data shown as a series of connected line segments showing values for the same entity at different measurement periods and then superimpose the mean for all the original data, not just the subsample shown. This needed to be done on multiple subsamples of the original/calculated data so I made it into a function.

One of the datasets contained a column that was either A or B and the function was called multiple times to show what a random sample of A+B, just A and just B graphed like along with the mean of the specific data it was drawn from. But then, I got an innocuously simple request.

Could we graph A+B and overlay not only the means for A+B as was now done, but also the mean for A and the mean for B. Ideally, this would mean three bolder jaged lines superimposed above the plot and seemed simple enough.

But was it? To graph the means in the first place, I made a more complex data structure needed so when graphed, it aligned well with what was below it. But that was hard coded in my function, but in one implementation, I now needed it three times. Extracting it into a new function was not trivial as it depended initially on other things within the body of the function. But, it was doable and might have been done that way had I known such a need might arise. It often is like that when there seems no need to write a function for just one use. The main function now needed to be modified to allow optionally adding one or two more datasets and if available, call the new function on each and add layers to the graph with the additional means (dashed and dotted) if they are called while otherwise, the function worked as before.

But did I do it right? Well, if next time I am asked to have the data extended to have more measurements in more columns at more times, I might have to rewrite quite a bit of the code. My localized change allowed one or two additional means to be plotted. Adding an arbitrary number takes a different approach and, frankly, there are limits on how many kinds of 'line" segments can be used to differentiate among them.

Enough of the example except to make a point. In some projects, it is not enough to tell a programmer what you want NOW. You may get what you want fairly quickly but if you have ideas of possible extensions or future upgrades, it would be wiser to make clear some of the goals so the programmer creates an implementation that can be more easily adjusted to do more. Such code can take longer and be more complex so it may not pay off immediately.

But, having said that, plenty of software may benefit from looking at what is happening and adjusting on the fly. Clearly my client cannot know what feedback they may get when showing an actual result to others who then suggest changes or enhancements. The results may not be anticipated so well in advance and especially not when the client has no idea what is doable and so on.

A related example was a request for how to modify a sort of Venn Diagram chart to change the font size. Why? Because some of the labels were long and the relative sizes of the pie slices were not known till an analysis of the data produced the appropriate numbers and ratios. This was a case where the documentation of the function used by them did not suggest how to do many things as it called a function that called others to quite some depth. A few simple experiments and some guesses and exploration showed me ways to pass arguments along that were not documented but that were passed properly down the chain and I could now change the text size and quite a few other things. But I asked myself if this was really the right solution the client needed. I then made a guess on how I could get the long text wrapped into multiple lines that fit into the sections of the Venn Diagram without shrinking the text at all, or as much. The client had not considered that as an option, but it was better for their display than required. But until people see such output, unless they have lots of experience, it cannot be expected they can tell you up-front what they want.

One danger of languages like Python is that often people get the code you supply and modify it themselves or reuse it on some project they consider similar. That can be a good thing but often a mess as you wrote the code to do things in a specific way for a specific purpose ...

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Thomas Passin via Python-list
Sent: Saturday, June 8, 2024 1:10 PM
To: python-list@python.org
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once

On 6/8/2024 11:54 AM, Larry Martell via Python-list wrote:
> On Sat, Jun 8, 2024 at 10:39 AM Mats Wichmann via Python-list <
> python-list@python.org> wrote:
>
>> On 6/5/24 05:10, Thomas Passin via Python-list wrote:
>>
>>> Of course, we see this lack of clarity all the time in questions to the
>>> list. I often wonder how these askers can possibly come up with
>>> acceptable code if they don't realize they don't truly know what it's
>>> supposed to do.
>>
>> Fortunately, having to explain to someone else why something is giving
>> you trouble can help shed light on the fact the problem statement isn't
>> clear, or isn't clearly understood. Sometimes (sadly, many times it
>> doesn't).
>
>
> The original question struck me as homework or an interview question for a
> junior position. But having no clear requirements or specifications is good
> training for the real world where that is often the case. When you question
> that, you are told to just do something, and then you’re told it’s not what
> is wanted. That frustrates people but it’s often part of the process.
> People need to see something to help them know what they really want.

At the extremes, there are two kinds of approaches you are alluding to.
One is what I learned to call "rock management": "Bring me a rock ...
no, that's not the right one, bring me another ... no that's not what
I'm looking for, bring me another...". If this is your situation, so,
so sorry!

At the other end, there is a mutual evolution of the requirements
because you and your client could not have known what they should be
until you have spent effort and time feeling your way along. With the
right client and management, this kind of project can be a joy to work
on. I've been lucky enough to have worked on several projects of this kind.

In truth, there always are requirements. Often (usually?) they are not
thought out, not consistent, not articulated clearly, and not
communicated well. They may live only in the mind of one person.

--
https://mail.python.org/mailman/listinfo/python-list

Pages:12

rocksolid light 0.9.8
clearnet tor