Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

Caution: Keep out of reach of children.


comp / comp.lang.lisp / Re: From JoyceUlysses.txt -- words occurring exactly once

SubjectAuthor
* From JoyceUlysses.txt -- words occurring exactly onceHenHanna
+- Re: From JoyceUlysses.txt -- words occurring exactly onceJeff Barnett
+* Re: From JoyceUlysses.txt -- words occurring exactly onceStefan Monnier
|`* Re: From JoyceUlysses.txt -- words occurring exactly onceKaz Kylheku
| `* Re: From JoyceUlysses.txt -- words occurring exactly onceMadhu
|  `- Re: From JoyceUlysses.txt -- words occurring exactly oncesteve g
+- Re: From JoyceUlysses.txt -- words occurring exactly oncePaul Rubin
`- Re: From JoyceUlysses.txt -- words occurring exactly onceB. Pym

1
Subject: From JoyceUlysses.txt -- words occurring exactly once
From: HenHanna
Newsgroups: comp.lang.lisp, comp.lang.scheme
Organization: A noiseless patient Spider
Date: Thu, 30 May 2024 20:09 UTC
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: HenHanna@devnull.tb (HenHanna)
Newsgroups: comp.lang.lisp,comp.lang.scheme
Subject: From JoyceUlysses.txt -- words occurring exactly once
Date: Thu, 30 May 2024 13:09:39 -0700
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <v3ame4$1qf6m$5@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 30 May 2024 22:09:40 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="f52218980f176c0dd32f4029d8d739d1";
logging-data="1916118"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+6rzA6AzKlnTXYGNmst5paZRpzqsYOmIE="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:WgFTgUR9FsrgTqElm82oqyLLD7c=
Content-Language: en-US
View all headers

i'd not use Gauche for this, but maybe someone can change my mind.

_______________________
From JoyceUlysses.txt -- words occurring exactly once

Given a text file of a novel (JoyceUlysses.txt) ...

could someone give me a pretty fast (and simple) program that'd give me
a list of all words occurring exactly once?

-- Also, a list of words occurring once, twice or 3 times

re: hyphenated words (you can treat it anyway you like)

ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Jeff Barnett
Newsgroups: comp.lang.lisp, comp.lang.scheme
Organization: A noiseless patient Spider
Date: Thu, 30 May 2024 22:33 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jbb@notatt.com (Jeff Barnett)
Newsgroups: comp.lang.lisp,comp.lang.scheme
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Thu, 30 May 2024 16:33:30 -0600
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <v3aus4$1sknf$1@dont-email.me>
References: <v3ame4$1qf6m$5@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: base64
Injection-Date: Fri, 31 May 2024 00:33:40 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="c2bcaee4d820f520b787d3813faef04a";
logging-data="1987311"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19H47WtPGq6F2cXCDXWubQmIJ8XvLfY40k="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:QEOR0sE/AWC2bTT76Zt59TPiKG8=
In-Reply-To: <v3ame4$1qf6m$5@dont-email.me>
Content-Language: en-US
View all headers

On 5/30/2024 2:09 PM, HenHanna wrote:
>
> i'd not use Gauche for this, but maybe someone can change my mind.
>
>
> _______________________
> From JoyceUlysses.txt -- words occurring exactly once
>
>
> Given a text file of a novel (JoyceUlysses.txt) ...
>
> could someone give me a pretty fast (and simple) program that'd give me
> a list of all words occurring exactly once?
>
>               -- Also, a list of words occurring once, twice or 3 times
>
>
>
> re: hyphenated words        (you can treat it anyway you like)
>
>        ideally, i'd treat  [editor-in-chief]
>                            [go-ahead]  [pen-knife]
>                            [know-how]  [far-fetched] ...
>        as one unit.
Make a list (or array) of the individual words (as strings or symbols in
a special package) of the original document then sort the list using the
Lisp-supplied sort function. You than write a loop using your favorite
tools and look for interior sequences of the required length. This gives
you a program that is asymptotically efficient as the theoretical
run-time will look something like (* c N (log N)), where N is the length
of the list produced by the first step and c is some constant.
Note, any solution resembling this one is not really what you want. For
example it would think "Snark" and "Snarks" are different words. Some
differences such as capitalization can be suppressed by choosing a sort
predicate that is case insensitive. You can, of course, write your own
sort predicate. The thing to note is that the predicate (the <= operator
used by sort) will not access the words or maintain state between
invocations; otherwise, the complexity can become arbitrarily large.
--
Jeff Barnett

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Stefan Monnier
Newsgroups: comp.lang.lisp, comp.lang.scheme
Organization: A noiseless patient Spider
Date: Thu, 30 May 2024 22:45 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: monnier@iro.umontreal.ca (Stefan Monnier)
Newsgroups: comp.lang.lisp,comp.lang.scheme
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Thu, 30 May 2024 18:45:00 -0400
Organization: A noiseless patient Spider
Lines: 10
Message-ID: <jwvzfs6ncq0.fsf-monnier+comp.lang.lisp@gnu.org>
References: <v3ame4$1qf6m$5@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Date: Fri, 31 May 2024 00:45:10 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="09cb8bd1565b7925484549f09d63700c";
logging-data="1988815"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+LPh+DSwMh5NNdGubo+SD6LnbRtPS80Mo="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:JI24YxwIKD0stq8chza0Go84xTg=
sha1:UTvNvILuxHRt0wxgEctLnWYyb1M=
View all headers

> Given a text file of a novel (JoyceUlysses.txt) ...
> could someone give me a pretty fast (and simple) program that'd give me
> a list of all words occurring exactly once?

tr ' .;:,?!' '\n' | sort | uniq -u

?

- Stefan

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Kaz Kylheku
Newsgroups: comp.lang.lisp, comp.lang.scheme
Organization: A noiseless patient Spider
Date: Thu, 30 May 2024 23:20 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 643-408-1753@kylheku.com (Kaz Kylheku)
Newsgroups: comp.lang.lisp,comp.lang.scheme
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Thu, 30 May 2024 23:20:08 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 13
Message-ID: <20240530161942.627@kylheku.com>
References: <v3ame4$1qf6m$5@dont-email.me>
<jwvzfs6ncq0.fsf-monnier+comp.lang.lisp@gnu.org>
Injection-Date: Fri, 31 May 2024 01:20:09 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="be4518a799bc3fa2f27a0cf115882681";
logging-data="2000837"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19tlvOgMVDcb6/vcMwlumovD8/f6o3wWqU="
User-Agent: slrn/pre1.0.4-9 (Linux)
Cancel-Lock: sha1:z2cg8LeT5K4OV72kmtZ1fzOQHAo=
View all headers

On 2024-05-30, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>> Given a text file of a novel (JoyceUlysses.txt) ...
>> could someone give me a pretty fast (and simple) program that'd give me
>> a list of all words occurring exactly once?
>
> tr ' .;:,?!' '\n' | sort | uniq -u

Yep, that's pretty much how Doug McIlroy famously shut down Knuth.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Paul Rubin
Newsgroups: comp.lang.lisp, comp.lang.scheme
Organization: A noiseless patient Spider
Date: Fri, 31 May 2024 07:40 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: no.email@nospam.invalid (Paul Rubin)
Newsgroups: comp.lang.lisp,comp.lang.scheme
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Fri, 31 May 2024 00:40:59 -0700
Organization: A noiseless patient Spider
Lines: 6
Message-ID: <878qzqtoms.fsf@nightsong.com>
References: <v3ame4$1qf6m$5@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Date: Fri, 31 May 2024 09:41:06 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="c1fe16f3e3b4431758512624042c13d3";
logging-data="2252582"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19c2U4ypaGcpzlZHQsr/E5f"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
Cancel-Lock: sha1:TVvd+H2L8ZX73JEfTlvBi3diEvU=
sha1:E0Dg2ajXQ66Fxz55YKNQqmvuWfQ=
View all headers

> could someone give me a pretty fast (and simple) program that'd give
> me a list of all words occurring exactly once?

To first approximation, this works for me (bash command):

tr -c "[a-zA-Z-]" "\n" < ulysses.txt |sort|uniq -c|sort -n

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: B. Pym
Newsgroups: comp.lang.lisp, comp.lang.scheme
Organization: A noiseless patient Spider
Date: Fri, 31 May 2024 10:13 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: No_spamming@noWhere_7073.org (B. Pym)
Newsgroups: comp.lang.lisp,comp.lang.scheme
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Fri, 31 May 2024 10:13:50 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 62
Message-ID: <v3c7st$26biv$1@dont-email.me>
References: <v3ame4$1qf6m$5@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Injection-Date: Fri, 31 May 2024 12:13:51 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="6c2b9b9238357433b68a6ad6acbc6363";
logging-data="2305631"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+lq/+0ukfdWOEHT9W9Ot2H"
User-Agent: XanaNews/1.18.1.6
Cancel-Lock: sha1:XLLkddecDl9FUISxDGw2H0gfzv4=
View all headers

On 5/30/2024, HenHanna wrote:

>
> i'd not use Gauche for this, but maybe someone can change my mind.
>
>
> _______________________
> From JoyceUlysses.txt -- words occurring exactly once
>
>
> Given a text file of a novel (JoyceUlysses.txt) ...
>
> could someone give me a pretty fast (and simple) program that'd give me a list of all words occurring exactly once?
>
> -- Also, a list of words occurring once, twice or 3 times
>
>
>
> re: hyphenated words (you can treat it anyway you like)
>
> ideally, i'd treat [editor-in-chief]
> [go-ahead] [pen-knife]
> [know-how] [far-fetched] ...
> as one unit.

Gauche Scheme

(use file.util) ;; file->string
(use srfi-13) ;; character sets
(use srfi-14) ;; string-tokenize

(define h (make-hash-table 'string=?))

(dolist
(s
(string-tokenize (file->string "Alice.txt")
(char-set-adjoin char-set:letter #\-)))
(hash-table-update! h
(regexp-replace* (string-upcase s) #/^-+/ "" #/-+$/ "")
(pa$ + 1) 0))

(filter (lambda(kv) (< (cdr kv) 3))
(hash-table->alist h))

===>

(("LASTED" . 2) ("WAY--NEVER" . 1) ("VISIT" . 1) ("CHANCED" . 1)
("WILDLY" . 2) ("BEHEAD" . 1) ("PROMISE" . 1) ("MEANWHILE" . 1)
("ENGAGED" . 1) ("KNIFE" . 2) ("ROARED" . 1) ("RETIRE" . 1)
("BLACKING" . 1) ("HATED" . 1) ("BRIGHT-EYED" . 1)
("SHEEP-BELLS" . 1) ("PROTECTION" . 1) ("CRIES" . 1) ("ADA" . 1)
("ENJOY" . 1) ("WRITHING" . 1) ("RAW" . 1) ("APPEALED" . 1)
("RELIEVED" . 1) ("CHILDHOOD" . 1) ("WEPT" . 1) ("RACE-COURSE" . 1)
("THEIRS" . 1) ("MAD--AT" . 1) ("SPOKEN" . 1) ("PENCILS" . 1)
("CLEAR" . 2) ("TREADING" . 2) ("RETURNED" . 2) ("CHERRY-TART" . 1)
("UNEASY" . 1) ("LOW-SPIRITED" . 1) ("BONE" . 1) ("PROMISED" . 1)
("HAPPENING" . 1) ("OYSTER" . 1) ("PATIENTLY" . 2) ("NEEDS" . 1)
("LESSON-BOOK" . 1) ("PITIED" . 1) ("UNCOMFORTABLY" . 1)
("ANTIPATHIES" . 1) ("PICTURED" . 1) ("DESPERATE" . 1)
("ENGRAVED" . 1)
...
)

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: Madhu
Newsgroups: comp.lang.lisp
Organization: Motzarella
Date: Sat, 8 Jun 2024 16:47 UTC
References: 1 2 3
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: enometh@meer.net (Madhu)
Newsgroups: comp.lang.lisp
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
Date: Sat, 08 Jun 2024 22:17:18 +0530
Organization: Motzarella
Lines: 31
Message-ID: <m3zfrv5qll.fsf@leonis4.robolove.meer.net>
References: <v3ame4$1qf6m$5@dont-email.me>
<jwvzfs6ncq0.fsf-monnier+comp.lang.lisp@gnu.org>
<20240530161942.627@kylheku.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Date: Sat, 08 Jun 2024 18:47:18 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="148b5a38ade18e885d5bb98692d69b25";
logging-data="2828901"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18T6NI7i3BvD1MLzA1pfolnvd3s++QFWTo="
Cancel-Lock: sha1:fGb6ERQDjRLnfE3JcYbKmNMXu9c=
sha1:rpjB4GpuHwHVoZ0omUNrcntl/eU=
View all headers

* Kaz Kylheku <20240530161942.627@kylheku.com> :
Wrote on Thu, 30 May 2024 23:20:08 -0000 (UTC):

> On 2024-05-30, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
>>> Given a text file of a novel (JoyceUlysses.txt) ...
>>> could someone give me a pretty fast (and simple) program that'd give me
>>> a list of all words occurring exactly once?
>>
>> tr ' .;:,?!' '\n' | sort | uniq -u
>
> Yep, that's pretty much how Doug McIlroy famously shut down Knuth.

https://www.cs.tufts.edu/~nr/cs257/archive/don-knuth/pearls-2.pdf

(how do you cite this?)

Knuth didn't invent the "hash trie" data structure for this the article,
it was already there in TeX, in this article knuth credits Frank Liang's
phd thesis for the data structure.

This was one of the first things things I coded up at the time of the
article. The fun was in designing how to best modify the structure
without sacrificing space

Phil Bagwell's paper "Ideal Hash Trees" described its invention
correctly as Hash Array Mapped Tries. However at some point, (probably
after the coming from clojure developers with "functional" pretensions?)
the "hash trie" was appropriated meaning something else,
something"immutable" and all that.

At least there isn't a wiki page for it.

Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
From: steve g
Newsgroups: comp.lang.lisp
Date: Sun, 11 Aug 2024 22:34 UTC
References: 1 2 3 4
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!border-2.nntp.ord.giganews.com!border-3.nntp.ord.giganews.com!border-4.nntp.ord.giganews.com!nntp.giganews.com!Xl.tags.giganews.com!local-1.nntp.ord.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Sun, 11 Aug 2024 22:34:51 +0000
From: sgonedes1977@gmail.com (steve g)
Newsgroups: comp.lang.lisp
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once
References: <v3ame4$1qf6m$5@dont-email.me>
<jwvzfs6ncq0.fsf-monnier+comp.lang.lisp@gnu.org>
<20240530161942.627@kylheku.com>
<m3zfrv5qll.fsf@leonis4.robolove.meer.net>
Date: Sun, 11 Aug 2024 18:34:45 -0400
Message-ID: <87le123cze.fsf@gmail.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:kJGbJxbvV6GKQ1qKCidFz1yV+T8=
MIME-Version: 1.0
Content-Type: text/plain
Lines: 27
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-nJTfP/NUXZ42lbuLXvDJ+FQEijMyDDx8S4m4tT0D/UKNAPiDngGhiv81rk4BpTaEHP2vAeH8vFpW1co!5kkyWNZewkO4fGHYIt03Jt6KFhALKTUQx0zvi7wGdkB/wp8=
X-Complaints-To: abuse@giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
View all headers

Madhu <enometh@meer.net> writes:

> * Kaz Kylheku <20240530161942.627@kylheku.com> :
> Wrote on Thu, 30 May 2024 23:20:08 -0000 (UTC):
>
< > On 2024-05-30, Stefan Monnier <monnier@iro.umontreal.ca> wrote:
< >>> Given a text file of a novel (JoyceUlysses.txt) ...
< >>> could someone give me a pretty fast (and simple) program that'd give me
< >>> a list of all words occurring exactly once?
< >>
< >> tr ' .;:,?!' '\n' | sort | uniq -u
< >
< > Yep, that's pretty much how Doug McIlroy famously shut down Knuth.
>
> https://www.cs.tufts.edu/~nr/cs257/archive/don-knuth/pearls-2.pdf
>
> (how do you cite this?)

you would think the university would have figured this out.

http://www.cs.tufts.edu/comp/250NN/neuralfaq.html
https://www-cs-faculty.stanford.edu/~knuth/lp.html

like I said before an FTP server can be usefull. imagine having to do
your assignments with a web browser or even worse: email. poor children.

1

rocksolid light 0.9.8
clearnet tor