Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

Increased knowledge will help you now. Have mate's phone bugged.


comp / comp.lang.tcl / Re: Does cheating produce faster searches?

SubjectAuthor
* Does cheating produce faster searches?Luc
+- Re: Does cheating produce faster searches?Rich
`* Re: Does cheating produce faster searches?Shaun Deacon
 `* Re: Does cheating produce faster searches?Luc
  `- Re: Does cheating produce faster searches?Rich

1
Subject: Does cheating produce faster searches?
From: Luc
Newsgroups: comp.lang.tcl
Organization: A noiseless patient Spider
Date: Wed, 25 Sep 2024 20:01 UTC
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: luc@sep.invalid (Luc)
Newsgroups: comp.lang.tcl
Subject: Does cheating produce faster searches?
Date: Wed, 25 Sep 2024 17:01:49 -0300
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <20240925170149.3463bec9@lud1.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 25 Sep 2024 22:01:50 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="4ff023fcdb8adc874e18eb4a8b17abff";
logging-data="4012684"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1981Sg6joJkg+4G8ui1uMgHOFhHcd4HbDs="
Cancel-Lock: sha1:MqMiJcpTRJKxfBTIiQcRK2/SFXY=
View all headers

Suppose I have a large list. Very large list. Then I want to search
for an item in that string:

% lsearch $list "string"

Now, suppose I have many lists instead. One list contains all the items
that begin with the letter a, another list contains all the items that
begin with the letter b, another list contains all the items that begin
with the letter c, and so on. Then I see what the first character in
"string" is and only search for it in the one corresponding list.
Would that be faster? I somehow suspect the answer is 'no.'

Bonus question: what about sqlite operations? Would they be faster if
I had one separate table for each initial letter/character?

TIA

--
Luc
>>

Subject: Re: Does cheating produce faster searches?
From: Rich
Newsgroups: comp.lang.tcl
Organization: A noiseless patient Spider
Date: Wed, 25 Sep 2024 20:36 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: rich@example.invalid (Rich)
Newsgroups: comp.lang.tcl
Subject: Re: Does cheating produce faster searches?
Date: Wed, 25 Sep 2024 20:36:23 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 62
Message-ID: <vd1s87$3qlog$1@dont-email.me>
References: <20240925170149.3463bec9@lud1.home>
Injection-Date: Wed, 25 Sep 2024 22:36:23 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="c9bb3a4d1fe083989d2cd97db187f1d3";
logging-data="4019984"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+v+D1FmeMs5xvkPiT9OVeM"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.139 (x86_64))
Cancel-Lock: sha1:F1glGq1VFjhZB5iPM7i6MMM5LvA=
View all headers

Luc <luc@sep.invalid> wrote:
> Suppose I have a large list. Very large list. Then I want to search
> for an item in that string:
>
> % lsearch $list "string"
>
> Now, suppose I have many lists instead. One list contains all the items
> that begin with the letter a, another list contains all the items that
> begin with the letter b, another list contains all the items that begin
> with the letter c, and so on. Then I see what the first character in
> "string" is and only search for it in the one corresponding list.

What you have described is, at a very high level, similar to how B-Tree
indexes work.

https://en.wikipedia.org/wiki/B-tree

And they are the common index type used in databases for faster data
retreival.

> Would that be faster? I somehow suspect the answer is 'no.'

The answer is actually "it depends". You'd have to define "very
large". For some value of "very large" the extra time spent to pick
which sublist to search will most likely be faster than doing a search
across a single list containing all the items.

But, as is always the case, the algorithm matters.

For basic 'lsearch', it does, by default, a linear search. It starts
at index 0, and looks at each item sequentially until it finds a match,
or hits the end of the list. For this algorithm, your "separate based
on first character" method will be faster for a not-so-big value of
"very large".

But, lsearch also has the "-sorted" option. For a "very large" list
that you can maintain in sorted order, using "-sorted" causes lsearch
to perform a binary search upon the list.

https://en.wikipedia.org/wiki/Binary_search

For this alternate algorithm, your "separate lists" method will need a
much larger value for "very large" vs. the linear search without
"-sorted" before the "separate lists" are faster.

However, "-sorted" now leaves you with the need to have "very large
list" be sorted. And the time needed to sort the list, for a "very
large" list, could be substantial. So you'd need, in this case, a way
to either only sort once, but use many times, or if the list is being
modified, you'd want to be careful to modify it such that it is still
sorted vs. needing to "sort" it over and over again.

> Bonus question: what about sqlite operations? Would they be faster if
> I had one separate table for each initial letter/character?

Again, it depends.

If you have no indexes on your tables, then likely yes.

If you have indexes on the tables, that can also be utilized by the
query you are running, then the size of the table will need to be much
much larger before the "split up" version might become faster.

Subject: Re: Does cheating produce faster searches?
From: Shaun Deacon
Newsgroups: comp.lang.tcl
Organization: A noiseless patient Spider
Date: Fri, 27 Sep 2024 01:46 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: sdeacon@us.socionext.com (Shaun Deacon)
Newsgroups: comp.lang.tcl
Subject: Re: Does cheating produce faster searches?
Date: Thu, 26 Sep 2024 18:46:36 -0700
Organization: A noiseless patient Spider
Lines: 43
Message-ID: <vd52pu$ibfo$1@dont-email.me>
References: <20240925170149.3463bec9@lud1.home>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 27 Sep 2024 03:46:38 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="fa498b2702092671c79dea8153b466a1";
logging-data="601592"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/t2pQMC2/he1vE/kbXtrzH"
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.19
Cancel-Lock: sha1:UnmQ+sbDyMPYfH+aa2XOgO6EdvU=
In-Reply-To: <20240925170149.3463bec9@lud1.home>
View all headers

Luc wrote:
> Suppose I have a large list. Very large list. Then I want to search
> for an item in that string:
>
> % lsearch $list "string"
>
> Now, suppose I have many lists instead. One list contains all the items
> that begin with the letter a, another list contains all the items that
> begin with the letter b, another list contains all the items that begin
> with the letter c, and so on. Then I see what the first character in
> "string" is and only search for it in the one corresponding list.
> Would that be faster? I somehow suspect the answer is 'no.'
>
> Bonus question: what about sqlite operations? Would they be faster if
> I had one separate table for each initial letter/character?
>
> TIA
>

As you've probably discovered lsearch can be slow for huge lists, and if
you're doing basic searches on a big list a lot, it can be noticeable.

Depending very much on what you want to do with the result from your
search, and speed is your primary concern, there may be another way to
approach this...

If you're just checking for whether a word exists in some word list,
have you considered creating array variables ?

foreach word $bigWordList {
set words($word) 1
} ....
set string "foo"
if { [info exists words($string)] } {
puts "$string is in my list"
} else {
puts "$string is not in my list"
}

The above test would be quick and you just build the initial 'words'
array once (and then add new words by setting new variables). Of course,
"set words($word) xxx" could be set to something other than 1...

Subject: Re: Does cheating produce faster searches?
From: Luc
Newsgroups: comp.lang.tcl
Organization: A noiseless patient Spider
Date: Fri, 27 Sep 2024 02:48 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: luc@sep.invalid (Luc)
Newsgroups: comp.lang.tcl
Subject: Re: Does cheating produce faster searches?
Date: Thu, 26 Sep 2024 23:48:49 -0300
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <20240926234849.1ab3ea7c@lud1.home>
References: <20240925170149.3463bec9@lud1.home>
<vd52pu$ibfo$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 27 Sep 2024 04:48:50 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="e7b51491f24ca7095b7116c0e7352966";
logging-data="600728"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19HfO0BcxkE2NyWTXjVMuBAp9CcLLtIZDc="
Cancel-Lock: sha1:7LhL+GAlMYVLPA2SPdfbrOYpSDI=
View all headers

On Thu, 26 Sep 2024 18:46:36 -0700, Shaun Deacon wrote:

>Depending very much on what you want to do with the result from your
>search, and speed is your primary concern, there may be another way to
>approach this...
>
>If you're just checking for whether a word exists in some word list,
>have you considered creating array variables ?

Right now, at this very exact moment, I am toying with a real time
search box, and by "real time" I mean the search output changes with
every new character typed into the user input widget. But I'm always
searching for all kinds of stuff when I code. It's a very common need.

Interesting idea with the array variables. Thank you for your input.

--
Luc
>>

Subject: Re: Does cheating produce faster searches?
From: Rich
Newsgroups: comp.lang.tcl
Organization: A noiseless patient Spider
Date: Fri, 27 Sep 2024 12:44 UTC
References: 1 2 3
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: rich@example.invalid (Rich)
Newsgroups: comp.lang.tcl
Subject: Re: Does cheating produce faster searches?
Date: Fri, 27 Sep 2024 12:44:50 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <vd69c2$nscn$1@dont-email.me>
References: <20240925170149.3463bec9@lud1.home> <vd52pu$ibfo$1@dont-email.me> <20240926234849.1ab3ea7c@lud1.home>
Injection-Date: Fri, 27 Sep 2024 14:44:50 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="dcce1bb59e982dc772b78d51c33e5b9d";
logging-data="782743"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18FSNWpckIW5jcRUs7tjNxZ"
User-Agent: tin/2.6.1-20211226 ("Convalmore") (Linux/5.15.139 (x86_64))
Cancel-Lock: sha1:PpJCqLLnt014W4nxrAabC56MgcM=
View all headers

Luc <luc@sep.invalid> wrote:
> On Thu, 26 Sep 2024 18:46:36 -0700, Shaun Deacon wrote:
>
>>Depending very much on what you want to do with the result from your
>>search, and speed is your primary concern, there may be another way to
>>approach this...
>>
>>If you're just checking for whether a word exists in some word list,
>>have you considered creating array variables ?
>
>
> Right now, at this very exact moment, I am toying with a real time
> search box, and by "real time" I mean the search output changes with
> every new character typed into the user input widget. But I'm always
> searching for all kinds of stuff when I code. It's a very common need.
>
> Interesting idea with the array variables. Thank you for your input.

Do note that reality is that the method that turns out to be "fastest"
is a "it depends" (and upon several variables, including at least the
size of the data you want to search).

To determine which is actually fastest, you need your data set, and you
need to do some experiments with all the default options using Tcl's
[time] command to determine how long each takes.

And, if you really want to test, you also need to test against
alternate methods of storage and search to see if one of those
alternates is actually faster. Given the 'hint' you've given above,
you might find that something like a prefix trie
(https://en.wikipedia.org/wiki/Trie), esp. one built as a C extension,
is faster than any of the default Tcl built in operators. You are
crossing over the boundary with this post from "general purpose search,
with reasonable performance" into "specialized searching, for fastest
performance". And the realm of specialization in this area can be both
wide and deep.

1

rocksolid light 0.9.8
clearnet tor