Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

You will inherit some money or a small piece of land.


comp / comp.unix.shell / Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]

SubjectAuthor
* bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Kenny McCormack
+* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Kaz Kylheku
|`* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Janis Papanagnou
| +* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Kenny McCormack
| |`* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Janis Papanagnou
| | `* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Kenny McCormack
| |  `* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Janis Papanagnou
| |   `* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Kenny McCormack
| |    `- Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Janis Papanagnou
| `- Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Kaz Kylheku
`* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Arti F. Idiot
 `* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Kenny McCormack
  `* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Kaz Kylheku
   +* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Ben Bacarisse
   |`* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Kaz Kylheku
   | `* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Ben Bacarisse
   |  `* Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Kaz Kylheku
   |   `- Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Ben Bacarisse
   `- Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]Janis Papanagnou

1
Subject: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Kenny McCormack
Newsgroups: comp.unix.shell
Organization: The official candy of the new Millennium
Date: Mon, 22 Jul 2024 21:59 UTC
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gazelle@shell.xmission.com (Kenny McCormack)
Newsgroups: comp.unix.shell
Subject: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
Date: Mon, 22 Jul 2024 21:59:11 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <v7mknf$3plab$1@news.xmission.com>
Injection-Date: Mon, 22 Jul 2024 21:59:11 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="3986763"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)
View all headers

Note: this is just a question of aesthetics. Functionally, it all works as
expected.

Sample bash code:

f="$(fortune)" # Get some multi-line output into "f"
# Look for foo followed by bar on the same line
[[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"

The point is you need the "anything other than a newline" or else it might
match foo on one line and bar on a later line. The above is the only way I
could figure out to express a newline in the particular flavor of reg exps
used by the =~ operator.

The problem is that if the above is in a function, when you list out the
function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd
rather see "\n".

Is there any way to get this?

--
Shikata ga nai...

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Kaz Kylheku
Newsgroups: comp.unix.shell
Organization: A noiseless patient Spider
Date: Mon, 22 Jul 2024 22:47 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 643-408-1753@kylheku.com (Kaz Kylheku)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[
... =~~ ... ]]
Date: Mon, 22 Jul 2024 22:47:59 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <20240722153843.823@kylheku.com>
References: <v7mknf$3plab$1@news.xmission.com>
Injection-Date: Tue, 23 Jul 2024 00:47:59 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="2e723cea1cdfb5e1d326eb8834436c3e";
logging-data="905595"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+1fVJwQlogdm9xdTmW2ND4nLt5p3PubtQ="
User-Agent: slrn/pre1.0.4-9 (Linux)
Cancel-Lock: sha1:f8IvrQ5x6wcl/YEBOFZpaJGFDDw=
View all headers

On 2024-07-22, Kenny McCormack <gazelle@shell.xmission.com> wrote:
> The problem is that if the above is in a function, when you list out the
> function with "type funName", the \n has already been digested and
> converted to a hard newline. This makes the listing look strange. I'd
> rather see "\n".

I see what you mean:

$ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ set | grep -A 4 '^test'
test ()
{ [[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}

> Is there any way to get this?

Patch Bash so that when it's listing code, any items that need '...'
quoting and that contain control characters are printed as $'...'
syntax with escape sequences.

Someone who had their original code as '
' will might not want that. It has to be an option.

If Bash stored a bit in the code indicating "this word was produced
using $ syntax", then it could be recovered accordingly.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Arti F. Idiot
Newsgroups: comp.unix.shell
Organization: Anarchists of America
Date: Tue, 23 Jul 2024 04:00 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!panix!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!nnrp.usenet.blueworldhosting.com!.POSTED!not-for-mail
From: addr@is.invalid (Arti F. Idiot)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ...
=~~ ... ]]
Date: Mon, 22 Jul 2024 22:00:00 -0600
Organization: Anarchists of America
Message-ID: <v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>
References: <v7mknf$3plab$1@news.xmission.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 23 Jul 2024 04:00:01 -0000 (UTC)
Injection-Info: nnrp.usenet.blueworldhosting.com;
logging-data="91241"; mail-complaints-to="usenet@blueworldhosting.com"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:AG6JUDB7H/wlPbpD0fegYDlQyLM= sha256:HL5RS1a0sntLcqKi2ZB7q3vPBtmBn3a9P8U1B49sOR0=
sha1:RExEUNYfC6BNX5mC6R0SJYpV7DI= sha256:0saFfpBABkfXvPN4e2ksUvN0Rl3pg/KReeVm/naPBs0=
In-Reply-To: <v7mknf$3plab$1@news.xmission.com>
Content-Language: en-US
View all headers

On 7/22/24 3:59 PM, Kenny McCormack wrote:
> Note: this is just a question of aesthetics. Functionally, it all works as
> expected.
>
> Sample bash code:
>
> f="$(fortune)" # Get some multi-line output into "f"
> # Look for foo followed by bar on the same line
> [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"
>
> The point is you need the "anything other than a newline" or else it might
> match foo on one line and bar on a later line. The above is the only way I
> could figure out to express a newline in the particular flavor of reg exps
> used by the =~ operator.
>
> The problem is that if the above is in a function, when you list out the
> function with "type funName", the \n has already been digested and
> converted to a hard newline. This makes the listing look strange. I'd
> rather see "\n".
>
> Is there any way to get this?
>

Not sure this really addresses your 'type funcName' query but maybe
somewhat better output from 'type funcName' ? :

...
regex=$(printf 'foo[^$\n]*bar')
[[ "$f" =~ $regex ]] && echo "foo bar"

Kind of wish the regex string could be bracketed by "/" as in awk.

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Kenny McCormack
Newsgroups: comp.unix.shell
Organization: The official candy of the new Millennium
Date: Tue, 23 Jul 2024 05:33 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gazelle@shell.xmission.com (Kenny McCormack)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ...
=~~ ... ]]
Date: Tue, 23 Jul 2024 05:33:31 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <v7nfbb$3q3of$1@news.xmission.com>
References: <v7mknf$3plab$1@news.xmission.com> <v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>
Injection-Date: Tue, 23 Jul 2024 05:33:31 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="4001551"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)
View all headers

In article <v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>,
Arti F. Idiot <addr@is.invalid> wrote:
....
>Not sure this really addresses your 'type funcName' query but maybe
>somewhat better output from 'type funcName' ? :
>
> ...
> regex=$(printf 'foo[^$\n]*bar')
> [[ "$f" =~ $regex ]] && echo "foo bar"

Yes. I think there are actually some other situations like this (i.e.,
issues involving using =~) - where putting the reg exp into a variable and
then using the variable works better. It seems to be a common solution.

>Kind of wish the regex string could be bracketed by "/" as in awk.

Yes. I've often wished the same. It is annoying that you have to escape
any spaces in the regexp (since it isn't delimited, like reg exps are in
most other languages [not just AWK]).

Incidentally, here's another situation - involving using $'' but not
involving =~. In another script, I have:

ctrla=$'\001'

Then I use $ctrla thereafter. But when the function is listed, the above
line comes out as:

ctrla=$''

Unless I pipe it into "less", in which case it displays as:

ctrla=$'^A'

(with the ^A in reverse video). The point being that, as before with \n,
there is a hard 001 character in there, not a graphic representation of it
(as there should, IMHO, be).

Agreeing with what Kaz wrote, I'm not objecting to there being hard
characters in the internal representation of the code, but rather, I am
saying that when it is displayed (e.g., by the "type" command), it should be
rendered in a visible way.

And, yet, changing gears once again, I don't quite understand why you can't
write [\n] with =~. You have to write [$'\n']. It's not like that in most
other languages (e.g., AWK).

Which all kind of echoes back to the other recent thread in this NG about
regular expressions vs. globs. The cold hard fact is that there really is
no such thing as "regular expressions" (*), since every language, every
program, every implementation of them, is quite different.

(*) As an abstract concept, separate from any specific implementation.

--
Trump - the President for the rest of us.

https://www.youtube.com/watch?v=JSkUJKgdcoE

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Janis Papanagnou
Newsgroups: comp.unix.shell
Organization: A noiseless patient Spider
Date: Tue, 23 Jul 2024 09:48 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_papanagnou+ng@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ...
=~~ ... ]]
Date: Tue, 23 Jul 2024 11:48:11 +0200
Organization: A noiseless patient Spider
Lines: 57
Message-ID: <v7nu8t$15bon$1@dont-email.me>
References: <v7mknf$3plab$1@news.xmission.com>
<20240722153843.823@kylheku.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 23 Jul 2024 11:48:13 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="813c436e4ebdd14f30c57dbc5b95f784";
logging-data="1224471"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19fcp+b4Oczh8OV68RlWweW"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:0Xujtjr1ywjeT0yuttfHv/PNbxw=
In-Reply-To: <20240722153843.823@kylheku.com>
X-Enigmail-Draft-Status: N1110
View all headers

On 23.07.2024 00:47, Kaz Kylheku wrote:
> On 2024-07-22, Kenny McCormack <gazelle@shell.xmission.com> wrote:
>> The problem is that if the above is in a function, when you list out the
>> function with "type funName", the \n has already been digested and
>> converted to a hard newline. This makes the listing look strange. I'd
>> rather see "\n".
>
> I see what you mean:
>
> $ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
> $ set | grep -A 4 '^test'
> test ()
> {
> [[ "$f" =~ foo[^'
> ']*bar ]] && echo "foo bar"
> }
>
>> Is there any way to get this?

Of course (and out of curiosity) I tried that display detail as well
in Kornshell to see how it behaves, and using a different command to
display it...

With my (old?) bash:

$ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ typeset -f f
f ()
{ [[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}

The same with ksh:

$ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ typeset -f f
f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }

And for good measure also in zsh:

% f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
% typeset -f f
f () {
[[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"
}

Both seem to show "better aesthetics". Too bad it doesn't help for
your bash context.

Janis

> [...]

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Kenny McCormack
Newsgroups: comp.unix.shell
Organization: The official candy of the new Millennium
Date: Tue, 23 Jul 2024 11:46 UTC
References: 1 2 3
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gazelle@shell.xmission.com (Kenny McCormack)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ...
=~~ ... ]]
Date: Tue, 23 Jul 2024 11:46:19 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <v7o56b$3qeeq$1@news.xmission.com>
References: <v7mknf$3plab$1@news.xmission.com> <20240722153843.823@kylheku.com> <v7nu8t$15bon$1@dont-email.me>
Injection-Date: Tue, 23 Jul 2024 11:46:19 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="4012506"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)
View all headers

In article <v7nu8t$15bon$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
....
>Both (ksh & zsh) seem to show "better aesthetics".

Indeed, it does. That is how it should work.

>Too bad it doesn't help for your bash context.

Alas, it doesn't.

--
In American politics, there are two things you just don't f*ck with:

1) Goldman Sachs
2) The military/industrial complex

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Janis Papanagnou
Newsgroups: comp.unix.shell
Organization: A noiseless patient Spider
Date: Tue, 23 Jul 2024 14:44 UTC
References: 1 2 3 4
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_papanagnou+ng@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ...
=~~ ... ]]
Date: Tue, 23 Jul 2024 16:44:37 +0200
Organization: A noiseless patient Spider
Lines: 21
Message-ID: <v7ofkl$18d66$1@dont-email.me>
References: <v7mknf$3plab$1@news.xmission.com>
<20240722153843.823@kylheku.com> <v7nu8t$15bon$1@dont-email.me>
<v7o56b$3qeeq$1@news.xmission.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 23 Jul 2024 16:44:37 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="813c436e4ebdd14f30c57dbc5b95f784";
logging-data="1324230"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18xgD9KN6Xo2N9xopGH8Bb3"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:bRDRT79kk0bcGtgiOK+2hU6kCSs=
X-Enigmail-Draft-Status: N1110
In-Reply-To: <v7o56b$3qeeq$1@news.xmission.com>
View all headers

On 23.07.2024 13:46, Kenny McCormack wrote:
> In article <v7nu8t$15bon$1@dont-email.me>,
> Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
> ...
>> Both (ksh & zsh) seem to show "better aesthetics".
>
> Indeed, it does. That is how it should work.

BTW, it's interesting that bash and zsh both reformat (sort
of pretty-print) the code (when using 'typeset -f'), only
that zsh keeps that literal '\n'. This may show a way (by
zsh example) how to follow Kaz' suggestion of patching the
bash. (But, frankly, I'm not sure it was meant seriously.)

But ksh displays it as it had been typed in; a raw format.
If you define your function, say, as multi-line code you
also see it that way, there's no processing at that point
(or the original retained as copy). I didn't expect that.

Janis

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Kenny McCormack
Newsgroups: comp.unix.shell
Organization: The official candy of the new Millennium
Date: Tue, 23 Jul 2024 16:13 UTC
References: 1 2 3 4
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gazelle@shell.xmission.com (Kenny McCormack)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ...
=~~ ... ]]
Date: Tue, 23 Jul 2024 16:13:40 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <v7okrk$3qkbf$1@news.xmission.com>
References: <v7mknf$3plab$1@news.xmission.com> <v7nu8t$15bon$1@dont-email.me> <v7o56b$3qeeq$1@news.xmission.com> <v7ofkl$18d66$1@dont-email.me>
Injection-Date: Tue, 23 Jul 2024 16:13:40 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="4018543"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)
View all headers

In article <v7ofkl$18d66$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
>On 23.07.2024 13:46, Kenny McCormack wrote:
>> In article <v7nu8t$15bon$1@dont-email.me>,
>> Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
>> ...
>>> Both (ksh & zsh) seem to show "better aesthetics".
>>
>> Indeed, it does. That is how it should work.
>
>BTW, it's interesting that bash and zsh both reformat (sort
>of pretty-print) the code (when using 'typeset -f'), only
>that zsh keeps that literal '\n'. This may show a way (by
>zsh example) how to follow Kaz' suggestion of patching the
>bash. (But, frankly, I'm not sure it was meant seriously. (see ** below))

Yes. ksh seems to dump it out literally as is (as it was typed), but bash
(and, I guess also zsh - I have zero knowledge or experience of zsh) pretty
prints it. But it seems zsh does a prettier print than bash.

One thing that bash does that's annoying is puts semicolons on the end of
(almost) every line. I have, on occasion, had to recover a function from
the bash pretty print (*), and one of the things that needs to be done is
to remove those extraneous semicolons.

(*) BTW, the command I use is "type". I.e., "type funName" displays the
function definition of function funName. That seems to be the same as your
use of "typeset".

>But ksh displays it as it had been typed in; a raw format.
>If you define your function, say, as multi-line code you
>also see it that way, there's no processing at that point
>(or the original retained as copy). I didn't expect that.

Yep. Note also that bash reformats something like:

cmd1 &&
cmd2 &&
cmd3

to:

cmd1 && cmd2 && cmd3

which is annoying.

(**) I've hacked the bash source code for less. So, yeah, it is possible.

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/ThePublicGood

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Janis Papanagnou
Newsgroups: comp.unix.shell
Organization: A noiseless patient Spider
Date: Tue, 23 Jul 2024 16:48 UTC
References: 1 2 3 4 5
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_papanagnou+ng@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ...
=~~ ... ]]
Date: Tue, 23 Jul 2024 18:48:42 +0200
Organization: A noiseless patient Spider
Lines: 59
Message-ID: <v7omtd$19ng6$1@dont-email.me>
References: <v7mknf$3plab$1@news.xmission.com> <v7nu8t$15bon$1@dont-email.me>
<v7o56b$3qeeq$1@news.xmission.com> <v7ofkl$18d66$1@dont-email.me>
<v7okrk$3qkbf$1@news.xmission.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 23 Jul 2024 18:48:45 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="813c436e4ebdd14f30c57dbc5b95f784";
logging-data="1367558"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19uAfS3GyD2yDnMUkcRUT9q"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:d0BfHYbSZDmGOyIyG2qZ7QPAOxA=
X-Enigmail-Draft-Status: N1110
In-Reply-To: <v7okrk$3qkbf$1@news.xmission.com>
View all headers

On 23.07.2024 18:13, Kenny McCormack wrote:
> In article <v7ofkl$18d66$1@dont-email.me>,
> Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
>> On 23.07.2024 13:46, Kenny McCormack wrote:
>
> One thing that bash does that's annoying is puts semicolons on the end of
> (almost) every line.

Ouch!

> I have, on occasion, had to recover a function from
> the bash pretty print (*), and one of the things that needs to be done is
> to remove those extraneous semicolons.
>
> (*) BTW, the command I use is "type". I.e., "type funName" displays the
> function definition of function funName. That seems to be the same as your
> use of "typeset".

I started tests with 'type' but the result was something undesirable
(forgot already what it was), so I tried the 'typeset -f' which had
better results (with ksh, zsh, at least).

Actually I was just playing around, since your post made me curious.
(I almost never inspect function definitions using one method or the
other. The interesting functions are non-trivial and already tested,
so interactively looking them up makes no sense for me. And other
functions are part of shell programs, either monolithic or used as
lib.) But as a side-effect of my tries I noticed another bug in the
ksh93u+m shell that I'm using. :-/ (But I'm digressing.)

>
>> But ksh displays it as it had been typed in; a raw format.
>> If you define your function, say, as multi-line code you
>> also see it that way, there's no processing at that point
>> (or the original retained as copy). I didn't expect that.
>
> Yep. Note also that bash reformats something like:
>
> cmd1 &&
> cmd2 &&
> cmd3
>
> to:
>
> cmd1 && cmd2 && cmd3
>
> which is annoying.

Indeed. It reminds me the philosphy that I often noticed in MS (and
nowadays also in Linux software, sadly) contexts; they seem to think
their auto-changes are better than the intention of the programmer.

>
> (**) I've hacked the bash source code for less. So, yeah, it is possible.

Ah, okay. (Would not be my preferred way. :-)

Janis

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Kenny McCormack
Newsgroups: comp.unix.shell
Organization: The official candy of the new Millennium
Date: Tue, 23 Jul 2024 17:15 UTC
References: 1 2 3 4
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gazelle@shell.xmission.com (Kenny McCormack)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ...
=~~ ... ]]
Date: Tue, 23 Jul 2024 17:15:41 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <v7ooft$3qm6t$1@news.xmission.com>
References: <v7mknf$3plab$1@news.xmission.com> <v7ofkl$18d66$1@dont-email.me> <v7okrk$3qkbf$1@news.xmission.com> <v7omtd$19ng6$1@dont-email.me>
Injection-Date: Tue, 23 Jul 2024 17:15:41 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="4020445"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)
View all headers

In article <v7omtd$19ng6$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
....
>Indeed. It reminds me the philosphy that I often noticed in MS (and
>nowadays also in Linux software, sadly) contexts; they seem to think
>their auto-changes are better than the intention of the programmer.

The overall plan is to turn programming into a minimum wage job. That's
why they are starting to call it "coding" and make it sound like something
anybody can do.

So, they have to take as much as possible of the choice/initiative out of it.
Make it the modern equivalent of a factory job.

--
After Using Gender Slur Against AOC, GOP Rep. Yoyo Won't Apologize 'For Loving God'.

That's so sweet...

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Kaz Kylheku
Newsgroups: comp.unix.shell
Organization: A noiseless patient Spider
Date: Tue, 23 Jul 2024 18:20 UTC
References: 1 2 3
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 643-408-1753@kylheku.com (Kaz Kylheku)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[
... =~~ ... ]]
Date: Tue, 23 Jul 2024 18:20:14 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 95
Message-ID: <20240723110919.974@kylheku.com>
References: <v7mknf$3plab$1@news.xmission.com>
<20240722153843.823@kylheku.com> <v7nu8t$15bon$1@dont-email.me>
Injection-Date: Tue, 23 Jul 2024 20:20:15 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="2e723cea1cdfb5e1d326eb8834436c3e";
logging-data="1392005"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/OUPAJozVbpK1nJZ1sF/f40VumndLRf1c="
User-Agent: slrn/pre1.0.4-9 (Linux)
Cancel-Lock: sha1:ixfy+jOnXV/+qQk9ytCHj9xqFh8=
View all headers

On 2024-07-23, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
> On 23.07.2024 00:47, Kaz Kylheku wrote:
>> On 2024-07-22, Kenny McCormack <gazelle@shell.xmission.com> wrote:
>>> The problem is that if the above is in a function, when you list out the
>>> function with "type funName", the \n has already been digested and
>>> converted to a hard newline. This makes the listing look strange. I'd
>>> rather see "\n".
>>
>> I see what you mean:
>>
>> $ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
>> $ set | grep -A 4 '^test'
>> test ()
>> {
>> [[ "$f" =~ foo[^'
>> ']*bar ]] && echo "foo bar"
>> }
>>
>>> Is there any way to get this?
>
> Of course (and out of curiosity) I tried that display detail as well
> in Kornshell to see how it behaves, and using a different command to
> display it...
>
>
> With my (old?) bash:
>
> $ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
> $ typeset -f f
> f ()
> {
> [[ "$f" =~ foo[^'
> ']*bar ]] && echo "foo bar"
> }
>
>
> The same with ksh:
>
> $ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
> $ typeset -f f
> f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
>
>
> And for good measure also in zsh:
>
> % f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
> % typeset -f f
> f () {
> [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"
> }

It bolsters the argument that Bash could use a fix to be this
way also.

zsh preserves the original syntax. So it is saving information
in the stored code about how the datum was represented in
the source code:

% f() { [[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar" ; }
sun-go% typeset -f f
f () {
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}

I can understand why an implementor wouldn't want to save this.

If the code that we see in "typeset" is the actual code that
executes, it means that in the $'...' case, zsh has to process
the escape sequences, whereas bash has expanded them out upfront.

If the code that we see in "typeset" is not the actual code
that executes, then that requires extra storage. The Bash
project might be reluctant to imitate that strategy.

Oh look, zsh preserves comments:

sun-go% f() { # f function
function> :
function> }
sun-go% typeset -f f
f () {
# f function
:
}

I doubt that when f is called, it's actually dealing with the
lexical details any more like comments; it's storing some
compiled version of the code along with the source, more likely.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Kaz Kylheku
Newsgroups: comp.unix.shell
Organization: A noiseless patient Spider
Date: Tue, 23 Jul 2024 18:34 UTC
References: 1 2 3
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 643-408-1753@kylheku.com (Kaz Kylheku)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[
... =~~ ... ]]
Date: Tue, 23 Jul 2024 18:34:52 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <20240723112050.105@kylheku.com>
References: <v7mknf$3plab$1@news.xmission.com>
<v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>
<v7nfbb$3q3of$1@news.xmission.com>
Injection-Date: Tue, 23 Jul 2024 20:34:52 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="2e723cea1cdfb5e1d326eb8834436c3e";
logging-data="1392005"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19tiyn3NH1ltXqmtLrWJyExCa+YXXFy7jc="
User-Agent: slrn/pre1.0.4-9 (Linux)
Cancel-Lock: sha1:5mRI2wNxzkhy73EnrCIGH0hM4Y0=
View all headers

On 2024-07-23, Kenny McCormack <gazelle@shell.xmission.com> wrote:
> Which all kind of echoes back to the other recent thread in this NG about
> regular expressions vs. globs. The cold hard fact is that there really is
> no such thing as "regular expressions" (*), since every language, every
> program, every implementation of them, is quite different.
>
> (*) As an abstract concept, separate from any specific implementation.

Yes, there are regular expressions as an abstract concept. They are part
of the theory of automata. Much of the research went on up through the
1960's. The * operator is called the "Kleene star".
https://en.wikipedia.org/wiki/Kleene_star

In the old math/CS papers about regular expressions, regular expressions
are typically represented in terms of some input symbol alphabet
(usually just letters a, b, c ...) and only the operators | and *,
and parentheses (other than when advanced operators are being discussed,
like intersection and complement, whicha re not easily constructed from
these.)

I think character classes might have been a pragmatic invention in
regex implementations. The theory doesn't require [a-c] because
that can be encoded as (a|b|c).

The ? operator is not required because (R)? can be written (R)(R)*.

Escaping is not required because the oeprators and input symbols are
distinct; the idea that ( could be an input symbol is something that
occurs in implementations, not in the theory.

Regex implementors take the theory and adjust it to taste,
and add necessary details such as character escape sequences for
control characters, and escaping to allow the oeprator characters
themselves to be matched. Plus character classes, with negation
and ranges and all that.

Not all implementations follow solid theory. For instance, the branch
operator | is supposed to be commutative. There is no difference
between R1|R2 and R2|R1. But in many implementations (particularly
backtracking ones like PCRE and similar), there is a difference: these
implementations implement R1|R2|R3 by trying the expressions in left to
right order and stop at the first match.

This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Ben Bacarisse
Newsgroups: comp.unix.shell
Organization: A noiseless patient Spider
Date: Tue, 23 Jul 2024 23:51 UTC
References: 1 2 3 4
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ben@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
Date: Wed, 24 Jul 2024 00:51:44 +0100
Organization: A noiseless patient Spider
Lines: 61
Message-ID: <87y15r650v.fsf@bsb.me.uk>
References: <v7mknf$3plab$1@news.xmission.com>
<v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>
<v7nfbb$3q3of$1@news.xmission.com> <20240723112050.105@kylheku.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Date: Wed, 24 Jul 2024 01:51:44 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="cecbc031db9d83bdacabd1935f084c00";
logging-data="1491887"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+c5SUt13BffVvMg4jsBCb1PGo1A6rnrhs="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:f5GPZzSOEjhV5ilsrVf4RYCYVHU=
sha1:gPAepOMQC9xXgYuj6LXscp3o3hs=
X-BSB-Auth: 1.41579f614674024aba7e.20240724005144BST.87y15r650v.fsf@bsb.me.uk
View all headers

Kaz Kylheku <643-408-1753@kylheku.com> writes:

> On 2024-07-23, Kenny McCormack <gazelle@shell.xmission.com> wrote:
>> Which all kind of echoes back to the other recent thread in this NG about
>> regular expressions vs. globs. The cold hard fact is that there really is
>> no such thing as "regular expressions" (*), since every language, every
>> program, every implementation of them, is quite different.
>>
>> (*) As an abstract concept, separate from any specific implementation.
>
> Yes, there are regular expressions as an abstract concept. They are part
> of the theory of automata. Much of the research went on up through the
> 1960's. The * operator is called the "Kleene star".
> https://en.wikipedia.org/wiki/Kleene_star
>
> In the old math/CS papers about regular expressions, regular expressions
> are typically represented in terms of some input symbol alphabet
> (usually just letters a, b, c ...) and only the operators | and *,
> and parentheses (other than when advanced operators are being discussed,
> like intersection and complement, whicha re not easily constructed from
> these.)
>
> I think character classes might have been a pragmatic invention in
> regex implementations. The theory doesn't require [a-c] because
> that can be encoded as (a|b|c).
>
> The ? operator is not required because (R)? can be written (R)(R)*.

(Aside: the choice is arbitrary but + would be a more "Unixy" choice for
that operator.)

> Escaping is not required because the oeprators and input symbols are
> distinct; the idea that ( could be an input symbol is something that
> occurs in implementations, not in the theory.
>
> Regex implementors take the theory and adjust it to taste,
> and add necessary details such as character escape sequences for
> control characters, and escaping to allow the oeprator characters
> themselves to be matched. Plus character classes, with negation
> and ranges and all that.
>
> Not all implementations follow solid theory. For instance, the branch
> operator | is supposed to be commutative. There is no difference
> between R1|R2 and R2|R1. But in many implementations (particularly
> backtracking ones like PCRE and similar), there is a difference: these
> implementations implement R1|R2|R3 by trying the expressions in left to
> right order and stop at the first match.
>
> This matters when regexes are used for matching a prefix of the input;
> if the regex is interpreted according to the theory should match
> the longest possible prefix; it cannot ignore R3, which matches
> thousands of symbols, because R2 matched three symbols.

This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.

--
Ben.

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Kaz Kylheku
Newsgroups: comp.unix.shell
Organization: A noiseless patient Spider
Date: Wed, 24 Jul 2024 03:25 UTC
References: 1 2 3 4 5
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 643-408-1753@kylheku.com (Kaz Kylheku)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[
... =~~ ... ]]
Date: Wed, 24 Jul 2024 03:25:14 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 26
Message-ID: <20240723202055.122@kylheku.com>
References: <v7mknf$3plab$1@news.xmission.com>
<v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>
<v7nfbb$3q3of$1@news.xmission.com> <20240723112050.105@kylheku.com>
<87y15r650v.fsf@bsb.me.uk>
Injection-Date: Wed, 24 Jul 2024 05:25:15 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="22dbec4c97aef40d7cd38abdf24b02a3";
logging-data="1678613"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/RTAadVYMY9jMFzm8FBh2wGMXl6+QZY1U="
User-Agent: slrn/pre1.0.4-9 (Linux)
Cancel-Lock: sha1:4dmG6yQqjtqTCLIjPaXrM8XVezg=
View all headers

On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:
> Kaz Kylheku <643-408-1753@kylheku.com> writes:
>> This matters when regexes are used for matching a prefix of the input;
>> if the regex is interpreted according to the theory should match
>> the longest possible prefix; it cannot ignore R3, which matches
>> thousands of symbols, because R2 matched three symbols.
>
> This is more a consequence of the different views. The in the formal
> theory there is no notion of "matching". Regular expressions define
> languages (i.e. sets of sequences of symbols) according to a recursive
> set of rules. The whole idea of an RE matching a string is from their
> use in practical applications.

Under the set view, we can ask, what is the longest prefix of
the input which belongs to the language R1|R2. The answer is the
same for R2|R1, which denote the same set, since | corresponds
to set union.

Broken regular expressions identify the longest prefix, except
when the | operator is used; then they just identify a prefix,
not necessarily longest.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Janis Papanagnou
Newsgroups: comp.unix.shell
Organization: A noiseless patient Spider
Date: Wed, 24 Jul 2024 09:41 UTC
References: 1 2 3 4
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_papanagnou+ng@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ...
=~~ ... ]]
Date: Wed, 24 Jul 2024 11:41:48 +0200
Organization: A noiseless patient Spider
Lines: 88
Message-ID: <v7qi8t$1mi8m$1@dont-email.me>
References: <v7mknf$3plab$1@news.xmission.com>
<v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>
<v7nfbb$3q3of$1@news.xmission.com> <20240723112050.105@kylheku.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 24 Jul 2024 11:41:50 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="a525e55a33c812c78de3ae166044610c";
logging-data="1788182"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19dyaTAwJe1yNo9vTN8zKrJ"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:rhhba3/LkXBnzcAdtfxvcSyUHhk=
In-Reply-To: <20240723112050.105@kylheku.com>
X-Enigmail-Draft-Status: N1110
View all headers

On 23.07.2024 20:34, Kaz Kylheku wrote:
>
> [...]
>
> In the old math/CS papers about regular expressions, regular expressions
> are typically represented in terms of some input symbol alphabet
> (usually just letters a, b, c ...) and only the operators | and *,
> and parentheses (other than when advanced operators are being discussed,
> like intersection and complement, whicha re not easily constructed from
> these.)
>
> I think character classes might have been a pragmatic invention in
> regex implementations. The theory doesn't require [a-c] because
> that can be encoded as (a|b|c).

While formally we can restrict to some basic elements it's quite
inconvenient in practice. I recall that in Compiler Construction
and Automata Theory we regularly used the 'all-but' operator for
complementing input symbol sets. And also obvious abbreviations
for larger sets of symbols (like 'digits', etc.). Not only in
practical regexp implementations, also in education certainly no
one wants to waste time.

>
> The ? operator is not required because (R)? can be written (R)(R)*.

ITYM: (R)+

>
> Escaping is not required because the oeprators and input symbols are
> distinct; the idea that ( could be an input symbol is something that
> occurs in implementations, not in the theory.
>
> Regex implementors take the theory and adjust it to taste,
> and add necessary details such as character escape sequences for
> control characters, and escaping to allow the oeprator characters
> themselves to be matched. Plus character classes, with negation
> and ranges and all that.

There are (at least) two different types of such adjustments. One
are the convenience enhancements (like the '+' or the multiplicity
('{m,n}', Perl's '\d' and '\D' etc.) that, from a complexity
perspective, all stay within the same [theoretical] class of the
Regular Languages. (There's other types of extensions that we find
in implementations that leave that language class.)

>
> Not all implementations follow solid theory. For instance, the branch
> operator | is supposed to be commutative. There is no difference
> between R1|R2 and R2|R1. But in many implementations (particularly
> backtracking ones like PCRE and similar), there is a difference: these
> implementations implement R1|R2|R3 by trying the expressions in left to
> right order and stop at the first match.

In my book it's not necessary to follow "solid theory"; if, and only
if, it's documented, correctly/sensibly implemented, and implications
made clear to the programmer.

There's two common examples that come to mind. I think it's okay if
there's support for, e.g., back-references if it's clearly stated
that you should not expect that this code will run in O(N), linear
complexity. But there have been implementations - don't recall if it
was in Java, Perl, or both - where they implemented "generalizations"
of "Regexp-processings" that had the runtime-effect that even for
*true* Regular Expressions you had no O(N) guarantee any more; and
this is IMO a fatal decision! If the programmer uses only RE elements
from the class of Regular Languages there should always be complexity
O(N) guaranteed.

>
> This matters when regexes are used for matching a prefix of the input;
> if the regex is interpreted according to the theory should match
> the longest possible prefix; it cannot ignore R3, which matches
> thousands of symbols, because R2 matched three symbols.

I think we should differentiate the matching process (implementation
specific syntax for formulas of REs, matching implementation methods)
from the Regular Languages theory; the complete strings of text that
we match are in general not part of the Regular Language, the Regular
Expression only specifies the subset of (matching) strings as part of
the respective Regular Language. And, WRT complexity, we should also
be aware that the O(N) in Regular Languages is the complexity of the
"match", not the length of the data that is skimmed for a match. The
various algorithms combine these complexities in supposedly efficient
ways. Some RE parsers failed.

Janis

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Janis Papanagnou
Newsgroups: comp.unix.shell
Organization: A noiseless patient Spider
Date: Wed, 24 Jul 2024 09:53 UTC
References: 1 2 3 4 5
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_papanagnou+ng@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ...
=~~ ... ]]
Date: Wed, 24 Jul 2024 11:53:00 +0200
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <v7qitt$1mlk1$1@dont-email.me>
References: <v7mknf$3plab$1@news.xmission.com> <v7ofkl$18d66$1@dont-email.me>
<v7okrk$3qkbf$1@news.xmission.com> <v7omtd$19ng6$1@dont-email.me>
<v7ooft$3qm6t$1@news.xmission.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 24 Jul 2024 11:53:02 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="a525e55a33c812c78de3ae166044610c";
logging-data="1791617"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19ZfQ7vzK3A1KSSbZESmhxR"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:vnJch5bVR+OCmTn0HSpQWtqcarI=
X-Enigmail-Draft-Status: N1110
X-Mozilla-News-Host: news://news.eternal-september.org
In-Reply-To: <v7ooft$3qm6t$1@news.xmission.com>
View all headers

On 23.07.2024 19:15, Kenny McCormack wrote:
> In article <v7omtd$19ng6$1@dont-email.me>,
> Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
> ...
>> Indeed. It reminds me the philosphy that I often noticed in MS (and
>> nowadays also in Linux software, sadly) contexts; they seem to think
>> their auto-changes are better than the intention of the programmer.
>
> The overall plan is to turn programming into a minimum wage job. That's
> why they are starting to call it "coding" and make it sound like something
> anybody can do.

And sometimes it doesn't even appear as coding; during a small episode
in Java I could observe they are just clicking together pieces of code
from drop-down menus using Eclipse.

>
> So, they have to take as much as possible of the choice/initiative out of it.
> Make it the modern equivalent of a factory job.

Interesting comparison. But, yes.

Janis

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Ben Bacarisse
Newsgroups: comp.unix.shell
Organization: A noiseless patient Spider
Date: Wed, 24 Jul 2024 13:17 UTC
References: 1 2 3 4 5 6
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ben@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
Date: Wed, 24 Jul 2024 14:17:14 +0100
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <87sevz53qd.fsf@bsb.me.uk>
References: <v7mknf$3plab$1@news.xmission.com>
<v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>
<v7nfbb$3q3of$1@news.xmission.com> <20240723112050.105@kylheku.com>
<87y15r650v.fsf@bsb.me.uk> <20240723202055.122@kylheku.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Date: Wed, 24 Jul 2024 15:17:14 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="cecbc031db9d83bdacabd1935f084c00";
logging-data="1854881"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+7kZ5RdQdgGpl+7cMUDLpsfZ+AL2BNzsQ="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:5Iy4an6yJWPYQRu6F4xwM1qRmRI=
sha1:R1dJQyINbyR0n1PpZiOmeRIFOEs=
X-BSB-Auth: 1.802a8bad114d65de04ee.20240724141714BST.87sevz53qd.fsf@bsb.me.uk
View all headers

Kaz Kylheku <643-408-1753@kylheku.com> writes:

> On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:
>> Kaz Kylheku <643-408-1753@kylheku.com> writes:
>>> This matters when regexes are used for matching a prefix of the input;
>>> if the regex is interpreted according to the theory should match
>>> the longest possible prefix; it cannot ignore R3, which matches
>>> thousands of symbols, because R2 matched three symbols.
>>
>> This is more a consequence of the different views. The in the formal
>> theory there is no notion of "matching". Regular expressions define
>> languages (i.e. sets of sequences of symbols) according to a recursive
>> set of rules. The whole idea of an RE matching a string is from their
>> use in practical applications.
>
> Under the set view, we can ask, what is the longest prefix of
> the input which belongs to the language R1|R2. The answer is the
> same for R2|R1, which denote the same set, since | corresponds
> to set union.

What is "the input" in the set view. The set view is simply a recursive
definition of the language.

> Broken regular expressions identify the longest prefix, except
> when the | operator is used; then they just identify a prefix,
> not necessarily longest.

What is a "broken" RE in the set view?

--
Ben.

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Kaz Kylheku
Newsgroups: comp.unix.shell
Organization: A noiseless patient Spider
Date: Wed, 24 Jul 2024 18:35 UTC
References: 1 2 3 4 5 6 7
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: 643-408-1753@kylheku.com (Kaz Kylheku)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[
... =~~ ... ]]
Date: Wed, 24 Jul 2024 18:35:51 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 56
Message-ID: <20240724112619.254@kylheku.com>
References: <v7mknf$3plab$1@news.xmission.com>
<v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>
<v7nfbb$3q3of$1@news.xmission.com> <20240723112050.105@kylheku.com>
<87y15r650v.fsf@bsb.me.uk> <20240723202055.122@kylheku.com>
<87sevz53qd.fsf@bsb.me.uk>
Injection-Date: Wed, 24 Jul 2024 20:35:51 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="22dbec4c97aef40d7cd38abdf24b02a3";
logging-data="1949461"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+uvBPjY44snRs1zIMCrxKMwN5oeeURQAE="
User-Agent: slrn/pre1.0.4-9 (Linux)
Cancel-Lock: sha1:xhgEIIU6fH3RycG8i55z/vj5q6A=
View all headers

On 2024-07-24, Ben Bacarisse <ben@bsb.me.uk> wrote:
> Kaz Kylheku <643-408-1753@kylheku.com> writes:
>
>> On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:
>>> Kaz Kylheku <643-408-1753@kylheku.com> writes:
>>>> This matters when regexes are used for matching a prefix of the input;
>>>> if the regex is interpreted according to the theory should match
>>>> the longest possible prefix; it cannot ignore R3, which matches
>>>> thousands of symbols, because R2 matched three symbols.
>>>
>>> This is more a consequence of the different views. The in the formal
>>> theory there is no notion of "matching". Regular expressions define
>>> languages (i.e. sets of sequences of symbols) according to a recursive
>>> set of rules. The whole idea of an RE matching a string is from their
>>> use in practical applications.
>>
>> Under the set view, we can ask, what is the longest prefix of
>> the input which belongs to the language R1|R2. The answer is the
>> same for R2|R1, which denote the same set, since | corresponds
>> to set union.
>
> What is "the input" in the set view. The set view is simply a recursive
> definition of the language.

It is a separate string under consideration.

We have a set, and are asking the question "what is the longest prefix
of the given string which is a member of the set".

>> Broken regular expressions identify the longest prefix, except
>> when the | operator is used; then they just identify a prefix,
>> not necessarily longest.
>
> What is a "broken" RE in the set view?

Inconsistency in being able to answer the question "what is the longest
prefix of the string which is a member of the set".

Broken regexes contain a pitfall: they deliver the right answer
for expressions like ab*. If the input is "abbbbbbbc",
they identify the entire "abbbbbbb" prefix. But if the branch
operator is used, as in "a|ab*", oops, they short-circuit.
The "a" matches a prefix of the input, and so that's done; no need
to match the "ab*" part of the branch.

The "a" prefix is in the language described from the language; a
set element has been identified. But it's not the longest one.

It is an inconsistency. If the longest match is not required, why
bother finding one for "ab*"; for that expression, the "a" prefix could
also just be returned.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
From: Ben Bacarisse
Newsgroups: comp.unix.shell
Organization: A noiseless patient Spider
Date: Wed, 24 Jul 2024 21:28 UTC
References: 1 2 3 4 5 6 7 8
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ben@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
Date: Wed, 24 Jul 2024 22:28:32 +0100
Organization: A noiseless patient Spider
Lines: 78
Message-ID: <87msm65vjz.fsf@bsb.me.uk>
References: <v7mknf$3plab$1@news.xmission.com>
<v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>
<v7nfbb$3q3of$1@news.xmission.com> <20240723112050.105@kylheku.com>
<87y15r650v.fsf@bsb.me.uk> <20240723202055.122@kylheku.com>
<87sevz53qd.fsf@bsb.me.uk> <20240724112619.254@kylheku.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Date: Wed, 24 Jul 2024 23:28:32 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="cecbc031db9d83bdacabd1935f084c00";
logging-data="2011422"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18apQeZDw0Yg9JzT2JeL1iMPZBTRZv+j4Q="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:uXbT38DXzsoOsNWngBnJReZXESA=
sha1:HfA35nOzaYo7LgglexX8GLZffr8=
X-BSB-Auth: 1.0864bdfbc636e95ba92f.20240724222832BST.87msm65vjz.fsf@bsb.me.uk
View all headers

Kaz Kylheku <643-408-1753@kylheku.com> writes:

> On 2024-07-24, Ben Bacarisse <ben@bsb.me.uk> wrote:
>> Kaz Kylheku <643-408-1753@kylheku.com> writes:
>>
>>> On 2024-07-23, Ben Bacarisse <ben@bsb.me.uk> wrote:
>>>> Kaz Kylheku <643-408-1753@kylheku.com> writes:
>>>>> This matters when regexes are used for matching a prefix of the input;
>>>>> if the regex is interpreted according to the theory should match
>>>>> the longest possible prefix; it cannot ignore R3, which matches
>>>>> thousands of symbols, because R2 matched three symbols.
>>>>
>>>> This is more a consequence of the different views. The in the formal
>>>> theory there is no notion of "matching". Regular expressions define
>>>> languages (i.e. sets of sequences of symbols) according to a recursive
>>>> set of rules. The whole idea of an RE matching a string is from their
>>>> use in practical applications.
>>>
>>> Under the set view, we can ask, what is the longest prefix of
>>> the input which belongs to the language R1|R2. The answer is the
>>> same for R2|R1, which denote the same set, since | corresponds
>>> to set union.
>>
>> What is "the input" in the set view. The set view is simply a recursive
>> definition of the language.
>
> It is a separate string under consideration.
>
> We have a set, and are asking the question "what is the longest prefix
> of the given string which is a member of the set".

It's better, then, (as in the latter wording) not to use a term from the
"implementation" view of REs.

>>> Broken regular expressions identify the longest prefix, except
>>> when the | operator is used; then they just identify a prefix,
>>> not necessarily longest.
>>
>> What is a "broken" RE in the set view?
>
> Inconsistency in being able to answer the question "what is the longest
> prefix of the string which is a member of the set".
>
> Broken regexes contain a pitfall: they deliver the right answer
> for expressions like ab*. If the input is "abbbbbbbc",
>
> they identify the entire "abbbbbbb" prefix. But if the branch
> operator is used, as in "a|ab*", oops, they short-circuit.
> The "a" matches a prefix of the input, and so that's done; no need
> to match the "ab*" part of the branch.

I don't see any "pitfall". The answer to you question "what is the
longest prefix of the given string which is a member of the set" is not
"a" and nothing in the either the formal definition of the language
"a|ab*" nor in the wording of the question is a pitfall. The longest
prefix of "abbbbbbbc" that is in the language "a|ab*" is, unambiguously,
"abbbbbbb".

> The "a" prefix is in the language described from the language; a
> set element has been identified. But it's not the longest one.

Yes. But there is no "pitfall" and the RE is not "broken" in any formal
sense at all.

An implementation might be broken and there are pitfalls to look out for
when viewing REs as patterns to match, but that's my whole point. This
is all about the "other" view, not the view of REs as defining formal
languages.

> It is an inconsistency. If the longest match is not required, why
> bother finding one for "ab*"; for that expression, the "a" prefix could
> also just be returned.

We could, of course, ask about other prefixes of "abbbbbbbc" that are in
the language "a|ab*". I don't see anything inconsistent here at all.

--
Ben.

1

rocksolid light 0.9.8
clearnet tor