Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

BOFH excuse #8: static buildup


comp / comp.unix.questions / Vanilla regex

SubjectAuthor
* Vanilla regexTuxedo
`* Re: Vanilla regexBen Bacarisse
 `* Re: Vanilla regexTuxedo
  `- Re: Vanilla regexBen Bacarisse

1
Subject: Vanilla regex
From: Tuxedo
Newsgroups: comp.unix.questions
Date: Sun, 2 Jul 2023 13:24 UTC
Path: eternal-september.org!news.eternal-september.org!weretis.net!feeder8.news.weretis.net!reader5.news.weretis.net!news.solani.org!.POSTED!not-for-mail
From: tuxedo@mailinator.net (Tuxedo)
Newsgroups: comp.unix.questions
Subject: Vanilla regex
Date: Sun, 02 Jul 2023 15:24:40 +0200
Lines: 49
Message-ID: <u7rufb$20hmf$1@solani.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7Bit
Injection-Date: Sun, 2 Jul 2023 13:35:40 -0000 (UTC)
Injection-Info: solani.org;
logging-data="2115279"; mail-complaints-to="abuse@news.solani.org"
User-Agent: KNode/4.14.10
Cancel-Lock: sha1:gCD2wlhMLFXX4DnSF0UGk1Yp65k=
X-User-ID: eJwNx8ERgEAIA8CWQJLAlaM46b8E3d+ylNqGKNC0sgg5qtI6g92pPvWCYwb6tPcJ/w/ozusD+NQP8Q==
View all headers

Can anyone assist with a regex using fairly standard and cross compatible
methods?

It's for files containing wiki markup segments as follows:

[[File:Some File Name 0123.jpg|800px]]

Or maybe:

[[File:Some other file.jpg|250px]]

Or maybe:

[[File:Another file.jpg |600px|thumb]]

etc.

The only certainty to identify the relevant parts are the start of "[[File:"
followed by characters and/or numbers making up a file names (No UTF-8) and
ending in some suffix, such as .jpg JPEG, .Jpeg etc. .PNG, .gif, followed by
a "|" pipe or closing "]]" brackets

The regex needs to grab the filename portion, eg. "Another file.jpg", keep
it in a variable and replace any spaces with underscore(s) so the new
variable becomes "Another_file.jpg"

Thereafter, within the existing markup, for example:

[[File:Another file.jpg |600px|thumb]]

Add the following markup after the first pipe:

link=https://example.com/display.pl?Another_file.jpg|

So the final markup becomes:
[[File:Another file.jpg |
link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]

The spaces in the original "File: ..." name parts can remain as it's valid
but the underscores need to exist in link=... strings.

There may be instances where "|link=" occurrences already exits within the
opening of a "[[File:" and before its closing "]]" brackets. The regex
should avoid operating on any such instances so the procedure can be run
without conflict of past replacements.

Many thanks for any example code snippets and ideas.

Tuxedo

Subject: Re: Vanilla regex
From: Ben Bacarisse
Newsgroups: comp.unix.questions
Organization: A noiseless patient Spider
Date: Sun, 2 Jul 2023 20:42 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ben.usenet@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.unix.questions
Subject: Re: Vanilla regex
Date: Sun, 02 Jul 2023 21:42:51 +0100
Organization: A noiseless patient Spider
Lines: 67
Message-ID: <87o7ku11v8.fsf@bsb.me.uk>
References: <u7rufb$20hmf$1@solani.org>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="6866e705013944fa28384d45e327445d";
logging-data="3592703"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Flaso88g92E0N4c2aPNfUMFVDX5lUnXA="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:ZUSn0VzlwhiD6ilX6pxblbrmEdQ=
sha1:D2cMyceGW9IbA4ESFcC+K8Vm9vc=
X-BSB-Auth: 1.77b8d573c92e8634e96f.20230702214251BST.87o7ku11v8.fsf@bsb.me.uk
View all headers

Tuxedo <tuxedo@mailinator.net> writes:

> Can anyone assist with a regex using fairly standard and cross compatible
> methods?

What you want can't be done with a regex. You need a tool that uses
regexes to drive substitutions like sed, AWK, Perl, Python, PHP, ruby...

> It's for files containing wiki markup segments as follows:
>
> [[File:Some File Name 0123.jpg|800px]]
>
> Or maybe:
>
> [[File:Some other file.jpg|250px]]
>
> Or maybe:
>
> [[File:Another file.jpg |600px|thumb]]
>
> etc.
>
> The only certainty to identify the relevant parts are the start of "[[File:"
> followed by characters and/or numbers making up a file names (No UTF-8) and
> ending in some suffix, such as .jpg JPEG, .Jpeg etc. .PNG, .gif, followed by
> a "|" pipe or closing "]]" brackets

Is that really the only certainty? If so, it's a hard problem. Can the
file name contain | or ]] or newlines? I suspect not as "characters
and/or numbers" is an odd thing to say. I think you mean [a-zA-Z0-9 ].

> The regex needs to grab the filename portion, eg. "Another file.jpg", keep
> it in a variable and replace any spaces with underscore(s) so the new
> variable becomes "Another_file.jpg"

Regexes can't do that, but lots of tools that use them can. Do you care
what tool is used?

> Thereafter, within the existing markup, for example:
>
> [[File:Another file.jpg |600px|thumb]]
>
> Add the following markup after the first pipe:
>
> link=https://example.com/display.pl?Another_file.jpg|
>
> So the final markup becomes:
> [[File:Another file.jpg |
> link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]
>
> The spaces in the original "File: ..." name parts can remain as it's valid
> but the underscores need to exist in link=... strings.
>
> There may be instances where "|link=" occurrences already exits within the
> opening of a "[[File:" and before its closing "]]" brackets. The regex
> should avoid operating on any such instances so the procedure can be run
> without conflict of past replacements.

FYI: you want the program to be "idempotent".

> Many thanks for any example code snippets and ideas.

It's not hard, but then it's not very much fun either, so you may have
to pay someone or learn how to do it yourself.

--
Ben.

Subject: Re: Vanilla regex
From: Tuxedo
Newsgroups: comp.unix.questions
Date: Mon, 3 Jul 2023 09:40 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!weretis.net!feeder8.news.weretis.net!reader5.news.weretis.net!news.solani.org!.POSTED!not-for-mail
From: tuxedo@mailinator.net (Tuxedo)
Newsgroups: comp.unix.questions
Subject: Re: Vanilla regex
Date: Mon, 03 Jul 2023 11:40:06 +0200
Lines: 82
Message-ID: <u7u5m7$23de8$1@solani.org>
References: <u7rufb$20hmf$1@solani.org> <87o7ku11v8.fsf@bsb.me.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7Bit
Injection-Date: Mon, 3 Jul 2023 09:51:04 -0000 (UTC)
Injection-Info: solani.org;
logging-data="2209224"; mail-complaints-to="abuse@news.solani.org"
User-Agent: KNode/4.14.10
Cancel-Lock: sha1:27qcFwyYWqaHD1mnEc3AEiwcQAY=
X-User-ID: eJwNx8kRACEMA7CWNjjOUQ4wcf8lsPqJCIubHgynqEL2kVBmxDT/bl8tfePr3Mmt6bIeEMbFBxCbEI8=
View all headers

Ben Bacarisse wrote:

> Tuxedo <tuxedo@mailinator.net> writes:
>
>> Can anyone assist with a regex using fairly standard and cross compatible
>> methods?
>
> What you want can't be done with a regex. You need a tool that uses
> regexes to drive substitutions like sed, AWK, Perl, Python, PHP, ruby...
>
>> It's for files containing wiki markup segments as follows:
>>
>> [[File:Some File Name 0123.jpg|800px]]
>>
>> Or maybe:
>>
>> [[File:Some other file.jpg|250px]]
>>
>> Or maybe:
>>
>> [[File:Another file.jpg |600px|thumb]]
>>
>> etc.
>>
>> The only certainty to identify the relevant parts are the start of
>> "[[File:" followed by characters and/or numbers making up a file names
>> (No UTF-8) and ending in some suffix, such as .jpg JPEG, .Jpeg etc. .PNG,
>> .gif, followed by a "|" pipe or closing "]]" brackets
>
> Is that really the only certainty? If so, it's a hard problem. Can the
> file name contain | or ]] or newlines? I suspect not as "characters
> and/or numbers" is an odd thing to say. I think you mean [a-zA-Z0-9 ].

The filename itself never contains | or ]] in this case. The odd new line
could be part of the complete string although it's unlikely and never in the
filename part.

>
>> The regex needs to grab the filename portion, eg. "Another file.jpg",
>> keep it in a variable and replace any spaces with underscore(s) so the
>> new variable becomes "Another_file.jpg"
>
> Regexes can't do that, but lots of tools that use them can. Do you care
> what tool is used?

Yes, I care which tool is used in the sense that it works.

>
>> Thereafter, within the existing markup, for example:
>>
>> [[File:Another file.jpg |600px|thumb]]
>>
>> Add the following markup after the first pipe:
>>
>> link=https://example.com/display.pl?Another_file.jpg|
>>
>> So the final markup becomes:
>> [[File:Another file.jpg |
>> link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]
>>
>> The spaces in the original "File: ..." name parts can remain as it's
>> valid but the underscores need to exist in link=... strings.
>>
>> There may be instances where "|link=" occurrences already exits within
>> the opening of a "[[File:" and before its closing "]]" brackets. The
>> regex should avoid operating on any such instances so the procedure can
>> be run without conflict of past replacements.
>
> FYI: you want the program to be "idempotent".

Thank you for that word :-)

>
>> Many thanks for any example code snippets and ideas.
>
> It's not hard, but then it's not very much fun either, so you may have
> to pay someone or learn how to do it yourself.
>

And for the advice.

Tuxedo

Subject: Re: Vanilla regex
From: Ben Bacarisse
Newsgroups: comp.unix.questions
Organization: A noiseless patient Spider
Date: Mon, 3 Jul 2023 12:56 UTC
References: 1 2 3
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ben.usenet@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.unix.questions
Subject: Re: Vanilla regex
Date: Mon, 03 Jul 2023 13:56:04 +0100
Organization: A noiseless patient Spider
Lines: 46
Message-ID: <87ilb117dn.fsf@bsb.me.uk>
References: <u7rufb$20hmf$1@solani.org> <87o7ku11v8.fsf@bsb.me.uk>
<u7u5m7$23de8$1@solani.org>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="0ef82b83531cbe1bbc5ad7d8a0547d18";
logging-data="3901083"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19W9IWIYp1IlmzsaYLYVVUcHueC80utQH4="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:s7/VCdZJ7M0zX1HK+mSCG4HeGBk=
sha1:gSyVj+GZxAjokHg/vOpBwLMqj0Y=
X-BSB-Auth: 1.0343a2848d74c3e22327.20230703135604BST.87ilb117dn.fsf@bsb.me.uk
View all headers

Tuxedo <tuxedo@mailinator.net> writes:

> Ben Bacarisse wrote:
>
>> Tuxedo <tuxedo@mailinator.net> writes:
>>
>>> Can anyone assist with a regex using fairly standard and cross compatible
>>> methods?
>>
>> What you want can't be done with a regex. You need a tool that uses
>> regexes to drive substitutions like sed, AWK, Perl, Python, PHP, ruby...
>>
>>> It's for files containing wiki markup segments as follows:
>>>
>>> [[File:Some File Name 0123.jpg|800px]]
>>>
>>> Or maybe:
>>>
>>> [[File:Some other file.jpg|250px]]
>>>
>>> Or maybe:
>>>
>>> [[File:Another file.jpg |600px|thumb]]
>>>
>>> etc.
>>>
>>> The only certainty to identify the relevant parts are the start of
>>> "[[File:" followed by characters and/or numbers making up a file names
>>> (No UTF-8) and ending in some suffix, such as .jpg JPEG, .Jpeg etc. .PNG,
>>> .gif, followed by a "|" pipe or closing "]]" brackets
>>
>> Is that really the only certainty? If so, it's a hard problem. Can the
>> file name contain | or ]] or newlines? I suspect not as "characters
>> and/or numbers" is an odd thing to say. I think you mean [a-zA-Z0-9 ].
>
> The filename itself never contains | or ]] in this case. The odd new line
> could be part of the complete string although it's unlikely and never in the
> filename part.

That's significant as some tools (AWK and sed for example) are oriented
towards processing lines, though AWK really processes records and it has
ways to re-define what a record is so as to help in situations like
this. Even so, using AWK for multi-line data like this can get fiddly.

--
Ben.

1

rocksolid light 0.9.8
clearnet tor