Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

You're a card which will have to be dealt with.


comp / comp.text.pdf / Re: pdf grep?

SubjectAuthor
* pdf grep?db
+* Re: pdf grep?Robert Heller
|`* Re: pdf grep?Stefan Ram
| `* Re: pdf grep?Stefan Ram
|  `* Re: pdf grep?db
|   `* Re: pdf grep?db
|    `* Re: pdf grep?Peter Flynn
|     `- Re: pdf grep?db
`- Re: pdf grep?Tim Landscheidt

1
Subject: pdf grep?
From: db
Newsgroups: comp.text.pdf
Organization: A noiseless patient Spider
Date: Wed, 3 Apr 2024 12:45 UTC
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: dieterhansbritz@gmail.com (db)
Newsgroups: comp.text.pdf
Subject: pdf grep?
Date: Wed, 3 Apr 2024 12:45:20 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 3
Message-ID: <uujj10$3tv68$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 03 Apr 2024 12:45:20 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="3fc59fc5b164e30958ab4c2ac5ec4c56";
logging-data="4127944"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+iL0HfbMqGhtmHWui5lehrmonfk8Da64w="
User-Agent: Pan/0.149 (Bellevue; 4c157ba)
Cancel-Lock: sha1:mOOCkst187rse+NF/UCsNrLSnDg=
View all headers

Under Linux, I can use grep to search a bunch of
files for a character string. Is there an equivalent
command for searching pdf files?

Subject: Re: pdf grep?
From: Robert Heller
Newsgroups: comp.text.pdf
Organization: Deepwoods Software
Date: Wed, 3 Apr 2024 14:03 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!news.mixmin.net!news.neodome.net!npeer.as286.net!npeer-ng0.as286.net!peer03.ams1!peer.ams1.xlned.com!news.xlned.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!Xl.tags.giganews.com!local-1.nntp.ord.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Wed, 03 Apr 2024 14:03:37 +0000
MIME-Version: 1.0
From: heller@deepsoft.com (Robert Heller)
Organization: Deepwoods Software
X-Newsreader: TkNews 3.0 (1.2.17)
Subject: Re: pdf grep?
In-Reply-To: <uujj10$3tv68$2@dont-email.me>
References: <uujj10$3tv68$2@dont-email.me>
Newsgroups: comp.text.pdf
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset="us-ascii"
Originator: heller@sharky4.deepsoft.com
Message-ID: <XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com>
Date: Wed, 03 Apr 2024 14:03:37 +0000
Lines: 24
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-48zvgird97A9/ITu+Drx0XzrDBTk43LIy4R0e+edm9E5pljQW9kbEU7P8w/DXZvJ1v3OuRqcfYE+7Ux!ihML7eQjlYz0d+KRgDNfiz0Y0QvgmG2I1bvsRaQmEfYJXrNuf465pgMgXq5+PTbpJdC1/+R4BT4i!qHQ=
X-Complaints-To: abuse@giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Received-Bytes: 2349
View all headers

Grep may sort of also work with pdf files. You might want to also use the
strings command to get "clean" srings. Note: *some* pdf files are just images
(no actual text). These would be PDFs created by scanning a document (not
using OCR). Also, many typesetting programs (TeX/LaTex, word-processos, etc),
might do some typesetting "magic" (eg ligitures, etc.) that might make things
hard for grep.

xpdf includes a text search button as part of its UI.

At Wed, 3 Apr 2024 12:45:20 -0000 (UTC) db <dieterhansbritz@gmail.com> wrote:

>
> Under Linux, I can use grep to search a bunch of
> files for a character string. Is there an equivalent
> command for searching pdf files?
>
>

--
Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
heller@deepsoft.com -- Webhosting Services

Subject: Re: pdf grep?
From: Stefan Ram
Newsgroups: comp.text.pdf
Organization: Stefan Ram
Date: Wed, 3 Apr 2024 14:17 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: 3 Apr 2024 14:17:22 GMT
Organization: Stefan Ram
Lines: 9
Expires: 1 Feb 2025 11:59:58 GMT
Message-ID: <grep-20240403151634@ram.dialup.fu-berlin.de>
References: <uujj10$3tv68$2@dont-email.me> <XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de zgDP1DF/5Ws8fAzvm+jV3gsZmKXxjvsOPlbppt0MFxgcJu
Cancel-Lock: sha1:yCzZ3e5JZEsdgmg5fv/Is93ItZ0= sha256:i7ow4sk1NyC22+jTMgeXzJE4XOOVb5Rjg2b+P7Vtnn0=
X-Copyright: (C) Copyright 2024 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
View all headers

Robert Heller <heller@deepsoft.com> wrote or quoted:
>might do some typesetting "magic" (eg ligitures, etc.) that might make things

"ligatures"

Text in PDFs is sometimes compressed. So one can either use
programs like "Agent Ransack" to search for text in PDFs or
tools like "pdftotext" to first create a text file for every
PDF file and then grep those text files.

Subject: Re: pdf grep?
From: Tim Landscheidt
Newsgroups: comp.text.pdf
Organization: https://www.tim-landscheidt.de/
Date: Wed, 3 Apr 2024 14:22 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: tim@tim-landscheidt.de (Tim Landscheidt)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: Wed, 03 Apr 2024 14:22:18 +0000
Organization: https://www.tim-landscheidt.de/
Lines: 10
Message-ID: <87wmpe1q79.fsf@vagabond.tim-landscheidt.de>
References: <uujj10$3tv68$2@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: individual.net bkGqySacOKNlh+IW18qhaA2eO8wtu4dY0XzSj/7VbNdjhmQhJj
Cancel-Lock: sha1:hK1coaA6m4l4oneREIlMZEtYMQ0= sha1:j0wI1pbYO5wBaF3wPsJfM6+SJTs= sha256:IOivHOMSEq3/latdjC7ui0lScvW0efRuF1VTSfEWvok=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.3 (gnu/linux)
View all headers

db <dieterhansbritz@gmail.com> wrote:

> Under Linux, I can use grep to search a bunch of
> files for a character string. Is there an equivalent
> command for searching pdf files?

You can use pdfgrep (https://pdfgrep.org/) for that. It is
available as a package in Fedora and Debian as well.

Tim

Subject: Re: pdf grep?
From: Stefan Ram
Newsgroups: comp.text.pdf
Organization: Stefan Ram
Date: Wed, 3 Apr 2024 14:29 UTC
References: 1 2 3
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: 3 Apr 2024 14:29:40 GMT
Organization: Stefan Ram
Lines: 11
Expires: 1 Feb 2025 11:59:58 GMT
Message-ID: <search-20240403152924@ram.dialup.fu-berlin.de>
References: <uujj10$3tv68$2@dont-email.me> <XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com> <grep-20240403151634@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de jSb4m5gA5BEGBu6j4KzGsQFojIzpWOlFMo1rZplTjNmV4M
Cancel-Lock: sha1:9riBKCF8fxhQypFxUfn/Ae8mRx0= sha256:uxItC6vosBIu/EtCgJuJVq+iTlgZsa7lvosJso3jkRo=
X-Copyright: (C) Copyright 2024 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
View all headers

ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
>Text in PDFs is sometimes compressed. So one can either use
>programs like "Agent Ransack" to search for text in PDFs or
>tools like "pdftotext" to first create a text file for every
>PDF file and then grep those text files.

PS: "Agent Ransack" is Windows software. "pdftotext" is also
available for Linux. Converting all PDFs to text files needs
to be done only once, and then search operations on those
text files are faster than scanning the PDF files for text
on every search!

Subject: Re: pdf grep?
From: db
Newsgroups: comp.text.pdf
Organization: A noiseless patient Spider
Date: Wed, 3 Apr 2024 15:19 UTC
References: 1 2 3 4
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: dieterhansbritz@gmail.com (db)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: Wed, 3 Apr 2024 15:19:24 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <uujs1s$7u0$3@dont-email.me>
References: <uujj10$3tv68$2@dont-email.me>
<XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com>
<grep-20240403151634@ram.dialup.fu-berlin.de>
<search-20240403152924@ram.dialup.fu-berlin.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 03 Apr 2024 15:19:24 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="3fc59fc5b164e30958ab4c2ac5ec4c56";
logging-data="8128"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ruhJn43xRmQgQeqvwqBUNjdbeWethttw="
User-Agent: Pan/0.149 (Bellevue; 4c157ba)
Cancel-Lock: sha1:g5dkZRgK81gQJJzE7717xw5oqLg=
View all headers

On 3 Apr 2024 14:29:40 GMT, Stefan Ram wrote:

> ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
>>Text in PDFs is sometimes compressed. So one can either use programs
>>like "Agent Ransack" to search for text in PDFs or tools like
>>"pdftotext" to first create a text file for every PDF file and then grep
>>those text files.
>
> PS: "Agent Ransack" is Windows software. "pdftotext" is also available
> for Linux. Converting all PDFs to text files needs to be done only
> once, and then search operations on those text files are faster than
> scanning the PDF files for text on every search!

I should maybe have elaborated a bit. Sometimes I
remember a certain phrase or word but forget which
pdf it is in. With text files I can do
grep blabla *.txt
and I wanted an equivalent. Using pdftotext would
mean using it for every suspect pdf. Since a lot of
pdf files are searchable, I figured that such a
command might exist.
But if there really is a pdfgrep command, that might
do the job. I will do some googling, thanks.

Subject: Re: pdf grep?
From: db
Newsgroups: comp.text.pdf
Organization: A noiseless patient Spider
Date: Thu, 4 Apr 2024 09:50 UTC
References: 1 2 3 4 5
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: dieterhansbritz@gmail.com (db)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: Thu, 4 Apr 2024 09:50:46 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 33
Message-ID: <uult5m$iqkv$1@dont-email.me>
References: <uujj10$3tv68$2@dont-email.me>
<XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com>
<grep-20240403151634@ram.dialup.fu-berlin.de>
<search-20240403152924@ram.dialup.fu-berlin.de> <uujs1s$7u0$3@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 04 Apr 2024 09:50:46 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="11d4a932091b4ca06ee40d12e4656fcc";
logging-data="617119"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+lleGXsaB3j6A72oeVCsTr711vCdBSMX4="
User-Agent: Pan/0.149 (Bellevue; 4c157ba)
Cancel-Lock: sha1:8lzRqLoeAsmZW4nD79/DdzGer1A=
View all headers

On Wed, 3 Apr 2024 15:19:24 -0000 (UTC), db wrote:

> On 3 Apr 2024 14:29:40 GMT, Stefan Ram wrote:
>
>> ram@zedat.fu-berlin.de (Stefan Ram) wrote or quoted:
>>>Text in PDFs is sometimes compressed. So one can either use programs
>>>like "Agent Ransack" to search for text in PDFs or tools like
>>>"pdftotext" to first create a text file for every PDF file and then
>>>grep those text files.
>>
>> PS: "Agent Ransack" is Windows software. "pdftotext" is also
>> available for Linux. Converting all PDFs to text files needs to be
>> done only once, and then search operations on those text files are
>> faster than scanning the PDF files for text on every search!
>
> I should maybe have elaborated a bit. Sometimes I remember a certain
> phrase or word but forget which pdf it is in. With text files I can do
> grep blabla *.txt and I wanted an equivalent. Using pdftotext would mean
> using it for every suspect pdf. Since a lot of pdf files are searchable,
> I figured that such a command might exist.
> But if there really is a pdfgrep command, that might do the job. I will
> do some googling, thanks.

I installed pdfgrep in my Kubuntu system, but it is
not happy. Although the man file is there, even help
doesn't work:

> pdfgrep --help
terminate called after throwing an instance of 'std::runtime_error'
what(): locale::facet::_S_create_c_locale name not valid
Aborted (core dumped)

??

Subject: Re: pdf grep?
From: Peter Flynn
Newsgroups: comp.text.pdf
Organization: Usenet Labs Bozon Detector Facility
Date: Thu, 4 Apr 2024 15:57 UTC
References: 1 2 3 4 5 6
Path: eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: peter@silmaril.ie (Peter Flynn)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: Thu, 4 Apr 2024 16:57:49 +0100
Organization: Usenet Labs Bozon Detector Facility
Lines: 11
Message-ID: <l780vtFidhnU1@mid.individual.net>
References: <uujj10$3tv68$2@dont-email.me>
<XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com>
<grep-20240403151634@ram.dialup.fu-berlin.de>
<search-20240403152924@ram.dialup.fu-berlin.de> <uujs1s$7u0$3@dont-email.me>
<uult5m$iqkv$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: individual.net lwMg5sGS1URvdMcf4ijROQ8hv33YmSvHTUIRoK1ANavtKejYwq
Cancel-Lock: sha1:FsgDr+6AHXdZJxut4G7/25xxEbM= sha256:28sQRzY1LTxlNz3mQI86V6lFcPScrnur+NHb8ioKnPQ=
User-Agent: Mozilla Thunderbird
Content-Language: en-GB
In-Reply-To: <uult5m$iqkv$1@dont-email.me>
View all headers

On 04/04/2024 10:50, db wrote:
[...]
> I installed pdfgrep in my Kubuntu system, but it is
> not happy. Although the man file is there, even help
> doesn't work:

I just installed pdfgrep_2.1.2-1build1_amd64.deb in my Mint 20.1 and it
seems to work OK. What version is the Kubuntu one?

Peter

Subject: Re: pdf grep?
From: db
Newsgroups: comp.text.pdf
Organization: A noiseless patient Spider
Date: Fri, 5 Apr 2024 12:31 UTC
References: 1 2 3 4 5 6 7
Path: eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: dieterhansbritz@gmail.com (db)
Newsgroups: comp.text.pdf
Subject: Re: pdf grep?
Date: Fri, 5 Apr 2024 12:31:04 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 14
Message-ID: <uuoqu8$1cco6$1@dont-email.me>
References: <uujj10$3tv68$2@dont-email.me>
<XB6cnYfPsZMk_JD7nZ2dnZfqnPGdnZ2d@giganews.com>
<grep-20240403151634@ram.dialup.fu-berlin.de>
<search-20240403152924@ram.dialup.fu-berlin.de> <uujs1s$7u0$3@dont-email.me>
<uult5m$iqkv$1@dont-email.me> <l780vtFidhnU1@mid.individual.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 05 Apr 2024 12:31:04 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="88a35264003d6fd2fe031050366b57d5";
logging-data="1454854"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18HXOuPTbuU3gLqx9L60Wr4pT09yR+rmk0="
User-Agent: Pan/0.149 (Bellevue; 4c157ba)
Cancel-Lock: sha1:1BA4E6VODTaXCk5nR0V9vtmlO2w=
View all headers

On Thu, 4 Apr 2024 16:57:49 +0100, Peter Flynn wrote:

> On 04/04/2024 10:50, db wrote:
> [...]
>> I installed pdfgrep in my Kubuntu system, but it is not happy. Although
>> the man file is there, even help doesn't work:
>
> I just installed pdfgrep_2.1.2-1build1_amd64.deb in my Mint 20.1 and it
> seems to work OK. What version is the Kubuntu one?
>
> Peter

The man file for pdfgrep says V. 2.1.1. My Kubuntu
is 23.04.

1

rocksolid light 0.9.8
clearnet tor