Rocksolid Light

News from da outaworlds

mail  files  register  groups  login

Message-ID:  

BOFH excuse #130: new management


comp / comp.unix.questions / Re: mmap vs. read

SubjectAuthor
* mmap vs. readSteve Keller
+* Re: mmap vs. readRichard Kettlewell
|+* Re: mmap vs. readCasper H.S. Dik
||`* Re: mmap vs. readblt uYh21j
|| `- Re: mmap vs. readMikko Rauhala
|`- Re: mmap vs. readKaz Kylheku
+- Re: mmap vs. readMarcel Mueller
`- Re: mmap vs. readKaz Kylheku

1
Subject: mmap vs. read
From: Steve Keller
Newsgroups: comp.unix.questions, comp.unix.programmer
Organization: Aioe.org NNTP Server
Date: Fri, 8 Feb 2019 11:40 UTC
Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!reader01.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED.+ig+4JBTiItVT1HSpocy/w.user.gioia.aioe.org!not-for-mail
From: keller@no.invalid (Steve Keller)
Newsgroups: comp.unix.questions, comp.unix.programmer
Subject: mmap vs. read
Date: Fri, 08 Feb 2019 12:40:19 +0100
Organization: Aioe.org NNTP Server
Lines: 15
Message-ID: <q3jpr2$3t9$1@gioia.aioe.org>
NNTP-Posting-Host: +ig+4JBTiItVT1HSpocy/w.user.gioia.aioe.org
X-Complaints-To: abuse@aioe.org
X-Notice: Filtered by postfilter v. 0.9.2
View all headers

AFAIU, reading files using mmap(2) has some performance benefits
compared to read(2). If a number of proecesses read the same file and
each process mmap()s the file into its address space to read it, then
only one copy of the file is in memory. OTOH, if the processes malloc
some memory and use read() to fill it with file data, the memory is
not shared, because (1) it will be aligned differently in these
processes and (2) each process writes to the memory causing a private
copy to be created.

So I think one should prefer mmap() to access files, but how can
errors be handled portably, then? On file I/O errors I get an error
return code from read() (e.g. EIO), but with mmap() I typically get a
SIGSEGV. How should I handle this?

Steve

Subject: Re: mmap vs. read
From: Richard Kettlewell
Newsgroups: comp.unix.questions, comp.unix.programmer
Organization: terraraq NNTP server
Date: Fri, 8 Feb 2019 15:38 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!reader01.eternal-september.org!feeder.eternal-september.org!nntp-feed.chiark.greenend.org.uk!ewrotcd!nntp.terraraq.uk!.POSTED.mantic.terraraq.uk!not-for-mail
From: invalid@invalid.invalid (Richard Kettlewell)
Newsgroups: comp.unix.questions,comp.unix.programmer
Subject: Re: mmap vs. read
Date: Fri, 08 Feb 2019 15:38:57 +0000
Organization: terraraq NNTP server
Message-ID: <87imxu2stq.fsf@LkoBDZeT.terraraq.uk>
References: <q3jpr2$3t9$1@gioia.aioe.org>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: mantic.terraraq.uk; posting-host="mantic.terraraq.uk:46.235.226.39";
logging-data="18604"; mail-complaints-to="news@terraraq.uk"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)
X-Face: h[Hh-7npe<<b4/eW[]sat,I3O`t8A`(ej.H!F4\8|;ih)`7{@:A~/j1}gTt4e7-n*F?.Rl^
F<\{jehn7.KrO{!7=:(@J~]<.[{>v9!1<qZY,{EJxg6?Er4Y7Ng2\Ft>Z&W?r\c.!4DXH5PWpga"ha
+r0NzP?vnz:e/knOY)PI-
X-Boydie: NO
Cancel-Lock: sha1:N1iKpTLOK/9YbBelTYD2W8mdcWs=
View all headers

Steve Keller <keller@no.invalid> writes:
> AFAIU, reading files using mmap(2) has some performance benefits
> compared to read(2). If a number of proecesses read the same file and
> each process mmap()s the file into its address space to read it, then
> only one copy of the file is in memory. OTOH, if the processes malloc
> some memory and use read() to fill it with file data, the memory is
> not shared, because (1) it will be aligned differently in these
> processes and (2) each process writes to the memory causing a private
> copy to be created.
>
> So I think one should prefer mmap() to access files,

Profile first; historically at least mmap was not reliably faster than
read/write. Fiddling with pages tables can be quite expensive.

> but how can errors be handled portably, then? On file I/O errors I
> get an error return code from read() (e.g. EIO), but with mmap() I
> typically get a SIGSEGV. How should I handle this?

Pass.

--
https://www.greenend.org.uk/rjk/

Subject: Re: mmap vs. read
From: Casper H.S. Dik
Newsgroups: comp.unix.questions, comp.unix.programmer
Date: Fri, 8 Feb 2019 16:15 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!reader01.eternal-september.org!feeder.eternal-september.org!news.uzoreto.com!feeder.erje.net!2.eu.feeder.erje.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed8.news.xs4all.nl!nzpost1.xs4all.net!not-for-mail
Newsgroups: comp.unix.questions,comp.unix.programmer
Subject: Re: mmap vs. read
References: <q3jpr2$3t9$1@gioia.aioe.org> <87imxu2stq.fsf@LkoBDZeT.terraraq.uk>
From: Casper.Dik@OrSPaMcle.COM (Casper H.S. Dik)
User-Agent: nn/6.6.2
Date: 08 Feb 2019 16:15:44 GMT
Lines: 35
Message-ID: <5c5dab30$0$22363$e4fe514c@news.xs4all.nl>
NNTP-Posting-Host: a8224b4b.news.xs4all.nl
X-Trace: G=udehDB2e,C=U2FsdGVkX196C0DZgTV3d8Y0+h9E7/FWiVAP8xiibh3vc2mTnRoiCQBWFcnGd7hwuC6OTQaHjbumVQWqGvAZuAok/+S97sy6+rHgMhSxIM4=
X-Complaints-To: abuse@xs4all.nl
View all headers

Richard Kettlewell <invalid@invalid.invalid> writes:

>Steve Keller <keller@no.invalid> writes:
>> AFAIU, reading files using mmap(2) has some performance benefits
>> compared to read(2). If a number of proecesses read the same file and
>> each process mmap()s the file into its address space to read it, then
>> only one copy of the file is in memory. OTOH, if the processes malloc
>> some memory and use read() to fill it with file data, the memory is
>> not shared, because (1) it will be aligned differently in these
>> processes and (2) each process writes to the memory causing a private
>> copy to be created.
>>
>> So I think one should prefer mmap() to access files,

>Profile first; historically at least mmap was not reliably faster than
>read/write. Fiddling with pages tables can be quite expensive.

Yeah, though over time, memory closer to the CPU (cache, memory, page
tables) has become much faster and CPU became faster more quickly.
Storage, however, was lacking.

>> but how can errors be handled portably, then? On file I/O errors I
>> get an error return code from read() (e.g. EIO), but with mmap() I
>> typically get a SIGSEGV. How should I handle this?

>Pass.

catch siginfo and see where the memory fault it (and siginfo may
return why it failed). Returning from such a signal handler
is not possible; you will need to resume somewhere else.

That is, catching errors is pretty hard in that case, especially when
writing.

Casper

Subject: Re: mmap vs. read
From: Marcel Mueller
Newsgroups: comp.unix.questions, comp.unix.programmer
Organization: FreeDYN.net News Server
Date: Fri, 8 Feb 2019 17:02 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!reader01.eternal-september.org!feeder.eternal-september.org!news.freedyn.net!news.dns-netz.com!.POSTED!not-for-mail
From: news.5.maazl@spamgourmet.org (Marcel Mueller)
Newsgroups: comp.unix.questions,comp.unix.programmer
Subject: Re: mmap vs. read
Date: Fri, 8 Feb 2019 18:02:07 +0100
Organization: FreeDYN.net News Server
Lines: 34
Message-ID: <q3kcpu$o91$1@news.freedyn.net>
References: <q3jpr2$3t9$1@gioia.aioe.org>
NNTP-Posting-Host: p200300e0c3e03c000a0027fffe8d1f8d.dip0.t-ipconnect.de
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: news.freedyn.net 1549645438 24865 2003:e0:c3e0:3c00:a00:27ff:fe8d:1f8d (8 Feb 2019 17:03:58 GMT)
X-Complaints-To: usenet@news.freedyn.net
NNTP-Posting-Date: Fri, 8 Feb 2019 17:03:58 +0000 (UTC)
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
Thunderbird/60.4.0
In-Reply-To: <q3jpr2$3t9$1@gioia.aioe.org>
Content-Language: en-US
View all headers

Am 08.02.19 um 12:40 schrieb Steve Keller:
> AFAIU, reading files using mmap(2) has some performance benefits
> compared to read(2). If a number of proecesses read the same file and
> each process mmap()s the file into its address space to read it, then
> only one copy of the file is in memory.

This is significant if and only if
(1) the file is sufficiently large,
(2) the file is opened by multiple processes and
(3) the file is not processed as stream.

But if the file is large, you probably do not want to load it into
memory completely at all. Most large files are processed as stream with
limited buffer size.

> So I think one should prefer mmap() to access files, but how can

I do not agree. Quite the contrary. You should use mmap if you /need/ it.

> errors be handled portably, then?

If you really need mmap, it is likely that any I/O error is fatal for
your application. So the question is less likely to arise.

> On file I/O errors I get an error
> return code from read() (e.g. EIO), but with mmap() I typically get a
> SIGSEGV. How should I handle this?

With a signal handler. Of course you have to examine where the error
occurs and whether it is in your mapped memory area.

Marcel

Subject: Re: mmap vs. read
From: blt_uYh21j@xvjhmg9ueyj23p1690akks_mo.net
Newsgroups: comp.unix.questions, comp.unix.programmer
Organization: Aioe.org NNTP Server
Date: Fri, 8 Feb 2019 17:32 UTC
References: 1 2 3
Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!reader01.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED.GFLdSzsN8Kvwb9PdjUT7sw.user.gioia.aioe.org!not-for-mail
From: blt_uYh21j@xvjhmg9ueyj23p1690akks_mo.net
Newsgroups: comp.unix.questions,comp.unix.programmer
Subject: Re: mmap vs. read
Date: Fri, 8 Feb 2019 17:32:33 +0000 (UTC)
Organization: Aioe.org NNTP Server
Lines: 26
Message-ID: <q3kefh$15au$1@gioia.aioe.org>
References: <q3jpr2$3t9$1@gioia.aioe.org> <87imxu2stq.fsf@LkoBDZeT.terraraq.uk> <5c5dab30$0$22363$e4fe514c@news.xs4all.nl>
NNTP-Posting-Host: GFLdSzsN8Kvwb9PdjUT7sw.user.gioia.aioe.org
X-Complaints-To: abuse@aioe.org
NewsSpy: 0.0.7
X-Notice: Filtered by postfilter v. 0.9.2
View all headers

On 08 Feb 2019 16:15:44 GMT
Casper H.S. Dik <Casper.Dik@OrSPaMcle.COM> wrote:
>Richard Kettlewell <invalid@invalid.invalid> writes:
>
>>Steve Keller <keller@no.invalid> writes:
>>> AFAIU, reading files using mmap(2) has some performance benefits
>>> compared to read(2). If a number of proecesses read the same file and
>>> each process mmap()s the file into its address space to read it, then
>>> only one copy of the file is in memory. OTOH, if the processes malloc
>>> some memory and use read() to fill it with file data, the memory is
>>> not shared, because (1) it will be aligned differently in these
>>> processes and (2) each process writes to the memory causing a private
>>> copy to be created.
>>>
>>> So I think one should prefer mmap() to access files,
>
>>Profile first; historically at least mmap was not reliably faster than
>>read/write. Fiddling with pages tables can be quite expensive.
>
>Yeah, though over time, memory closer to the CPU (cache, memory, page
>tables) has become much faster and CPU became faster more quickly.
>Storage, however, was lacking.

Arn't the higher level I/O routines, eg fread() etc, supposed to be written
to use the best access method on a given architecture?

Subject: Re: mmap vs. read
From: Mikko Rauhala
Newsgroups: comp.unix.questions, comp.unix.programmer
Organization: A noiseless patient Spider
Date: Fri, 8 Feb 2019 19:00 UTC
References: 1 2 3 4
Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: mjr@iki.fi (Mikko Rauhala)
Newsgroups: comp.unix.questions,comp.unix.programmer
Subject: Re: mmap vs. read
Date: Fri, 8 Feb 2019 19:00:27 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 13
Message-ID: <slrnq5rkeb.3q1.mjr@shadow.rauhala.org>
References: <q3jpr2$3t9$1@gioia.aioe.org>
<87imxu2stq.fsf@LkoBDZeT.terraraq.uk>
<5c5dab30$0$22363$e4fe514c@news.xs4all.nl> <q3kefh$15au$1@gioia.aioe.org>
Injection-Date: Fri, 8 Feb 2019 19:00:27 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="6bd84ce1e1366cb9a4dfe011a775e6fb";
logging-data="31880"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19lm0I4gJ+ZSw6ttqit8ek8NPpY4M+3q3M="
User-Agent: slrn/1.0.2 (Linux)
Cancel-Lock: sha1:ra3ZZ8CBIizYYO7H+WoxjUmOwFE=
View all headers

On Fri, 8 Feb 2019 17:32:33 +0000 (UTC),
blt_uYh21j@xvjhmg9ueyj23p1690akks_mo.net
<blt_uYh21j@xvjhmg9ueyj23p1690akks_mo.net> wrote:
> Arn't the higher level I/O routines, eg fread() etc, supposed to be written
> to use the best access method on a given architecture?

fread() API limits it to making necessarily at least one copy of the data,
not (easily) shareable. Internally, of course, it may use whatever method
it wants to get at the data to be copied.

--
Mikko Rauhala - mjr@iki.fi - http://rauhala.org/

Subject: Re: mmap vs. read
From: Kaz Kylheku
Newsgroups: comp.unix.questions, comp.unix.programmer
Followup: comp.unix.programmer
Organization: Aioe.org NNTP Server
Date: Fri, 8 Feb 2019 19:09 UTC
References: 1
Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!reader01.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED.opIzC+UpU1FF0qsNaB+JHA.user.gioia.aioe.org!not-for-mail
From: 157-073-9834@kylheku.com (Kaz Kylheku)
Newsgroups: comp.unix.questions,comp.unix.programmer
Subject: Re: mmap vs. read
Followup-To: comp.unix.programmer
Date: Fri, 8 Feb 2019 19:09:35 +0000 (UTC)
Organization: Aioe.org NNTP Server
Lines: 57
Message-ID: <20190208103745.86@kylheku.com>
References: <q3jpr2$3t9$1@gioia.aioe.org>
NNTP-Posting-Host: opIzC+UpU1FF0qsNaB+JHA.user.gioia.aioe.org
X-Complaints-To: abuse@aioe.org
User-Agent: slrn/pre1.0.0-18 (Linux)
X-Notice: Filtered by postfilter v. 0.9.2
View all headers

On 2019-02-08, Steve Keller <keller@no.invalid> wrote:
> AFAIU, reading files using mmap(2) has some performance benefits
> compared to read(2).

This is not always the case. Basically the file has to be large enough
for the overhead of allocating a new map.

A program that repeatedly processes files by reading them into buffers
from malloc can perform better, because malloc can efficiently re-use
liberated memory without having to make system calls.

A program that repeatedly processes small files using mmap is constantly
making calls to mmap and munmap. These are expensive, and additionally
so because they manipulate the address space.

Basically the cost of the mmap operation has to be amortized somehow:
the best situation is that very large files are processed, and
infrequently so. Furthermore, random access is required.

> If a number of proecesses read the same file and
> each process mmap()s the file into its address space to read it, then
> only one copy of the file is in memory. OTOH, if the processes malloc
> some memory and use read() to fill it with file data, the memory is
> not shared, because (1) it will be aligned differently in these
> processes and (2) each process writes to the memory causing a private
> copy to be created.

However, often we can process an arbitrarily large file with only a
small buffer of a few kilobytes. Including doing random access, achieved
by seeking around in the file.

Ten processes passing over the same gigabyte file using 4 kilobyte
buffers are allocating only 40 kilobytes in total.

Ten processes mmapping the same gigabyte file means a gigabyte memory
map exists. The madvise system call can help here.

(To present a balanced view, we must observe that mmap doesn't have to
map the entire file at once, either. Also, a mapping can be destroyed
piece-wise, rather than all at once: munmap can be called on portions of
a mapping that we know we are not going to touch.)

> So I think one should prefer mmap() to access files, but how can
> errors be handled portably, then? On file I/O errors I get an error
> return code from read() (e.g. EIO), but with mmap() I typically get a
> SIGSEGV. How should I handle this?

In a utility program that can just bail on errors, you don't have to
bother too much. Fetch the size of the file upfront (for instance
stat(file, &stbuf) it and take stbuf.st_size). Then map just for that
size. If the file happens to shrink, let the chips land where they may.

In a robust application, you have to deal with the SIGBUS if you access
the mapping beyond the end of the file.

The signal handling for SIGBUS is about equally portable as mmap: you're
writing a POSIX application.

Subject: Re: mmap vs. read
From: Kaz Kylheku
Newsgroups: comp.unix.questions, comp.unix.programmer
Organization: Aioe.org NNTP Server
Date: Fri, 8 Feb 2019 19:13 UTC
References: 1 2
Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!reader01.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED.opIzC+UpU1FF0qsNaB+JHA.user.gioia.aioe.org!not-for-mail
From: 157-073-9834@kylheku.com (Kaz Kylheku)
Newsgroups: comp.unix.questions,comp.unix.programmer
Subject: Re: mmap vs. read
Date: Fri, 8 Feb 2019 19:13:41 +0000 (UTC)
Organization: Aioe.org NNTP Server
Lines: 34
Message-ID: <20190208110946.945@kylheku.com>
References: <q3jpr2$3t9$1@gioia.aioe.org>
<87imxu2stq.fsf@LkoBDZeT.terraraq.uk>
NNTP-Posting-Host: opIzC+UpU1FF0qsNaB+JHA.user.gioia.aioe.org
X-Complaints-To: abuse@aioe.org
User-Agent: slrn/pre1.0.0-18 (Linux)
X-Notice: Filtered by postfilter v. 0.9.2
View all headers

On 2019-02-08, Richard Kettlewell <invalid@invalid.invalid> wrote:
> Steve Keller <keller@no.invalid> writes:
>> AFAIU, reading files using mmap(2) has some performance benefits
>> compared to read(2). If a number of proecesses read the same file and
>> each process mmap()s the file into its address space to read it, then
>> only one copy of the file is in memory. OTOH, if the processes malloc
>> some memory and use read() to fill it with file data, the memory is
>> not shared, because (1) it will be aligned differently in these
>> processes and (2) each process writes to the memory causing a private
>> copy to be created.
>>
>> So I think one should prefer mmap() to access files,
>
> Profile first; historically at least mmap was not reliably faster than
> read/write. Fiddling with pages tables can be quite expensive.

I recently saw this on recent PC hardware, Ubuntu 18.

There is a Debian patch for bsdiff which converts it from malloced
buffers to use mmap. (The patch has a bug in the unmapping, which I
fixed: it uses the compressed size of the source file to unmap it,
rather than the original size.)

I converted the bsdiff utility into a shared library, to use as a
subroutine in a program which calls it millions of times for small-ish
files.

The original read() version was found to be faster than the mmap()
version, so we dropped the patch instead of fixing its bug.

I hypothesized the poorer performance to be caused by the repeated
mapping and unmapping calls which manipulate the virtual address space
and require trips to the kernel. Whereas the malloced buffers can be
recycled without trips to the kernel or tweaking of the address space.

1

rocksolid light 0.9.8
clearnet tor