diff mbox series

[v2] madvise.2: Document MADV_POPULATE_READ and MADV_POPULATE_WRITE

Message ID 20210816081922.5155-1-david@redhat.com (mailing list archive)
State New
Headers show
Series [v2] madvise.2: Document MADV_POPULATE_READ and MADV_POPULATE_WRITE | expand

Commit Message

David Hildenbrand Aug. 16, 2021, 8:19 a.m. UTC
MADV_POPULATE_READ and MADV_POPULATE_WRITE have been merged into
upstream Linux via commit 4ca9b3859dac ("mm/madvise: introduce
MADV_POPULATE_(READ|WRITE) to prefault page tables"), part of v5.14-rc1.

Further, commit eb2faa513c24 ("mm/madvise: report SIGBUS as -EFAULT for
MADV_POPULATE_(READ|WRITE)"), part of v5.14-rc6, made sure that SIGBUS is
converted to -EFAULT instead of -EINVAL.

Let's document the behavior and error conditions of these new madvise()
options.

Acked-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Cc: Alejandro Colomar <alx.manpages@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Jann Horn <jannh@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: David Hildenbrand <david@redhat.com>
---

v1 -> v2:
- Use semantic newlines in all cases
- Add two missing "
- Document -EFAULT handling
- Rephrase some parts to make it more generic: VM_PFNMAP and VM_IO are only
  examples for special mappings

---
 man2/madvise.2 | 107 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)

Comments

Michael Kerrisk (man-pages) Aug. 17, 2021, 9:42 p.m. UTC | #1
Hello David,

Thank you for writing this! Could you please take
a look at the comments below and revise?

On 8/16/21 10:19 AM, David Hildenbrand wrote:
> MADV_POPULATE_READ and MADV_POPULATE_WRITE have been merged into
> upstream Linux via commit 4ca9b3859dac ("mm/madvise: introduce
> MADV_POPULATE_(READ|WRITE) to prefault page tables"), part of v5.14-rc1.
> 
> Further, commit eb2faa513c24 ("mm/madvise: report SIGBUS as -EFAULT for
> MADV_POPULATE_(READ|WRITE)"), part of v5.14-rc6, made sure that SIGBUS is
> converted to -EFAULT instead of -EINVAL.
> 
> Let's document the behavior and error conditions of these new madvise()
> options.
> 
> Acked-by: Pankaj Gupta <pankaj.gupta@ionos.com>
> Cc: Alejandro Colomar <alx.manpages@gmail.com>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Jann Horn <jannh@google.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Linux API <linux-api@vger.kernel.org>
> Cc: linux-mm@kvack.org
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> 
> v1 -> v2:
> - Use semantic newlines in all cases
> - Add two missing "
> - Document -EFAULT handling
> - Rephrase some parts to make it more generic: VM_PFNMAP and VM_IO are only
>   examples for special mappings
> 
> ---
>  man2/madvise.2 | 107 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 107 insertions(+)
> 
> diff --git a/man2/madvise.2 b/man2/madvise.2
> index f1f384c0c..f6cea9ad2 100644
> --- a/man2/madvise.2
> +++ b/man2/madvise.2
> @@ -469,6 +469,72 @@ If a page is file-backed and dirty, it will be written back to the backing
>  storage.
>  The advice might be ignored for some pages in the range when it is not
>  applicable.
> +.TP
> +.BR MADV_POPULATE_READ " (since Linux 5.14)"
> +Populate (prefault) page tables readable for the whole range without actually

I have trouble to understand "Populate (prefault) page tables readable".
Does it mean that it is just the page tables are being populated, and the
PTEs are marked to indicate that the pages are readable? If yes, I
think some rewording would help.

> +reading memory.

I don't understand "without actually reading memory"? Do you mean,
"without actually  faulting in the pages"; or something else?

> +Depending on the underlying mapping,
> +map the shared zeropage,
> +preallocate memory or read the underlying file;
> +files with holes might or might not preallocate blocks.
> +Do not generate
> +.B SIGBUS
> +when populating fails,
> +return an error instead.

Better:

[[
If populating fails, a
.B SIGBUS
signal is not generated; instead, an error i returned.
]]

> +.IP
> +If
> +.B MADV_POPULATE_READ
> +succeeds,
> +all page tables have been populated (prefaulted) readable once.
> +If
> +.B MADV_POPULATE_READ
> +fails,
> +some page tables might have been populated.
> +.IP
> +.B MADV_POPULATE_READ
> +cannot be applied to mappings without read permissions
> +and special mappings,
> +for example,
> +marked with the kernel-internal

s/marked/mappings marked/

> +.B VM_PFNMAP
> +and

Just checking: should it be "and" or "or" here"?

Looking at the EINVAL error below, I guess "or", and a better 
wording would be:

[[
...for example, mappings marked with kernel-internal flags such as
.B VMPPFNMAP
or
.BR BR_V_IO.
]]

> +.BR VM_IO .
> +.IP
> +Note that with
> +.BR MADV_POPULATE_READ ,
> +the process can be killed at any moment when the system runs out of memory.
> +.TP
> +.BR MADV_POPULATE_WRITE " (since Linux 5.14)"
> +Populate (prefault) page tables writable for the whole range without actually

I have trouble to understand "Populate (prefault) page tables writable".
Does it mean that it is just the page tables are being populated, and the
PTEs are marked to indicate that the pages are writable? If yes, I
think some rewording would help.

> +writing memory.

I don't understand "without actually writing memory"? Do you mean,
"without actually  faulting in the pages"; or something else?

> +Depending on the underlying mapping,
> +preallocate memory or read the underlying file;
> +files with holes will preallocate blocks.
> +Do not generate
> +.B SIGBUS
> +when populating fails,
> +return an error instead.

Better:

[[
If populating fails, a
.B SIGBUS
signal is not generated; instead, an error i returned.
]]

+.IP
> +If
> +.B MADV_POPULATE_WRITE
> +succeeds,
> +all page tables have been populated (prefaulted) writable once.
> +If
> +.B MADV_POPULATE_WRITE
> +fails, some page tables might have been populated.
> +.IP
> +.B MADV_POPULATE_WRITE
> +cannot be applied to mappings without write permissions
> +and special mappings,
> +for example,
> +marked with the kernel-internal

s/marked/mappings marked/

> +.B VM_PFNMAP
> +and

Just checking: should it be "and" or "or" here"?

Looking at the EINVAL error below, I guess "or", and a better 
wording would be:

[[
...for example, mappings marked with kernel-internal flags such as
.B VMPPFNMAP
or
.BR BR_V_IO.
]]

> +.BR VM_IO .
> +.IP
> +Note that with
> +.BR MADV_POPULATE_WRITE ,
> +the process can be killed at any moment when the system runs out of memory.
>  .SH RETURN VALUE
>  On success,
>  .BR madvise ()
> @@ -490,6 +556,17 @@ A kernel resource was temporarily unavailable.
>  .B EBADF
>  The map exists, but the area maps something that isn't a file.
>  .TP
> +.B EFAULT
> +.I advice
> +is
> +.B MADV_POPULATE_READ
> +or
> +.BR MADV_POPULATE_WRITE ,
> +and populating (prefaulting) page tables failed because a
> +.B SIGBUS
> +would have been generated on actual memory access and the reason is not a
> +HW poisoned page.

Maybe:
s/.$/(see the description of MADV_HWPOISON in this page)./
?

> +.TP
>  .B EINVAL
>  .I addr
>  is not page-aligned or
> @@ -533,6 +610,18 @@ or
>  .BR VM_PFNMAP
>  ranges.
>  .TP
> +.B EINVAL
> +.I advice
> +is
> +.B MADV_POPULATE_READ
> +or
> +.BR MADV_POPULATE_WRITE ,
> +but the specified address range includes ranges with insufficient permissions
> +or incompatible mappings such as
> +.B VM_IO
> +or
> +.BR VM_PFNMAP.

s/.BR VM_PFNMAP./.BR VM_PFNMAP ./

> +.TP
>  .B EIO
>  (for
>  .BR MADV_WILLNEED )
> @@ -548,6 +637,15 @@ Not enough memory: paging in failed.
>  Addresses in the specified range are not currently
>  mapped, or are outside the address space of the process.
>  .TP
> +.B ENOMEM
> +.I advice
> +is
> +.B MADV_POPULATE_READ
> +or
> +.BR MADV_POPULATE_WRITE ,
> +and populating (prefaulting) page tables failed because there was not enough
> +memory.
> +.TP
>  .B EPERM
>  .I advice
>  is
> @@ -555,6 +653,15 @@ is
>  but the caller does not have the
>  .B CAP_SYS_ADMIN
>  capability.
> +.TP
> +.B EHWPOISON
> +.I advice
> +is
> +.B MADV_POPULATE_READ
> +or
> +.BR MADV_POPULATE_WRITE ,
> +and populating (prefaulting) page tables failed because a HW poisoned page
> +was encountered.
>  .SH VERSIONS
>  Since Linux 3.18,
>  .\" commit d3ac21cacc24790eb45d735769f35753f5b56ceb

Thanks,

Michael
David Hildenbrand Aug. 18, 2021, 8:35 a.m. UTC | #2
On 17.08.21 23:42, Michael Kerrisk (man-pages) wrote:
> Hello David,
> 
> Thank you for writing this! Could you please take
> a look at the comments below and revise?

Hi Michael,

thanks for your valuable input. Your feedback will certainly make this 
easier to understand for people that are not heavily involved in MM work :)

[...]

>>   man2/madvise.2 | 107 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 107 insertions(+)
>>
>> diff --git a/man2/madvise.2 b/man2/madvise.2
>> index f1f384c0c..f6cea9ad2 100644
>> --- a/man2/madvise.2
>> +++ b/man2/madvise.2
>> @@ -469,6 +469,72 @@ If a page is file-backed and dirty, it will be written back to the backing
>>   storage.
>>   The advice might be ignored for some pages in the range when it is not
>>   applicable.
>> +.TP
>> +.BR MADV_POPULATE_READ " (since Linux 5.14)"
>> +Populate (prefault) page tables readable for the whole range without actually
> 
> I have trouble to understand "Populate (prefault) page tables readable".
> Does it mean that it is just the page tables are being populated, and the
> PTEs are marked to indicate that the pages are readable? If yes, I
> think some rewording would help.

I actually tried phrasing it similar to our MAP_POPULATE documentation:
	("Populate  (prefault)  page tables for a mapping.")

We will prefault all pages, faulting them in.
> 
>> +reading memory.
> 
> I don't understand "without actually reading memory"? Do you mean,
> "without actually  faulting in the pages"; or something else?

"Populate (prefault) page tables readable, faulting in all pages in the 
range just as if manually reading one byte of each page; however, avoid 
the actual memory access that would have been performed after handling 
the fault."

Does that make it clearer? (avoiding eventually touching the page at all 
can be beneficial, especially when dealing with DAX memory where memory 
access might be expensive)

> 
>> +Depending on the underlying mapping,
>> +map the shared zeropage,
>> +preallocate memory or read the underlying file;
>> +files with holes might or might not preallocate blocks.
>> +Do not generate
>> +.B SIGBUS
>> +when populating fails,
>> +return an error instead.
> 
> Better:
> 
> [[
> If populating fails, a
> .B SIGBUS
> signal is not generated; instead, an error i returned.
> ]]
> 

Sure, thanks.

>> +.IP
>> +If
>> +.B MADV_POPULATE_READ
>> +succeeds,
>> +all page tables have been populated (prefaulted) readable once.
>> +If
>> +.B MADV_POPULATE_READ
>> +fails,
>> +some page tables might have been populated.
>> +.IP
>> +.B MADV_POPULATE_READ
>> +cannot be applied to mappings without read permissions
>> +and special mappings,
>> +for example,
>> +marked with the kernel-internal
> 
> s/marked/mappings marked/
> 
>> +.B VM_PFNMAP
>> +and
> 
> Just checking: should it be "and" or "or" here"?
> 
> Looking at the EINVAL error below, I guess "or", and a better
> wording would be:
> 
> [[
> ...for example, mappings marked with kernel-internal flags such as
> .B VMPPFNMAP
> or
> .BR BR_V_IO.
> ]]

Much better. Note that there might be more types of mappings that won't 
work (e.g., initially also secretmem IIRC).

> 
>> +.BR VM_IO .
>> +.IP
>> +Note that with
>> +.BR MADV_POPULATE_READ ,
>> +the process can be killed at any moment when the system runs out of memory.
>> +.TP
>> +.BR MADV_POPULATE_WRITE " (since Linux 5.14)"
>> +Populate (prefault) page tables writable for the whole range without actually
> 
> I have trouble to understand "Populate (prefault) page tables writable".
> Does it mean that it is just the page tables are being populated, and the
> PTEs are marked to indicate that the pages are writable? If yes, I
> think some rewording would help.
> 
>> +writing memory.
> 
> I don't understand "without actually writing memory"? Do you mean,
> "without actually  faulting in the pages"; or something else?
> 

Similar to the other wording:

"Populate (prefault) page tables writable, faulting in all pages in the 
range just as if manually writing one byte of each page; however, avoid 
the actual memory access that would have been performed after handling 
the fault."

>> +Depending on the underlying mapping,
>> +preallocate memory or read the underlying file;
>> +files with holes will preallocate blocks.
>> +Do not generate
>> +.B SIGBUS
>> +when populating fails,
>> +return an error instead.
> 
> Better:
> 
> [[
> If populating fails, a
> .B SIGBUS
> signal is not generated; instead, an error i returned.
> ]]
> 

Ack.

> +.IP
>> +If
>> +.B MADV_POPULATE_WRITE
>> +succeeds,
>> +all page tables have been populated (prefaulted) writable once.
>> +If
>> +.B MADV_POPULATE_WRITE
>> +fails, some page tables might have been populated.
>> +.IP
>> +.B MADV_POPULATE_WRITE
>> +cannot be applied to mappings without write permissions
>> +and special mappings,
>> +for example,
>> +marked with the kernel-internal
> 
> s/marked/mappings marked/
> 
>> +.B VM_PFNMAP
>> +and
> 
> Just checking: should it be "and" or "or" here"?
> 
> Looking at the EINVAL error below, I guess "or", and a better
> wording would be:
> 
> [[
> ...for example, mappings marked with kernel-internal flags such as
> .B VMPPFNMAP
> or
> .BR BR_V_IO.
> ]]
> 

Ack.

>> +.BR VM_IO .
>> +.IP
>> +Note that with
>> +.BR MADV_POPULATE_WRITE ,
>> +the process can be killed at any moment when the system runs out of memory.
>>   .SH RETURN VALUE
>>   On success,
>>   .BR madvise ()
>> @@ -490,6 +556,17 @@ A kernel resource was temporarily unavailable.
>>   .B EBADF
>>   The map exists, but the area maps something that isn't a file.
>>   .TP
>> +.B EFAULT
>> +.I advice
>> +is
>> +.B MADV_POPULATE_READ
>> +or
>> +.BR MADV_POPULATE_WRITE ,
>> +and populating (prefaulting) page tables failed because a
>> +.B SIGBUS
>> +would have been generated on actual memory access and the reason is not a
>> +HW poisoned page.
> 
> Maybe:
> s/.$/(see the description of MADV_HWPOISON in this page)./
> ?
> 

Sure, we can add that. But note that MADV_HWPOISON is just one of many 
ways to HWpoison a page.

>> +.TP
>>   .B EINVAL
>>   .I addr
>>   is not page-aligned or
>> @@ -533,6 +610,18 @@ or
>>   .BR VM_PFNMAP
>>   ranges.
>>   .TP
>> +.B EINVAL
>> +.I advice
>> +is
>> +.B MADV_POPULATE_READ
>> +or
>> +.BR MADV_POPULATE_WRITE ,
>> +but the specified address range includes ranges with insufficient permissions
>> +or incompatible mappings such as
>> +.B VM_IO
>> +or
>> +.BR VM_PFNMAP.
> 
> s/.BR VM_PFNMAP./.BR VM_PFNMAP ./
> 

Agreed.

>> +.TP
>>   .B EIO
>>   (for
>>   .BR MADV_WILLNEED )
>> @@ -548,6 +637,15 @@ Not enough memory: paging in failed.
>>   Addresses in the specified range are not currently
>>   mapped, or are outside the address space of the process.
>>   .TP
>> +.B ENOMEM
>> +.I advice
>> +is
>> +.B MADV_POPULATE_READ
>> +or
>> +.BR MADV_POPULATE_WRITE ,
>> +and populating (prefaulting) page tables failed because there was not enough
>> +memory.
>> +.TP
>>   .B EPERM
>>   .I advice
>>   is
>> @@ -555,6 +653,15 @@ is
>>   but the caller does not have the
>>   .B CAP_SYS_ADMIN
>>   capability.
>> +.TP
>> +.B EHWPOISON
>> +.I advice
>> +is
>> +.B MADV_POPULATE_READ
>> +or
>> +.BR MADV_POPULATE_WRITE ,
>> +and populating (prefaulting) page tables failed because a HW poisoned page
>> +was encountered.
>>   .SH VERSIONS
>>   Since Linux 3.18,
>>   .\" commit d3ac21cacc24790eb45d735769f35753f5b56ceb
> 
> Thanks,
> 


Thanks a lot Michael!
Michael Kerrisk (man-pages) Aug. 18, 2021, 8:58 p.m. UTC | #3
Hello David,

On 8/18/21 10:35 AM, David Hildenbrand wrote:
> On 17.08.21 23:42, Michael Kerrisk (man-pages) wrote:
>> Hello David,
>>
>> Thank you for writing this! Could you please take
>> a look at the comments below and revise?
> 
> Hi Michael,
> 
> thanks for your valuable input. Your feedback will certainly make this 
> easier to understand for people that are not heavily involved in MM work :)
> 
> [...]
> 
>>>   man2/madvise.2 | 107 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 107 insertions(+)
>>>
>>> diff --git a/man2/madvise.2 b/man2/madvise.2
>>> index f1f384c0c..f6cea9ad2 100644
>>> --- a/man2/madvise.2
>>> +++ b/man2/madvise.2
>>> @@ -469,6 +469,72 @@ If a page is file-backed and dirty, it will be written back to the backing
>>>   storage.
>>>   The advice might be ignored for some pages in the range when it is not
>>>   applicable.
>>> +.TP
>>> +.BR MADV_POPULATE_READ " (since Linux 5.14)"
>>> +Populate (prefault) page tables readable for the whole range without actually
>>
>> I have trouble to understand "Populate (prefault) page tables readable".
>> Does it mean that it is just the page tables are being populated, and the
>> PTEs are marked to indicate that the pages are readable? If yes, I
>> think some rewording would help.
> 
> I actually tried phrasing it similar to our MAP_POPULATE documentation:
> 	("Populate  (prefault)  page tables for a mapping.")

Yeah, well that description is a bit thin too :-}.

> We will prefault all pages, faulting them in.
>>
>>> +reading memory.
>>
>> I don't understand "without actually reading memory"? Do you mean,
>> "without actually  faulting in the pages"; or something else?
> 
> "Populate (prefault) page tables readable, faulting in all pages in the 
> range just as if manually reading one byte of each page; however, avoid 
> the actual memory access that would have been performed after handling 
> the fault."
> 
> Does that make it clearer? (avoiding eventually touching the page at all 
> can be beneficial, especially when dealing with DAX memory where memory 
> access might be expensive)

That text is much better. But, what's still not clear to me then is the
dfference between mmap(2) MAP_POPULATE, and MADV_POPULATE_READ and
MADV_POPULATE_WRITE. What is the differnece, and in what situations 
would one prefer one or the other approach? I think it would be helpful
if the manual page said something about these details.

>>> +Depending on the underlying mapping,
>>> +map the shared zeropage,
>>> +preallocate memory or read the underlying file;
>>> +files with holes might or might not preallocate blocks.
>>> +Do not generate
>>> +.B SIGBUS
>>> +when populating fails,
>>> +return an error instead.
>>
>> Better:
>>
>> [[
>> If populating fails, a
>> .B SIGBUS
>> signal is not generated; instead, an error i returned.
>> ]]
>>
> 
> Sure, thanks.
> 
>>> +.IP
>>> +If
>>> +.B MADV_POPULATE_READ
>>> +succeeds,
>>> +all page tables have been populated (prefaulted) readable once.
>>> +If
>>> +.B MADV_POPULATE_READ
>>> +fails,
>>> +some page tables might have been populated.
>>> +.IP
>>> +.B MADV_POPULATE_READ
>>> +cannot be applied to mappings without read permissions
>>> +and special mappings,
>>> +for example,
>>> +marked with the kernel-internal
>>
>> s/marked/mappings marked/
>>
>>> +.B VM_PFNMAP
>>> +and
>>
>> Just checking: should it be "and" or "or" here"?
>>
>> Looking at the EINVAL error below, I guess "or", and a better
>> wording would be:
>>
>> [[
>> ...for example, mappings marked with kernel-internal flags such as
>> .B VMPPFNMAP
>> or
>> .BR BR_V_IO.
>> ]]
> 
> Much better. Note that there might be more types of mappings that won't 
> work (e.g., initially also secretmem IIRC).

Ahh nice. Since there's about to be a memfd_secret() manual page, 
I suggest adding also "or secret memory regions created using
memfd_secret(2)".

>>> +.BR VM_IO .
>>> +.IP
>>> +Note that with
>>> +.BR MADV_POPULATE_READ ,
>>> +the process can be killed at any moment when the system runs out of memory.
>>> +.TP
>>> +.BR MADV_POPULATE_WRITE " (since Linux 5.14)"
>>> +Populate (prefault) page tables writable for the whole range without actually
>>
>> I have trouble to understand "Populate (prefault) page tables writable".
>> Does it mean that it is just the page tables are being populated, and the
>> PTEs are marked to indicate that the pages are writable? If yes, I
>> think some rewording would help.
>>
>>> +writing memory.
>>
>> I don't understand "without actually writing memory"? Do you mean,
>> "without actually  faulting in the pages"; or something else?
>>
> 
> Similar to the other wording:
> 
> "Populate (prefault) page tables writable, faulting in all pages in the 
> range just as if manually writing one byte of each page; however, avoid 
> the actual memory access that would have been performed after handling 
> the fault."

Much better, but see also my comments above re MADV_POPULATE_READ.

[...]

>>> +.B EFAULT
>>> +.I advice
>>> +is
>>> +.B MADV_POPULATE_READ
>>> +or
>>> +.BR MADV_POPULATE_WRITE ,
>>> +and populating (prefaulting) page tables failed because a
>>> +.B SIGBUS
>>> +would have been generated on actual memory access and the reason is not a
>>> +HW poisoned page.
>>
>> Maybe:
>> s/.$/(see the description of MADV_HWPOISON in this page)./
>> ?
>>
> 
> Sure, we can add that. But note that MADV_HWPOISON is just one of many 
> ways to HWpoison a page.

Then maybe something like:

"(HW poisoned pages can, for example, be created using the
MADV_HWPOISON flag described elsewhere in this page.)"

[...]

Thanks,

Michael
David Hildenbrand Aug. 19, 2021, 6:38 p.m. UTC | #4
>>> I don't understand "without actually reading memory"? Do you mean,
>>> "without actually  faulting in the pages"; or something else?
>>
>> "Populate (prefault) page tables readable, faulting in all pages in the
>> range just as if manually reading one byte of each page; however, avoid
>> the actual memory access that would have been performed after handling
>> the fault."
>>
>> Does that make it clearer? (avoiding eventually touching the page at all
>> can be beneficial, especially when dealing with DAX memory where memory
>> access might be expensive)
> 
> That text is much better. But, what's still not clear to me then is the
> dfference between mmap(2) MAP_POPULATE, and MADV_POPULATE_READ and
> MADV_POPULATE_WRITE. What is the differnece, and in what situations
> would one prefer one or the other approach? I think it would be helpful
> if the manual page said something about these details.

Well, MAP_POPULATE

1. Can only be used on new mappings, not on existing mappings and 
especially not on parts of (sparse) memory mappings.

2. Hides actual population errors, simply returning "success"

3. Cannot specify the actual faultin behavior (readable vs. writable). 
Private mappings are always faulted in writable, shared mappings writable.

MADV_POPULATE_WRITE is the way to go to preallocate memory, especially 
hugetlbfs. MADV_POPULATE_READ can be used to just populate the shared 
zero page, or to fault-in file-backed memory without marking everything 
dirty such that it won't have to be written back to disk.

for MADV_POPULATE_READ

"In contrast to MAP_POPULATE, MADV_POPULATE_READ does not hide errors, 
can be applied to (parts of) existing mappings and will always populate 
(prefault) page tables readable. One example use case is prefaulting a 
file mapping, reading all file content from disk; however, pages won't 
be dirtied and consequently won't have to be written back to disk when 
evicting the pages from memory."

Suggestions welcome :)

for MADV_POPULATE_WRITE

"In contrast to MAP_POPULATE, MADV_POPULATE_WRITE does not hide errors, 
can be applied to (parts of) existing mappings and will always populate 
(prefault) page tables writable. One example use case is preallocating 
memory, breaking any CoW (Copy on Write)."

Again, suggestions welcome :)

[...]

>> Much better. Note that there might be more types of mappings that won't
>> work (e.g., initially also secretmem IIRC).
> 
> Ahh nice. Since there's about to be a memfd_secret() manual page,
> I suggest adding also "or secret memory regions created using
> memfd_secret(2)".

Can do, thanks!

[...]

>>
>> Sure, we can add that. But note that MADV_HWPOISON is just one of many
>> ways to HWpoison a page.
> 
> Then maybe something like:
> 
> "(HW poisoned pages can, for example, be created using the
> MADV_HWPOISON flag described elsewhere in this page.)"

Makes sense, I'll reuse the same description at the other place in this 
file where I mention HW poisoned pages.
diff mbox series

Patch

diff --git a/man2/madvise.2 b/man2/madvise.2
index f1f384c0c..f6cea9ad2 100644
--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -469,6 +469,72 @@  If a page is file-backed and dirty, it will be written back to the backing
 storage.
 The advice might be ignored for some pages in the range when it is not
 applicable.
+.TP
+.BR MADV_POPULATE_READ " (since Linux 5.14)"
+Populate (prefault) page tables readable for the whole range without actually
+reading memory.
+Depending on the underlying mapping,
+map the shared zeropage,
+preallocate memory or read the underlying file;
+files with holes might or might not preallocate blocks.
+Do not generate
+.B SIGBUS
+when populating fails,
+return an error instead.
+.IP
+If
+.B MADV_POPULATE_READ
+succeeds,
+all page tables have been populated (prefaulted) readable once.
+If
+.B MADV_POPULATE_READ
+fails,
+some page tables might have been populated.
+.IP
+.B MADV_POPULATE_READ
+cannot be applied to mappings without read permissions
+and special mappings,
+for example,
+marked with the kernel-internal
+.B VM_PFNMAP
+and
+.BR VM_IO .
+.IP
+Note that with
+.BR MADV_POPULATE_READ ,
+the process can be killed at any moment when the system runs out of memory.
+.TP
+.BR MADV_POPULATE_WRITE " (since Linux 5.14)"
+Populate (prefault) page tables writable for the whole range without actually
+writing memory.
+Depending on the underlying mapping,
+preallocate memory or read the underlying file;
+files with holes will preallocate blocks.
+Do not generate
+.B SIGBUS
+when populating fails,
+return an error instead.
+.IP
+If
+.B MADV_POPULATE_WRITE
+succeeds,
+all page tables have been populated (prefaulted) writable once.
+If
+.B MADV_POPULATE_WRITE
+fails, some page tables might have been populated.
+.IP
+.B MADV_POPULATE_WRITE
+cannot be applied to mappings without write permissions
+and special mappings,
+for example,
+marked with the kernel-internal
+.B VM_PFNMAP
+and
+.BR VM_IO .
+.IP
+Note that with
+.BR MADV_POPULATE_WRITE ,
+the process can be killed at any moment when the system runs out of memory.
 .SH RETURN VALUE
 On success,
 .BR madvise ()
@@ -490,6 +556,17 @@  A kernel resource was temporarily unavailable.
 .B EBADF
 The map exists, but the area maps something that isn't a file.
 .TP
+.B EFAULT
+.I advice
+is
+.B MADV_POPULATE_READ
+or
+.BR MADV_POPULATE_WRITE ,
+and populating (prefaulting) page tables failed because a
+.B SIGBUS
+would have been generated on actual memory access and the reason is not a
+HW poisoned page.
+.TP
 .B EINVAL
 .I addr
 is not page-aligned or
@@ -533,6 +610,18 @@  or
 .BR VM_PFNMAP
 ranges.
 .TP
+.B EINVAL
+.I advice
+is
+.B MADV_POPULATE_READ
+or
+.BR MADV_POPULATE_WRITE ,
+but the specified address range includes ranges with insufficient permissions
+or incompatible mappings such as
+.B VM_IO
+or
+.BR VM_PFNMAP.
+.TP
 .B EIO
 (for
 .BR MADV_WILLNEED )
@@ -548,6 +637,15 @@  Not enough memory: paging in failed.
 Addresses in the specified range are not currently
 mapped, or are outside the address space of the process.
 .TP
+.B ENOMEM
+.I advice
+is
+.B MADV_POPULATE_READ
+or
+.BR MADV_POPULATE_WRITE ,
+and populating (prefaulting) page tables failed because there was not enough
+memory.
+.TP
 .B EPERM
 .I advice
 is
@@ -555,6 +653,15 @@  is
 but the caller does not have the
 .B CAP_SYS_ADMIN
 capability.
+.TP
+.B EHWPOISON
+.I advice
+is
+.B MADV_POPULATE_READ
+or
+.BR MADV_POPULATE_WRITE ,
+and populating (prefaulting) page tables failed because a HW poisoned page
+was encountered.
 .SH VERSIONS
 Since Linux 3.18,
 .\" commit d3ac21cacc24790eb45d735769f35753f5b56ceb