diff mbox series

[1/1] process_madvise.2: Add process_madvise man page

Message ID 20210120202337.1481402-1-surenb@google.com (mailing list archive)
State New
Headers show
Series [1/1] process_madvise.2: Add process_madvise man page | expand

Commit Message

Suren Baghdasaryan Jan. 20, 2021, 8:23 p.m. UTC
Initial version of process_madvise(2) manual page. Initial text was
extracted from [1], amended after fix [2] and more details added using
man pages of madvise(2) and process_vm_read(2) as examples. It also
includes the changes to required permission proposed in [3].

[1] https://lore.kernel.org/patchwork/patch/1297933/
[2] https://lkml.org/lkml/2020/12/8/1282
[3] https://patchwork.kernel.org/project/selinux/patch/20210111170622.2613577-1-surenb@google.com/#23888311

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---

Adding the plane text version for ease of review:

NAME
    process_madvise - give advice about use of memory to a process

SYNOPSIS
    #include <sys/uio.h>

    ssize_t process_madvise(int pidfd,
                           const struct iovec *iovec,
                           unsigned long vlen,
                           int advice,
                           unsigned int flags);

DESCRIPTION
    The process_madvise() system call is used to give advice or directions to
    the kernel about the address ranges from external process as well as local
    process. It provides the advice to address ranges of process described by
    iovec and vlen. The goal of such advice is to improve system or application
    performance.

    The pidfd selects the process referred to by the PID file descriptor
    specified in pidfd. (see pidofd_open(2) for further information).

    The pointer iovec points to an array of iovec structures, defined in
    <sys/uio.h> as:

    struct iovec {
        void  *iov_base;    /* Starting address */
        size_t iov_len;     /* Number of bytes to transfer */
    };

    The iovec describes address ranges beginning at iov_base address and with
    the size of iov_len bytes.

    The vlen represents the number of elements in iovec.

    The advice can be one of the values listed below.

  Linux-specific advice values
    The following Linux-specific advice values have no counterparts in the
    POSIX-specified posix_madvise(3), and may or may not have counterparts in
    the madvise() interface available on other implementations.

    MADV_COLD (since Linux 5.4.1)
        Deactivate a given range of pages by moving them from active to
        inactive LRU list. This is done to accelerate the reclaim of these
        pages. The advice might be ignored for some pages in the range when it
        is not applicable.
    MADV_PAGEOUT (since Linux 5.4.1)
        Reclaim a given range of pages. This is done to free up memory occupied
        by these pages. If a page is anonymous it will be swapped out. If a
        page is file-backed and dirty it will be written back into the backing
        storage. The advice might be ignored for some pages in the range when
        it is not applicable.

    The flags argument is reserved for future use; currently, this argument must
    be specified as 0.

    The value specified in the vlen argument must be less than or equal to
    IOV_MAX (defined in <limits.h> or accessible via the call
    sysconf(_SC_IOV_MAX)).

    The vlen and iovec arguments are checked before applying any hints. If the
    vlen is too big, or iovec is invalid, an error will be returned
    immediately.

    Hint might be applied to a part of iovec if one of its elements points to
    an invalid memory region in the remote process. No further elements will be
    processed beyond that point.

    Permission to provide a hint to external process is governed by a ptrace
    access mode PTRACE_MODE_READ_REALCREDS check; see ptrace(2) and
    CAP_SYS_ADMIN capability that caller should have in order to affect
    performance of an external process.

RETURN VALUE
    On success, process_madvise() returns the number of bytes advised. This
    return value may be less than the total number of requested bytes, if an
    error occurred. The caller should check return value to determine whether
    a partial advice occurred.
ERRORS
    EFAULT The memory described by iovec is outside the accessible address
           space of the process pid.
    EINVAL flags is not 0.
    EINVAL The sum of the iov_len values of iovec overflows a ssize_t value.
    EINVAL vlen is too large.
    ENOMEM Could not allocate memory for internal copies of the iovec
           structures.
    EPERM The caller does not have permission to access the address space of
          the process pidfd.
    ESRCH No process with ID pidfd exists.

VERSIONS
    Since Linux 5.10, support for this system call is optional, depending on
    the setting of the CONFIG_ADVISE_SYSCALLS configuration option.

SEE ALSO
    madvise(2), pidofd_open(2), process_vm_readv(2), process_vm_write(2)

 man2/process_madvise.2 | 208 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 208 insertions(+)
 create mode 100644 man2/process_madvise.2

Comments

Michal Hocko Jan. 25, 2021, 1:19 p.m. UTC | #1
On Wed 20-01-21 12:23:37, Suren Baghdasaryan wrote:
[...]
>     MADV_COLD (since Linux 5.4.1)
>         Deactivate a given range of pages by moving them from active to
>         inactive LRU list. This is done to accelerate the reclaim of these
>         pages. The advice might be ignored for some pages in the range when it
>         is not applicable.

I do not think we want to talk about active/inactive LRU lists here.
Wouldn't it be sufficient to say
Deactive a given range of pages which will make them a more probable
reclaim target should there be a memory pressure. This is a
non-destructive operation.

Other than that, looks good to me from the content POV.

Thanks!
Suren Baghdasaryan Jan. 26, 2021, 12:14 a.m. UTC | #2
On Mon, Jan 25, 2021 at 5:19 AM 'Michal Hocko' via kernel-team
<kernel-team@android.com> wrote:
>
> On Wed 20-01-21 12:23:37, Suren Baghdasaryan wrote:
> [...]
> >     MADV_COLD (since Linux 5.4.1)
> >         Deactivate a given range of pages by moving them from active to
> >         inactive LRU list. This is done to accelerate the reclaim of these
> >         pages. The advice might be ignored for some pages in the range when it
> >         is not applicable.
>
> I do not think we want to talk about active/inactive LRU lists here.
> Wouldn't it be sufficient to say
> Deactive a given range of pages which will make them a more probable
> reclaim target should there be a memory pressure. This is a
> non-destructive operation.

That sounds better. Will update in the next version.

>
> Other than that, looks good to me from the content POV.
>
> Thanks!

Thanks for the review Michal!

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Michael Kerrisk (man-pages) Jan. 28, 2021, 12:24 p.m. UTC | #3
Hello Suren,

Thank you for writing this page! Some comments below.

On Wed, 20 Jan 2021 at 21:36, Suren Baghdasaryan <surenb@google.com> wrote:
>
> Initial version of process_madvise(2) manual page. Initial text was
> extracted from [1], amended after fix [2] and more details added using
> man pages of madvise(2) and process_vm_read(2) as examples. It also
> includes the changes to required permission proposed in [3].
>
> [1] https://lore.kernel.org/patchwork/patch/1297933/
> [2] https://lkml.org/lkml/2020/12/8/1282
> [3] https://patchwork.kernel.org/project/selinux/patch/20210111170622.2613577-1-surenb@google.com/#23888311
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>
> Adding the plane text version for ease of review:

Thanks for adding the rendered version. I will make my comments
against the source, below.

> NAME
>     process_madvise - give advice about use of memory to a process
>
> SYNOPSIS
>     #include <sys/uio.h>
>
>     ssize_t process_madvise(int pidfd,
>                            const struct iovec *iovec,
>                            unsigned long vlen,
>                            int advice,
>                            unsigned int flags);
>
> DESCRIPTION
>     The process_madvise() system call is used to give advice or directions to
>     the kernel about the address ranges from external process as well as local
>     process. It provides the advice to address ranges of process described by
>     iovec and vlen. The goal of such advice is to improve system or application
>     performance.
>
>     The pidfd selects the process referred to by the PID file descriptor
>     specified in pidfd. (see pidofd_open(2) for further information).
>
>     The pointer iovec points to an array of iovec structures, defined in
>     <sys/uio.h> as:
>
>     struct iovec {
>         void  *iov_base;    /* Starting address */
>         size_t iov_len;     /* Number of bytes to transfer */
>     };
>
>     The iovec describes address ranges beginning at iov_base address and with
>     the size of iov_len bytes.
>
>     The vlen represents the number of elements in iovec.
>
>     The advice can be one of the values listed below.
>
>   Linux-specific advice values
>     The following Linux-specific advice values have no counterparts in the
>     POSIX-specified posix_madvise(3), and may or may not have counterparts in
>     the madvise() interface available on other implementations.
>
>     MADV_COLD (since Linux 5.4.1)
>         Deactivate a given range of pages by moving them from active to
>         inactive LRU list. This is done to accelerate the reclaim of these
>         pages. The advice might be ignored for some pages in the range when it
>         is not applicable.
>     MADV_PAGEOUT (since Linux 5.4.1)
>         Reclaim a given range of pages. This is done to free up memory occupied
>         by these pages. If a page is anonymous it will be swapped out. If a
>         page is file-backed and dirty it will be written back into the backing
>         storage. The advice might be ignored for some pages in the range when
>         it is not applicable.
>
>     The flags argument is reserved for future use; currently, this argument must
>     be specified as 0.
>
>     The value specified in the vlen argument must be less than or equal to
>     IOV_MAX (defined in <limits.h> or accessible via the call
>     sysconf(_SC_IOV_MAX)).
>
>     The vlen and iovec arguments are checked before applying any hints. If the
>     vlen is too big, or iovec is invalid, an error will be returned
>     immediately.
>
>     Hint might be applied to a part of iovec if one of its elements points to
>     an invalid memory region in the remote process. No further elements will be
>     processed beyond that point.
>
>     Permission to provide a hint to external process is governed by a ptrace
>     access mode PTRACE_MODE_READ_REALCREDS check; see ptrace(2) and
>     CAP_SYS_ADMIN capability that caller should have in order to affect
>     performance of an external process.
>
> RETURN VALUE
>     On success, process_madvise() returns the number of bytes advised. This
>     return value may be less than the total number of requested bytes, if an
>     error occurred. The caller should check return value to determine whether
>     a partial advice occurred.

So there are three return values possible,
> ERRORS
>     EFAULT The memory described by iovec is outside the accessible address
>            space of the process pid.

s/pid/
of the process referred to by
.IR pidfd .

>     EINVAL flags is not 0.
>     EINVAL The sum of the iov_len values of iovec overflows a ssize_t value.
>     EINVAL vlen is too large.
>     ENOMEM Could not allocate memory for internal copies of the iovec
>            structures.
>     EPERM The caller does not have permission to access the address space of
>           the process pidfd.
>     ESRCH No process with ID pidfd exists.
>
> VERSIONS
>     Since Linux 5.10, support for this system call is optional, depending on
>     the setting of the CONFIG_ADVISE_SYSCALLS configuration option.
>
> SEE ALSO
>     madvise(2), pidofd_open(2), process_vm_readv(2), process_vm_write(2)
>
>  man2/process_madvise.2 | 208 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 208 insertions(+)
>  create mode 100644 man2/process_madvise.2
>
> diff --git a/man2/process_madvise.2 b/man2/process_madvise.2
> new file mode 100644
> index 000000000..9bb5cb5ed
> --- /dev/null
> +++ b/man2/process_madvise.2
> @@ -0,0 +1,208 @@
> +.\" Copyright (C) 2021 Suren Baghdasaryan <surenb@google.com>
> +.\" and Copyright (C) 2021 Minchan Kim <minchan@kernel.org>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.\" Commit ecb8ac8b1f146915aa6b96449b66dd48984caacc
> +.\"
> +.TH PROCESS_MADVISE 2 2021-01-12 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +process_madvise \- give advice about use of memory to a process
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/uio.h>
> +.PP
> +.BI "ssize_t process_madvise(int " pidfd ,
> +.BI "                       const struct iovec *" iovec ,
> +.BI "                       unsigned long " vlen ,
> +.BI "                       int " advice ,
> +.BI "                       unsigned int " flags ");"
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR process_madvise()
> +system call is used to give advice or directions
> +to the kernel about the address ranges from external process as well as

s/from external/of other/

> +local process. It provides the advice to address ranges of process

s/local/of the calling/

Please start new sentence on new lines. (See the discussion of
semantic newlines in man-pages(7).)

> +described by
> +.I iovec
> +and
> +.I vlen\.
> +The goal of such advice is to improve system or application performance.
> +.PP
> +The
> +.I pidfd
> +selects the process referred to by the PID file descriptor
> +specified in pidfd. (see
> +.BR pidofd_open(2)
> +for further information).

Rewrite the previous as:

[[
The
.I pidfd
argument is a PID file descriptor (see
.BR pidofd_open (2))
that specifies the process to which the advice is to be applied.

> +.PP
> +The pointer
> +.I iovec
> +points to an array of iovec structures, defined in

"iovec" should be formatted as

.I iovec

> +.IR <sys/uio.h>
> +as:
> +.PP
> +.in +4n
> +.EX
> +struct iovec {
> +    void  *iov_base;    /* Starting address */
> +    size_t iov_len;     /* Number of bytes to transfer */
> +};
> +.EE
> +.in
> +.PP
> +The
> +.I iovec
> +describes address ranges beginning at

s/describes/structure describes/

> +.I iov_base
> +address and with the size of
> +.I iov_len
> +bytes.
> +.PP
> +The
> +.I vlen
> +represents the number of elements in
> +.I iovec\.

==>
the
.IR iovec
structure.
> +.PP
> +The
> +.I advice
> +can be one of the values listed below.

s/can be/argument is/

> +.\"
> +.\" ======================================================================
> +.\"
> +.SS Linux-specific advice values
> +The following Linux-specific
> +.I advice
> +values have no counterparts in the POSIX-specified
> +.BR posix_madvise (3),
> +and may or may not have counterparts in the
> +.BR madvise ()
> +interface available on other implementations.
> +.TP
> +.BR MADV_COLD " (since Linux 5.4.1)"
> +.\" commit 9c276cc65a58faf98be8e56962745ec99ab87636
> +Deactivate a given range of pages by moving them from active to inactive
> +LRU list. This is done to accelerate the reclaim of these pages. The advice

New sentences on new lines.

> +might be ignored for some pages in the range when it is not applicable.
> +.TP
> +.BR MADV_PAGEOUT " (since Linux 5.4.1)"
> +.\" commit 1a4e58cce84ee88129d5d49c064bd2852b481357
> +Reclaim a given range of pages. This is done to free up memory occupied by
> +these pages. If a page is anonymous it will be swapped out. If a page is
> +file-backed and dirty it will be written back into the backing storage.

s/into/to/

> +The advice might be ignored for some pages in the range when it is not
> +applicable.
> +.PP
> +The
> +.I flags
> +argument is reserved for future use; currently, this argument must be
> +specified as 0.
> +.PP
> +The value specified in the
> +.I vlen
> +argument must be less than or equal to
> +.BR IOV_MAX
> +(defined in
> +.I <limits.h>
> +or accessible via the call
> +.IR sysconf(_SC_IOV_MAX) ).
> +.PP
> +The
> +.I vlen
> +and
> +.I iovec
> +arguments are checked before applying any hints.
> +If the
> +.I vlen
> +is too big, or
> +.I iovec
> +is invalid, an error will be returned immediately.
> +.PP
> +Hint might be applied to a part of

s/Hint/The hint/

> +.I iovec
> +if one of its elements points to an invalid memory
> +region in the remote process. No further elements will be
> +processed beyond that point.
> +.PP
> +Permission to provide a hint to external process is governed by a
> +ptrace access mode
> +.B PTRACE_MODE_READ_REALCREDS
> +check; see
> +.BR ptrace (2)
> +and
> +.B CAP_SYS_ADMIN
> +capability that caller should have in order to affect performance
> +of an external process.

The preceding sentence is garbled. Missing words?

> +.SH RETURN VALUE
> +On success, process_madvise() returns the number of bytes advised.
> +This return value may be less than the total number of requested
> +bytes, if an error occurred. The caller should check return value
> +to determine whether a partial advice occurred.
> +.SH ERRORS
> +.TP
> +.B EFAULT
> +The memory described by
> +.I iovec
> +is outside the accessible address space of the process pid.

s/process pid./
the process referred to by
.IR pidfd .
/

> +.TP
> +.B EINVAL
> +.I flags
> +is not 0.
> +.TP
> +.B EINVAL
> +The sum of the
> +.I iov_len
> +values of
> +.I iovec
> +overflows a ssize_t value.

.I ssize_t

> +.TP
> +.B EINVAL
> +.I vlen
> +is too large.
> +.TP
> +.B ENOMEM
> +Could not allocate memory for internal copies of the
> +.I iovec
> +structures.
> +.TP
> +.B EPERM
> +The caller does not have permission to access the address space of the process
> +.I pidfd.

.IR pidfd .

> +.TP
> +.B ESRCH
> +No process with ID
> +.I pidfd
> +exists.

Should this maybe be:
[[
The target process does not exist (i.e., it has terminated and
been waited on).
]]

See pidfd_send_signal(2).

Also, is an EBADF error possible? Again, see pidfd_send_signal(2).

> +.SH VERSIONS
> +Since Linux 5.10,

Better: This system call first appeared in Linux 5.10.

> +.\" commit ecb8ac8b1f146915aa6b96449b66dd48984caacc
> +support for this system call is optional,

s/support/Support/

> +depending on the setting of the
> +.B CONFIG_ADVISE_SYSCALLS
> +configuration option.
> +.SH SEE ALSO
> +.BR madvise (2),
> +.BR pidofd_open(2),
> +.BR process_vm_readv (2),
> +.BR process_vm_write (2)

Thanks,

Michael
Michael Kerrisk (man-pages) Jan. 28, 2021, 8:31 p.m. UTC | #4
Hello Suren,

On 1/28/21 7:40 PM, Suren Baghdasaryan wrote:
> On Thu, Jan 28, 2021 at 4:24 AM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>>
>> Hello Suren,
>>
>> Thank you for writing this page! Some comments below.
> 
> Thanks for the review!
> Couple questions below and I'll respin the new version once they are clarified.

Okay. See below.

>> On Wed, 20 Jan 2021 at 21:36, Suren Baghdasaryan <surenb@google.com> wrote:
>>>

[...]

Thanks for all the acks. That let's me know that you saw what I said.

>>> RETURN VALUE
>>>     On success, process_madvise() returns the number of bytes advised. This
>>>     return value may be less than the total number of requested bytes, if an
>>>     error occurred. The caller should check return value to determine whether
>>>     a partial advice occurred.
>>
>> So there are three return values possible,
> 
> Ok, I think I see your point. How about this instead:

Well, I'm glad you saw it, because I forgot to finish it. But yes,
you understood what I forgot to say.

> RETURN VALUE
>      On success, process_madvise() returns the number of bytes advised. This
>      return value may be less than the total number of requested bytes, if an
>      error occurred after some iovec elements were already processed. The caller
>      should check the return value to determine whether a partial
> advice occurred.
> 
>     On error, -1 is returned and errno is set appropriately.

We recently standardized some wording here:
s/appropriately/to indicate the error/.


>>> +.PP
>>> +The pointer
>>> +.I iovec
>>> +points to an array of iovec structures, defined in
>>
>> "iovec" should be formatted as
>>
>> .I iovec
> 
> I think it is formatted that way above. What am I missing?

But also in "an array of iovec structures"...

> BTW, where should I be using .I vs .IR? I was looking for an answer
> but could not find it.

.B / .I == bold/italic this line
.BR / .IR == alternate bold/italic with normal (Roman) font.

So:
.I iovec
.I iovec ,       # so that comma is not italic
.BR process_madvise ()
etc.

[...]

>>> +.I iovec
>>> +if one of its elements points to an invalid memory
>>> +region in the remote process. No further elements will be
>>> +processed beyond that point.
>>> +.PP
>>> +Permission to provide a hint to external process is governed by a
>>> +ptrace access mode
>>> +.B PTRACE_MODE_READ_REALCREDS
>>> +check; see
>>> +.BR ptrace (2)
>>> +and
>>> +.B CAP_SYS_ADMIN
>>> +capability that caller should have in order to affect performance
>>> +of an external process.
>>
>> The preceding sentence is garbled. Missing words?
> 
> Maybe I worded it incorrectly. What I need to say here is that the
> caller should have both PTRACE_MODE_READ_REALCREDS credentials and
> CAP_SYS_ADMIN capability. The first part I shamelessly copy/pasted
> from https://man7.org/linux/man-pages/man2/process_vm_readv.2.html and
> tried adding the second one to it, obviously unsuccessfully. Any
> advice on how to fix that?

I think you already got pretty close. How about:

[[
Permission to provide a hint to another process is governed by a
ptrace access mode
.B PTRACE_MODE_READ_REALCREDS
check (see
BR ptrace (2));
in addition, the caller must have the
.B CAP_SYS_ADMIN
capability.
]]

[...]

>>> +.TP
>>> +.B ESRCH
>>> +No process with ID
>>> +.I pidfd
>>> +exists.
>>
>> Should this maybe be:
>> [[
>> The target process does not exist (i.e., it has terminated and
>> been waited on).
>> ]]
>>
>> See pidfd_send_signal(2).
> 
> I "borrowed" mine from
> https://man7.org/linux/man-pages/man2/process_vm_readv.2.html but
> either one sounds good to me. Maybe for pidfd_send_signal the wording
> about termination is more important. Anyway, it's up to you. Just let
> me know which one to use.

I think the pidfd_send_signal(2) wording fits better.

[...]

Thanks,

Michael
Suren Baghdasaryan Jan. 28, 2021, 8:37 p.m. UTC | #5
On Thu, Jan 28, 2021 at 12:31 PM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> Hello Suren,
>
> On 1/28/21 7:40 PM, Suren Baghdasaryan wrote:
> > On Thu, Jan 28, 2021 at 4:24 AM Michael Kerrisk (man-pages)
> > <mtk.manpages@gmail.com> wrote:
> >>
> >> Hello Suren,
> >>
> >> Thank you for writing this page! Some comments below.
> >
> > Thanks for the review!
> > Couple questions below and I'll respin the new version once they are clarified.
>
> Okay. See below.
>
> >> On Wed, 20 Jan 2021 at 21:36, Suren Baghdasaryan <surenb@google.com> wrote:
> >>>
>
> [...]
>
> Thanks for all the acks. That let's me know that you saw what I said.
>
> >>> RETURN VALUE
> >>>     On success, process_madvise() returns the number of bytes advised. This
> >>>     return value may be less than the total number of requested bytes, if an
> >>>     error occurred. The caller should check return value to determine whether
> >>>     a partial advice occurred.
> >>
> >> So there are three return values possible,
> >
> > Ok, I think I see your point. How about this instead:
>
> Well, I'm glad you saw it, because I forgot to finish it. But yes,
> you understood what I forgot to say.
>
> > RETURN VALUE
> >      On success, process_madvise() returns the number of bytes advised. This
> >      return value may be less than the total number of requested bytes, if an
> >      error occurred after some iovec elements were already processed. The caller
> >      should check the return value to determine whether a partial
> > advice occurred.
> >
> >     On error, -1 is returned and errno is set appropriately.
>
> We recently standardized some wording here:
> s/appropriately/to indicate the error/.

ack

>
>
> >>> +.PP
> >>> +The pointer
> >>> +.I iovec
> >>> +points to an array of iovec structures, defined in
> >>
> >> "iovec" should be formatted as
> >>
> >> .I iovec
> >
> > I think it is formatted that way above. What am I missing?
>
> But also in "an array of iovec structures"...

ack

>
> > BTW, where should I be using .I vs .IR? I was looking for an answer
> > but could not find it.
>
> .B / .I == bold/italic this line
> .BR / .IR == alternate bold/italic with normal (Roman) font.
>
> So:
> .I iovec
> .I iovec ,       # so that comma is not italic
> .BR process_madvise ()
> etc.

Aha! Got it now. It's clear after your example. Thanks!

>
> [...]
>
> >>> +.I iovec
> >>> +if one of its elements points to an invalid memory
> >>> +region in the remote process. No further elements will be
> >>> +processed beyond that point.
> >>> +.PP
> >>> +Permission to provide a hint to external process is governed by a
> >>> +ptrace access mode
> >>> +.B PTRACE_MODE_READ_REALCREDS
> >>> +check; see
> >>> +.BR ptrace (2)
> >>> +and
> >>> +.B CAP_SYS_ADMIN
> >>> +capability that caller should have in order to affect performance
> >>> +of an external process.
> >>
> >> The preceding sentence is garbled. Missing words?
> >
> > Maybe I worded it incorrectly. What I need to say here is that the
> > caller should have both PTRACE_MODE_READ_REALCREDS credentials and
> > CAP_SYS_ADMIN capability. The first part I shamelessly copy/pasted
> > from https://man7.org/linux/man-pages/man2/process_vm_readv.2.html and
> > tried adding the second one to it, obviously unsuccessfully. Any
> > advice on how to fix that?
>
> I think you already got pretty close. How about:
>
> [[
> Permission to provide a hint to another process is governed by a
> ptrace access mode
> .B PTRACE_MODE_READ_REALCREDS
> check (see
> BR ptrace (2));
> in addition, the caller must have the
> .B CAP_SYS_ADMIN
> capability.
> ]]

Perfect! I'll use that.

>
> [...]
>
> >>> +.TP
> >>> +.B ESRCH
> >>> +No process with ID
> >>> +.I pidfd
> >>> +exists.
> >>
> >> Should this maybe be:
> >> [[
> >> The target process does not exist (i.e., it has terminated and
> >> been waited on).
> >> ]]
> >>
> >> See pidfd_send_signal(2).
> >
> > I "borrowed" mine from
> > https://man7.org/linux/man-pages/man2/process_vm_readv.2.html but
> > either one sounds good to me. Maybe for pidfd_send_signal the wording
> > about termination is more important. Anyway, it's up to you. Just let
> > me know which one to use.
>
> I think the pidfd_send_signal(2) wording fits better.

ack, will use pidfd_send_signal(2) version.

>
> [...]
>
> Thanks,
>
> Michael

I'll include your and Michal's suggestions and will post the next
version later today or tomorrow morning.
Thanks for the guidance!

>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
Suren Baghdasaryan Jan. 29, 2021, 7:15 a.m. UTC | #6
On Thu, Jan 28, 2021 at 12:31 PM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> Hello Suren,
>
> On 1/28/21 7:40 PM, Suren Baghdasaryan wrote:
> > On Thu, Jan 28, 2021 at 4:24 AM Michael Kerrisk (man-pages)
> > <mtk.manpages@gmail.com> wrote:
> >>
> >> Hello Suren,
> >>
> >> Thank you for writing this page! Some comments below.
> >
> > Thanks for the review!
> > Couple questions below and I'll respin the new version once they are clarified.
>
> Okay. See below.
>
> >> On Wed, 20 Jan 2021 at 21:36, Suren Baghdasaryan <surenb@google.com> wrote:
> >>>
>
> [...]
>
> Thanks for all the acks. That let's me know that you saw what I said.
>
> >>> RETURN VALUE
> >>>     On success, process_madvise() returns the number of bytes advised. This
> >>>     return value may be less than the total number of requested bytes, if an
> >>>     error occurred. The caller should check return value to determine whether
> >>>     a partial advice occurred.
> >>
> >> So there are three return values possible,
> >
> > Ok, I think I see your point. How about this instead:
>
> Well, I'm glad you saw it, because I forgot to finish it. But yes,
> you understood what I forgot to say.
>
> > RETURN VALUE
> >      On success, process_madvise() returns the number of bytes advised. This
> >      return value may be less than the total number of requested bytes, if an
> >      error occurred after some iovec elements were already processed. The caller
> >      should check the return value to determine whether a partial
> > advice occurred.
> >
> >     On error, -1 is returned and errno is set appropriately.
>
> We recently standardized some wording here:
> s/appropriately/to indicate the error/.
>
>
> >>> +.PP
> >>> +The pointer
> >>> +.I iovec
> >>> +points to an array of iovec structures, defined in
> >>
> >> "iovec" should be formatted as
> >>
> >> .I iovec
> >
> > I think it is formatted that way above. What am I missing?
>
> But also in "an array of iovec structures"...
>
> > BTW, where should I be using .I vs .IR? I was looking for an answer
> > but could not find it.
>
> .B / .I == bold/italic this line
> .BR / .IR == alternate bold/italic with normal (Roman) font.
>
> So:
> .I iovec
> .I iovec ,       # so that comma is not italic
> .BR process_madvise ()
> etc.
>
> [...]
>
> >>> +.I iovec
> >>> +if one of its elements points to an invalid memory
> >>> +region in the remote process. No further elements will be
> >>> +processed beyond that point.
> >>> +.PP
> >>> +Permission to provide a hint to external process is governed by a
> >>> +ptrace access mode
> >>> +.B PTRACE_MODE_READ_REALCREDS
> >>> +check; see
> >>> +.BR ptrace (2)
> >>> +and
> >>> +.B CAP_SYS_ADMIN
> >>> +capability that caller should have in order to affect performance
> >>> +of an external process.
> >>
> >> The preceding sentence is garbled. Missing words?
> >
> > Maybe I worded it incorrectly. What I need to say here is that the
> > caller should have both PTRACE_MODE_READ_REALCREDS credentials and
> > CAP_SYS_ADMIN capability. The first part I shamelessly copy/pasted
> > from https://man7.org/linux/man-pages/man2/process_vm_readv.2.html and
> > tried adding the second one to it, obviously unsuccessfully. Any
> > advice on how to fix that?
>
> I think you already got pretty close. How about:
>
> [[
> Permission to provide a hint to another process is governed by a
> ptrace access mode
> .B PTRACE_MODE_READ_REALCREDS
> check (see
> BR ptrace (2));
> in addition, the caller must have the
> .B CAP_SYS_ADMIN
> capability.

In V2 I explanded a bit this part to explain why CAP_SYS_ADMIN is
needed. There were questions about that during my patch review which
adds this requirement
(https://lore.kernel.org/patchwork/patch/1363605), so I thought a
short explanation would be useful.

> ]]
>
> [...]
>
> >>> +.TP
> >>> +.B ESRCH
> >>> +No process with ID
> >>> +.I pidfd
> >>> +exists.
> >>
> >> Should this maybe be:
> >> [[
> >> The target process does not exist (i.e., it has terminated and
> >> been waited on).
> >> ]]
> >>
> >> See pidfd_send_signal(2).
> >
> > I "borrowed" mine from
> > https://man7.org/linux/man-pages/man2/process_vm_readv.2.html but
> > either one sounds good to me. Maybe for pidfd_send_signal the wording
> > about termination is more important. Anyway, it's up to you. Just let
> > me know which one to use.
>
> I think the pidfd_send_signal(2) wording fits better.
>
> [...]
>
> Thanks,
>
> Michael
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
diff mbox series

Patch

diff --git a/man2/process_madvise.2 b/man2/process_madvise.2
new file mode 100644
index 000000000..9bb5cb5ed
--- /dev/null
+++ b/man2/process_madvise.2
@@ -0,0 +1,208 @@ 
+.\" Copyright (C) 2021 Suren Baghdasaryan <surenb@google.com>
+.\" and Copyright (C) 2021 Minchan Kim <minchan@kernel.org>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.\" Commit ecb8ac8b1f146915aa6b96449b66dd48984caacc
+.\"
+.TH PROCESS_MADVISE 2 2021-01-12 "Linux" "Linux Programmer's Manual"
+.SH NAME
+process_madvise \- give advice about use of memory to a process
+.SH SYNOPSIS
+.nf
+.B #include <sys/uio.h>
+.PP
+.BI "ssize_t process_madvise(int " pidfd ,
+.BI "                       const struct iovec *" iovec ,
+.BI "                       unsigned long " vlen ,
+.BI "                       int " advice ,
+.BI "                       unsigned int " flags ");"
+.fi
+.SH DESCRIPTION
+The
+.BR process_madvise()
+system call is used to give advice or directions
+to the kernel about the address ranges from external process as well as
+local process. It provides the advice to address ranges of process
+described by
+.I iovec
+and
+.I vlen\.
+The goal of such advice is to improve system or application performance.
+.PP
+The
+.I pidfd
+selects the process referred to by the PID file descriptor
+specified in pidfd. (see
+.BR pidofd_open(2)
+for further information).
+.PP
+The pointer
+.I iovec
+points to an array of iovec structures, defined in
+.IR <sys/uio.h>
+as:
+.PP
+.in +4n
+.EX
+struct iovec {
+    void  *iov_base;    /* Starting address */
+    size_t iov_len;     /* Number of bytes to transfer */
+};
+.EE
+.in
+.PP
+The
+.I iovec
+describes address ranges beginning at
+.I iov_base
+address and with the size of
+.I iov_len
+bytes.
+.PP
+The
+.I vlen
+represents the number of elements in
+.I iovec\.
+.PP
+The
+.I advice
+can be one of the values listed below.
+.\"
+.\" ======================================================================
+.\"
+.SS Linux-specific advice values
+The following Linux-specific
+.I advice
+values have no counterparts in the POSIX-specified
+.BR posix_madvise (3),
+and may or may not have counterparts in the
+.BR madvise ()
+interface available on other implementations.
+.TP
+.BR MADV_COLD " (since Linux 5.4.1)"
+.\" commit 9c276cc65a58faf98be8e56962745ec99ab87636
+Deactivate a given range of pages by moving them from active to inactive
+LRU list. This is done to accelerate the reclaim of these pages. The advice
+might be ignored for some pages in the range when it is not applicable.
+.TP
+.BR MADV_PAGEOUT " (since Linux 5.4.1)"
+.\" commit 1a4e58cce84ee88129d5d49c064bd2852b481357
+Reclaim a given range of pages. This is done to free up memory occupied by
+these pages. If a page is anonymous it will be swapped out. If a page is
+file-backed and dirty it will be written back into the backing storage.
+The advice might be ignored for some pages in the range when it is not
+applicable.
+.PP
+The
+.I flags
+argument is reserved for future use; currently, this argument must be
+specified as 0.
+.PP
+The value specified in the
+.I vlen
+argument must be less than or equal to
+.BR IOV_MAX
+(defined in
+.I <limits.h>
+or accessible via the call
+.IR sysconf(_SC_IOV_MAX) ).
+.PP
+The
+.I vlen
+and
+.I iovec
+arguments are checked before applying any hints.
+If the
+.I vlen
+is too big, or
+.I iovec
+is invalid, an error will be returned immediately.
+.PP
+Hint might be applied to a part of
+.I iovec
+if one of its elements points to an invalid memory
+region in the remote process. No further elements will be
+processed beyond that point.
+.PP
+Permission to provide a hint to external process is governed by a
+ptrace access mode
+.B PTRACE_MODE_READ_REALCREDS
+check; see
+.BR ptrace (2)
+and
+.B CAP_SYS_ADMIN
+capability that caller should have in order to affect performance
+of an external process.
+.SH RETURN VALUE
+On success, process_madvise() returns the number of bytes advised.
+This return value may be less than the total number of requested
+bytes, if an error occurred. The caller should check return value
+to determine whether a partial advice occurred.
+.SH ERRORS
+.TP
+.B EFAULT
+The memory described by
+.I iovec
+is outside the accessible address space of the process pid.
+.TP
+.B EINVAL
+.I flags
+is not 0.
+.TP
+.B EINVAL
+The sum of the
+.I iov_len
+values of
+.I iovec
+overflows a ssize_t value.
+.TP
+.B EINVAL
+.I vlen
+is too large.
+.TP
+.B ENOMEM
+Could not allocate memory for internal copies of the
+.I iovec
+structures.
+.TP
+.B EPERM
+The caller does not have permission to access the address space of the process
+.I pidfd.
+.TP
+.B ESRCH
+No process with ID
+.I pidfd
+exists.
+.SH VERSIONS
+Since Linux 5.10,
+.\" commit ecb8ac8b1f146915aa6b96449b66dd48984caacc
+support for this system call is optional,
+depending on the setting of the
+.B CONFIG_ADVISE_SYSCALLS
+configuration option.
+.SH SEE ALSO
+.BR madvise (2),
+.BR pidofd_open(2),
+.BR process_vm_readv (2),
+.BR process_vm_write (2)