diff mbox series

refpage_create.2: Document refpage_create(2)

Message ID 20210717025951.3946505-1-pcc@google.com (mailing list archive)
State New
Headers show
Series refpage_create.2: Document refpage_create(2) | expand

Commit Message

Peter Collingbourne July 17, 2021, 2:59 a.m. UTC
---
The syscall has not landed in the kernel yet.
Therefore, as usual, the patch should not be taken yet
and I've used 5.x as the introducing kernel version for now.

 man2/refpage_create.2 | 167 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 167 insertions(+)
 create mode 100644 man2/refpage_create.2

Comments

Alejandro Colomar July 29, 2021, 12:09 p.m. UTC | #1
Hi Peter,

On 7/17/21 4:59 AM, Peter Collingbourne wrote:
> ---
> The syscall has not landed in the kernel yet.
> Therefore, as usual, the patch should not be taken yet
> and I've used 5.x as the introducing kernel version for now.

Thanks!  Please see a few comments below.
Apart from formatting and code issues I noted,
the text looks good to me.

Please, ping us when this is merged in the kernel :)

Regards,

Alex

> 
>   man2/refpage_create.2 | 167 ++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 167 insertions(+)
>   create mode 100644 man2/refpage_create.2
> 
> diff --git a/man2/refpage_create.2 b/man2/refpage_create.2
> new file mode 100644
> index 000000000..c0b928b92
> --- /dev/null
> +++ b/man2/refpage_create.2
> @@ -0,0 +1,167 @@
> +.\" Copyright (C) 2021 Google LLC
> +.\" Author: Peter Collingbourne <pcc@google.com>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH REFPAGE_CREATE 2 2021-07-16 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +refpage_create \- create a reference page file descriptor
> +.SH SYNOPSIS
> +.nf
> +.BR "#include <unistd.h>"
> +.PP
> +.BI "int syscall(SYS_refpage_create, void *" content ", unsigned int " size ,
> +.BI "            unsigned long " flags ");"
> +.fi
> +.PP
> +.IR Note :
> +glibc provides no wrapper for
> +.BR refpage_create (),
> +necessitating the use of
> +.BR syscall (2).
> +.SH DESCRIPTION
> +The
> +.BR refpage_create ()
> +system call is used to create a file descriptor
> +that conceptually refers to a read-only file
> +whose contents are an infinite repetition of
> +.I size
> +bytes of data read from the
> +.I content
> +argument to the system call,
> +and which may be mapped into memory with
> +.BR mmap (2).
> +The file descriptor is created as if by passing
> +.BR O_RDONLY | O_CLOEXEC
> +to
> +.BR open (2).
> +.PP
> +In reality, any read-only pages in the mapping are backed
> +by a so-called reference page,
> +whose contents are specified using the arguments to
> +.BR refpage_create ().
> +.PP
> +The reference page will consist of repetitions of
> +.I size
> +bytes read
> +from
> +.IR content ,
> +as many as are required to fill the page. The
> +.I size
> +argument must be a power of two less than or equal to the page size, and the
> +.I content
> +argument must have at least
> +.I size
> +alignment. The behavior is as if a copy of this data

s/\. /.\n/

Rationale: semantic newlines.

> +is made while servicing the system call;
> +any updates to the data after the system call has returned
> +will not be reflected in the reference page.
> +.PP
> +If the architecture specifies that // metadata may be associated /J/

Please, use semantic newlines (see man-pages(7))

> +with memory addresses, // that metadata if present is copied
> +into the reference page along with the data itself,
> +but only if the size argument is at least as large
> +as the granularity of the metadata.
> +For example, with the ARMv8.5 Memory Tagging Extension,
> +the memory tags are copied, // but only if the size is greater than /J/
> +or equal to // the architecturally specified tag granule size of 16 bytes.
> +.PP
> +Writable private mappings trigger specific copy-on-write behavior
> +when a page in the mapping is written to.
> +The behavior is as if the reference page is copied,
> +but the kernel may use a more efficient technique such as
> +.BR memset (3)
> +to produce the copy if the
> +.I size
> +argument originally used to create the reference page file descriptor
> +is sufficiently small.
> +For this reason it is recommended to specify as small of a
> +.I size
> +argument as possible
> +in order to activate any such optimizations implemented in the kernel.
> +.PP
> +The advantage of using this system call
> +over creating normal anonymous mappings
> +and manually initializing the pages from userspace
> +is that it is more efficient.
> +If it is not known that all of the pages in the mapping
> +will be faulted (for example, if the system call is used
> +by a general purpose memory allocator
> +where the behavior of the client program is unknown),
> +letting the pages be prepared on fault only if needed
> +is more efficient from both a performance
> +and memory consumption perspective.
> +Even if all of the pages would end up being faulted,
> +it would still be more efficient
> +to have the kernel initialize the pages with the required contents once
> +than to have the kernel zero initialize them on fault
> +and then have userspace initialize them again with different contents.
> +.SH EXAMPLES
> +The following program creates a 128KB memory mapping

The SI mandates that a space shall be inserted between a number and the 
associated unit.

Also, if it really means 128 KiB, which I guess, please use KiB.  See 
units(7).

Use a non-breaking space to make sure that the unit goes with the number.

With all that, it would be:

... creates a 128\ KiB memory ...

> +preinitialized with the pattern byte 0xAA
> +and verifies that the contents of the mapping are correct.
> +.PP
> +.EX
> +#include <linux/unistd.h>
> +#include <stdio.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +
> +int main() {
> +    unsigned char pattern = 0xaa;

Please use capital AA to help visually differentiate x and a.

> +    unsigned long mmap_size = 131072;

Why that magic number?
Maybe a shift to indicate that it's a power of 2...  or 128 * 1024...
I don't know from the top of my head powers of 2 that high :)

Also, why 'unsigned long'?  The SYNOPSIS says it's an 'unsigned int'.

> +
> +    int fd = syscall(SYS_refpage_create, &pattern, 1, 0);

Please use sizeof(pattern) instead of 1 to communicate the relationship 
between them.

> +    if (fd < 0) {
> +        perror("refpage_create");
> +        return 1;

Please use EXIT_FAILURE (<stdlib.h>).  Also use exit(3) instead of 
return, as is common practice in manual pages.

> +    }
> +    unsigned char *p = mmap(0, mmap_size, PROT_READ | PROT_WRITE,

Use NULL instead of 0 for pointers.  The first argument of mmap(2) is 
'void *addr'.

> +                            MAP_PRIVATE, fd, 0);
> +    if (p == MAP_FAILED) {
> +        perror("mmap");
> +        return 1;
> +    }
> +    for (unsigned i = 0; i != mmap_size; ++i) {

s/unsigned/unsigned int/

> +        if (p[i] != pattern) {
> +            fprintf(stderr, "refpage failed contents check @ %u: "
> +                    "0x%x != 0x%x\n",

I prefer 0x%X, which is already in use in some manual pages (seccomp(2)).

Also, 'i' may be more readable in hex, given it's an offset of an 
address (actually the concept of a size_t, even if the kernel doesn't 
use that type) don't you think?

> +                    i, p[i], pattern);
> +            return 1;

exit(3)

> +        }
> +    }
> +}
> +.EE
> +.SH NOTE
> +Reading from a reference page file descriptor, e.g. with
> +.BR read (2),
> +is not supported, nor would this be particularly useful.
> +.SH VERSIONS
> +This system call first appeared in Linux 5.x. > +.SH CONFORMING TO
> +The
> +.BR refpage_create ()
> +system call is Linux-specific.
> +.SH SEE ALSO
> +.BR mmap (2),
> +.BR open (2).
>
diff mbox series

Patch

diff --git a/man2/refpage_create.2 b/man2/refpage_create.2
new file mode 100644
index 000000000..c0b928b92
--- /dev/null
+++ b/man2/refpage_create.2
@@ -0,0 +1,167 @@ 
+.\" Copyright (C) 2021 Google LLC
+.\" Author: Peter Collingbourne <pcc@google.com>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH REFPAGE_CREATE 2 2021-07-16 "Linux" "Linux Programmer's Manual"
+.SH NAME
+refpage_create \- create a reference page file descriptor
+.SH SYNOPSIS
+.nf
+.BR "#include <unistd.h>"
+.PP
+.BI "int syscall(SYS_refpage_create, void *" content ", unsigned int " size ,
+.BI "            unsigned long " flags ");"
+.fi
+.PP
+.IR Note :
+glibc provides no wrapper for
+.BR refpage_create (),
+necessitating the use of
+.BR syscall (2).
+.SH DESCRIPTION
+The
+.BR refpage_create ()
+system call is used to create a file descriptor
+that conceptually refers to a read-only file
+whose contents are an infinite repetition of
+.I size
+bytes of data read from the
+.I content
+argument to the system call,
+and which may be mapped into memory with
+.BR mmap (2).
+The file descriptor is created as if by passing
+.BR O_RDONLY | O_CLOEXEC
+to
+.BR open (2).
+.PP
+In reality, any read-only pages in the mapping are backed
+by a so-called reference page,
+whose contents are specified using the arguments to
+.BR refpage_create ().
+.PP
+The reference page will consist of repetitions of
+.I size
+bytes read
+from
+.IR content ,
+as many as are required to fill the page. The
+.I size
+argument must be a power of two less than or equal to the page size, and the
+.I content
+argument must have at least
+.I size
+alignment. The behavior is as if a copy of this data
+is made while servicing the system call;
+any updates to the data after the system call has returned
+will not be reflected in the reference page.
+.PP
+If the architecture specifies that metadata may be associated
+with memory addresses, that metadata if present is copied
+into the reference page along with the data itself,
+but only if the size argument is at least as large
+as the granularity of the metadata.
+For example, with the ARMv8.5 Memory Tagging Extension,
+the memory tags are copied, but only if the size is greater than
+or equal to the architecturally specified tag granule size of 16 bytes.
+.PP
+Writable private mappings trigger specific copy-on-write behavior
+when a page in the mapping is written to.
+The behavior is as if the reference page is copied,
+but the kernel may use a more efficient technique such as
+.BR memset (3)
+to produce the copy if the
+.I size
+argument originally used to create the reference page file descriptor
+is sufficiently small.
+For this reason it is recommended to specify as small of a
+.I size
+argument as possible
+in order to activate any such optimizations implemented in the kernel.
+.PP
+The advantage of using this system call
+over creating normal anonymous mappings
+and manually initializing the pages from userspace
+is that it is more efficient.
+If it is not known that all of the pages in the mapping
+will be faulted (for example, if the system call is used
+by a general purpose memory allocator
+where the behavior of the client program is unknown),
+letting the pages be prepared on fault only if needed
+is more efficient from both a performance
+and memory consumption perspective.
+Even if all of the pages would end up being faulted,
+it would still be more efficient
+to have the kernel initialize the pages with the required contents once
+than to have the kernel zero initialize them on fault
+and then have userspace initialize them again with different contents.
+.SH EXAMPLES
+The following program creates a 128KB memory mapping
+preinitialized with the pattern byte 0xAA
+and verifies that the contents of the mapping are correct.
+.PP
+.EX
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+int main() {
+    unsigned char pattern = 0xaa;
+    unsigned long mmap_size = 131072;
+
+    int fd = syscall(SYS_refpage_create, &pattern, 1, 0);
+    if (fd < 0) {
+        perror("refpage_create");
+        return 1;
+    }
+    unsigned char *p = mmap(0, mmap_size, PROT_READ | PROT_WRITE,
+                            MAP_PRIVATE, fd, 0);
+    if (p == MAP_FAILED) {
+        perror("mmap");
+        return 1;
+    }
+    for (unsigned i = 0; i != mmap_size; ++i) {
+        if (p[i] != pattern) {
+            fprintf(stderr, "refpage failed contents check @ %u: "
+                    "0x%x != 0x%x\n",
+                    i, p[i], pattern);
+            return 1;
+        }
+    }
+}
+.EE
+.SH NOTE
+Reading from a reference page file descriptor, e.g. with
+.BR read (2),
+is not supported, nor would this be particularly useful.
+.SH VERSIONS
+This system call first appeared in Linux 5.x.
+.SH CONFORMING TO
+The
+.BR refpage_create ()
+system call is Linux-specific.
+.SH SEE ALSO
+.BR mmap (2),
+.BR open (2).