diff mbox series

[v4] mm: introduce reference pages

Message ID 20210619092002.1791322-1-pcc@google.com (mailing list archive)
State New
Headers show
Series [v4] mm: introduce reference pages | expand

Commit Message

Peter Collingbourne June 19, 2021, 9:20 a.m. UTC
Introduce a new syscall, refpage_create, which returns a file
descriptor which may be mapped using mmap. Such a mapping is similar
to an anonymous mapping, but instead of clean pages being backed by the
zero page, they are instead backed by a so-called reference page, whose
contents are specified using an argument to refpage_create. Loads from
the mapping will load directly from the reference page, and initial
stores to the mapping will copy-on-write from the reference page.

Reference pages are useful in circumstances where anonymous mappings
combined with manual stores to memory would impose undesirable costs,
either in terms of performance or RSS. Use cases are focused on heap
allocators and include:

- Pattern initialization for the heap. This is where malloc(3) gives
  you memory whose contents are filled with a non-zero pattern
  byte, in order to help detect and mitigate bugs involving use
  of uninitialized memory. Typically this is implemented by having
  the allocator memset the allocation with the pattern byte before
  returning it to the user, but for large allocations this can result
  in a significant increase in RSS, especially for allocations that
  are used sparsely. Even for dense allocations there is a needless
  impact to startup performance when it may be better to amortize it
  throughout the program. By creating allocations using a reference
  page filled with the pattern byte, we can avoid these costs.

- Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
  feature which allows for memory to be tagged in order to detect
  certain kinds of memory errors with low overhead. In order to set
  up an allocation to allow memory errors to be detected, the entire
  allocation needs to have the same tag. The issue here is similar to
  pattern initialization in the sense that large tagged allocations
  will be expensive if the tagging is done up front. The idea is that
  the allocator would create reference pages with each of the possible
  memory tags, and use those reference pages for the large allocations.

This patch includes specific optimizations for these use cases in
order to reduce memory traffic. If the reference page consists of a
single repeating byte, the page is initialized using memset, and on
arm64 if the reference page consists of a uniformly tagged zero page,
the DC GZVA instruction is used to initialize the page.

In order to measure the performance and RSS impact of reference pages,
I used the following microbenchmark program, which is intended to
compare an implementation of heap pattern initialization that uses
memset to initialize the pages against an implementation that uses
reference pages:

  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <sys/mman.h>
  #include <unistd.h>

  constexpr unsigned char pattern_byte = 0xaa;

  #define PAGE_SIZE 4096

  _Alignas(PAGE_SIZE) static unsigned char pattern[PAGE_SIZE];

  int main(int argc, char **argv) {
    if (argc < 3)
      return 1;
    bool use_refpage = argc > 3;
    size_t mmap_size = atoi(argv[1]);
    size_t touch_size = atoi(argv[2]);

    int refpage_fd;
    if (use_refpage) {
      memset(pattern, pattern_byte, PAGE_SIZE);
      refpage_fd = syscall(448, pattern, 0);
    }
    for (unsigned i = 0; i != 1000; ++i) {
      char *p;
      if (use_refpage) {
        p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
                         refpage_fd, 0);
      } else {
        p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE,
                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        memset(p, pattern_byte, mmap_size);
      }
      for (unsigned j = 0; j < touch_size; j += PAGE_SIZE)
        p[j] = 0;
      munmap(p, mmap_size);
    }
  }

On a DragonBoard 845c with the powersave governor, and taking the
median of 10 runs for each measurement, I measured the following
results for real time (s):

touch_size/mmap_size   memset   refpages     improvement (95% CI)
      4096/4096000    3.962194   0.026726   98.8015% +/- 1.14684%
   2048000/4096000    3.925309   1.48081    61.8271% +/- 1.11911%
   4096000/4096000    3.986275   3.385003   15.1205% +/- 0.227235%

And the following for max RSS (KiB):

touch_size/mmap_size   memset   refpages     improvement (95% CI)
      4096/4096000      6656      3448      49.3815% +/- 1.30339%
   2048000/4096000      6696      4580      31.7053% +/- 1.16411%
   4096000/4096000      6716      6684              none

So we see a large improvement for sparsely used allocations, and even
a modest perf improvement for fully utilized allocations as a result
of touching the pages one fewer time (with memset: once in the kernel
and once in userspace; with refpages: just once in the kernel).

Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: [1] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/enhancing-memory-safety
---
v4:
- rebased to linux-next
- added arch hooks to support MTE tagged reference pages
- added optimizations for pages with pattern byte as well as uniformly MTE-tagged pages
- added helper functions to avoid open-coding the reference page detection
- wrote a microbenchmark program and got new perf results for the commit message

As an alternative to introducing this syscall, I considered using
userfaultfd to implement reference pages. However, after having taken
a detailed look at the interface, it does not seem suitable to be
used in the context of a general purpose allocator. For example,
UFFD_FEATURE_FORK support would be required in order to correctly
support fork(2) in a process that uses the allocator (although POSIX
does not guarantee support for allocating after fork, many allocators
including Scudo support it, and nothing stops the forked process from
page faulting pre-existing allocations after forking anyway), but
UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
making it unsuitable for use in an allocator. Furthermore, even if
the interface issues are resolved, I suspect (but have not measured)
that the cost of the multiple context switches between kernel and
userspace would be too high to be used in an allocator anyway.

 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/include/asm/mman.h               | 15 ++++
 arch/arm64/include/asm/mte.h                |  9 +-
 arch/arm64/include/asm/page.h               |  2 +-
 arch/arm64/include/asm/unistd.h             |  2 +-
 arch/arm64/include/asm/unistd32.h           |  2 +
 arch/arm64/kernel/mte.c                     | 24 +++++
 arch/arm64/lib/mte.S                        |  7 +-
 arch/arm64/mm/fault.c                       | 41 ++++++++-
 arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 include/linux/gfp.h                         | 11 ++-
 include/linux/highmem.h                     |  2 +-
 include/linux/huge_mm.h                     |  7 ++
 include/linux/mm.h                          | 39 ++++++++
 include/linux/mman.h                        | 19 ++++
 include/linux/syscalls.h                    |  3 +
 include/uapi/asm-generic/unistd.h           |  5 +-
 kernel/sys_ni.c                             |  1 +
 mm/Makefile                                 |  4 +-
 mm/gup.c                                    |  2 +-
 mm/kasan/hw_tags.c                          |  2 +-
 mm/memory.c                                 | 47 +++++++---
 mm/migrate.c                                |  4 +-
 mm/page_alloc.c                             |  2 +-
 mm/refpage.c                                | 98 +++++++++++++++++++++
 39 files changed, 330 insertions(+), 34 deletions(-)
 create mode 100644 mm/refpage.c

Comments

Kirill A. Shutemov June 28, 2021, 12:24 p.m. UTC | #1
On Sat, Jun 19, 2021 at 02:20:02AM -0700, Peter Collingbourne wrote:
>   #include <stdio.h>
>   #include <stdlib.h>
>   #include <string.h>
>   #include <sys/mman.h>
>   #include <unistd.h>
> 
>   constexpr unsigned char pattern_byte = 0xaa;
> 
>   #define PAGE_SIZE 4096
> 
>   _Alignas(PAGE_SIZE) static unsigned char pattern[PAGE_SIZE];
> 
>   int main(int argc, char **argv) {
>     if (argc < 3)
>       return 1;
>     bool use_refpage = argc > 3;
>     size_t mmap_size = atoi(argv[1]);
>     size_t touch_size = atoi(argv[2]);
> 
>     int refpage_fd;
>     if (use_refpage) {
>       memset(pattern, pattern_byte, PAGE_SIZE);
>       refpage_fd = syscall(448, pattern, 0);
>     }
>     for (unsigned i = 0; i != 1000; ++i) {
>       char *p;
>       if (use_refpage) {
>         p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
>                          refpage_fd, 0);
>       } else {
>         p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE,
>                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>         memset(p, pattern_byte, mmap_size);
>       }
>       for (unsigned j = 0; j < touch_size; j += PAGE_SIZE)
>         p[j] = 0;
>       munmap(p, mmap_size);
>     }
>   }

I don't like the inteface. It is tied to PAGE_SIZE and this doesn't seem
to be very future looking. How would it work with THPs?

Maybe we should cosider passing down a filling pattern to kernel and let
kernel allocate appropriate page size on read page fault? The pattern has
to be power of 2 and limited in lenght.
Matthew Wilcox June 28, 2021, 1:10 p.m. UTC | #2
On Sat, Jun 19, 2021 at 02:20:02AM -0700, Peter Collingbourne wrote:
> +++ b/include/linux/mm.h
> @@ -32,6 +32,7 @@
>  #include <linux/sched.h>
>  #include <linux/pgtable.h>
>  #include <linux/kasan.h>
> +#include <linux/fs.h>

No.

> +++ b/include/linux/mman.h
> @@ -2,6 +2,7 @@
>  #ifndef _LINUX_MMAN_H
>  #define _LINUX_MMAN_H
>  
> +#include <linux/fs.h>

No.
Matthew Wilcox June 28, 2021, 7:33 p.m. UTC | #3
On Sat, Jun 19, 2021 at 02:20:02AM -0700, Peter Collingbourne wrote:
> +void prep_refpage_private_data(struct refpage_private_data *priv)
> +{
> +	u8 *addr = page_address(priv->refpage);
> +	u8 pattern = addr[0];
> +	int i;
> +
> +	for (i = 1; i != PAGE_SIZE; ++i)
> +		if (addr[i] != pattern)
> +			return;
> +
> +	priv->optzn_kind = REFPAGE_OPTZN_PATTERN;
> +	priv->optzn_info = pattern;
> +}
> +
> +void copy_refpage(struct page *page, unsigned long addr,
> +		  struct vm_area_struct *vma)
> +{
> +	struct refpage_private_data *priv = vma->vm_private_data;
> +
> +	if (priv->optzn_kind == REFPAGE_OPTZN_PATTERN)
> +		memset(page_address(page), priv->optzn_info, PAGE_SIZE);
> +	else
> +		copy_user_highpage(page, priv->refpage, addr, vma);
> +}

I wonder if single-byte captures enough of the useful possibilities.
In the kernel we have memset32() and memset64() [1] so we could support
a larger pattern than just an 8-bit byte.  It all depends what userspace
would find useful.

[1] Along with memset_p(), memset_l() and memset16() that aren't terribly
relevant to this use case.  Although maybe memset_l() would be the right
one to use since there probably aren't too many 32-bit apps that want
a 64-bit pattern and memset64() might not be the fastest on a 32-bit
kernel).
John Hubbard June 28, 2021, 7:44 p.m. UTC | #4
On 6/28/21 12:33 PM, Matthew Wilcox wrote:
...
> 
> I wonder if single-byte captures enough of the useful possibilities.
> In the kernel we have memset32() and memset64() [1] so we could support
> a larger pattern than just an 8-bit byte.  It all depends what userspace
> would find useful.
> 
> [1] Along with memset_p(), memset_l() and memset16() that aren't terribly
> relevant to this use case.  Although maybe memset_l() would be the right
> one to use since there probably aren't too many 32-bit apps that want
> a 64-bit pattern and memset64() might not be the fastest on a 32-bit
> kernel).
> 

And in fact, I'm also rather intrigued by doing something like 256 copies
of a 16-byte UUID, per 4KB page. In other words, there are *definitely*
useful patterns that are longer than a single byte, and it seems interesting
to support them here.

Kirill's idea of an API that somehow allows various power of 2 patterns seems
like it would be nice, because then we don't have to pick a value that seems
good in 2021, but less good as time goes by, perhaps.

Another thought is to use an entire 4KB page as the smallest pattern unit.
That would allow the maximum API flexibility, because the caller could
explicitly set every single byte in the page.


thanks,
Matthew Wilcox June 28, 2021, 7:56 p.m. UTC | #5
On Mon, Jun 28, 2021 at 12:44:22PM -0700, John Hubbard wrote:
> On 6/28/21 12:33 PM, Matthew Wilcox wrote:
> ...
> > 
> > I wonder if single-byte captures enough of the useful possibilities.
> > In the kernel we have memset32() and memset64() [1] so we could support
> > a larger pattern than just an 8-bit byte.  It all depends what userspace
> > would find useful.
> > 
> > [1] Along with memset_p(), memset_l() and memset16() that aren't terribly
> > relevant to this use case.  Although maybe memset_l() would be the right
> > one to use since there probably aren't too many 32-bit apps that want
> > a 64-bit pattern and memset64() might not be the fastest on a 32-bit
> > kernel).
> > 
> 
> And in fact, I'm also rather intrigued by doing something like 256 copies
> of a 16-byte UUID, per 4KB page. In other words, there are *definitely*
> useful patterns that are longer than a single byte, and it seems interesting
> to support them here.
> 
> Kirill's idea of an API that somehow allows various power of 2 patterns seems
> like it would be nice, because then we don't have to pick a value that seems
> good in 2021, but less good as time goes by, perhaps.
> 
> Another thought is to use an entire 4KB page as the smallest pattern unit.
> That would allow the maximum API flexibility, because the caller could
> explicitly set every single byte in the page.

That's what this patch does.  If it can be reduced to a pattern (in
Peter's patch of a single byte; i'm proposing expanding that), then
the page is filled with the pattern; otherwise we copy the reference
page.
John Hubbard June 29, 2021, 7:19 a.m. UTC | #6
On 6/19/21 2:20 AM, Peter Collingbourne wrote:
> Introduce a new syscall, refpage_create, which returns a file
> descriptor which may be mapped using mmap. Such a mapping is similar
> to an anonymous mapping, but instead of clean pages being backed by the
> zero page, they are instead backed by a so-called reference page, whose
> contents are specified using an argument to refpage_create. Loads from
> the mapping will load directly from the reference page, and initial
> stores to the mapping will copy-on-write from the reference page.

Hi Peter,

Now that you have shown that this seems to have some performance
justification, I've taken a closer look at the patch, and have a handfull
of small suggestions, most of them very easy to deal with.

First of all: documentation of the new syscall. At the very least,
refpage.c could use a bunch of the wording that is in this patch's
commit description, at the top. I'm sure there are other places for new
syscall documentation (someone else probably knows where), but that would
be a good start.


> 
> Reference pages are useful in circumstances where anonymous mappings
> combined with manual stores to memory would impose undesirable costs,
> either in terms of performance or RSS. Use cases are focused on heap
> allocators and include:
> 
> - Pattern initialization for the heap. This is where malloc(3) gives
>    you memory whose contents are filled with a non-zero pattern
>    byte, in order to help detect and mitigate bugs involving use
>    of uninitialized memory. Typically this is implemented by having
>    the allocator memset the allocation with the pattern byte before
>    returning it to the user, but for large allocations this can result
>    in a significant increase in RSS, especially for allocations that
>    are used sparsely. Even for dense allocations there is a needless
>    impact to startup performance when it may be better to amortize it
>    throughout the program. By creating allocations using a reference
>    page filled with the pattern byte, we can avoid these costs.

As Kirill and Matthew mentioned in the other thread, it would be good
to pass in the pattern as part of the syscall, instead of deducing it
in prep_refpage_private_data(). I'll cover that more in the diffs area.

> 
> - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
>    feature which allows for memory to be tagged in order to detect
>    certain kinds of memory errors with low overhead. In order to set
>    up an allocation to allow memory errors to be detected, the entire
>    allocation needs to have the same tag. The issue here is similar to
>    pattern initialization in the sense that large tagged allocations
>    will be expensive if the tagging is done up front. The idea is that
>    the allocator would create reference pages with each of the possible
>    memory tags, and use those reference pages for the large allocations.
> 
> This patch includes specific optimizations for these use cases in
> order to reduce memory traffic. If the reference page consists of a
> single repeating byte, the page is initialized using memset, and on
> arm64 if the reference page consists of a uniformly tagged zero page,
> the DC GZVA instruction is used to initialize the page.
> 
> In order to measure the performance and RSS impact of reference pages,
> I used the following microbenchmark program, which is intended to
> compare an implementation of heap pattern initialization that uses
> memset to initialize the pages against an implementation that uses
> reference pages:
> 
>    #include <stdio.h>
>    #include <stdlib.h>
>    #include <string.h>
>    #include <sys/mman.h>
>    #include <unistd.h>
> 
>    constexpr unsigned char pattern_byte = 0xaa;
> 
>    #define PAGE_SIZE 4096
> 
>    _Alignas(PAGE_SIZE) static unsigned char pattern[PAGE_SIZE];
> 
>    int main(int argc, char **argv) {
>      if (argc < 3)
>        return 1;
>      bool use_refpage = argc > 3;
>      size_t mmap_size = atoi(argv[1]);
>      size_t touch_size = atoi(argv[2]);
> 
>      int refpage_fd;
>      if (use_refpage) {
>        memset(pattern, pattern_byte, PAGE_SIZE);
>        refpage_fd = syscall(448, pattern, 0);
>      }
>      for (unsigned i = 0; i != 1000; ++i) {
>        char *p;
>        if (use_refpage) {
>          p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
>                           refpage_fd, 0);
>        } else {
>          p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE,
>                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>          memset(p, pattern_byte, mmap_size);
>        }
>        for (unsigned j = 0; j < touch_size; j += PAGE_SIZE)
>          p[j] = 0;
>        munmap(p, mmap_size);
>      }
>    }
> 


That sample code would be very nice to include in a documentation
section for documentation too, once we figure out the best place to put
it. If no one else recommends anything, then I'd start with
Documentation/mm/reference_pages.rst.


> On a DragonBoard 845c with the powersave governor, and taking the
> median of 10 runs for each measurement, I measured the following
> results for real time (s):
> 
> touch_size/mmap_size   memset   refpages     improvement (95% CI)
>        4096/4096000    3.962194   0.026726   98.8015% +/- 1.14684%
>     2048000/4096000    3.925309   1.48081    61.8271% +/- 1.11911%
>     4096000/4096000    3.986275   3.385003   15.1205% +/- 0.227235%
> 
> And the following for max RSS (KiB):
> 
> touch_size/mmap_size   memset   refpages     improvement (95% CI)
>        4096/4096000      6656      3448      49.3815% +/- 1.30339%
>     2048000/4096000      6696      4580      31.7053% +/- 1.16411%
>     4096000/4096000      6716      6684              none
> 
> So we see a large improvement for sparsely used allocations, and even
> a modest perf improvement for fully utilized allocations as a result
> of touching the pages one fewer time (with memset: once in the kernel
> and once in userspace; with refpages: just once in the kernel).
> 
> Signed-off-by: Peter Collingbourne <pcc@google.com>
> Link: [1] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/enhancing-memory-safety
> ---
> v4:
> - rebased to linux-next
> - added arch hooks to support MTE tagged reference pages
> - added optimizations for pages with pattern byte as well as uniformly MTE-tagged pages
> - added helper functions to avoid open-coding the reference page detection
> - wrote a microbenchmark program and got new perf results for the commit message
> 
> As an alternative to introducing this syscall, I considered using
> userfaultfd to implement reference pages. However, after having taken
> a detailed look at the interface, it does not seem suitable to be
> used in the context of a general purpose allocator. For example,
> UFFD_FEATURE_FORK support would be required in order to correctly
> support fork(2) in a process that uses the allocator (although POSIX
> does not guarantee support for allocating after fork, many allocators
> including Scudo support it, and nothing stops the forked process from
> page faulting pre-existing allocations after forking anyway), but
> UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
> ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
> making it unsuitable for use in an allocator. Furthermore, even if
> the interface issues are resolved, I suspect (but have not measured)
> that the cost of the multiple context switches between kernel and
> userspace would be too high to be used in an allocator anyway.
> 
>   arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
>   arch/arm/tools/syscall.tbl                  |  1 +
>   arch/arm64/include/asm/mman.h               | 15 ++++
>   arch/arm64/include/asm/mte.h                |  9 +-
>   arch/arm64/include/asm/page.h               |  2 +-
>   arch/arm64/include/asm/unistd.h             |  2 +-
>   arch/arm64/include/asm/unistd32.h           |  2 +
>   arch/arm64/kernel/mte.c                     | 24 +++++
>   arch/arm64/lib/mte.S                        |  7 +-
>   arch/arm64/mm/fault.c                       | 41 ++++++++-
>   arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
>   arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
>   arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
>   arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
>   arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
>   arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
>   arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
>   arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
>   arch/s390/kernel/syscalls/syscall.tbl       |  1 +
>   arch/sh/kernel/syscalls/syscall.tbl         |  1 +
>   arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
>   arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
>   arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
>   arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
>   include/linux/gfp.h                         | 11 ++-
>   include/linux/highmem.h                     |  2 +-
>   include/linux/huge_mm.h                     |  7 ++
>   include/linux/mm.h                          | 39 ++++++++
>   include/linux/mman.h                        | 19 ++++
>   include/linux/syscalls.h                    |  3 +
>   include/uapi/asm-generic/unistd.h           |  5 +-
>   kernel/sys_ni.c                             |  1 +
>   mm/Makefile                                 |  4 +-
>   mm/gup.c                                    |  2 +-
>   mm/kasan/hw_tags.c                          |  2 +-
>   mm/memory.c                                 | 47 +++++++---
>   mm/migrate.c                                |  4 +-
>   mm/page_alloc.c                             |  2 +-
>   mm/refpage.c                                | 98 +++++++++++++++++++++
>   39 files changed, 330 insertions(+), 34 deletions(-)
>   create mode 100644 mm/refpage.c
> 
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index a17687ed4b51..494edc5ca61c 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -486,3 +486,4 @@
>   554	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   555	common	landlock_add_rule		sys_landlock_add_rule
>   556	common	landlock_restrict_self		sys_landlock_restrict_self
> +558	common	refpage_create			sys_refpage_create
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index c5df1179fc5d..8fd7045f46b9 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -460,3 +460,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
> index e3e28f7daf62..5c0da3f76ec7 100644
> --- a/arch/arm64/include/asm/mman.h
> +++ b/arch/arm64/include/asm/mman.h
> @@ -84,4 +84,19 @@ static inline bool arch_validate_flags(unsigned long vm_flags)
>   }
>   #define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
>   
> +struct refpage_private_data;
> +
> +void arch_prep_refpage_private_data(struct refpage_private_data *priv);
> +#define arch_prep_refpage_private_data arch_prep_refpage_private_data
> +
> +static inline void arch_prep_refpage_vma(struct vm_area_struct *vma)
> +{
> +	vma->vm_flags |= VM_MTE_ALLOWED;
> +}
> +#define arch_prep_refpage_vma arch_prep_refpage_vma
> +
> +void arch_copy_refpage(struct page *page, unsigned long addr,
> +				     struct vm_area_struct *vma);
> +#define arch_copy_refpage arch_copy_refpage
> +
>   #endif /* ! __ASM_MMAN_H__ */
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 67bf259ae768..b513f83010c7 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -37,7 +37,7 @@ void mte_free_tag_storage(char *storage);
>   /* track which pages have valid allocation tags */
>   #define PG_mte_tagged	PG_arch_2
>   
> -void mte_zero_clear_page_tags(void *addr);
> +void mte_zero_set_page_tags(void *addr);


We should preserve the existing mte_zero_clear_page_tags(), and just
implement it in terms of the new, more general mte_zero_set_page_tags().
This is because: a) it will remove some diffs from this patch, and more
importantly, b) the concept of zeroing is still a distinct and useful
thing to have here.


>   void mte_sync_tags(pte_t *ptep, pte_t pte);
>   void mte_copy_page_tags(void *kto, const void *kfrom);
>   void mte_thread_init_user(void);
> @@ -48,13 +48,14 @@ long set_mte_ctrl(struct task_struct *task, unsigned long arg);
>   long get_mte_ctrl(struct task_struct *task);
>   int mte_ptrace_copy_tags(struct task_struct *child, long request,
>   			 unsigned long addr, unsigned long data);
> +u8 mte_check_tag_zero_page(struct page *userpage);
>   
>   #else /* CONFIG_ARM64_MTE */
>   
>   /* unused if !CONFIG_ARM64_MTE, silence the compiler */
>   #define PG_mte_tagged	0
>   
> -static inline void mte_zero_clear_page_tags(void *addr)
> +static inline void mte_zero_set_page_tags(void *addr)
>   {
>   }
>   static inline void mte_sync_tags(pte_t *ptep, pte_t pte)
> @@ -89,6 +90,10 @@ static inline int mte_ptrace_copy_tags(struct task_struct *child,
>   {
>   	return -EIO;
>   }
> +static inline u8 mte_check_tag_zero_page(struct page *userpage)
> +{
> +	return 0;
> +}
>   
>   #endif /* CONFIG_ARM64_MTE */
>   
> diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> index 993a27ea6f54..234f48688b1a 100644
> --- a/arch/arm64/include/asm/page.h
> +++ b/arch/arm64/include/asm/page.h
> @@ -33,7 +33,7 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>   						unsigned long vaddr);
>   #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
>   
> -void tag_clear_highpage(struct page *to);
> +void tag_set_highpage(struct page *to, unsigned long tag);


Same reasoning here: let's preserve tag_clear_highpage(), as well.


>   #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
>   
>   #define clear_user_page(page, vaddr, pg)	clear_page(page)
> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 727bfc3be99b..3cb206aea3db 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -38,7 +38,7 @@
>   #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
>   #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
>   
> -#define __NR_compat_syscalls		447
> +#define __NR_compat_syscalls		449
>   #endif
>   
>   #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index 99ffcafc736c..2a116aa17fe7 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -901,6 +901,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
>   __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
>   #define __NR_landlock_restrict_self 446
>   __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> +#define __NR_refpage_create 448
> +__SYSCALL(__NR_refpage_create, sys_refpage_create)
>   
>   /*
>    * Please add new compat syscalls above this comment and update
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index 125a10e413e9..6a79240d5a77 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -453,3 +453,27 @@ int mte_ptrace_copy_tags(struct task_struct *child, long request,
>   
>   	return ret;
>   }
> +
> +u8 mte_check_tag_zero_page(struct page *userpage)
> +{
> +	char *userpage_addr = page_address(userpage);
> +	u8 tag;
> +	int i;
> +
> +	if (!test_bit(PG_mte_tagged, &userpage->flags))
> +		return 0;
> +
> +	tag = mte_get_mem_tag(userpage_addr) & 0xF;
> +	if (tag == 0)
> +		return 0;
> +
> +	for (i = 0; i != PAGE_SIZE; ++i)
> +		if (userpage_addr[i] != 0)
> +			return 0;
> +
> +	for (i = MTE_GRANULE_SIZE; i != PAGE_SIZE; i += MTE_GRANULE_SIZE)
> +		if ((mte_get_mem_tag(userpage_addr + i) & 0xF) != tag)
> +			return 0;
> +
> +	return tag;
> +}
> diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
> index e83643b3995f..45be436c97af 100644
> --- a/arch/arm64/lib/mte.S
> +++ b/arch/arm64/lib/mte.S
> @@ -37,24 +37,23 @@ SYM_FUNC_START(mte_clear_page_tags)
>   SYM_FUNC_END(mte_clear_page_tags)
>   
>   /*
> - * Zero the page and tags at the same time
> + * Zero the page and set tags at the same time
>    *
>    * Parameters:
>    *	x0 - address to the beginning of the page
>    */
> -SYM_FUNC_START(mte_zero_clear_page_tags)
> +SYM_FUNC_START(mte_zero_set_page_tags)
>   	mrs	x1, dczid_el0
>   	and	w1, w1, #0xf
>   	mov	x2, #4
>   	lsl	x1, x2, x1
> -	and	x0, x0, #(1 << MTE_TAG_SHIFT) - 1	// clear the tag
>   
>   1:	dc	gzva, x0
>   	add	x0, x0, x1
>   	tst	x0, #(PAGE_SIZE - 1)
>   	b.ne	1b
>   	ret
> -SYM_FUNC_END(mte_zero_clear_page_tags)
> +SYM_FUNC_END(mte_zero_set_page_tags)
>   
>   /*
>    * Copy the tags from the source page to the destination one
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 349c488765ca..36355758ffc7 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -25,6 +25,7 @@
>   #include <linux/perf_event.h>
>   #include <linux/preempt.h>
>   #include <linux/hugetlb.h>
> +#include <linux/mman.h>
>   
>   #include <asm/acpi.h>
>   #include <asm/bug.h>
> @@ -939,9 +940,45 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>   	return alloc_page_vma(flags, vma, vaddr);
>   }
>   
> -void tag_clear_highpage(struct page *page)
> +void tag_set_highpage(struct page *page, unsigned long tag)
>   {
> -	mte_zero_clear_page_tags(page_address(page));
> +	unsigned long addr = (unsigned long)page_address(page);
> +
> +	addr &= ~MTE_TAG_MASK;
> +	addr |= tag << MTE_TAG_SHIFT;
> +	mte_zero_set_page_tags((void *)addr);
>   	page_kasan_tag_reset(page);
>   	set_bit(PG_mte_tagged, &page->flags);
>   }
> +
> +#define REFPAGE_OPTZN_MTE_TAGGED REFPAGE_OPTZN_ARCH

I see what you're doing with the arch layer here, but there's no need to
accept the minor drawbacks (of having this #define hidden away near the
bottom of a .c file). Instead, let's just put this into the list in
mm.h, and call it what it is, rather than "arch".


> +
> +void arch_prep_refpage_private_data(struct refpage_private_data *priv)
> +{
> +	if (system_supports_mte()) {
> +		u8 tag;
> +
> +		if (!test_and_set_bit(PG_mte_tagged, &priv->refpage->flags))
> +			mte_clear_page_tags(page_address(priv->refpage));
> +
> +		tag = mte_check_tag_zero_page(priv->refpage);
> +		if (tag) {
> +			priv->optzn_kind = REFPAGE_OPTZN_MTE_TAGGED;
> +			priv->optzn_info = tag;
> +			return;
> +		}
> +	}
> +
> +	prep_refpage_private_data(priv);
> +}
> +
> +void arch_copy_refpage(struct page *page, unsigned long addr,
> +		       struct vm_area_struct *vma)
> +{
> +	struct refpage_private_data *priv = vma->vm_private_data;
> +
> +	if (priv->optzn_kind == REFPAGE_OPTZN_MTE_TAGGED)
> +		tag_set_highpage(page, priv->optzn_info);
> +	else
> +		copy_refpage(page, addr, vma);
> +}
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index 6d07742c57b8..c2209d83f3c3 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -367,3 +367,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index 541bc1b3a8f9..0360cf474a49 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -446,3 +446,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index a176faca2927..de85d758e564 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -452,3 +452,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> index c2d2e19abea8..b07c7293d2a3 100644
> --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> @@ -385,3 +385,4 @@
>   444	n32	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	n32	landlock_add_rule		sys_landlock_add_rule
>   446	n32	landlock_restrict_self		sys_landlock_restrict_self
> +448	n32	refpage_create			sys_refpage_create
> diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> index ac653d08b1ea..7ebabb99dd06 100644
> --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> @@ -361,3 +361,4 @@
>   444	n64	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	n64	landlock_add_rule		sys_landlock_add_rule
>   446	n64	landlock_restrict_self		sys_landlock_restrict_self
> +448	n64	refpage_create			sys_refpage_create
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index 253f2cd70b6b..a51149ac101c 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -434,3 +434,4 @@
>   444	o32	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	o32	landlock_add_rule		sys_landlock_add_rule
>   446	o32	landlock_restrict_self		sys_landlock_restrict_self
> +448	o32	refpage_create			sys_refpage_create
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index e26187b9ab87..385565864861 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -444,3 +444,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index aef2a290e71a..95cdd9f7dc06 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -526,3 +526,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index 64d51ab5a8b4..92ed1260ffd9 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -449,3 +449,4 @@
>   444  common	landlock_create_ruleset	sys_landlock_create_ruleset	sys_landlock_create_ruleset
>   445  common	landlock_add_rule	sys_landlock_add_rule		sys_landlock_add_rule
>   446  common	landlock_restrict_self	sys_landlock_restrict_self	sys_landlock_restrict_self
> +448  common	refpage_create		sys_refpage_create		sys_refpage_create
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index e0a70be77d84..f9d198cc2541 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -449,3 +449,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index 603f5a821502..83533aa49340 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -492,3 +492,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index ce763a12311c..054c69e395b5 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -452,3 +452,4 @@
>   445	i386	landlock_add_rule	sys_landlock_add_rule
>   446	i386	landlock_restrict_self	sys_landlock_restrict_self
>   447	i386	memfd_secret		sys_memfd_secret
> +448	i386	refpage_create		sys_refpage_create
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index f6b57799c1ea..1f24f0b66cbd 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -369,6 +369,7 @@
>   445	common	landlock_add_rule	sys_landlock_add_rule
>   446	common	landlock_restrict_self	sys_landlock_restrict_self
>   447	common	memfd_secret		sys_memfd_secret
> +448	common	refpage_create		sys_refpage_create
>   
>   #
>   # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index 235d67d6ceb4..96c27fb404ca 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -417,3 +417,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 55b2ec1f965a..ae3c763eb9e9 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -55,8 +55,9 @@ struct vm_area_struct;
>   #define ___GFP_ACCOUNT		0x400000u
>   #define ___GFP_ZEROTAGS		0x800000u
>   #define ___GFP_SKIP_KASAN_POISON	0x1000000u
> +#define ___GFP_NOZERO		0x2000000u
>   #ifdef CONFIG_LOCKDEP
> -#define ___GFP_NOLOCKDEP	0x2000000u
> +#define ___GFP_NOLOCKDEP	0x4000000u
>   #else
>   #define ___GFP_NOLOCKDEP	0
>   #endif
> @@ -238,18 +239,24 @@ struct vm_area_struct;
>    * %__GFP_SKIP_KASAN_POISON returns a page which does not need to be poisoned
>    * on deallocation. Typically used for userspace pages. Currently only has an
>    * effect in HW tags mode.
> + *
> + * %__GFP_NOZERO disables any implicit zeroing of the page (e.g. as a result
> + * of init_on_alloc=on). This flag should only be used to address specific
> + * performance bottlenecks and only if the page is clearly being fully
> + * initialized following the allocation.
>    */
>   #define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)
>   #define __GFP_COMP	((__force gfp_t)___GFP_COMP)
>   #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)
>   #define __GFP_ZEROTAGS	((__force gfp_t)___GFP_ZEROTAGS)
>   #define __GFP_SKIP_KASAN_POISON	((__force gfp_t)___GFP_SKIP_KASAN_POISON)
> +#define __GFP_NOZERO	((__force gfp_t)___GFP_NOZERO)
>   
>   /* Disable lockdep for GFP context tracking */
>   #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
>   
>   /* Room for N __GFP_FOO bits */
> -#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
> +#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
>   #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>   
>   /**
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index 93ba33c09d12..6c5076dd1e9b 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -187,7 +187,7 @@ static inline void clear_highpage(struct page *page)
>   
>   #ifndef __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
>   
> -static inline void tag_clear_highpage(struct page *page)
> +static inline void tag_set_highpage(struct page *page, unsigned long tag)
>   {
>   }
>   
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index f123e15d966e..36ecfc391b46 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -127,6 +127,13 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
>   
>   	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
>   		return false;
> +
> +	/*
> +	 * Transparent hugepages not currently supported for anonymous VMAs with
> +	 * reference pages
> +	 */
> +	if (unlikely(is_refpage_vma(vma)))
> +		return false;
>   	return true;
>   }
>   
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a127d93612fa..8cff9e0463b5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -32,6 +32,7 @@
>   #include <linux/sched.h>
>   #include <linux/pgtable.h>
>   #include <linux/kasan.h>
> +#include <linux/fs.h>
>   
>   struct mempolicy;
>   struct anon_vma;
> @@ -722,6 +723,42 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
>   /* flush_tlb_range() takes a vma, not a mm, and can care about flags */
>   #define TLB_FLUSH_VMA(mm,flags) { .vm_mm = (mm), .vm_flags = (flags) }
>   
> +extern const struct file_operations refpage_file_operations;
> +
> +struct refpage_private_data {
> +	struct page *refpage;
> +	u8 optzn_kind;

How about:
	u8 content_type;

> +	u8 optzn_info;

and:
	u8 pattern[16]; // or whatever size the enums go up to, see below

> +};
> +
> +#define REFPAGE_OPTZN_NONE	0

For this next set, how about How about REFPAGE_CONTENT_TYPE_ for a prefix?
The spelling of OPTZN is tough, and there's no particular need internally
to call these out as optimizations.

So then this one becomes:

#define REFPAGE_CONTENT_TYPE_USER_SET	0

> +#define REFPAGE_OPTZN_PATTERN	1
> +#define REFPAGE_OPTZN_ARCH	2

And for the last one, let's avoid the arch hiding and just call it what it
is, no reason not to:

#define REFPAGE_CONTENT_TYPE_MTE_TAGGED	2


> +
> +static inline bool is_refpage_vma(struct vm_area_struct *vma)
> +{
> +	return vma->vm_file && vma->vm_file->f_op == &refpage_file_operations;
> +}
> +
> +static inline struct page *get_vma_refpage(struct vm_area_struct *vma)
> +{
> +	struct refpage_private_data *priv = vma->vm_private_data;
> +
> +	BUG_ON(!is_refpage_vma(vma));
> +	return priv->refpage;
> +}
> +
> +static inline int is_refpage_pfn(struct vm_area_struct *vma, unsigned long pfn)
> +{
> +	return is_refpage_vma(vma) && pfn == page_to_pfn(get_vma_refpage(vma));
> +}
> +
> +static inline int is_zero_or_refpage_pfn(struct vm_area_struct *vma,
> +					 unsigned long pfn)
> +{
> +	return is_zero_pfn(pfn) || is_refpage_pfn(vma, pfn);
> +}
> +


I don't think this helper function is helping enough to justify itself,
seeing as how it is quite clear when the implementation is used instead. No
big deal either way, though.


>   struct mmu_gather;
>   struct inode;
>   
> @@ -2977,6 +3014,8 @@ static inline void kernel_unpoison_pages(struct page *page, int numpages) { }
>   DECLARE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc);
>   static inline bool want_init_on_alloc(gfp_t flags)
>   {
> +	if (flags & __GFP_NOZERO)
> +		return false;
>   	if (static_branch_maybe(CONFIG_INIT_ON_ALLOC_DEFAULT_ON,
>   				&init_on_alloc))
>   		return true;
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index ebb09a964272..cdf8f8245c78 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -2,6 +2,7 @@
>   #ifndef _LINUX_MMAN_H
>   #define _LINUX_MMAN_H
>   
> +#include <linux/fs.h>
>   #include <linux/mm.h>
>   #include <linux/percpu_counter.h>
>   
> @@ -123,6 +124,24 @@ static inline bool arch_validate_flags(unsigned long flags)
>   #define arch_validate_flags arch_validate_flags
>   #endif
>   
> +void prep_refpage_private_data(struct refpage_private_data *priv);
> +#ifndef arch_prep_refpage_private_data
> +#define arch_prep_refpage_private_data prep_refpage_private_data
> +#endif
> +
> +#ifndef arch_prep_refpage_vma
> +static inline void arch_prep_refpage_vma(struct vm_area_struct *vma)
> +{
> +}
> +#define arch_prep_refpage_vma arch_prep_refpage_vma
> +#endif
> +
> +void copy_refpage(struct page *page, unsigned long addr,
> +		  struct vm_area_struct *vma);
> +#ifndef arch_copy_refpage
> +#define arch_copy_refpage copy_refpage
> +#endif
> +
>   /*
>    * Optimisation macro.  It is equivalent to:
>    *      (x & bit1) ? bit2 : 0
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 69c9a7010081..303a28a86500 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -864,6 +864,9 @@ asmlinkage long sys_mremap(unsigned long addr,
>   			   unsigned long old_len, unsigned long new_len,
>   			   unsigned long flags, unsigned long new_addr);
>   
> +/* mm/refpage.c */
> +asmlinkage long sys_refpage_create(const void __user *content, unsigned long flags);
> +
>   /* security/keys/keyctl.c */
>   asmlinkage long sys_add_key(const char __user *_type,
>   			    const char __user *_description,
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index a9d6fcd95f42..54cede7db5f0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -878,8 +878,11 @@ __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
>   __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
>   #endif
>   
> +#define __NR_refpage_create 448
> +__SYSCALL(__NR_refpage_create, sys_refpage_create)
> +
>   #undef __NR_syscalls
> -#define __NR_syscalls 448
> +#define __NR_syscalls 449
>   
>   /*
>    * 32 bit systems traditionally used different
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 30971b1dd4a9..bc65a54eb2a4 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -300,6 +300,7 @@ COND_SYSCALL(migrate_pages);
>   COND_SYSCALL_COMPAT(migrate_pages);
>   COND_SYSCALL(move_pages);
>   COND_SYSCALL_COMPAT(move_pages);
> +COND_SYSCALL(refpage_create);
>   
>   COND_SYSCALL(perf_event_open);
>   COND_SYSCALL(accept4);
> diff --git a/mm/Makefile b/mm/Makefile
> index e3436741d539..137adc22bf50 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -35,10 +35,10 @@ CFLAGS_init-mm.o += $(call cc-disable-warning, override-init)
>   CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides)
>   
>   mmu-y			:= nommu.o
> -mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
> +mmu-$(CONFIG_MMU)	:= highmem.o ioremap.o memory.o mincore.o \
>   			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
>   			   msync.o page_vma_mapped.o pagewalk.o \
> -			   pgtable-generic.o rmap.o vmalloc.o ioremap.o
> +			   pgtable-generic.o refpage.o rmap.o vmalloc.o
>   
>   
>   ifdef CONFIG_CROSS_MEMORY_ATTACH
> diff --git a/mm/gup.c b/mm/gup.c
> index 42b8b1fa6521..ba1b7bd7a0a0 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -548,7 +548,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>   			goto out;
>   		}
>   
> -		if (is_zero_pfn(pte_pfn(pte))) {
> +		if (is_zero_or_refpage_pfn(vma, pte_pfn(pte))) {
>   			page = pte_page(pte);
>   		} else {
>   			ret = follow_pfn_pte(vma, address, ptep, flags);
> diff --git a/mm/kasan/hw_tags.c b/mm/kasan/hw_tags.c
> index ed5e5b833d61..3c433e430c80 100644
> --- a/mm/kasan/hw_tags.c
> +++ b/mm/kasan/hw_tags.c
> @@ -253,7 +253,7 @@ void kasan_alloc_pages(struct page *page, unsigned int order, gfp_t flags)
>   		int i;
>   
>   		for (i = 0; i != 1 << order; ++i)
> -			tag_clear_highpage(page + i);
> +			tag_set_highpage(page + i, 0);


Here, we could avoid this diff, by preserving tag_clear_highpage(). And
that's good, because the current diff is making the code just ever so
slightly worse. :)


>   	} else {
>   		kasan_unpoison_pages(page, order, init);
>   	}
> diff --git a/mm/memory.c b/mm/memory.c
> index db86558791f1..8b32bdd215b7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -614,7 +614,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>   			return vma->vm_ops->find_special_page(vma, addr);
>   		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
>   			return NULL;
> -		if (is_zero_pfn(pfn))
> +		if (is_zero_or_refpage_pfn(vma, pfn))
>   			return NULL;
>   		if (pte_devmap(pte))
>   			return NULL;
> @@ -640,7 +640,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>   		}
>   	}
>   
> -	if (is_zero_pfn(pfn))
> +	if (is_zero_or_refpage_pfn(vma, pfn))
>   		return NULL;
>   
>   check_pfn:
> @@ -2166,7 +2166,7 @@ static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
>   		return true;
>   	if (pfn_t_special(pfn))
>   		return true;
> -	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
> +	if (is_zero_or_refpage_pfn(vma, pfn_t_to_pfn(pfn)))
>   		return true;
>   	return false;
>   }
> @@ -2990,22 +2990,29 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>   	pte_t entry;
>   	int page_copied = 0;
>   	struct mmu_notifier_range range;
> +	unsigned long pfn;
>   
>   	if (unlikely(anon_vma_prepare(vma)))
>   		goto oom;
>   
> -	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
> +	pfn = pte_pfn(vmf->orig_pte);
> +	if (is_zero_pfn(pfn)) {
>   		new_page = alloc_zeroed_user_highpage_movable(vma,
>   							      vmf->address);
>   		if (!new_page)
>   			goto oom;
>   	} else {
> -		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
> -				vmf->address);
> +		bool refpage = is_refpage_pfn(vma, pfn);
> +
> +		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE |
> +						  (refpage ? __GFP_NOZERO : 0),
> +					  vma, vmf->address);
>   		if (!new_page)
>   			goto oom;
>   
> -		if (!cow_user_page(new_page, old_page, vmf)) {
> +		if (refpage) {
> +			arch_copy_refpage(new_page, vmf->address, vma);
> +		} else if (!cow_user_page(new_page, old_page, vmf)) {
>   			/*
>   			 * COW failed, if the fault was solved by other,
>   			 * it's fine. If not, userspace would re-fault on
> @@ -3739,11 +3746,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>   	if (unlikely(pmd_trans_unstable(vmf->pmd)))
>   		return 0;
>   
> -	/* Use the zero-page for reads */
> +	/* Use the zero-page, or reference page if set, for reads */
>   	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
>   			!mm_forbids_zeropage(vma->vm_mm)) {
> -		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
> -						vma->vm_page_prot));
> +		unsigned long pfn;
> +
> +		if (unlikely(is_refpage_vma(vma)))
> +			pfn = page_to_pfn(get_vma_refpage(vma));
> +		else
> +			pfn = my_zero_pfn(vmf->address);
> +		entry = pte_mkspecial(pfn_pte(pfn, vma->vm_page_prot));
>   		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>   				vmf->address, &vmf->ptl);
>   		if (!pte_none(*vmf->pte)) {
> @@ -3764,9 +3776,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>   	/* Allocate our own private page. */
>   	if (unlikely(anon_vma_prepare(vma)))
>   		goto oom;
> -	page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
> -	if (!page)
> -		goto oom;
> +
> +	if (unlikely(is_refpage_vma(vma))) {
> +		page = alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_NOZERO, vma,
> +				      vmf->address);
> +		if (!page)
> +			goto oom;
> +		arch_copy_refpage(page, vmf->address, vma);
> +	} else {
> +		page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
> +		if (!page)
> +			goto oom;
> +	}
>   
>   	if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
>   		goto oom_free_page;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 23cbd9de030b..9a897676ff95 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2774,8 +2774,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>   	pmd_t *pmdp;
>   	pte_t *ptep;
>   
> -	/* Only allow populating anonymous memory */
> -	if (!vma_is_anonymous(vma))
> +	/* Only allow populating anonymous memory without a reference page */
> +	if (!vma_is_anonymous(vma) || is_refpage_vma(vma))
>   		goto abort;
>   
>   	pgdp = pgd_offset(mm, addr);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8836e54721ae..6ca831c1821f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1283,7 +1283,7 @@ static void kernel_init_free_pages(struct page *page, int numpages, bool zero_ta
>   
>   	if (zero_tags) {
>   		for (i = 0; i < numpages; i++)
> -			tag_clear_highpage(page + i);
> +			tag_set_highpage(page + i, 0);
>   		return;
>   	}
>   
> diff --git a/mm/refpage.c b/mm/refpage.c
> new file mode 100644
> index 000000000000..ee95e281d2d4
> --- /dev/null
> +++ b/mm/refpage.c
> @@ -0,0 +1,98 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/anon_inodes.h>
> +#include <linux/fs_context.h>
> +#include <linux/highmem.h>
> +#include <linux/mman.h>
> +#include <linux/mount.h>
> +#include <linux/syscalls.h>
> +
> +void prep_refpage_private_data(struct refpage_private_data *priv)
> +{
> +	u8 *addr = page_address(priv->refpage);
> +	u8 pattern = addr[0];
> +	int i;
> +
> +	for (i = 1; i != PAGE_SIZE; ++i)
> +		if (addr[i] != pattern)
> +			return;
> +
> +	priv->optzn_kind = REFPAGE_OPTZN_PATTERN;
> +	priv->optzn_info = pattern;
> +}
> +

I am hoping that this doesn't remain in its current form, because of
the API discussions. Probably we'll end up with setting a pattern instead
of deducing it.

> +void copy_refpage(struct page *page, unsigned long addr,
> +		  struct vm_area_struct *vma)
> +{
> +	struct refpage_private_data *priv = vma->vm_private_data;
> +
> +	if (priv->optzn_kind == REFPAGE_OPTZN_PATTERN)
> +		memset(page_address(page), priv->optzn_info, PAGE_SIZE);
> +	else
> +		copy_user_highpage(page, priv->refpage, addr, vma);
> +}
> +
> +static void put_refpage_private_data(struct refpage_private_data *priv)

Can you please rename this to free_refpage_private_data()? It's a little more
accurate.

> +{
> +	put_page(priv->refpage);
> +	kfree(priv);
> +}
> +
> +static int refpage_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	vma_set_anonymous(vma);
> +	vma->vm_private_data = vma->vm_file->private_data;
> +	arch_prep_refpage_vma(vma);
> +	return 0;
> +}
> +
> +static int refpage_release(struct inode *inode, struct file *file)
> +{
> +	put_refpage_private_data(file->private_data);
> +	return 0;
> +}
> +
> +const struct file_operations refpage_file_operations = {
> +	.mmap = refpage_mmap,
> +	.release = refpage_release,
> +};
> +
> +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
> +		flags)

 From the API discussion (and using a simpler syntax to illustrate this), it
seems like the following would be close:

enum content_type {
	BYTE_PATTERN,
	FOUR_BYTE_PATTERN,
	...
	FULL_4KB_PAGE
};

int refpage_create(const void *__user content, enum content_type, unsigned long flags);

...and if content_type == BYTE_PATTERN, then content is a pointer to just one byte of
data, and so forth for the other enum values.
	

> +{
> +	unsigned long content_addr = (unsigned long)content;
> +	struct page *userpage;
> +	struct refpage_private_data *private_data;
> +	int fd;
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
> +	    get_user_pages(content_addr, 1, 0, &userpage, 0) != 1)
> +		return -EFAULT;
> +
> +	private_data = kzalloc(sizeof(struct refpage_private_data), GFP_KERNEL);
> +	if (!private_data) {
> +		put_page(userpage);
> +		return -ENOMEM;
> +	}
> +
> +	private_data->refpage = alloc_page(GFP_KERNEL);
> +	if (!private_data->refpage) {
> +		kfree(private_data);
> +		put_page(userpage);
> +		return -ENOMEM;
> +	}
> +
> +	copy_highpage(private_data->refpage, userpage);
> +	arch_prep_refpage_private_data(private_data);
> +	put_page(userpage);
> +
> +	fd = anon_inode_getfd("[refpage]", &refpage_file_operations,
> +			      private_data, O_RDONLY | O_CLOEXEC);
> +	if (fd < 0)
> +		put_refpage_private_data(private_data);

And here, a couple of things:

1) I think there's a bug in the fd < 0 case, because you're only freeing
one of the two pages (there's an alloc_page() call, and a gup call above).

2) It's jarring to have part the error handling in three different ways:
returning -EFAULT directly, coding each error case to undo the growing
set of operations, and finally, jumping out to another routine here for
fd < 0.

Even for a small routine, that's too error-prone. Instead, one of the
following will be cleaner and safer too:

a) use goto and labels to unwind, or

b) use a no-fail cleanup routine to unwind

and either way, do it for all cases (or at least all of them after the first
trivial -EFAULT return.

> +
> +	return fd;
> +}
> 

thanks,
Matthew Wilcox June 29, 2021, 11:58 a.m. UTC | #7
On Tue, Jun 29, 2021 at 12:19:22AM -0700, John Hubbard wrote:
> > +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
> > +		flags)
> 
> From the API discussion (and using a simpler syntax to illustrate this), it
> seems like the following would be close:
> 
> enum content_type {
> 	BYTE_PATTERN,
> 	FOUR_BYTE_PATTERN,
> 	...
> 	FULL_4KB_PAGE
> };
> 
> int refpage_create(const void *__user content, enum content_type, unsigned long flags);
> 
> ...and if content_type == BYTE_PATTERN, then content is a pointer to just one byte of
> data, and so forth for the other enum values.

That seems a little more complicated and non-extensible.

int refpage_create(const void *__user content, unsigned int size,
		unsigned long flags);
John Hubbard June 29, 2021, 5:48 p.m. UTC | #8
On 6/29/21 4:58 AM, Matthew Wilcox wrote:
> On Tue, Jun 29, 2021 at 12:19:22AM -0700, John Hubbard wrote:
>>> +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
>>> +		flags)
>>
>>  From the API discussion (and using a simpler syntax to illustrate this), it
>> seems like the following would be close:
>>
>> enum content_type {
>> 	BYTE_PATTERN,
>> 	FOUR_BYTE_PATTERN,
>> 	...
>> 	FULL_4KB_PAGE
>> };
>>
>> int refpage_create(const void *__user content, enum content_type, unsigned long flags);
>>
>> ...and if content_type == BYTE_PATTERN, then content is a pointer to just one byte of
>> data, and so forth for the other enum values.
> 
> That seems a little more complicated and non-extensible.

This is true. I wasn't sure I liked it all that much, either, when I wrote
it. haha.

> 
> int refpage_create(const void *__user content, unsigned int size,
> 		unsigned long flags);
> 

That does seem better. The key is to have at least one more parameter.

Actually I forgot to include pattern data. In both of the approaches above,
flags is probably used for that, but if we already know that patterns
are being passed, then how about add a "pattern" arg? I think it's
good to leave a little room for flexibility and future extensions:

int refpage_create(const void *__user content, unsigned int size,
		unsigned long pattern, unsigned long flags);


thanks,
Matthew Wilcox June 29, 2021, 6:21 p.m. UTC | #9
On Tue, Jun 29, 2021 at 10:48:20AM -0700, John Hubbard wrote:
> On 6/29/21 4:58 AM, Matthew Wilcox wrote:
> > int refpage_create(const void *__user content, unsigned int size,
> > 		unsigned long flags);
> > 
> 
> That does seem better. The key is to have at least one more parameter.
> 
> Actually I forgot to include pattern data. In both of the approaches above,
> flags is probably used for that, but if we already know that patterns
> are being passed, then how about add a "pattern" arg? I think it's
> good to leave a little room for flexibility and future extensions:
> 
> int refpage_create(const void *__user content, unsigned int size,
> 		unsigned long pattern, unsigned long flags);

I don't get it.  'size' is the length of the pattern, and it's
pointed to by 'content'.  Why would you pass 'pattern' as well?
John Hubbard June 29, 2021, 6:28 p.m. UTC | #10
On 6/29/21 11:21 AM, Matthew Wilcox wrote:
> On Tue, Jun 29, 2021 at 10:48:20AM -0700, John Hubbard wrote:
>> On 6/29/21 4:58 AM, Matthew Wilcox wrote:
>>> int refpage_create(const void *__user content, unsigned int size,
>>> 		unsigned long flags);
>>>
>>
>> That does seem better. The key is to have at least one more parameter.
>>
>> Actually I forgot to include pattern data. In both of the approaches above,
>> flags is probably used for that, but if we already know that patterns
>> are being passed, then how about add a "pattern" arg? I think it's
>> good to leave a little room for flexibility and future extensions:
>>
>> int refpage_create(const void *__user content, unsigned int size,
>> 		unsigned long pattern, unsigned long flags);
> 
> I don't get it.  'size' is the length of the pattern, and it's
> pointed to by 'content'.  Why would you pass 'pattern' as well?
> 

argghh, I think it's actually best if I avoid this whole "thinking" thing
until after doing the Coffee thing. sigh. :)

Yes, "content" would hold the pattern. So we're back to your exact
function prototype.


thanks,
Peter Collingbourne July 17, 2021, 2:58 a.m. UTC | #11
On Mon, Jun 28, 2021 at 5:24 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Sat, Jun 19, 2021 at 02:20:02AM -0700, Peter Collingbourne wrote:
> >   #include <stdio.h>
> >   #include <stdlib.h>
> >   #include <string.h>
> >   #include <sys/mman.h>
> >   #include <unistd.h>
> >
> >   constexpr unsigned char pattern_byte = 0xaa;
> >
> >   #define PAGE_SIZE 4096
> >
> >   _Alignas(PAGE_SIZE) static unsigned char pattern[PAGE_SIZE];
> >
> >   int main(int argc, char **argv) {
> >     if (argc < 3)
> >       return 1;
> >     bool use_refpage = argc > 3;
> >     size_t mmap_size = atoi(argv[1]);
> >     size_t touch_size = atoi(argv[2]);
> >
> >     int refpage_fd;
> >     if (use_refpage) {
> >       memset(pattern, pattern_byte, PAGE_SIZE);
> >       refpage_fd = syscall(448, pattern, 0);
> >     }
> >     for (unsigned i = 0; i != 1000; ++i) {
> >       char *p;
> >       if (use_refpage) {
> >         p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
> >                          refpage_fd, 0);
> >       } else {
> >         p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE,
> >                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >         memset(p, pattern_byte, mmap_size);
> >       }
> >       for (unsigned j = 0; j < touch_size; j += PAGE_SIZE)
> >         p[j] = 0;
> >       munmap(p, mmap_size);
> >     }
> >   }
>
> I don't like the inteface. It is tied to PAGE_SIZE and this doesn't seem
> to be very future looking. How would it work with THPs?

The idea with this interface is that the FD would be passed to mmap,
and anything that uses mmap already needs to be tied to the page size
to some extent.

For THPs I would expect that the kernel would duplicate the contents
of the page as needed.

Another reason that I thought to use a page size based interface was
to allow future optimizations that may reuse the actual page passed to
the syscall. So for example if libc.so contained a page filled with
the required pattern and the allocator passed a pointer to that page
then it could be shared between all of the processes on the system
that link against that libc.

But I suppose that such optimizations would not require passing in a
whole page like that. For pattern based optimizations we could use a
reference counted hash table or something, and for larger patterns we
could activate the optimization only if the size argument were equal
to the page size.

> Maybe we should cosider passing down a filling pattern to kernel and let
> kernel allocate appropriate page size on read page fault? The pattern has
> to be power of 2 and limited in lenght.

Okay, so this sounds like my idea for handling THPs except applied to
any size. This seems reasonable enough to me, however in order to
optimize use cases where the page is only ever read, let's have the
kernel prepare the reference page instead of recreating it every time.
In v5 I've adopted Matthew's proposed prototype:

int refpage_create(const void *__user content, unsigned int size,
                unsigned long pattern, unsigned long flags);

Peter
Peter Collingbourne July 17, 2021, 2:58 a.m. UTC | #12
On Mon, Jun 28, 2021 at 6:11 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Jun 19, 2021 at 02:20:02AM -0700, Peter Collingbourne wrote:
> > +++ b/include/linux/mm.h
> > @@ -32,6 +32,7 @@
> >  #include <linux/sched.h>
> >  #include <linux/pgtable.h>
> >  #include <linux/kasan.h>
> > +#include <linux/fs.h>
>
> No.

This was because is_refpage_vma needed to access vm_file->f_op to
check whether the mapping is a reference page. In v5 I've moved that
part of the check into mm/refpage.c.

> > +++ b/include/linux/mman.h
> > @@ -2,6 +2,7 @@
> >  #ifndef _LINUX_MMAN_H
> >  #define _LINUX_MMAN_H
> >
> > +#include <linux/fs.h>
>
> No.

Okay, this one was entirely unused; removed.

Peter
Peter Collingbourne July 17, 2021, 2:58 a.m. UTC | #13
On Mon, Jun 28, 2021 at 12:57 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Jun 28, 2021 at 12:44:22PM -0700, John Hubbard wrote:
> > On 6/28/21 12:33 PM, Matthew Wilcox wrote:
> > ...
> > >
> > > I wonder if single-byte captures enough of the useful possibilities.
> > > In the kernel we have memset32() and memset64() [1] so we could support
> > > a larger pattern than just an 8-bit byte.  It all depends what userspace
> > > would find useful.
> > >
> > > [1] Along with memset_p(), memset_l() and memset16() that aren't terribly
> > > relevant to this use case.  Although maybe memset_l() would be the right
> > > one to use since there probably aren't too many 32-bit apps that want
> > > a 64-bit pattern and memset64() might not be the fastest on a 32-bit
> > > kernel).
> > >
> >
> > And in fact, I'm also rather intrigued by doing something like 256 copies
> > of a 16-byte UUID, per 4KB page. In other words, there are *definitely*
> > useful patterns that are longer than a single byte, and it seems interesting
> > to support them here.
> >
> > Kirill's idea of an API that somehow allows various power of 2 patterns seems
> > like it would be nice, because then we don't have to pick a value that seems
> > good in 2021, but less good as time goes by, perhaps.
> >
> > Another thought is to use an entire 4KB page as the smallest pattern unit.
> > That would allow the maximum API flexibility, because the caller could
> > explicitly set every single byte in the page.
>
> That's what this patch does.  If it can be reduced to a pattern (in
> Peter's patch of a single byte; i'm proposing expanding that), then
> the page is filled with the pattern; otherwise we copy the reference
> page.

That sounds good. I propose that for now we only optimize the single
byte pattern and single MTE granule use cases, and allow future
expansion later via the size argument. Programs that use sizes with
optimizations only implemented on newer kernels will still work on
older kernels; they will just be faster on the new kernels.

Peter
Peter Collingbourne July 17, 2021, 2:59 a.m. UTC | #14
On Tue, Jun 29, 2021 at 12:19 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 6/19/21 2:20 AM, Peter Collingbourne wrote:
> > Introduce a new syscall, refpage_create, which returns a file
> > descriptor which may be mapped using mmap. Such a mapping is similar
> > to an anonymous mapping, but instead of clean pages being backed by the
> > zero page, they are instead backed by a so-called reference page, whose
> > contents are specified using an argument to refpage_create. Loads from
> > the mapping will load directly from the reference page, and initial
> > stores to the mapping will copy-on-write from the reference page.
>
> Hi Peter,
>
> Now that you have shown that this seems to have some performance
> justification, I've taken a closer look at the patch, and have a handfull
> of small suggestions, most of them very easy to deal with.
>
> First of all: documentation of the new syscall. At the very least,
> refpage.c could use a bunch of the wording that is in this patch's
> commit description, at the top. I'm sure there are other places for new
> syscall documentation (someone else probably knows where), but that would
> be a good start.

Okay, I copied some of the text from the commit message into a comment
at the top of refpage.c. I also wrote a man page for the new syscall,
which I'm sending out concurrently.

> >
> > Reference pages are useful in circumstances where anonymous mappings
> > combined with manual stores to memory would impose undesirable costs,
> > either in terms of performance or RSS. Use cases are focused on heap
> > allocators and include:
> >
> > - Pattern initialization for the heap. This is where malloc(3) gives
> >    you memory whose contents are filled with a non-zero pattern
> >    byte, in order to help detect and mitigate bugs involving use
> >    of uninitialized memory. Typically this is implemented by having
> >    the allocator memset the allocation with the pattern byte before
> >    returning it to the user, but for large allocations this can result
> >    in a significant increase in RSS, especially for allocations that
> >    are used sparsely. Even for dense allocations there is a needless
> >    impact to startup performance when it may be better to amortize it
> >    throughout the program. By creating allocations using a reference
> >    page filled with the pattern byte, we can avoid these costs.
>
> As Kirill and Matthew mentioned in the other thread, it would be good
> to pass in the pattern as part of the syscall, instead of deducing it
> in prep_refpage_private_data(). I'll cover that more in the diffs area.
>
> >
> > - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
> >    feature which allows for memory to be tagged in order to detect
> >    certain kinds of memory errors with low overhead. In order to set
> >    up an allocation to allow memory errors to be detected, the entire
> >    allocation needs to have the same tag. The issue here is similar to
> >    pattern initialization in the sense that large tagged allocations
> >    will be expensive if the tagging is done up front. The idea is that
> >    the allocator would create reference pages with each of the possible
> >    memory tags, and use those reference pages for the large allocations.
> >
> > This patch includes specific optimizations for these use cases in
> > order to reduce memory traffic. If the reference page consists of a
> > single repeating byte, the page is initialized using memset, and on
> > arm64 if the reference page consists of a uniformly tagged zero page,
> > the DC GZVA instruction is used to initialize the page.
> >
> > In order to measure the performance and RSS impact of reference pages,
> > I used the following microbenchmark program, which is intended to
> > compare an implementation of heap pattern initialization that uses
> > memset to initialize the pages against an implementation that uses
> > reference pages:
> >
> >    #include <stdio.h>
> >    #include <stdlib.h>
> >    #include <string.h>
> >    #include <sys/mman.h>
> >    #include <unistd.h>
> >
> >    constexpr unsigned char pattern_byte = 0xaa;
> >
> >    #define PAGE_SIZE 4096
> >
> >    _Alignas(PAGE_SIZE) static unsigned char pattern[PAGE_SIZE];
> >
> >    int main(int argc, char **argv) {
> >      if (argc < 3)
> >        return 1;
> >      bool use_refpage = argc > 3;
> >      size_t mmap_size = atoi(argv[1]);
> >      size_t touch_size = atoi(argv[2]);
> >
> >      int refpage_fd;
> >      if (use_refpage) {
> >        memset(pattern, pattern_byte, PAGE_SIZE);
> >        refpage_fd = syscall(448, pattern, 0);
> >      }
> >      for (unsigned i = 0; i != 1000; ++i) {
> >        char *p;
> >        if (use_refpage) {
> >          p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
> >                           refpage_fd, 0);
> >        } else {
> >          p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE,
> >                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >          memset(p, pattern_byte, mmap_size);
> >        }
> >        for (unsigned j = 0; j < touch_size; j += PAGE_SIZE)
> >          p[j] = 0;
> >        munmap(p, mmap_size);
> >      }
> >    }
> >
>
>
> That sample code would be very nice to include in a documentation
> section for documentation too, once we figure out the best place to put
> it. If no one else recommends anything, then I'd start with
> Documentation/mm/reference_pages.rst.

I would propose the man page to be the canonical source of
documentation for this syscall, since I would expect it to be the
first place that users will look when trying to understand code that
uses it, as opposed to the kernel's internal documentation.

I added some sample code to the man page, but not exactly the code
above since that code is more of a benchmark than a demonstration of
the feature, and I would expect the latter to be more useful to
readers.

> > On a DragonBoard 845c with the powersave governor, and taking the
> > median of 10 runs for each measurement, I measured the following
> > results for real time (s):
> >
> > touch_size/mmap_size   memset   refpages     improvement (95% CI)
> >        4096/4096000    3.962194   0.026726   98.8015% +/- 1.14684%
> >     2048000/4096000    3.925309   1.48081    61.8271% +/- 1.11911%
> >     4096000/4096000    3.986275   3.385003   15.1205% +/- 0.227235%
> >
> > And the following for max RSS (KiB):
> >
> > touch_size/mmap_size   memset   refpages     improvement (95% CI)
> >        4096/4096000      6656      3448      49.3815% +/- 1.30339%
> >     2048000/4096000      6696      4580      31.7053% +/- 1.16411%
> >     4096000/4096000      6716      6684              none
> >
> > So we see a large improvement for sparsely used allocations, and even
> > a modest perf improvement for fully utilized allocations as a result
> > of touching the pages one fewer time (with memset: once in the kernel
> > and once in userspace; with refpages: just once in the kernel).
> >
> > Signed-off-by: Peter Collingbourne <pcc@google.com>
> > Link: [1] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/enhancing-memory-safety
> > ---
> > v4:
> > - rebased to linux-next
> > - added arch hooks to support MTE tagged reference pages
> > - added optimizations for pages with pattern byte as well as uniformly MTE-tagged pages
> > - added helper functions to avoid open-coding the reference page detection
> > - wrote a microbenchmark program and got new perf results for the commit message
> >
> > As an alternative to introducing this syscall, I considered using
> > userfaultfd to implement reference pages. However, after having taken
> > a detailed look at the interface, it does not seem suitable to be
> > used in the context of a general purpose allocator. For example,
> > UFFD_FEATURE_FORK support would be required in order to correctly
> > support fork(2) in a process that uses the allocator (although POSIX
> > does not guarantee support for allocating after fork, many allocators
> > including Scudo support it, and nothing stops the forked process from
> > page faulting pre-existing allocations after forking anyway), but
> > UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
> > ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
> > making it unsuitable for use in an allocator. Furthermore, even if
> > the interface issues are resolved, I suspect (but have not measured)
> > that the cost of the multiple context switches between kernel and
> > userspace would be too high to be used in an allocator anyway.
> >
> >   arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
> >   arch/arm/tools/syscall.tbl                  |  1 +
> >   arch/arm64/include/asm/mman.h               | 15 ++++
> >   arch/arm64/include/asm/mte.h                |  9 +-
> >   arch/arm64/include/asm/page.h               |  2 +-
> >   arch/arm64/include/asm/unistd.h             |  2 +-
> >   arch/arm64/include/asm/unistd32.h           |  2 +
> >   arch/arm64/kernel/mte.c                     | 24 +++++
> >   arch/arm64/lib/mte.S                        |  7 +-
> >   arch/arm64/mm/fault.c                       | 41 ++++++++-
> >   arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
> >   arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
> >   arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
> >   arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
> >   arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
> >   arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
> >   arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
> >   arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
> >   arch/s390/kernel/syscalls/syscall.tbl       |  1 +
> >   arch/sh/kernel/syscalls/syscall.tbl         |  1 +
> >   arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
> >   arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
> >   arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
> >   arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
> >   include/linux/gfp.h                         | 11 ++-
> >   include/linux/highmem.h                     |  2 +-
> >   include/linux/huge_mm.h                     |  7 ++
> >   include/linux/mm.h                          | 39 ++++++++
> >   include/linux/mman.h                        | 19 ++++
> >   include/linux/syscalls.h                    |  3 +
> >   include/uapi/asm-generic/unistd.h           |  5 +-
> >   kernel/sys_ni.c                             |  1 +
> >   mm/Makefile                                 |  4 +-
> >   mm/gup.c                                    |  2 +-
> >   mm/kasan/hw_tags.c                          |  2 +-
> >   mm/memory.c                                 | 47 +++++++---
> >   mm/migrate.c                                |  4 +-
> >   mm/page_alloc.c                             |  2 +-
> >   mm/refpage.c                                | 98 +++++++++++++++++++++
> >   39 files changed, 330 insertions(+), 34 deletions(-)
> >   create mode 100644 mm/refpage.c
> >
> > diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> > index a17687ed4b51..494edc5ca61c 100644
> > --- a/arch/alpha/kernel/syscalls/syscall.tbl
> > +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> > @@ -486,3 +486,4 @@
> >   554 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   555 common  landlock_add_rule               sys_landlock_add_rule
> >   556 common  landlock_restrict_self          sys_landlock_restrict_self
> > +558  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> > index c5df1179fc5d..8fd7045f46b9 100644
> > --- a/arch/arm/tools/syscall.tbl
> > +++ b/arch/arm/tools/syscall.tbl
> > @@ -460,3 +460,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
> > index e3e28f7daf62..5c0da3f76ec7 100644
> > --- a/arch/arm64/include/asm/mman.h
> > +++ b/arch/arm64/include/asm/mman.h
> > @@ -84,4 +84,19 @@ static inline bool arch_validate_flags(unsigned long vm_flags)
> >   }
> >   #define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
> >
> > +struct refpage_private_data;
> > +
> > +void arch_prep_refpage_private_data(struct refpage_private_data *priv);
> > +#define arch_prep_refpage_private_data arch_prep_refpage_private_data
> > +
> > +static inline void arch_prep_refpage_vma(struct vm_area_struct *vma)
> > +{
> > +     vma->vm_flags |= VM_MTE_ALLOWED;
> > +}
> > +#define arch_prep_refpage_vma arch_prep_refpage_vma
> > +
> > +void arch_copy_refpage(struct page *page, unsigned long addr,
> > +                                  struct vm_area_struct *vma);
> > +#define arch_copy_refpage arch_copy_refpage
> > +
> >   #endif /* ! __ASM_MMAN_H__ */
> > diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> > index 67bf259ae768..b513f83010c7 100644
> > --- a/arch/arm64/include/asm/mte.h
> > +++ b/arch/arm64/include/asm/mte.h
> > @@ -37,7 +37,7 @@ void mte_free_tag_storage(char *storage);
> >   /* track which pages have valid allocation tags */
> >   #define PG_mte_tagged       PG_arch_2
> >
> > -void mte_zero_clear_page_tags(void *addr);
> > +void mte_zero_set_page_tags(void *addr);
>
>
> We should preserve the existing mte_zero_clear_page_tags(), and just
> implement it in terms of the new, more general mte_zero_set_page_tags().
> This is because: a) it will remove some diffs from this patch, and more
> importantly, b) the concept of zeroing is still a distinct and useful
> thing to have here.

With this patch there is only a single caller of
mte_zero_set_page_tags(), and that caller may pass an arbitrarily
tagged address. Which would mean that there would be no callers of the
mte_zero_clear_page_tags() function.

> >   void mte_sync_tags(pte_t *ptep, pte_t pte);
> >   void mte_copy_page_tags(void *kto, const void *kfrom);
> >   void mte_thread_init_user(void);
> > @@ -48,13 +48,14 @@ long set_mte_ctrl(struct task_struct *task, unsigned long arg);
> >   long get_mte_ctrl(struct task_struct *task);
> >   int mte_ptrace_copy_tags(struct task_struct *child, long request,
> >                        unsigned long addr, unsigned long data);
> > +u8 mte_check_tag_zero_page(struct page *userpage);
> >
> >   #else /* CONFIG_ARM64_MTE */
> >
> >   /* unused if !CONFIG_ARM64_MTE, silence the compiler */
> >   #define PG_mte_tagged       0
> >
> > -static inline void mte_zero_clear_page_tags(void *addr)
> > +static inline void mte_zero_set_page_tags(void *addr)
> >   {
> >   }
> >   static inline void mte_sync_tags(pte_t *ptep, pte_t pte)
> > @@ -89,6 +90,10 @@ static inline int mte_ptrace_copy_tags(struct task_struct *child,
> >   {
> >       return -EIO;
> >   }
> > +static inline u8 mte_check_tag_zero_page(struct page *userpage)
> > +{
> > +     return 0;
> > +}
> >
> >   #endif /* CONFIG_ARM64_MTE */
> >
> > diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> > index 993a27ea6f54..234f48688b1a 100644
> > --- a/arch/arm64/include/asm/page.h
> > +++ b/arch/arm64/include/asm/page.h
> > @@ -33,7 +33,7 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
> >                                               unsigned long vaddr);
> >   #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
> >
> > -void tag_clear_highpage(struct page *to);
> > +void tag_set_highpage(struct page *to, unsigned long tag);
>
>
> Same reasoning here: let's preserve tag_clear_highpage(), as well.

Makes sense, done.

> >   #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
> >
> >   #define clear_user_page(page, vaddr, pg)    clear_page(page)
> > diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> > index 727bfc3be99b..3cb206aea3db 100644
> > --- a/arch/arm64/include/asm/unistd.h
> > +++ b/arch/arm64/include/asm/unistd.h
> > @@ -38,7 +38,7 @@
> >   #define __ARM_NR_compat_set_tls             (__ARM_NR_COMPAT_BASE + 5)
> >   #define __ARM_NR_COMPAT_END         (__ARM_NR_COMPAT_BASE + 0x800)
> >
> > -#define __NR_compat_syscalls         447
> > +#define __NR_compat_syscalls         449
> >   #endif
> >
> >   #define __ARCH_WANT_SYS_CLONE
> > diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> > index 99ffcafc736c..2a116aa17fe7 100644
> > --- a/arch/arm64/include/asm/unistd32.h
> > +++ b/arch/arm64/include/asm/unistd32.h
> > @@ -901,6 +901,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
> >   __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
> >   #define __NR_landlock_restrict_self 446
> >   __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> > +#define __NR_refpage_create 448
> > +__SYSCALL(__NR_refpage_create, sys_refpage_create)
> >
> >   /*
> >    * Please add new compat syscalls above this comment and update
> > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > index 125a10e413e9..6a79240d5a77 100644
> > --- a/arch/arm64/kernel/mte.c
> > +++ b/arch/arm64/kernel/mte.c
> > @@ -453,3 +453,27 @@ int mte_ptrace_copy_tags(struct task_struct *child, long request,
> >
> >       return ret;
> >   }
> > +
> > +u8 mte_check_tag_zero_page(struct page *userpage)
> > +{
> > +     char *userpage_addr = page_address(userpage);
> > +     u8 tag;
> > +     int i;
> > +
> > +     if (!test_bit(PG_mte_tagged, &userpage->flags))
> > +             return 0;
> > +
> > +     tag = mte_get_mem_tag(userpage_addr) & 0xF;
> > +     if (tag == 0)
> > +             return 0;
> > +
> > +     for (i = 0; i != PAGE_SIZE; ++i)
> > +             if (userpage_addr[i] != 0)
> > +                     return 0;
> > +
> > +     for (i = MTE_GRANULE_SIZE; i != PAGE_SIZE; i += MTE_GRANULE_SIZE)
> > +             if ((mte_get_mem_tag(userpage_addr + i) & 0xF) != tag)
> > +                     return 0;
> > +
> > +     return tag;
> > +}
> > diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
> > index e83643b3995f..45be436c97af 100644
> > --- a/arch/arm64/lib/mte.S
> > +++ b/arch/arm64/lib/mte.S
> > @@ -37,24 +37,23 @@ SYM_FUNC_START(mte_clear_page_tags)
> >   SYM_FUNC_END(mte_clear_page_tags)
> >
> >   /*
> > - * Zero the page and tags at the same time
> > + * Zero the page and set tags at the same time
> >    *
> >    * Parameters:
> >    *  x0 - address to the beginning of the page
> >    */
> > -SYM_FUNC_START(mte_zero_clear_page_tags)
> > +SYM_FUNC_START(mte_zero_set_page_tags)
> >       mrs     x1, dczid_el0
> >       and     w1, w1, #0xf
> >       mov     x2, #4
> >       lsl     x1, x2, x1
> > -     and     x0, x0, #(1 << MTE_TAG_SHIFT) - 1       // clear the tag
> >
> >   1:  dc      gzva, x0
> >       add     x0, x0, x1
> >       tst     x0, #(PAGE_SIZE - 1)
> >       b.ne    1b
> >       ret
> > -SYM_FUNC_END(mte_zero_clear_page_tags)
> > +SYM_FUNC_END(mte_zero_set_page_tags)
> >
> >   /*
> >    * Copy the tags from the source page to the destination one
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index 349c488765ca..36355758ffc7 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -25,6 +25,7 @@
> >   #include <linux/perf_event.h>
> >   #include <linux/preempt.h>
> >   #include <linux/hugetlb.h>
> > +#include <linux/mman.h>
> >
> >   #include <asm/acpi.h>
> >   #include <asm/bug.h>
> > @@ -939,9 +940,45 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
> >       return alloc_page_vma(flags, vma, vaddr);
> >   }
> >
> > -void tag_clear_highpage(struct page *page)
> > +void tag_set_highpage(struct page *page, unsigned long tag)
> >   {
> > -     mte_zero_clear_page_tags(page_address(page));
> > +     unsigned long addr = (unsigned long)page_address(page);
> > +
> > +     addr &= ~MTE_TAG_MASK;
> > +     addr |= tag << MTE_TAG_SHIFT;
> > +     mte_zero_set_page_tags((void *)addr);
> >       page_kasan_tag_reset(page);
> >       set_bit(PG_mte_tagged, &page->flags);
> >   }
> > +
> > +#define REFPAGE_OPTZN_MTE_TAGGED REFPAGE_OPTZN_ARCH
>
> I see what you're doing with the arch layer here, but there's no need to
> accept the minor drawbacks (of having this #define hidden away near the
> bottom of a .c file). Instead, let's just put this into the list in
> mm.h, and call it what it is, rather than "arch".

Done.

> > +
> > +void arch_prep_refpage_private_data(struct refpage_private_data *priv)
> > +{
> > +     if (system_supports_mte()) {
> > +             u8 tag;
> > +
> > +             if (!test_and_set_bit(PG_mte_tagged, &priv->refpage->flags))
> > +                     mte_clear_page_tags(page_address(priv->refpage));
> > +
> > +             tag = mte_check_tag_zero_page(priv->refpage);
> > +             if (tag) {
> > +                     priv->optzn_kind = REFPAGE_OPTZN_MTE_TAGGED;
> > +                     priv->optzn_info = tag;
> > +                     return;
> > +             }
> > +     }
> > +
> > +     prep_refpage_private_data(priv);
> > +}
> > +
> > +void arch_copy_refpage(struct page *page, unsigned long addr,
> > +                    struct vm_area_struct *vma)
> > +{
> > +     struct refpage_private_data *priv = vma->vm_private_data;
> > +
> > +     if (priv->optzn_kind == REFPAGE_OPTZN_MTE_TAGGED)
> > +             tag_set_highpage(page, priv->optzn_info);
> > +     else
> > +             copy_refpage(page, addr, vma);
> > +}
> > diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> > index 6d07742c57b8..c2209d83f3c3 100644
> > --- a/arch/ia64/kernel/syscalls/syscall.tbl
> > +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> > @@ -367,3 +367,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> > index 541bc1b3a8f9..0360cf474a49 100644
> > --- a/arch/m68k/kernel/syscalls/syscall.tbl
> > +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> > @@ -446,3 +446,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> > index a176faca2927..de85d758e564 100644
> > --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> > +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> > @@ -452,3 +452,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> > index c2d2e19abea8..b07c7293d2a3 100644
> > --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> > @@ -385,3 +385,4 @@
> >   444 n32     landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 n32     landlock_add_rule               sys_landlock_add_rule
> >   446 n32     landlock_restrict_self          sys_landlock_restrict_self
> > +448  n32     refpage_create                  sys_refpage_create
> > diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> > index ac653d08b1ea..7ebabb99dd06 100644
> > --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> > @@ -361,3 +361,4 @@
> >   444 n64     landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 n64     landlock_add_rule               sys_landlock_add_rule
> >   446 n64     landlock_restrict_self          sys_landlock_restrict_self
> > +448  n64     refpage_create                  sys_refpage_create
> > diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> > index 253f2cd70b6b..a51149ac101c 100644
> > --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> > @@ -434,3 +434,4 @@
> >   444 o32     landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 o32     landlock_add_rule               sys_landlock_add_rule
> >   446 o32     landlock_restrict_self          sys_landlock_restrict_self
> > +448  o32     refpage_create                  sys_refpage_create
> > diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> > index e26187b9ab87..385565864861 100644
> > --- a/arch/parisc/kernel/syscalls/syscall.tbl
> > +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> > @@ -444,3 +444,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> > index aef2a290e71a..95cdd9f7dc06 100644
> > --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> > +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> > @@ -526,3 +526,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> > index 64d51ab5a8b4..92ed1260ffd9 100644
> > --- a/arch/s390/kernel/syscalls/syscall.tbl
> > +++ b/arch/s390/kernel/syscalls/syscall.tbl
> > @@ -449,3 +449,4 @@
> >   444  common landlock_create_ruleset sys_landlock_create_ruleset     sys_landlock_create_ruleset
> >   445  common landlock_add_rule       sys_landlock_add_rule           sys_landlock_add_rule
> >   446  common landlock_restrict_self  sys_landlock_restrict_self      sys_landlock_restrict_self
> > +448  common  refpage_create          sys_refpage_create              sys_refpage_create
> > diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> > index e0a70be77d84..f9d198cc2541 100644
> > --- a/arch/sh/kernel/syscalls/syscall.tbl
> > +++ b/arch/sh/kernel/syscalls/syscall.tbl
> > @@ -449,3 +449,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> > index 603f5a821502..83533aa49340 100644
> > --- a/arch/sparc/kernel/syscalls/syscall.tbl
> > +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> > @@ -492,3 +492,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index ce763a12311c..054c69e395b5 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -452,3 +452,4 @@
> >   445 i386    landlock_add_rule       sys_landlock_add_rule
> >   446 i386    landlock_restrict_self  sys_landlock_restrict_self
> >   447 i386    memfd_secret            sys_memfd_secret
> > +448  i386    refpage_create          sys_refpage_create
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index f6b57799c1ea..1f24f0b66cbd 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -369,6 +369,7 @@
> >   445 common  landlock_add_rule       sys_landlock_add_rule
> >   446 common  landlock_restrict_self  sys_landlock_restrict_self
> >   447 common  memfd_secret            sys_memfd_secret
> > +448  common  refpage_create          sys_refpage_create
> >
> >   #
> >   # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> > index 235d67d6ceb4..96c27fb404ca 100644
> > --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> > +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> > @@ -417,3 +417,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 55b2ec1f965a..ae3c763eb9e9 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -55,8 +55,9 @@ struct vm_area_struct;
> >   #define ___GFP_ACCOUNT              0x400000u
> >   #define ___GFP_ZEROTAGS             0x800000u
> >   #define ___GFP_SKIP_KASAN_POISON    0x1000000u
> > +#define ___GFP_NOZERO                0x2000000u
> >   #ifdef CONFIG_LOCKDEP
> > -#define ___GFP_NOLOCKDEP     0x2000000u
> > +#define ___GFP_NOLOCKDEP     0x4000000u
> >   #else
> >   #define ___GFP_NOLOCKDEP    0
> >   #endif
> > @@ -238,18 +239,24 @@ struct vm_area_struct;
> >    * %__GFP_SKIP_KASAN_POISON returns a page which does not need to be poisoned
> >    * on deallocation. Typically used for userspace pages. Currently only has an
> >    * effect in HW tags mode.
> > + *
> > + * %__GFP_NOZERO disables any implicit zeroing of the page (e.g. as a result
> > + * of init_on_alloc=on). This flag should only be used to address specific
> > + * performance bottlenecks and only if the page is clearly being fully
> > + * initialized following the allocation.
> >    */
> >   #define __GFP_NOWARN        ((__force gfp_t)___GFP_NOWARN)
> >   #define __GFP_COMP  ((__force gfp_t)___GFP_COMP)
> >   #define __GFP_ZERO  ((__force gfp_t)___GFP_ZERO)
> >   #define __GFP_ZEROTAGS      ((__force gfp_t)___GFP_ZEROTAGS)
> >   #define __GFP_SKIP_KASAN_POISON     ((__force gfp_t)___GFP_SKIP_KASAN_POISON)
> > +#define __GFP_NOZERO ((__force gfp_t)___GFP_NOZERO)
> >
> >   /* Disable lockdep for GFP context tracking */
> >   #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
> >
> >   /* Room for N __GFP_FOO bits */
> > -#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
> > +#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
> >   #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
> >
> >   /**
> > diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> > index 93ba33c09d12..6c5076dd1e9b 100644
> > --- a/include/linux/highmem.h
> > +++ b/include/linux/highmem.h
> > @@ -187,7 +187,7 @@ static inline void clear_highpage(struct page *page)
> >
> >   #ifndef __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
> >
> > -static inline void tag_clear_highpage(struct page *page)
> > +static inline void tag_set_highpage(struct page *page, unsigned long tag)
> >   {
> >   }
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index f123e15d966e..36ecfc391b46 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -127,6 +127,13 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
> >
> >       if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> >               return false;
> > +
> > +     /*
> > +      * Transparent hugepages not currently supported for anonymous VMAs with
> > +      * reference pages
> > +      */
> > +     if (unlikely(is_refpage_vma(vma)))
> > +             return false;
> >       return true;
> >   }
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index a127d93612fa..8cff9e0463b5 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -32,6 +32,7 @@
> >   #include <linux/sched.h>
> >   #include <linux/pgtable.h>
> >   #include <linux/kasan.h>
> > +#include <linux/fs.h>
> >
> >   struct mempolicy;
> >   struct anon_vma;
> > @@ -722,6 +723,42 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
> >   /* flush_tlb_range() takes a vma, not a mm, and can care about flags */
> >   #define TLB_FLUSH_VMA(mm,flags) { .vm_mm = (mm), .vm_flags = (flags) }
> >
> > +extern const struct file_operations refpage_file_operations;
> > +
> > +struct refpage_private_data {
> > +     struct page *refpage;
> > +     u8 optzn_kind;
>
> How about:
>         u8 content_type;
>
> > +     u8 optzn_info;
>
> and:
>         u8 pattern[16]; // or whatever size the enums go up to, see below
>
> > +};
> > +
> > +#define REFPAGE_OPTZN_NONE   0
>
> For this next set, how about How about REFPAGE_CONTENT_TYPE_ for a prefix?
> The spelling of OPTZN is tough, and there's no particular need internally
> to call these out as optimizations.
>
> So then this one becomes:
>
> #define REFPAGE_CONTENT_TYPE_USER_SET   0
>
> > +#define REFPAGE_OPTZN_PATTERN        1
> > +#define REFPAGE_OPTZN_ARCH   2
>
> And for the last one, let's avoid the arch hiding and just call it what it
> is, no reason not to:
>
> #define REFPAGE_CONTENT_TYPE_MTE_TAGGED 2

Done. But I think that MTE_TAGGED's usage of the field formerly known
as "optzn_info" is sufficiently different from PATTERN that "pattern"
is probably not a great name. So let's give that field a more opaque
name -- I chose "content_info".

> > +
> > +static inline bool is_refpage_vma(struct vm_area_struct *vma)
> > +{
> > +     return vma->vm_file && vma->vm_file->f_op == &refpage_file_operations;
> > +}
> > +
> > +static inline struct page *get_vma_refpage(struct vm_area_struct *vma)
> > +{
> > +     struct refpage_private_data *priv = vma->vm_private_data;
> > +
> > +     BUG_ON(!is_refpage_vma(vma));
> > +     return priv->refpage;
> > +}
> > +
> > +static inline int is_refpage_pfn(struct vm_area_struct *vma, unsigned long pfn)
> > +{
> > +     return is_refpage_vma(vma) && pfn == page_to_pfn(get_vma_refpage(vma));
> > +}
> > +
> > +static inline int is_zero_or_refpage_pfn(struct vm_area_struct *vma,
> > +                                      unsigned long pfn)
> > +{
> > +     return is_zero_pfn(pfn) || is_refpage_pfn(vma, pfn);
> > +}
> > +
>
>
> I don't think this helper function is helping enough to justify itself,
> seeing as how it is quite clear when the implementation is used instead. No
> big deal either way, though.

Fair. That ends up making the code a bit larger, but perhaps clarity
at the call site is more important. I removed it.

> >   struct mmu_gather;
> >   struct inode;
> >
> > @@ -2977,6 +3014,8 @@ static inline void kernel_unpoison_pages(struct page *page, int numpages) { }
> >   DECLARE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc);
> >   static inline bool want_init_on_alloc(gfp_t flags)
> >   {
> > +     if (flags & __GFP_NOZERO)
> > +             return false;
> >       if (static_branch_maybe(CONFIG_INIT_ON_ALLOC_DEFAULT_ON,
> >                               &init_on_alloc))
> >               return true;
> > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > index ebb09a964272..cdf8f8245c78 100644
> > --- a/include/linux/mman.h
> > +++ b/include/linux/mman.h
> > @@ -2,6 +2,7 @@
> >   #ifndef _LINUX_MMAN_H
> >   #define _LINUX_MMAN_H
> >
> > +#include <linux/fs.h>
> >   #include <linux/mm.h>
> >   #include <linux/percpu_counter.h>
> >
> > @@ -123,6 +124,24 @@ static inline bool arch_validate_flags(unsigned long flags)
> >   #define arch_validate_flags arch_validate_flags
> >   #endif
> >
> > +void prep_refpage_private_data(struct refpage_private_data *priv);
> > +#ifndef arch_prep_refpage_private_data
> > +#define arch_prep_refpage_private_data prep_refpage_private_data
> > +#endif
> > +
> > +#ifndef arch_prep_refpage_vma
> > +static inline void arch_prep_refpage_vma(struct vm_area_struct *vma)
> > +{
> > +}
> > +#define arch_prep_refpage_vma arch_prep_refpage_vma
> > +#endif
> > +
> > +void copy_refpage(struct page *page, unsigned long addr,
> > +               struct vm_area_struct *vma);
> > +#ifndef arch_copy_refpage
> > +#define arch_copy_refpage copy_refpage
> > +#endif
> > +
> >   /*
> >    * Optimisation macro.  It is equivalent to:
> >    *      (x & bit1) ? bit2 : 0
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 69c9a7010081..303a28a86500 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -864,6 +864,9 @@ asmlinkage long sys_mremap(unsigned long addr,
> >                          unsigned long old_len, unsigned long new_len,
> >                          unsigned long flags, unsigned long new_addr);
> >
> > +/* mm/refpage.c */
> > +asmlinkage long sys_refpage_create(const void __user *content, unsigned long flags);
> > +
> >   /* security/keys/keyctl.c */
> >   asmlinkage long sys_add_key(const char __user *_type,
> >                           const char __user *_description,
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index a9d6fcd95f42..54cede7db5f0 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -878,8 +878,11 @@ __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> >   __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
> >   #endif
> >
> > +#define __NR_refpage_create 448
> > +__SYSCALL(__NR_refpage_create, sys_refpage_create)
> > +
> >   #undef __NR_syscalls
> > -#define __NR_syscalls 448
> > +#define __NR_syscalls 449
> >
> >   /*
> >    * 32 bit systems traditionally used different
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 30971b1dd4a9..bc65a54eb2a4 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -300,6 +300,7 @@ COND_SYSCALL(migrate_pages);
> >   COND_SYSCALL_COMPAT(migrate_pages);
> >   COND_SYSCALL(move_pages);
> >   COND_SYSCALL_COMPAT(move_pages);
> > +COND_SYSCALL(refpage_create);
> >
> >   COND_SYSCALL(perf_event_open);
> >   COND_SYSCALL(accept4);
> > diff --git a/mm/Makefile b/mm/Makefile
> > index e3436741d539..137adc22bf50 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -35,10 +35,10 @@ CFLAGS_init-mm.o += $(call cc-disable-warning, override-init)
> >   CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides)
> >
> >   mmu-y                       := nommu.o
> > -mmu-$(CONFIG_MMU)    := highmem.o memory.o mincore.o \
> > +mmu-$(CONFIG_MMU)    := highmem.o ioremap.o memory.o mincore.o \
> >                          mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
> >                          msync.o page_vma_mapped.o pagewalk.o \
> > -                        pgtable-generic.o rmap.o vmalloc.o ioremap.o
> > +                        pgtable-generic.o refpage.o rmap.o vmalloc.o
> >
> >
> >   ifdef CONFIG_CROSS_MEMORY_ATTACH
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 42b8b1fa6521..ba1b7bd7a0a0 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -548,7 +548,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> >                       goto out;
> >               }
> >
> > -             if (is_zero_pfn(pte_pfn(pte))) {
> > +             if (is_zero_or_refpage_pfn(vma, pte_pfn(pte))) {
> >                       page = pte_page(pte);
> >               } else {
> >                       ret = follow_pfn_pte(vma, address, ptep, flags);
> > diff --git a/mm/kasan/hw_tags.c b/mm/kasan/hw_tags.c
> > index ed5e5b833d61..3c433e430c80 100644
> > --- a/mm/kasan/hw_tags.c
> > +++ b/mm/kasan/hw_tags.c
> > @@ -253,7 +253,7 @@ void kasan_alloc_pages(struct page *page, unsigned int order, gfp_t flags)
> >               int i;
> >
> >               for (i = 0; i != 1 << order; ++i)
> > -                     tag_clear_highpage(page + i);
> > +                     tag_set_highpage(page + i, 0);
>
>
> Here, we could avoid this diff, by preserving tag_clear_highpage(). And
> that's good, because the current diff is making the code just ever so
> slightly worse. :)

Done.

> >       } else {
> >               kasan_unpoison_pages(page, order, init);
> >       }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index db86558791f1..8b32bdd215b7 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -614,7 +614,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> >                       return vma->vm_ops->find_special_page(vma, addr);
> >               if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> >                       return NULL;
> > -             if (is_zero_pfn(pfn))
> > +             if (is_zero_or_refpage_pfn(vma, pfn))
> >                       return NULL;
> >               if (pte_devmap(pte))
> >                       return NULL;
> > @@ -640,7 +640,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> >               }
> >       }
> >
> > -     if (is_zero_pfn(pfn))
> > +     if (is_zero_or_refpage_pfn(vma, pfn))
> >               return NULL;
> >
> >   check_pfn:
> > @@ -2166,7 +2166,7 @@ static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
> >               return true;
> >       if (pfn_t_special(pfn))
> >               return true;
> > -     if (is_zero_pfn(pfn_t_to_pfn(pfn)))
> > +     if (is_zero_or_refpage_pfn(vma, pfn_t_to_pfn(pfn)))
> >               return true;
> >       return false;
> >   }
> > @@ -2990,22 +2990,29 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> >       pte_t entry;
> >       int page_copied = 0;
> >       struct mmu_notifier_range range;
> > +     unsigned long pfn;
> >
> >       if (unlikely(anon_vma_prepare(vma)))
> >               goto oom;
> >
> > -     if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
> > +     pfn = pte_pfn(vmf->orig_pte);
> > +     if (is_zero_pfn(pfn)) {
> >               new_page = alloc_zeroed_user_highpage_movable(vma,
> >                                                             vmf->address);
> >               if (!new_page)
> >                       goto oom;
> >       } else {
> > -             new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
> > -                             vmf->address);
> > +             bool refpage = is_refpage_pfn(vma, pfn);
> > +
> > +             new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE |
> > +                                               (refpage ? __GFP_NOZERO : 0),
> > +                                       vma, vmf->address);
> >               if (!new_page)
> >                       goto oom;
> >
> > -             if (!cow_user_page(new_page, old_page, vmf)) {
> > +             if (refpage) {
> > +                     arch_copy_refpage(new_page, vmf->address, vma);
> > +             } else if (!cow_user_page(new_page, old_page, vmf)) {
> >                       /*
> >                        * COW failed, if the fault was solved by other,
> >                        * it's fine. If not, userspace would re-fault on
> > @@ -3739,11 +3746,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >       if (unlikely(pmd_trans_unstable(vmf->pmd)))
> >               return 0;
> >
> > -     /* Use the zero-page for reads */
> > +     /* Use the zero-page, or reference page if set, for reads */
> >       if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> >                       !mm_forbids_zeropage(vma->vm_mm)) {
> > -             entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
> > -                                             vma->vm_page_prot));
> > +             unsigned long pfn;
> > +
> > +             if (unlikely(is_refpage_vma(vma)))
> > +                     pfn = page_to_pfn(get_vma_refpage(vma));
> > +             else
> > +                     pfn = my_zero_pfn(vmf->address);
> > +             entry = pte_mkspecial(pfn_pte(pfn, vma->vm_page_prot));
> >               vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> >                               vmf->address, &vmf->ptl);
> >               if (!pte_none(*vmf->pte)) {
> > @@ -3764,9 +3776,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >       /* Allocate our own private page. */
> >       if (unlikely(anon_vma_prepare(vma)))
> >               goto oom;
> > -     page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
> > -     if (!page)
> > -             goto oom;
> > +
> > +     if (unlikely(is_refpage_vma(vma))) {
> > +             page = alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_NOZERO, vma,
> > +                                   vmf->address);
> > +             if (!page)
> > +                     goto oom;
> > +             arch_copy_refpage(page, vmf->address, vma);
> > +     } else {
> > +             page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
> > +             if (!page)
> > +                     goto oom;
> > +     }
> >
> >       if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
> >               goto oom_free_page;
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 23cbd9de030b..9a897676ff95 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2774,8 +2774,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >       pmd_t *pmdp;
> >       pte_t *ptep;
> >
> > -     /* Only allow populating anonymous memory */
> > -     if (!vma_is_anonymous(vma))
> > +     /* Only allow populating anonymous memory without a reference page */
> > +     if (!vma_is_anonymous(vma) || is_refpage_vma(vma))
> >               goto abort;
> >
> >       pgdp = pgd_offset(mm, addr);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8836e54721ae..6ca831c1821f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1283,7 +1283,7 @@ static void kernel_init_free_pages(struct page *page, int numpages, bool zero_ta
> >
> >       if (zero_tags) {
> >               for (i = 0; i < numpages; i++)
> > -                     tag_clear_highpage(page + i);
> > +                     tag_set_highpage(page + i, 0);
> >               return;
> >       }
> >
> > diff --git a/mm/refpage.c b/mm/refpage.c
> > new file mode 100644
> > index 000000000000..ee95e281d2d4
> > --- /dev/null
> > +++ b/mm/refpage.c
> > @@ -0,0 +1,98 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +
> > +#include <linux/anon_inodes.h>
> > +#include <linux/fs_context.h>
> > +#include <linux/highmem.h>
> > +#include <linux/mman.h>
> > +#include <linux/mount.h>
> > +#include <linux/syscalls.h>
> > +
> > +void prep_refpage_private_data(struct refpage_private_data *priv)
> > +{
> > +     u8 *addr = page_address(priv->refpage);
> > +     u8 pattern = addr[0];
> > +     int i;
> > +
> > +     for (i = 1; i != PAGE_SIZE; ++i)
> > +             if (addr[i] != pattern)
> > +                     return;
> > +
> > +     priv->optzn_kind = REFPAGE_OPTZN_PATTERN;
> > +     priv->optzn_info = pattern;
> > +}
> > +
>
> I am hoping that this doesn't remain in its current form, because of
> the API discussions. Probably we'll end up with setting a pattern instead
> of deducing it.

That's right -- now the code will set up the pattern content type only
if the size is 1, so we don't need to explicitly check every byte.

> > +void copy_refpage(struct page *page, unsigned long addr,
> > +               struct vm_area_struct *vma)
> > +{
> > +     struct refpage_private_data *priv = vma->vm_private_data;
> > +
> > +     if (priv->optzn_kind == REFPAGE_OPTZN_PATTERN)
> > +             memset(page_address(page), priv->optzn_info, PAGE_SIZE);
> > +     else
> > +             copy_user_highpage(page, priv->refpage, addr, vma);
> > +}
> > +
> > +static void put_refpage_private_data(struct refpage_private_data *priv)
>
> Can you please rename this to free_refpage_private_data()? It's a little more
> accurate.

Yes, I think that free would be a better name. (I never understood the
distinction between free and put in the kernel. Although now that I
think about it, maybe it's to do with whether it's a refcounted object
or not? In that case, free seems like the right term.)

But with the error handling refactoring that you requested below,
there ends up being only a single caller of this function, so I
decided to move the body into the caller, making the naming here moot.

> > +{
> > +     put_page(priv->refpage);
> > +     kfree(priv);
> > +}
> > +
> > +static int refpage_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +     vma_set_anonymous(vma);
> > +     vma->vm_private_data = vma->vm_file->private_data;
> > +     arch_prep_refpage_vma(vma);
> > +     return 0;
> > +}
> > +
> > +static int refpage_release(struct inode *inode, struct file *file)
> > +{
> > +     put_refpage_private_data(file->private_data);
> > +     return 0;
> > +}
> > +
> > +const struct file_operations refpage_file_operations = {
> > +     .mmap = refpage_mmap,
> > +     .release = refpage_release,
> > +};
> > +
> > +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
> > +             flags)
>
>  From the API discussion (and using a simpler syntax to illustrate this), it
> seems like the following would be close:
>
> enum content_type {
>         BYTE_PATTERN,
>         FOUR_BYTE_PATTERN,
>         ...
>         FULL_4KB_PAGE
> };
>
> int refpage_create(const void *__user content, enum content_type, unsigned long flags);
>
> ...and if content_type == BYTE_PATTERN, then content is a pointer to just one byte of
> data, and so forth for the other enum values.

As we discussed later on, let's use Matthew's proposed API instead of
making the content type explicit.

> > +{
> > +     unsigned long content_addr = (unsigned long)content;
> > +     struct page *userpage;
> > +     struct refpage_private_data *private_data;
> > +     int fd;
> > +
> > +     if (flags != 0)
> > +             return -EINVAL;
> > +
> > +     if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
> > +         get_user_pages(content_addr, 1, 0, &userpage, 0) != 1)
> > +             return -EFAULT;
> > +
> > +     private_data = kzalloc(sizeof(struct refpage_private_data), GFP_KERNEL);
> > +     if (!private_data) {
> > +             put_page(userpage);
> > +             return -ENOMEM;
> > +     }
> > +
> > +     private_data->refpage = alloc_page(GFP_KERNEL);
> > +     if (!private_data->refpage) {
> > +             kfree(private_data);
> > +             put_page(userpage);
> > +             return -ENOMEM;
> > +     }
> > +
> > +     copy_highpage(private_data->refpage, userpage);
> > +     arch_prep_refpage_private_data(private_data);
> > +     put_page(userpage);
> > +
> > +     fd = anon_inode_getfd("[refpage]", &refpage_file_operations,
> > +                           private_data, O_RDONLY | O_CLOEXEC)
> > +     if (fd < 0)
> > +             put_refpage_private_data(private_data);
>
> And here, a couple of things:
>
> 1) I think there's a bug in the fd < 0 case, because you're only freeing
> one of the two pages (there's an alloc_page() call, and a gup call above).

(FWIW, there was no bug here. The page allocated by alloc_page() is
freed by put_refpage_private_data(), and the userpage is freed by the
put_page(userpage).)

> 2) It's jarring to have part the error handling in three different ways:
> returning -EFAULT directly, coding each error case to undo the growing
> set of operations, and finally, jumping out to another routine here for
> fd < 0.
>
> Even for a small routine, that's too error-prone. Instead, one of the
> following will be cleaner and safer too:
>
> a) use goto and labels to unwind, or
>
> b) use a no-fail cleanup routine to unwind
>
> and either way, do it for all cases (or at least all of them after the first
> trivial -EFAULT return.

Done.

Peter
David Hildenbrand July 19, 2021, 8:47 p.m. UTC | #15
> +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
> +		flags)
> +{
> +	unsigned long content_addr = (unsigned long)content;
> +	struct page *userpage;
> +	struct refpage_private_data *private_data;
> +	int fd;
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
> +	    get_user_pages(content_addr, 1, 0, &userpage, 0) != 1)
> +		return -EFAULT;
> +
> +	private_data = kzalloc(sizeof(struct refpage_private_data), GFP_KERNEL);
> +	if (!private_data) {
> +		put_page(userpage);
> +		return -ENOMEM;
> +	}
> +
> +	private_data->refpage = alloc_page(GFP_KERNEL);
> +	if (!private_data->refpage) {
> +		kfree(private_data);
> +		put_page(userpage);
> +		return -ENOMEM;
> +	}
> +
> +	copy_highpage(private_data->refpage, userpage);
> +	arch_prep_refpage_private_data(private_data);
> +	put_page(userpage);
> +
> +	fd = anon_inode_getfd("[refpage]", &refpage_file_operations,
> +			      private_data, O_RDONLY | O_CLOEXEC);
> +	if (fd < 0)
> +		put_refpage_private_data(private_data);
> +
> +	return fd;
> +}
> 

Hi,

some questions:

1. is there any upper limit or can a simple process effectively flood 
the system with alloc_page(GFP_KERNEL)? (does ulimit -n apply or is it 
/proc/sys/fs/file-max)

2. Shouldn't there be a GFP_ACCOUNT or am I missing something important?

3. How does this interact with MADV_DONTNEED? I assume we'll be able to 
zap the mapped refpage and on refault, we'll map in the refpage again, 
correct?
David Hildenbrand July 19, 2021, 8:50 p.m. UTC | #16
On 19.06.21 11:20, Peter Collingbourne wrote:
> Introduce a new syscall, refpage_create, which returns a file
> descriptor which may be mapped using mmap. Such a mapping is similar
> to an anonymous mapping, but instead of clean pages being backed by the
> zero page, they are instead backed by a so-called reference page, whose
> contents are specified using an argument to refpage_create. Loads from
> the mapping will load directly from the reference page, and initial
> stores to the mapping will copy-on-write from the reference page.
> 
> Reference pages are useful in circumstances where anonymous mappings
> combined with manual stores to memory would impose undesirable costs,
> either in terms of performance or RSS. Use cases are focused on heap
> allocators and include:
> 
> - Pattern initialization for the heap. This is where malloc(3) gives
>    you memory whose contents are filled with a non-zero pattern
>    byte, in order to help detect and mitigate bugs involving use
>    of uninitialized memory. Typically this is implemented by having
>    the allocator memset the allocation with the pattern byte before
>    returning it to the user, but for large allocations this can result
>    in a significant increase in RSS, especially for allocations that
>    are used sparsely. Even for dense allocations there is a needless
>    impact to startup performance when it may be better to amortize it
>    throughout the program. By creating allocations using a reference
>    page filled with the pattern byte, we can avoid these costs.
> 
> - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
>    feature which allows for memory to be tagged in order to detect
>    certain kinds of memory errors with low overhead. In order to set
>    up an allocation to allow memory errors to be detected, the entire
>    allocation needs to have the same tag. The issue here is similar to
>    pattern initialization in the sense that large tagged allocations
>    will be expensive if the tagging is done up front. The idea is that
>    the allocator would create reference pages with each of the possible
>    memory tags, and use those reference pages for the large allocations.
> 
> This patch includes specific optimizations for these use cases in
> order to reduce memory traffic. If the reference page consists of a
> single repeating byte, the page is initialized using memset, and on
> arm64 if the reference page consists of a uniformly tagged zero page,
> the DC GZVA instruction is used to initialize the page.
> 
> In order to measure the performance and RSS impact of reference pages,
> I used the following microbenchmark program, which is intended to
> compare an implementation of heap pattern initialization that uses
> memset to initialize the pages against an implementation that uses
> reference pages:
> 
>    #include <stdio.h>
>    #include <stdlib.h>
>    #include <string.h>
>    #include <sys/mman.h>
>    #include <unistd.h>
> 
>    constexpr unsigned char pattern_byte = 0xaa;
> 
>    #define PAGE_SIZE 4096
> 
>    _Alignas(PAGE_SIZE) static unsigned char pattern[PAGE_SIZE];
> 
>    int main(int argc, char **argv) {
>      if (argc < 3)
>        return 1;
>      bool use_refpage = argc > 3;
>      size_t mmap_size = atoi(argv[1]);
>      size_t touch_size = atoi(argv[2]);
> 
>      int refpage_fd;
>      if (use_refpage) {
>        memset(pattern, pattern_byte, PAGE_SIZE);
>        refpage_fd = syscall(448, pattern, 0);
>      }
>      for (unsigned i = 0; i != 1000; ++i) {
>        char *p;
>        if (use_refpage) {
>          p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
>                           refpage_fd, 0);
>        } else {
>          p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE,
>                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>          memset(p, pattern_byte, mmap_size);
>        }
>        for (unsigned j = 0; j < touch_size; j += PAGE_SIZE)
>          p[j] = 0;
>        munmap(p, mmap_size);
>      }
>    }
> 
> On a DragonBoard 845c with the powersave governor, and taking the
> median of 10 runs for each measurement, I measured the following
> results for real time (s):
> 
> touch_size/mmap_size   memset   refpages     improvement (95% CI)
>        4096/4096000    3.962194   0.026726   98.8015% +/- 1.14684%
>     2048000/4096000    3.925309   1.48081    61.8271% +/- 1.11911%
>     4096000/4096000    3.986275   3.385003   15.1205% +/- 0.227235%
> 
> And the following for max RSS (KiB):
> 
> touch_size/mmap_size   memset   refpages     improvement (95% CI)
>        4096/4096000      6656      3448      49.3815% +/- 1.30339%
>     2048000/4096000      6696      4580      31.7053% +/- 1.16411%
>     4096000/4096000      6716      6684              none
> 
> So we see a large improvement for sparsely used allocations, and even
> a modest perf improvement for fully utilized allocations as a result
> of touching the pages one fewer time (with memset: once in the kernel
> and once in userspace; with refpages: just once in the kernel).
> 
> Signed-off-by: Peter Collingbourne <pcc@google.com>
> Link: [1] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/enhancing-memory-safety
> ---
> v4:
> - rebased to linux-next
> - added arch hooks to support MTE tagged reference pages
> - added optimizations for pages with pattern byte as well as uniformly MTE-tagged pages
> - added helper functions to avoid open-coding the reference page detection
> - wrote a microbenchmark program and got new perf results for the commit message
> 
> As an alternative to introducing this syscall, I considered using
> userfaultfd to implement reference pages. However, after having taken
> a detailed look at the interface, it does not seem suitable to be
> used in the context of a general purpose allocator. For example,
> UFFD_FEATURE_FORK support would be required in order to correctly
> support fork(2) in a process that uses the allocator (although POSIX
> does not guarantee support for allocating after fork, many allocators
> including Scudo support it, and nothing stops the forked process from
> page faulting pre-existing allocations after forking anyway), but
> UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
> ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
> making it unsuitable for use in an allocator. Furthermore, even if
> the interface issues are resolved, I suspect (but have not measured)
> that the cost of the multiple context switches between kernel and
> userspace would be too high to be used in an allocator anyway.
> 
>   arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
>   arch/arm/tools/syscall.tbl                  |  1 +
>   arch/arm64/include/asm/mman.h               | 15 ++++
>   arch/arm64/include/asm/mte.h                |  9 +-
>   arch/arm64/include/asm/page.h               |  2 +-
>   arch/arm64/include/asm/unistd.h             |  2 +-
>   arch/arm64/include/asm/unistd32.h           |  2 +
>   arch/arm64/kernel/mte.c                     | 24 +++++
>   arch/arm64/lib/mte.S                        |  7 +-
>   arch/arm64/mm/fault.c                       | 41 ++++++++-
>   arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
>   arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
>   arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
>   arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
>   arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
>   arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
>   arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
>   arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
>   arch/s390/kernel/syscalls/syscall.tbl       |  1 +
>   arch/sh/kernel/syscalls/syscall.tbl         |  1 +
>   arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
>   arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
>   arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
>   arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
>   include/linux/gfp.h                         | 11 ++-
>   include/linux/highmem.h                     |  2 +-
>   include/linux/huge_mm.h                     |  7 ++
>   include/linux/mm.h                          | 39 ++++++++
>   include/linux/mman.h                        | 19 ++++
>   include/linux/syscalls.h                    |  3 +
>   include/uapi/asm-generic/unistd.h           |  5 +-
>   kernel/sys_ni.c                             |  1 +
>   mm/Makefile                                 |  4 +-
>   mm/gup.c                                    |  2 +-
>   mm/kasan/hw_tags.c                          |  2 +-
>   mm/memory.c                                 | 47 +++++++---
>   mm/migrate.c                                |  4 +-
>   mm/page_alloc.c                             |  2 +-
>   mm/refpage.c                                | 98 +++++++++++++++++++++
>   39 files changed, 330 insertions(+), 34 deletions(-)
>   create mode 100644 mm/refpage.c
> 
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index a17687ed4b51..494edc5ca61c 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -486,3 +486,4 @@
>   554	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   555	common	landlock_add_rule		sys_landlock_add_rule
>   556	common	landlock_restrict_self		sys_landlock_restrict_self
> +558	common	refpage_create			sys_refpage_create
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index c5df1179fc5d..8fd7045f46b9 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -460,3 +460,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
> index e3e28f7daf62..5c0da3f76ec7 100644
> --- a/arch/arm64/include/asm/mman.h
> +++ b/arch/arm64/include/asm/mman.h
> @@ -84,4 +84,19 @@ static inline bool arch_validate_flags(unsigned long vm_flags)
>   }
>   #define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
>   
> +struct refpage_private_data;
> +
> +void arch_prep_refpage_private_data(struct refpage_private_data *priv);
> +#define arch_prep_refpage_private_data arch_prep_refpage_private_data
> +
> +static inline void arch_prep_refpage_vma(struct vm_area_struct *vma)
> +{
> +	vma->vm_flags |= VM_MTE_ALLOWED;
> +}
> +#define arch_prep_refpage_vma arch_prep_refpage_vma
> +
> +void arch_copy_refpage(struct page *page, unsigned long addr,
> +				     struct vm_area_struct *vma);
> +#define arch_copy_refpage arch_copy_refpage
> +
>   #endif /* ! __ASM_MMAN_H__ */
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 67bf259ae768..b513f83010c7 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -37,7 +37,7 @@ void mte_free_tag_storage(char *storage);
>   /* track which pages have valid allocation tags */
>   #define PG_mte_tagged	PG_arch_2
>   
> -void mte_zero_clear_page_tags(void *addr);
> +void mte_zero_set_page_tags(void *addr);
>   void mte_sync_tags(pte_t *ptep, pte_t pte);
>   void mte_copy_page_tags(void *kto, const void *kfrom);
>   void mte_thread_init_user(void);
> @@ -48,13 +48,14 @@ long set_mte_ctrl(struct task_struct *task, unsigned long arg);
>   long get_mte_ctrl(struct task_struct *task);
>   int mte_ptrace_copy_tags(struct task_struct *child, long request,
>   			 unsigned long addr, unsigned long data);
> +u8 mte_check_tag_zero_page(struct page *userpage);
>   
>   #else /* CONFIG_ARM64_MTE */
>   
>   /* unused if !CONFIG_ARM64_MTE, silence the compiler */
>   #define PG_mte_tagged	0
>   
> -static inline void mte_zero_clear_page_tags(void *addr)
> +static inline void mte_zero_set_page_tags(void *addr)
>   {
>   }
>   static inline void mte_sync_tags(pte_t *ptep, pte_t pte)
> @@ -89,6 +90,10 @@ static inline int mte_ptrace_copy_tags(struct task_struct *child,
>   {
>   	return -EIO;
>   }
> +static inline u8 mte_check_tag_zero_page(struct page *userpage)
> +{
> +	return 0;
> +}
>   
>   #endif /* CONFIG_ARM64_MTE */
>   
> diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> index 993a27ea6f54..234f48688b1a 100644
> --- a/arch/arm64/include/asm/page.h
> +++ b/arch/arm64/include/asm/page.h
> @@ -33,7 +33,7 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>   						unsigned long vaddr);
>   #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
>   
> -void tag_clear_highpage(struct page *to);
> +void tag_set_highpage(struct page *to, unsigned long tag);
>   #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
>   
>   #define clear_user_page(page, vaddr, pg)	clear_page(page)
> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 727bfc3be99b..3cb206aea3db 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -38,7 +38,7 @@
>   #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
>   #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
>   
> -#define __NR_compat_syscalls		447
> +#define __NR_compat_syscalls		449
>   #endif
>   
>   #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index 99ffcafc736c..2a116aa17fe7 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -901,6 +901,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
>   __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
>   #define __NR_landlock_restrict_self 446
>   __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> +#define __NR_refpage_create 448
> +__SYSCALL(__NR_refpage_create, sys_refpage_create)
>   
>   /*
>    * Please add new compat syscalls above this comment and update
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index 125a10e413e9..6a79240d5a77 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -453,3 +453,27 @@ int mte_ptrace_copy_tags(struct task_struct *child, long request,
>   
>   	return ret;
>   }
> +
> +u8 mte_check_tag_zero_page(struct page *userpage)
> +{
> +	char *userpage_addr = page_address(userpage);
> +	u8 tag;
> +	int i;
> +
> +	if (!test_bit(PG_mte_tagged, &userpage->flags))
> +		return 0;
> +
> +	tag = mte_get_mem_tag(userpage_addr) & 0xF;
> +	if (tag == 0)
> +		return 0;
> +
> +	for (i = 0; i != PAGE_SIZE; ++i)
> +		if (userpage_addr[i] != 0)
> +			return 0;
> +
> +	for (i = MTE_GRANULE_SIZE; i != PAGE_SIZE; i += MTE_GRANULE_SIZE)
> +		if ((mte_get_mem_tag(userpage_addr + i) & 0xF) != tag)
> +			return 0;
> +
> +	return tag;
> +}
> diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
> index e83643b3995f..45be436c97af 100644
> --- a/arch/arm64/lib/mte.S
> +++ b/arch/arm64/lib/mte.S
> @@ -37,24 +37,23 @@ SYM_FUNC_START(mte_clear_page_tags)
>   SYM_FUNC_END(mte_clear_page_tags)
>   
>   /*
> - * Zero the page and tags at the same time
> + * Zero the page and set tags at the same time
>    *
>    * Parameters:
>    *	x0 - address to the beginning of the page
>    */
> -SYM_FUNC_START(mte_zero_clear_page_tags)
> +SYM_FUNC_START(mte_zero_set_page_tags)
>   	mrs	x1, dczid_el0
>   	and	w1, w1, #0xf
>   	mov	x2, #4
>   	lsl	x1, x2, x1
> -	and	x0, x0, #(1 << MTE_TAG_SHIFT) - 1	// clear the tag
>   
>   1:	dc	gzva, x0
>   	add	x0, x0, x1
>   	tst	x0, #(PAGE_SIZE - 1)
>   	b.ne	1b
>   	ret
> -SYM_FUNC_END(mte_zero_clear_page_tags)
> +SYM_FUNC_END(mte_zero_set_page_tags)
>   
>   /*
>    * Copy the tags from the source page to the destination one
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 349c488765ca..36355758ffc7 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -25,6 +25,7 @@
>   #include <linux/perf_event.h>
>   #include <linux/preempt.h>
>   #include <linux/hugetlb.h>
> +#include <linux/mman.h>
>   
>   #include <asm/acpi.h>
>   #include <asm/bug.h>
> @@ -939,9 +940,45 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
>   	return alloc_page_vma(flags, vma, vaddr);
>   }
>   
> -void tag_clear_highpage(struct page *page)
> +void tag_set_highpage(struct page *page, unsigned long tag)
>   {
> -	mte_zero_clear_page_tags(page_address(page));
> +	unsigned long addr = (unsigned long)page_address(page);
> +
> +	addr &= ~MTE_TAG_MASK;
> +	addr |= tag << MTE_TAG_SHIFT;
> +	mte_zero_set_page_tags((void *)addr);
>   	page_kasan_tag_reset(page);
>   	set_bit(PG_mte_tagged, &page->flags);
>   }
> +
> +#define REFPAGE_OPTZN_MTE_TAGGED REFPAGE_OPTZN_ARCH
> +
> +void arch_prep_refpage_private_data(struct refpage_private_data *priv)
> +{
> +	if (system_supports_mte()) {
> +		u8 tag;
> +
> +		if (!test_and_set_bit(PG_mte_tagged, &priv->refpage->flags))
> +			mte_clear_page_tags(page_address(priv->refpage));
> +
> +		tag = mte_check_tag_zero_page(priv->refpage);
> +		if (tag) {
> +			priv->optzn_kind = REFPAGE_OPTZN_MTE_TAGGED;
> +			priv->optzn_info = tag;
> +			return;
> +		}
> +	}
> +
> +	prep_refpage_private_data(priv);
> +}
> +
> +void arch_copy_refpage(struct page *page, unsigned long addr,
> +		       struct vm_area_struct *vma)
> +{
> +	struct refpage_private_data *priv = vma->vm_private_data;
> +
> +	if (priv->optzn_kind == REFPAGE_OPTZN_MTE_TAGGED)
> +		tag_set_highpage(page, priv->optzn_info);
> +	else
> +		copy_refpage(page, addr, vma);
> +}
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index 6d07742c57b8..c2209d83f3c3 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -367,3 +367,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index 541bc1b3a8f9..0360cf474a49 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -446,3 +446,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index a176faca2927..de85d758e564 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -452,3 +452,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> index c2d2e19abea8..b07c7293d2a3 100644
> --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> @@ -385,3 +385,4 @@
>   444	n32	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	n32	landlock_add_rule		sys_landlock_add_rule
>   446	n32	landlock_restrict_self		sys_landlock_restrict_self
> +448	n32	refpage_create			sys_refpage_create
> diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> index ac653d08b1ea..7ebabb99dd06 100644
> --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> @@ -361,3 +361,4 @@
>   444	n64	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	n64	landlock_add_rule		sys_landlock_add_rule
>   446	n64	landlock_restrict_self		sys_landlock_restrict_self
> +448	n64	refpage_create			sys_refpage_create
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index 253f2cd70b6b..a51149ac101c 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -434,3 +434,4 @@
>   444	o32	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	o32	landlock_add_rule		sys_landlock_add_rule
>   446	o32	landlock_restrict_self		sys_landlock_restrict_self
> +448	o32	refpage_create			sys_refpage_create
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index e26187b9ab87..385565864861 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -444,3 +444,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index aef2a290e71a..95cdd9f7dc06 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -526,3 +526,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index 64d51ab5a8b4..92ed1260ffd9 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -449,3 +449,4 @@
>   444  common	landlock_create_ruleset	sys_landlock_create_ruleset	sys_landlock_create_ruleset
>   445  common	landlock_add_rule	sys_landlock_add_rule		sys_landlock_add_rule
>   446  common	landlock_restrict_self	sys_landlock_restrict_self	sys_landlock_restrict_self
> +448  common	refpage_create		sys_refpage_create		sys_refpage_create
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index e0a70be77d84..f9d198cc2541 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -449,3 +449,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index 603f5a821502..83533aa49340 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -492,3 +492,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index ce763a12311c..054c69e395b5 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -452,3 +452,4 @@
>   445	i386	landlock_add_rule	sys_landlock_add_rule
>   446	i386	landlock_restrict_self	sys_landlock_restrict_self
>   447	i386	memfd_secret		sys_memfd_secret
> +448	i386	refpage_create		sys_refpage_create
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index f6b57799c1ea..1f24f0b66cbd 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -369,6 +369,7 @@
>   445	common	landlock_add_rule	sys_landlock_add_rule
>   446	common	landlock_restrict_self	sys_landlock_restrict_self
>   447	common	memfd_secret		sys_memfd_secret
> +448	common	refpage_create		sys_refpage_create
>   
>   #
>   # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index 235d67d6ceb4..96c27fb404ca 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -417,3 +417,4 @@
>   444	common	landlock_create_ruleset		sys_landlock_create_ruleset
>   445	common	landlock_add_rule		sys_landlock_add_rule
>   446	common	landlock_restrict_self		sys_landlock_restrict_self
> +448	common	refpage_create			sys_refpage_create
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 55b2ec1f965a..ae3c763eb9e9 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -55,8 +55,9 @@ struct vm_area_struct;
>   #define ___GFP_ACCOUNT		0x400000u
>   #define ___GFP_ZEROTAGS		0x800000u
>   #define ___GFP_SKIP_KASAN_POISON	0x1000000u
> +#define ___GFP_NOZERO		0x2000000u

Oh god, please no. We've discussed this a couple of times already: 
whatever leaves the page allcoator shall be zeroed. No exceptions, not 
even for other allocators (like hugetlb).

Introducing something like that to the whole system, including random 
drivers, destroys the whole purpose of the security feature. Please 
don't burry something so controversial in your patch.
Peter Collingbourne July 19, 2021, 10:26 p.m. UTC | #17
On Mon, Jul 19, 2021 at 1:50 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.06.21 11:20, Peter Collingbourne wrote:
> > Introduce a new syscall, refpage_create, which returns a file
> > descriptor which may be mapped using mmap. Such a mapping is similar
> > to an anonymous mapping, but instead of clean pages being backed by the
> > zero page, they are instead backed by a so-called reference page, whose
> > contents are specified using an argument to refpage_create. Loads from
> > the mapping will load directly from the reference page, and initial
> > stores to the mapping will copy-on-write from the reference page.
> >
> > Reference pages are useful in circumstances where anonymous mappings
> > combined with manual stores to memory would impose undesirable costs,
> > either in terms of performance or RSS. Use cases are focused on heap
> > allocators and include:
> >
> > - Pattern initialization for the heap. This is where malloc(3) gives
> >    you memory whose contents are filled with a non-zero pattern
> >    byte, in order to help detect and mitigate bugs involving use
> >    of uninitialized memory. Typically this is implemented by having
> >    the allocator memset the allocation with the pattern byte before
> >    returning it to the user, but for large allocations this can result
> >    in a significant increase in RSS, especially for allocations that
> >    are used sparsely. Even for dense allocations there is a needless
> >    impact to startup performance when it may be better to amortize it
> >    throughout the program. By creating allocations using a reference
> >    page filled with the pattern byte, we can avoid these costs.
> >
> > - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
> >    feature which allows for memory to be tagged in order to detect
> >    certain kinds of memory errors with low overhead. In order to set
> >    up an allocation to allow memory errors to be detected, the entire
> >    allocation needs to have the same tag. The issue here is similar to
> >    pattern initialization in the sense that large tagged allocations
> >    will be expensive if the tagging is done up front. The idea is that
> >    the allocator would create reference pages with each of the possible
> >    memory tags, and use those reference pages for the large allocations.
> >
> > This patch includes specific optimizations for these use cases in
> > order to reduce memory traffic. If the reference page consists of a
> > single repeating byte, the page is initialized using memset, and on
> > arm64 if the reference page consists of a uniformly tagged zero page,
> > the DC GZVA instruction is used to initialize the page.
> >
> > In order to measure the performance and RSS impact of reference pages,
> > I used the following microbenchmark program, which is intended to
> > compare an implementation of heap pattern initialization that uses
> > memset to initialize the pages against an implementation that uses
> > reference pages:
> >
> >    #include <stdio.h>
> >    #include <stdlib.h>
> >    #include <string.h>
> >    #include <sys/mman.h>
> >    #include <unistd.h>
> >
> >    constexpr unsigned char pattern_byte = 0xaa;
> >
> >    #define PAGE_SIZE 4096
> >
> >    _Alignas(PAGE_SIZE) static unsigned char pattern[PAGE_SIZE];
> >
> >    int main(int argc, char **argv) {
> >      if (argc < 3)
> >        return 1;
> >      bool use_refpage = argc > 3;
> >      size_t mmap_size = atoi(argv[1]);
> >      size_t touch_size = atoi(argv[2]);
> >
> >      int refpage_fd;
> >      if (use_refpage) {
> >        memset(pattern, pattern_byte, PAGE_SIZE);
> >        refpage_fd = syscall(448, pattern, 0);
> >      }
> >      for (unsigned i = 0; i != 1000; ++i) {
> >        char *p;
> >        if (use_refpage) {
> >          p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
> >                           refpage_fd, 0);
> >        } else {
> >          p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE,
> >                           MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >          memset(p, pattern_byte, mmap_size);
> >        }
> >        for (unsigned j = 0; j < touch_size; j += PAGE_SIZE)
> >          p[j] = 0;
> >        munmap(p, mmap_size);
> >      }
> >    }
> >
> > On a DragonBoard 845c with the powersave governor, and taking the
> > median of 10 runs for each measurement, I measured the following
> > results for real time (s):
> >
> > touch_size/mmap_size   memset   refpages     improvement (95% CI)
> >        4096/4096000    3.962194   0.026726   98.8015% +/- 1.14684%
> >     2048000/4096000    3.925309   1.48081    61.8271% +/- 1.11911%
> >     4096000/4096000    3.986275   3.385003   15.1205% +/- 0.227235%
> >
> > And the following for max RSS (KiB):
> >
> > touch_size/mmap_size   memset   refpages     improvement (95% CI)
> >        4096/4096000      6656      3448      49.3815% +/- 1.30339%
> >     2048000/4096000      6696      4580      31.7053% +/- 1.16411%
> >     4096000/4096000      6716      6684              none
> >
> > So we see a large improvement for sparsely used allocations, and even
> > a modest perf improvement for fully utilized allocations as a result
> > of touching the pages one fewer time (with memset: once in the kernel
> > and once in userspace; with refpages: just once in the kernel).
> >
> > Signed-off-by: Peter Collingbourne <pcc@google.com>
> > Link: [1] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/enhancing-memory-safety
> > ---
> > v4:
> > - rebased to linux-next
> > - added arch hooks to support MTE tagged reference pages
> > - added optimizations for pages with pattern byte as well as uniformly MTE-tagged pages
> > - added helper functions to avoid open-coding the reference page detection
> > - wrote a microbenchmark program and got new perf results for the commit message
> >
> > As an alternative to introducing this syscall, I considered using
> > userfaultfd to implement reference pages. However, after having taken
> > a detailed look at the interface, it does not seem suitable to be
> > used in the context of a general purpose allocator. For example,
> > UFFD_FEATURE_FORK support would be required in order to correctly
> > support fork(2) in a process that uses the allocator (although POSIX
> > does not guarantee support for allocating after fork, many allocators
> > including Scudo support it, and nothing stops the forked process from
> > page faulting pre-existing allocations after forking anyway), but
> > UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
> > ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
> > making it unsuitable for use in an allocator. Furthermore, even if
> > the interface issues are resolved, I suspect (but have not measured)
> > that the cost of the multiple context switches between kernel and
> > userspace would be too high to be used in an allocator anyway.
> >
> >   arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
> >   arch/arm/tools/syscall.tbl                  |  1 +
> >   arch/arm64/include/asm/mman.h               | 15 ++++
> >   arch/arm64/include/asm/mte.h                |  9 +-
> >   arch/arm64/include/asm/page.h               |  2 +-
> >   arch/arm64/include/asm/unistd.h             |  2 +-
> >   arch/arm64/include/asm/unistd32.h           |  2 +
> >   arch/arm64/kernel/mte.c                     | 24 +++++
> >   arch/arm64/lib/mte.S                        |  7 +-
> >   arch/arm64/mm/fault.c                       | 41 ++++++++-
> >   arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
> >   arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
> >   arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
> >   arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
> >   arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
> >   arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
> >   arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
> >   arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
> >   arch/s390/kernel/syscalls/syscall.tbl       |  1 +
> >   arch/sh/kernel/syscalls/syscall.tbl         |  1 +
> >   arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
> >   arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
> >   arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
> >   arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
> >   include/linux/gfp.h                         | 11 ++-
> >   include/linux/highmem.h                     |  2 +-
> >   include/linux/huge_mm.h                     |  7 ++
> >   include/linux/mm.h                          | 39 ++++++++
> >   include/linux/mman.h                        | 19 ++++
> >   include/linux/syscalls.h                    |  3 +
> >   include/uapi/asm-generic/unistd.h           |  5 +-
> >   kernel/sys_ni.c                             |  1 +
> >   mm/Makefile                                 |  4 +-
> >   mm/gup.c                                    |  2 +-
> >   mm/kasan/hw_tags.c                          |  2 +-
> >   mm/memory.c                                 | 47 +++++++---
> >   mm/migrate.c                                |  4 +-
> >   mm/page_alloc.c                             |  2 +-
> >   mm/refpage.c                                | 98 +++++++++++++++++++++
> >   39 files changed, 330 insertions(+), 34 deletions(-)
> >   create mode 100644 mm/refpage.c
> >
> > diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> > index a17687ed4b51..494edc5ca61c 100644
> > --- a/arch/alpha/kernel/syscalls/syscall.tbl
> > +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> > @@ -486,3 +486,4 @@
> >   554 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   555 common  landlock_add_rule               sys_landlock_add_rule
> >   556 common  landlock_restrict_self          sys_landlock_restrict_self
> > +558  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> > index c5df1179fc5d..8fd7045f46b9 100644
> > --- a/arch/arm/tools/syscall.tbl
> > +++ b/arch/arm/tools/syscall.tbl
> > @@ -460,3 +460,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
> > index e3e28f7daf62..5c0da3f76ec7 100644
> > --- a/arch/arm64/include/asm/mman.h
> > +++ b/arch/arm64/include/asm/mman.h
> > @@ -84,4 +84,19 @@ static inline bool arch_validate_flags(unsigned long vm_flags)
> >   }
> >   #define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
> >
> > +struct refpage_private_data;
> > +
> > +void arch_prep_refpage_private_data(struct refpage_private_data *priv);
> > +#define arch_prep_refpage_private_data arch_prep_refpage_private_data
> > +
> > +static inline void arch_prep_refpage_vma(struct vm_area_struct *vma)
> > +{
> > +     vma->vm_flags |= VM_MTE_ALLOWED;
> > +}
> > +#define arch_prep_refpage_vma arch_prep_refpage_vma
> > +
> > +void arch_copy_refpage(struct page *page, unsigned long addr,
> > +                                  struct vm_area_struct *vma);
> > +#define arch_copy_refpage arch_copy_refpage
> > +
> >   #endif /* ! __ASM_MMAN_H__ */
> > diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> > index 67bf259ae768..b513f83010c7 100644
> > --- a/arch/arm64/include/asm/mte.h
> > +++ b/arch/arm64/include/asm/mte.h
> > @@ -37,7 +37,7 @@ void mte_free_tag_storage(char *storage);
> >   /* track which pages have valid allocation tags */
> >   #define PG_mte_tagged       PG_arch_2
> >
> > -void mte_zero_clear_page_tags(void *addr);
> > +void mte_zero_set_page_tags(void *addr);
> >   void mte_sync_tags(pte_t *ptep, pte_t pte);
> >   void mte_copy_page_tags(void *kto, const void *kfrom);
> >   void mte_thread_init_user(void);
> > @@ -48,13 +48,14 @@ long set_mte_ctrl(struct task_struct *task, unsigned long arg);
> >   long get_mte_ctrl(struct task_struct *task);
> >   int mte_ptrace_copy_tags(struct task_struct *child, long request,
> >                        unsigned long addr, unsigned long data);
> > +u8 mte_check_tag_zero_page(struct page *userpage);
> >
> >   #else /* CONFIG_ARM64_MTE */
> >
> >   /* unused if !CONFIG_ARM64_MTE, silence the compiler */
> >   #define PG_mte_tagged       0
> >
> > -static inline void mte_zero_clear_page_tags(void *addr)
> > +static inline void mte_zero_set_page_tags(void *addr)
> >   {
> >   }
> >   static inline void mte_sync_tags(pte_t *ptep, pte_t pte)
> > @@ -89,6 +90,10 @@ static inline int mte_ptrace_copy_tags(struct task_struct *child,
> >   {
> >       return -EIO;
> >   }
> > +static inline u8 mte_check_tag_zero_page(struct page *userpage)
> > +{
> > +     return 0;
> > +}
> >
> >   #endif /* CONFIG_ARM64_MTE */
> >
> > diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> > index 993a27ea6f54..234f48688b1a 100644
> > --- a/arch/arm64/include/asm/page.h
> > +++ b/arch/arm64/include/asm/page.h
> > @@ -33,7 +33,7 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
> >                                               unsigned long vaddr);
> >   #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
> >
> > -void tag_clear_highpage(struct page *to);
> > +void tag_set_highpage(struct page *to, unsigned long tag);
> >   #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
> >
> >   #define clear_user_page(page, vaddr, pg)    clear_page(page)
> > diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> > index 727bfc3be99b..3cb206aea3db 100644
> > --- a/arch/arm64/include/asm/unistd.h
> > +++ b/arch/arm64/include/asm/unistd.h
> > @@ -38,7 +38,7 @@
> >   #define __ARM_NR_compat_set_tls             (__ARM_NR_COMPAT_BASE + 5)
> >   #define __ARM_NR_COMPAT_END         (__ARM_NR_COMPAT_BASE + 0x800)
> >
> > -#define __NR_compat_syscalls         447
> > +#define __NR_compat_syscalls         449
> >   #endif
> >
> >   #define __ARCH_WANT_SYS_CLONE
> > diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> > index 99ffcafc736c..2a116aa17fe7 100644
> > --- a/arch/arm64/include/asm/unistd32.h
> > +++ b/arch/arm64/include/asm/unistd32.h
> > @@ -901,6 +901,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
> >   __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
> >   #define __NR_landlock_restrict_self 446
> >   __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> > +#define __NR_refpage_create 448
> > +__SYSCALL(__NR_refpage_create, sys_refpage_create)
> >
> >   /*
> >    * Please add new compat syscalls above this comment and update
> > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > index 125a10e413e9..6a79240d5a77 100644
> > --- a/arch/arm64/kernel/mte.c
> > +++ b/arch/arm64/kernel/mte.c
> > @@ -453,3 +453,27 @@ int mte_ptrace_copy_tags(struct task_struct *child, long request,
> >
> >       return ret;
> >   }
> > +
> > +u8 mte_check_tag_zero_page(struct page *userpage)
> > +{
> > +     char *userpage_addr = page_address(userpage);
> > +     u8 tag;
> > +     int i;
> > +
> > +     if (!test_bit(PG_mte_tagged, &userpage->flags))
> > +             return 0;
> > +
> > +     tag = mte_get_mem_tag(userpage_addr) & 0xF;
> > +     if (tag == 0)
> > +             return 0;
> > +
> > +     for (i = 0; i != PAGE_SIZE; ++i)
> > +             if (userpage_addr[i] != 0)
> > +                     return 0;
> > +
> > +     for (i = MTE_GRANULE_SIZE; i != PAGE_SIZE; i += MTE_GRANULE_SIZE)
> > +             if ((mte_get_mem_tag(userpage_addr + i) & 0xF) != tag)
> > +                     return 0;
> > +
> > +     return tag;
> > +}
> > diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
> > index e83643b3995f..45be436c97af 100644
> > --- a/arch/arm64/lib/mte.S
> > +++ b/arch/arm64/lib/mte.S
> > @@ -37,24 +37,23 @@ SYM_FUNC_START(mte_clear_page_tags)
> >   SYM_FUNC_END(mte_clear_page_tags)
> >
> >   /*
> > - * Zero the page and tags at the same time
> > + * Zero the page and set tags at the same time
> >    *
> >    * Parameters:
> >    *  x0 - address to the beginning of the page
> >    */
> > -SYM_FUNC_START(mte_zero_clear_page_tags)
> > +SYM_FUNC_START(mte_zero_set_page_tags)
> >       mrs     x1, dczid_el0
> >       and     w1, w1, #0xf
> >       mov     x2, #4
> >       lsl     x1, x2, x1
> > -     and     x0, x0, #(1 << MTE_TAG_SHIFT) - 1       // clear the tag
> >
> >   1:  dc      gzva, x0
> >       add     x0, x0, x1
> >       tst     x0, #(PAGE_SIZE - 1)
> >       b.ne    1b
> >       ret
> > -SYM_FUNC_END(mte_zero_clear_page_tags)
> > +SYM_FUNC_END(mte_zero_set_page_tags)
> >
> >   /*
> >    * Copy the tags from the source page to the destination one
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index 349c488765ca..36355758ffc7 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -25,6 +25,7 @@
> >   #include <linux/perf_event.h>
> >   #include <linux/preempt.h>
> >   #include <linux/hugetlb.h>
> > +#include <linux/mman.h>
> >
> >   #include <asm/acpi.h>
> >   #include <asm/bug.h>
> > @@ -939,9 +940,45 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
> >       return alloc_page_vma(flags, vma, vaddr);
> >   }
> >
> > -void tag_clear_highpage(struct page *page)
> > +void tag_set_highpage(struct page *page, unsigned long tag)
> >   {
> > -     mte_zero_clear_page_tags(page_address(page));
> > +     unsigned long addr = (unsigned long)page_address(page);
> > +
> > +     addr &= ~MTE_TAG_MASK;
> > +     addr |= tag << MTE_TAG_SHIFT;
> > +     mte_zero_set_page_tags((void *)addr);
> >       page_kasan_tag_reset(page);
> >       set_bit(PG_mte_tagged, &page->flags);
> >   }
> > +
> > +#define REFPAGE_OPTZN_MTE_TAGGED REFPAGE_OPTZN_ARCH
> > +
> > +void arch_prep_refpage_private_data(struct refpage_private_data *priv)
> > +{
> > +     if (system_supports_mte()) {
> > +             u8 tag;
> > +
> > +             if (!test_and_set_bit(PG_mte_tagged, &priv->refpage->flags))
> > +                     mte_clear_page_tags(page_address(priv->refpage));
> > +
> > +             tag = mte_check_tag_zero_page(priv->refpage);
> > +             if (tag) {
> > +                     priv->optzn_kind = REFPAGE_OPTZN_MTE_TAGGED;
> > +                     priv->optzn_info = tag;
> > +                     return;
> > +             }
> > +     }
> > +
> > +     prep_refpage_private_data(priv);
> > +}
> > +
> > +void arch_copy_refpage(struct page *page, unsigned long addr,
> > +                    struct vm_area_struct *vma)
> > +{
> > +     struct refpage_private_data *priv = vma->vm_private_data;
> > +
> > +     if (priv->optzn_kind == REFPAGE_OPTZN_MTE_TAGGED)
> > +             tag_set_highpage(page, priv->optzn_info);
> > +     else
> > +             copy_refpage(page, addr, vma);
> > +}
> > diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> > index 6d07742c57b8..c2209d83f3c3 100644
> > --- a/arch/ia64/kernel/syscalls/syscall.tbl
> > +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> > @@ -367,3 +367,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> > index 541bc1b3a8f9..0360cf474a49 100644
> > --- a/arch/m68k/kernel/syscalls/syscall.tbl
> > +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> > @@ -446,3 +446,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> > index a176faca2927..de85d758e564 100644
> > --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> > +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> > @@ -452,3 +452,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> > index c2d2e19abea8..b07c7293d2a3 100644
> > --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> > @@ -385,3 +385,4 @@
> >   444 n32     landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 n32     landlock_add_rule               sys_landlock_add_rule
> >   446 n32     landlock_restrict_self          sys_landlock_restrict_self
> > +448  n32     refpage_create                  sys_refpage_create
> > diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> > index ac653d08b1ea..7ebabb99dd06 100644
> > --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> > @@ -361,3 +361,4 @@
> >   444 n64     landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 n64     landlock_add_rule               sys_landlock_add_rule
> >   446 n64     landlock_restrict_self          sys_landlock_restrict_self
> > +448  n64     refpage_create                  sys_refpage_create
> > diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> > index 253f2cd70b6b..a51149ac101c 100644
> > --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> > @@ -434,3 +434,4 @@
> >   444 o32     landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 o32     landlock_add_rule               sys_landlock_add_rule
> >   446 o32     landlock_restrict_self          sys_landlock_restrict_self
> > +448  o32     refpage_create                  sys_refpage_create
> > diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> > index e26187b9ab87..385565864861 100644
> > --- a/arch/parisc/kernel/syscalls/syscall.tbl
> > +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> > @@ -444,3 +444,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> > index aef2a290e71a..95cdd9f7dc06 100644
> > --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> > +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> > @@ -526,3 +526,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> > index 64d51ab5a8b4..92ed1260ffd9 100644
> > --- a/arch/s390/kernel/syscalls/syscall.tbl
> > +++ b/arch/s390/kernel/syscalls/syscall.tbl
> > @@ -449,3 +449,4 @@
> >   444  common landlock_create_ruleset sys_landlock_create_ruleset     sys_landlock_create_ruleset
> >   445  common landlock_add_rule       sys_landlock_add_rule           sys_landlock_add_rule
> >   446  common landlock_restrict_self  sys_landlock_restrict_self      sys_landlock_restrict_self
> > +448  common  refpage_create          sys_refpage_create              sys_refpage_create
> > diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> > index e0a70be77d84..f9d198cc2541 100644
> > --- a/arch/sh/kernel/syscalls/syscall.tbl
> > +++ b/arch/sh/kernel/syscalls/syscall.tbl
> > @@ -449,3 +449,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> > index 603f5a821502..83533aa49340 100644
> > --- a/arch/sparc/kernel/syscalls/syscall.tbl
> > +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> > @@ -492,3 +492,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index ce763a12311c..054c69e395b5 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -452,3 +452,4 @@
> >   445 i386    landlock_add_rule       sys_landlock_add_rule
> >   446 i386    landlock_restrict_self  sys_landlock_restrict_self
> >   447 i386    memfd_secret            sys_memfd_secret
> > +448  i386    refpage_create          sys_refpage_create
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index f6b57799c1ea..1f24f0b66cbd 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -369,6 +369,7 @@
> >   445 common  landlock_add_rule       sys_landlock_add_rule
> >   446 common  landlock_restrict_self  sys_landlock_restrict_self
> >   447 common  memfd_secret            sys_memfd_secret
> > +448  common  refpage_create          sys_refpage_create
> >
> >   #
> >   # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> > index 235d67d6ceb4..96c27fb404ca 100644
> > --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> > +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> > @@ -417,3 +417,4 @@
> >   444 common  landlock_create_ruleset         sys_landlock_create_ruleset
> >   445 common  landlock_add_rule               sys_landlock_add_rule
> >   446 common  landlock_restrict_self          sys_landlock_restrict_self
> > +448  common  refpage_create                  sys_refpage_create
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 55b2ec1f965a..ae3c763eb9e9 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -55,8 +55,9 @@ struct vm_area_struct;
> >   #define ___GFP_ACCOUNT              0x400000u
> >   #define ___GFP_ZEROTAGS             0x800000u
> >   #define ___GFP_SKIP_KASAN_POISON    0x1000000u
> > +#define ___GFP_NOZERO                0x2000000u
>
> Oh god, please no. We've discussed this a couple of times already:
> whatever leaves the page allcoator shall be zeroed. No exceptions, not
> even for other allocators (like hugetlb).
>
> Introducing something like that to the whole system, including random
> drivers, destroys the whole purpose of the security feature. Please
> don't burry something so controversial in your patch.

Got it -- I was unaware that this was controversial.

Avoiding the double initialization does help a bit in benchmarks, at
least for the fully faulted case. The alternative approach that I was
thinking of was to somehow plumb the required pattern into the page
allocator (which would maintain the invariant that the pages are
initialized, but not necessarily with zeroes), but this would require
touching several layers of the allocator.  I suppose that this doesn't
need to be done immediately though -- we can deal with the double
initialization for now and avoid it somehow in a followup.

Peter
John Hubbard July 19, 2021, 10:30 p.m. UTC | #18
On 7/19/21 3:26 PM, Peter Collingbourne wrote:
...
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index 55b2ec1f965a..ae3c763eb9e9 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -55,8 +55,9 @@ struct vm_area_struct;
>>>    #define ___GFP_ACCOUNT              0x400000u
>>>    #define ___GFP_ZEROTAGS             0x800000u
>>>    #define ___GFP_SKIP_KASAN_POISON    0x1000000u
>>> +#define ___GFP_NOZERO                0x2000000u
>>
>> Oh god, please no. We've discussed this a couple of times already:
>> whatever leaves the page allcoator shall be zeroed. No exceptions, not
>> even for other allocators (like hugetlb).
>>
>> Introducing something like that to the whole system, including random
>> drivers, destroys the whole purpose of the security feature. Please
>> don't burry something so controversial in your patch.
> 
> Got it -- I was unaware that this was controversial.
> 
> Avoiding the double initialization does help a bit in benchmarks, at
> least for the fully faulted case. The alternative approach that I was
> thinking of was to somehow plumb the required pattern into the page
> allocator (which would maintain the invariant that the pages are
> initialized, but not necessarily with zeroes), but this would require
> touching several layers of the allocator.  I suppose that this doesn't

That sounds right.

> need to be done immediately though -- we can deal with the double
> initialization for now and avoid it somehow in a followup.
> 

Actually, I'd encourage going straight to the final result, in this
case. It's good to see what we are going to end up with, and figure
out if it's worth the trade-offs.


thanks,
David Hildenbrand July 20, 2021, 7:28 a.m. UTC | #19
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -55,8 +55,9 @@ struct vm_area_struct;
>>>    #define ___GFP_ACCOUNT              0x400000u
>>>    #define ___GFP_ZEROTAGS             0x800000u
>>>    #define ___GFP_SKIP_KASAN_POISON    0x1000000u
>>> +#define ___GFP_NOZERO                0x2000000u
>>
>> Oh god, please no. We've discussed this a couple of times already:
>> whatever leaves the page allcoator shall be zeroed. No exceptions, not
>> even for other allocators (like hugetlb).
>>
>> Introducing something like that to the whole system, including random
>> drivers, destroys the whole purpose of the security feature. Please
>> don't burry something so controversial in your patch.
> 
> Got it -- I was unaware that this was controversial.
> 
> Avoiding the double initialization does help a bit in benchmarks, at
> least for the fully faulted case. The alternative approach that I was
> thinking of was to somehow plumb the required pattern into the page
> allocator (which would maintain the invariant that the pages are
> initialized, but not necessarily with zeroes), but this would require
> touching several layers of the allocator.  I suppose that this doesn't
> need to be done immediately though -- we can deal with the double
> initialization for now and avoid it somehow in a followup.

As it's a clear optimization, it should definitely be separated from 
this "introduction" patch.

I agree that the logic for optimizing initialization of a page in this 
case should best reside in page_alloc.c, for example providing a helper 
for optimizing the initialization in there. We already pass alloc_flags 
internally, maybe we can then reuse that to minimize changes.
diff mbox series

Patch

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index a17687ed4b51..494edc5ca61c 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -486,3 +486,4 @@ 
 554	common	landlock_create_ruleset		sys_landlock_create_ruleset
 555	common	landlock_add_rule		sys_landlock_add_rule
 556	common	landlock_restrict_self		sys_landlock_restrict_self
+558	common	refpage_create			sys_refpage_create
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index c5df1179fc5d..8fd7045f46b9 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -460,3 +460,4 @@ 
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+448	common	refpage_create			sys_refpage_create
diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
index e3e28f7daf62..5c0da3f76ec7 100644
--- a/arch/arm64/include/asm/mman.h
+++ b/arch/arm64/include/asm/mman.h
@@ -84,4 +84,19 @@  static inline bool arch_validate_flags(unsigned long vm_flags)
 }
 #define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
 
+struct refpage_private_data;
+
+void arch_prep_refpage_private_data(struct refpage_private_data *priv);
+#define arch_prep_refpage_private_data arch_prep_refpage_private_data
+
+static inline void arch_prep_refpage_vma(struct vm_area_struct *vma)
+{
+	vma->vm_flags |= VM_MTE_ALLOWED;
+}
+#define arch_prep_refpage_vma arch_prep_refpage_vma
+
+void arch_copy_refpage(struct page *page, unsigned long addr,
+				     struct vm_area_struct *vma);
+#define arch_copy_refpage arch_copy_refpage
+
 #endif /* ! __ASM_MMAN_H__ */
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 67bf259ae768..b513f83010c7 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -37,7 +37,7 @@  void mte_free_tag_storage(char *storage);
 /* track which pages have valid allocation tags */
 #define PG_mte_tagged	PG_arch_2
 
-void mte_zero_clear_page_tags(void *addr);
+void mte_zero_set_page_tags(void *addr);
 void mte_sync_tags(pte_t *ptep, pte_t pte);
 void mte_copy_page_tags(void *kto, const void *kfrom);
 void mte_thread_init_user(void);
@@ -48,13 +48,14 @@  long set_mte_ctrl(struct task_struct *task, unsigned long arg);
 long get_mte_ctrl(struct task_struct *task);
 int mte_ptrace_copy_tags(struct task_struct *child, long request,
 			 unsigned long addr, unsigned long data);
+u8 mte_check_tag_zero_page(struct page *userpage);
 
 #else /* CONFIG_ARM64_MTE */
 
 /* unused if !CONFIG_ARM64_MTE, silence the compiler */
 #define PG_mte_tagged	0
 
-static inline void mte_zero_clear_page_tags(void *addr)
+static inline void mte_zero_set_page_tags(void *addr)
 {
 }
 static inline void mte_sync_tags(pte_t *ptep, pte_t pte)
@@ -89,6 +90,10 @@  static inline int mte_ptrace_copy_tags(struct task_struct *child,
 {
 	return -EIO;
 }
+static inline u8 mte_check_tag_zero_page(struct page *userpage)
+{
+	return 0;
+}
 
 #endif /* CONFIG_ARM64_MTE */
 
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 993a27ea6f54..234f48688b1a 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -33,7 +33,7 @@  struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
 						unsigned long vaddr);
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
 
-void tag_clear_highpage(struct page *to);
+void tag_set_highpage(struct page *to, unsigned long tag);
 #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
 
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 727bfc3be99b..3cb206aea3db 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@ 
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		447
+#define __NR_compat_syscalls		449
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 99ffcafc736c..2a116aa17fe7 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -901,6 +901,8 @@  __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
 __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
 #define __NR_landlock_restrict_self 446
 __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
+#define __NR_refpage_create 448
+__SYSCALL(__NR_refpage_create, sys_refpage_create)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 125a10e413e9..6a79240d5a77 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -453,3 +453,27 @@  int mte_ptrace_copy_tags(struct task_struct *child, long request,
 
 	return ret;
 }
+
+u8 mte_check_tag_zero_page(struct page *userpage)
+{
+	char *userpage_addr = page_address(userpage);
+	u8 tag;
+	int i;
+
+	if (!test_bit(PG_mte_tagged, &userpage->flags))
+		return 0;
+
+	tag = mte_get_mem_tag(userpage_addr) & 0xF;
+	if (tag == 0)
+		return 0;
+
+	for (i = 0; i != PAGE_SIZE; ++i)
+		if (userpage_addr[i] != 0)
+			return 0;
+
+	for (i = MTE_GRANULE_SIZE; i != PAGE_SIZE; i += MTE_GRANULE_SIZE)
+		if ((mte_get_mem_tag(userpage_addr + i) & 0xF) != tag)
+			return 0;
+
+	return tag;
+}
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index e83643b3995f..45be436c97af 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -37,24 +37,23 @@  SYM_FUNC_START(mte_clear_page_tags)
 SYM_FUNC_END(mte_clear_page_tags)
 
 /*
- * Zero the page and tags at the same time
+ * Zero the page and set tags at the same time
  *
  * Parameters:
  *	x0 - address to the beginning of the page
  */
-SYM_FUNC_START(mte_zero_clear_page_tags)
+SYM_FUNC_START(mte_zero_set_page_tags)
 	mrs	x1, dczid_el0
 	and	w1, w1, #0xf
 	mov	x2, #4
 	lsl	x1, x2, x1
-	and	x0, x0, #(1 << MTE_TAG_SHIFT) - 1	// clear the tag
 
 1:	dc	gzva, x0
 	add	x0, x0, x1
 	tst	x0, #(PAGE_SIZE - 1)
 	b.ne	1b
 	ret
-SYM_FUNC_END(mte_zero_clear_page_tags)
+SYM_FUNC_END(mte_zero_set_page_tags)
 
 /*
  * Copy the tags from the source page to the destination one
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 349c488765ca..36355758ffc7 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -25,6 +25,7 @@ 
 #include <linux/perf_event.h>
 #include <linux/preempt.h>
 #include <linux/hugetlb.h>
+#include <linux/mman.h>
 
 #include <asm/acpi.h>
 #include <asm/bug.h>
@@ -939,9 +940,45 @@  struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
 	return alloc_page_vma(flags, vma, vaddr);
 }
 
-void tag_clear_highpage(struct page *page)
+void tag_set_highpage(struct page *page, unsigned long tag)
 {
-	mte_zero_clear_page_tags(page_address(page));
+	unsigned long addr = (unsigned long)page_address(page);
+
+	addr &= ~MTE_TAG_MASK;
+	addr |= tag << MTE_TAG_SHIFT;
+	mte_zero_set_page_tags((void *)addr);
 	page_kasan_tag_reset(page);
 	set_bit(PG_mte_tagged, &page->flags);
 }
+
+#define REFPAGE_OPTZN_MTE_TAGGED REFPAGE_OPTZN_ARCH
+
+void arch_prep_refpage_private_data(struct refpage_private_data *priv)
+{
+	if (system_supports_mte()) {
+		u8 tag;
+
+		if (!test_and_set_bit(PG_mte_tagged, &priv->refpage->flags))
+			mte_clear_page_tags(page_address(priv->refpage));
+
+		tag = mte_check_tag_zero_page(priv->refpage);
+		if (tag) {
+			priv->optzn_kind = REFPAGE_OPTZN_MTE_TAGGED;
+			priv->optzn_info = tag;
+			return;
+		}
+	}
+
+	prep_refpage_private_data(priv);
+}
+
+void arch_copy_refpage(struct page *page, unsigned long addr,
+		       struct vm_area_struct *vma)
+{
+	struct refpage_private_data *priv = vma->vm_private_data;
+
+	if (priv->optzn_kind == REFPAGE_OPTZN_MTE_TAGGED)
+		tag_set_highpage(page, priv->optzn_info);
+	else
+		copy_refpage(page, addr, vma);
+}
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 6d07742c57b8..c2209d83f3c3 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -367,3 +367,4 @@ 
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+448	common	refpage_create			sys_refpage_create
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 541bc1b3a8f9..0360cf474a49 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -446,3 +446,4 @@ 
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+448	common	refpage_create			sys_refpage_create
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index a176faca2927..de85d758e564 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -452,3 +452,4 @@ 
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+448	common	refpage_create			sys_refpage_create
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index c2d2e19abea8..b07c7293d2a3 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -385,3 +385,4 @@ 
 444	n32	landlock_create_ruleset		sys_landlock_create_ruleset
 445	n32	landlock_add_rule		sys_landlock_add_rule
 446	n32	landlock_restrict_self		sys_landlock_restrict_self
+448	n32	refpage_create			sys_refpage_create
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index ac653d08b1ea..7ebabb99dd06 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -361,3 +361,4 @@ 
 444	n64	landlock_create_ruleset		sys_landlock_create_ruleset
 445	n64	landlock_add_rule		sys_landlock_add_rule
 446	n64	landlock_restrict_self		sys_landlock_restrict_self
+448	n64	refpage_create			sys_refpage_create
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 253f2cd70b6b..a51149ac101c 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -434,3 +434,4 @@ 
 444	o32	landlock_create_ruleset		sys_landlock_create_ruleset
 445	o32	landlock_add_rule		sys_landlock_add_rule
 446	o32	landlock_restrict_self		sys_landlock_restrict_self
+448	o32	refpage_create			sys_refpage_create
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index e26187b9ab87..385565864861 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -444,3 +444,4 @@ 
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+448	common	refpage_create			sys_refpage_create
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index aef2a290e71a..95cdd9f7dc06 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -526,3 +526,4 @@ 
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+448	common	refpage_create			sys_refpage_create
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 64d51ab5a8b4..92ed1260ffd9 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -449,3 +449,4 @@ 
 444  common	landlock_create_ruleset	sys_landlock_create_ruleset	sys_landlock_create_ruleset
 445  common	landlock_add_rule	sys_landlock_add_rule		sys_landlock_add_rule
 446  common	landlock_restrict_self	sys_landlock_restrict_self	sys_landlock_restrict_self
+448  common	refpage_create		sys_refpage_create		sys_refpage_create
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index e0a70be77d84..f9d198cc2541 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -449,3 +449,4 @@ 
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+448	common	refpage_create			sys_refpage_create
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 603f5a821502..83533aa49340 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -492,3 +492,4 @@ 
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+448	common	refpage_create			sys_refpage_create
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ce763a12311c..054c69e395b5 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -452,3 +452,4 @@ 
 445	i386	landlock_add_rule	sys_landlock_add_rule
 446	i386	landlock_restrict_self	sys_landlock_restrict_self
 447	i386	memfd_secret		sys_memfd_secret
+448	i386	refpage_create		sys_refpage_create
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f6b57799c1ea..1f24f0b66cbd 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -369,6 +369,7 @@ 
 445	common	landlock_add_rule	sys_landlock_add_rule
 446	common	landlock_restrict_self	sys_landlock_restrict_self
 447	common	memfd_secret		sys_memfd_secret
+448	common	refpage_create		sys_refpage_create
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 235d67d6ceb4..96c27fb404ca 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -417,3 +417,4 @@ 
 444	common	landlock_create_ruleset		sys_landlock_create_ruleset
 445	common	landlock_add_rule		sys_landlock_add_rule
 446	common	landlock_restrict_self		sys_landlock_restrict_self
+448	common	refpage_create			sys_refpage_create
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 55b2ec1f965a..ae3c763eb9e9 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -55,8 +55,9 @@  struct vm_area_struct;
 #define ___GFP_ACCOUNT		0x400000u
 #define ___GFP_ZEROTAGS		0x800000u
 #define ___GFP_SKIP_KASAN_POISON	0x1000000u
+#define ___GFP_NOZERO		0x2000000u
 #ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP	0x2000000u
+#define ___GFP_NOLOCKDEP	0x4000000u
 #else
 #define ___GFP_NOLOCKDEP	0
 #endif
@@ -238,18 +239,24 @@  struct vm_area_struct;
  * %__GFP_SKIP_KASAN_POISON returns a page which does not need to be poisoned
  * on deallocation. Typically used for userspace pages. Currently only has an
  * effect in HW tags mode.
+ *
+ * %__GFP_NOZERO disables any implicit zeroing of the page (e.g. as a result
+ * of init_on_alloc=on). This flag should only be used to address specific
+ * performance bottlenecks and only if the page is clearly being fully
+ * initialized following the allocation.
  */
 #define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)
 #define __GFP_COMP	((__force gfp_t)___GFP_COMP)
 #define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)
 #define __GFP_ZEROTAGS	((__force gfp_t)___GFP_ZEROTAGS)
 #define __GFP_SKIP_KASAN_POISON	((__force gfp_t)___GFP_SKIP_KASAN_POISON)
+#define __GFP_NOZERO	((__force gfp_t)___GFP_NOZERO)
 
 /* Disable lockdep for GFP context tracking */
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /**
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 93ba33c09d12..6c5076dd1e9b 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -187,7 +187,7 @@  static inline void clear_highpage(struct page *page)
 
 #ifndef __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
 
-static inline void tag_clear_highpage(struct page *page)
+static inline void tag_set_highpage(struct page *page, unsigned long tag)
 {
 }
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f123e15d966e..36ecfc391b46 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -127,6 +127,13 @@  static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
 
 	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
 		return false;
+
+	/*
+	 * Transparent hugepages not currently supported for anonymous VMAs with
+	 * reference pages
+	 */
+	if (unlikely(is_refpage_vma(vma)))
+		return false;
 	return true;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a127d93612fa..8cff9e0463b5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -32,6 +32,7 @@ 
 #include <linux/sched.h>
 #include <linux/pgtable.h>
 #include <linux/kasan.h>
+#include <linux/fs.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -722,6 +723,42 @@  int vma_is_stack_for_current(struct vm_area_struct *vma);
 /* flush_tlb_range() takes a vma, not a mm, and can care about flags */
 #define TLB_FLUSH_VMA(mm,flags) { .vm_mm = (mm), .vm_flags = (flags) }
 
+extern const struct file_operations refpage_file_operations;
+
+struct refpage_private_data {
+	struct page *refpage;
+	u8 optzn_kind;
+	u8 optzn_info;
+};
+
+#define REFPAGE_OPTZN_NONE	0
+#define REFPAGE_OPTZN_PATTERN	1
+#define REFPAGE_OPTZN_ARCH	2
+
+static inline bool is_refpage_vma(struct vm_area_struct *vma)
+{
+	return vma->vm_file && vma->vm_file->f_op == &refpage_file_operations;
+}
+
+static inline struct page *get_vma_refpage(struct vm_area_struct *vma)
+{
+	struct refpage_private_data *priv = vma->vm_private_data;
+
+	BUG_ON(!is_refpage_vma(vma));
+	return priv->refpage;
+}
+
+static inline int is_refpage_pfn(struct vm_area_struct *vma, unsigned long pfn)
+{
+	return is_refpage_vma(vma) && pfn == page_to_pfn(get_vma_refpage(vma));
+}
+
+static inline int is_zero_or_refpage_pfn(struct vm_area_struct *vma,
+					 unsigned long pfn)
+{
+	return is_zero_pfn(pfn) || is_refpage_pfn(vma, pfn);
+}
+
 struct mmu_gather;
 struct inode;
 
@@ -2977,6 +3014,8 @@  static inline void kernel_unpoison_pages(struct page *page, int numpages) { }
 DECLARE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc);
 static inline bool want_init_on_alloc(gfp_t flags)
 {
+	if (flags & __GFP_NOZERO)
+		return false;
 	if (static_branch_maybe(CONFIG_INIT_ON_ALLOC_DEFAULT_ON,
 				&init_on_alloc))
 		return true;
diff --git a/include/linux/mman.h b/include/linux/mman.h
index ebb09a964272..cdf8f8245c78 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -2,6 +2,7 @@ 
 #ifndef _LINUX_MMAN_H
 #define _LINUX_MMAN_H
 
+#include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/percpu_counter.h>
 
@@ -123,6 +124,24 @@  static inline bool arch_validate_flags(unsigned long flags)
 #define arch_validate_flags arch_validate_flags
 #endif
 
+void prep_refpage_private_data(struct refpage_private_data *priv);
+#ifndef arch_prep_refpage_private_data
+#define arch_prep_refpage_private_data prep_refpage_private_data
+#endif
+
+#ifndef arch_prep_refpage_vma
+static inline void arch_prep_refpage_vma(struct vm_area_struct *vma)
+{
+}
+#define arch_prep_refpage_vma arch_prep_refpage_vma
+#endif
+
+void copy_refpage(struct page *page, unsigned long addr,
+		  struct vm_area_struct *vma);
+#ifndef arch_copy_refpage
+#define arch_copy_refpage copy_refpage
+#endif
+
 /*
  * Optimisation macro.  It is equivalent to:
  *      (x & bit1) ? bit2 : 0
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 69c9a7010081..303a28a86500 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -864,6 +864,9 @@  asmlinkage long sys_mremap(unsigned long addr,
 			   unsigned long old_len, unsigned long new_len,
 			   unsigned long flags, unsigned long new_addr);
 
+/* mm/refpage.c */
+asmlinkage long sys_refpage_create(const void __user *content, unsigned long flags);
+
 /* security/keys/keyctl.c */
 asmlinkage long sys_add_key(const char __user *_type,
 			    const char __user *_description,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a9d6fcd95f42..54cede7db5f0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -878,8 +878,11 @@  __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
 __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
 #endif
 
+#define __NR_refpage_create 448
+__SYSCALL(__NR_refpage_create, sys_refpage_create)
+
 #undef __NR_syscalls
-#define __NR_syscalls 448
+#define __NR_syscalls 449
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 30971b1dd4a9..bc65a54eb2a4 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -300,6 +300,7 @@  COND_SYSCALL(migrate_pages);
 COND_SYSCALL_COMPAT(migrate_pages);
 COND_SYSCALL(move_pages);
 COND_SYSCALL_COMPAT(move_pages);
+COND_SYSCALL(refpage_create);
 
 COND_SYSCALL(perf_event_open);
 COND_SYSCALL(accept4);
diff --git a/mm/Makefile b/mm/Makefile
index e3436741d539..137adc22bf50 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -35,10 +35,10 @@  CFLAGS_init-mm.o += $(call cc-disable-warning, override-init)
 CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides)
 
 mmu-y			:= nommu.o
-mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
+mmu-$(CONFIG_MMU)	:= highmem.o ioremap.o memory.o mincore.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
-			   pgtable-generic.o rmap.o vmalloc.o ioremap.o
+			   pgtable-generic.o refpage.o rmap.o vmalloc.o
 
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
diff --git a/mm/gup.c b/mm/gup.c
index 42b8b1fa6521..ba1b7bd7a0a0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -548,7 +548,7 @@  static struct page *follow_page_pte(struct vm_area_struct *vma,
 			goto out;
 		}
 
-		if (is_zero_pfn(pte_pfn(pte))) {
+		if (is_zero_or_refpage_pfn(vma, pte_pfn(pte))) {
 			page = pte_page(pte);
 		} else {
 			ret = follow_pfn_pte(vma, address, ptep, flags);
diff --git a/mm/kasan/hw_tags.c b/mm/kasan/hw_tags.c
index ed5e5b833d61..3c433e430c80 100644
--- a/mm/kasan/hw_tags.c
+++ b/mm/kasan/hw_tags.c
@@ -253,7 +253,7 @@  void kasan_alloc_pages(struct page *page, unsigned int order, gfp_t flags)
 		int i;
 
 		for (i = 0; i != 1 << order; ++i)
-			tag_clear_highpage(page + i);
+			tag_set_highpage(page + i, 0);
 	} else {
 		kasan_unpoison_pages(page, order, init);
 	}
diff --git a/mm/memory.c b/mm/memory.c
index db86558791f1..8b32bdd215b7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -614,7 +614,7 @@  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
-		if (is_zero_pfn(pfn))
+		if (is_zero_or_refpage_pfn(vma, pfn))
 			return NULL;
 		if (pte_devmap(pte))
 			return NULL;
@@ -640,7 +640,7 @@  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
-	if (is_zero_pfn(pfn))
+	if (is_zero_or_refpage_pfn(vma, pfn))
 		return NULL;
 
 check_pfn:
@@ -2166,7 +2166,7 @@  static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
 		return true;
 	if (pfn_t_special(pfn))
 		return true;
-	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
+	if (is_zero_or_refpage_pfn(vma, pfn_t_to_pfn(pfn)))
 		return true;
 	return false;
 }
@@ -2990,22 +2990,29 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	pte_t entry;
 	int page_copied = 0;
 	struct mmu_notifier_range range;
+	unsigned long pfn;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
 
-	if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
+	pfn = pte_pfn(vmf->orig_pte);
+	if (is_zero_pfn(pfn)) {
 		new_page = alloc_zeroed_user_highpage_movable(vma,
 							      vmf->address);
 		if (!new_page)
 			goto oom;
 	} else {
-		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
-				vmf->address);
+		bool refpage = is_refpage_pfn(vma, pfn);
+
+		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE |
+						  (refpage ? __GFP_NOZERO : 0),
+					  vma, vmf->address);
 		if (!new_page)
 			goto oom;
 
-		if (!cow_user_page(new_page, old_page, vmf)) {
+		if (refpage) {
+			arch_copy_refpage(new_page, vmf->address, vma);
+		} else if (!cow_user_page(new_page, old_page, vmf)) {
 			/*
 			 * COW failed, if the fault was solved by other,
 			 * it's fine. If not, userspace would re-fault on
@@ -3739,11 +3746,16 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (unlikely(pmd_trans_unstable(vmf->pmd)))
 		return 0;
 
-	/* Use the zero-page for reads */
+	/* Use the zero-page, or reference page if set, for reads */
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm)) {
-		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
-						vma->vm_page_prot));
+		unsigned long pfn;
+
+		if (unlikely(is_refpage_vma(vma)))
+			pfn = page_to_pfn(get_vma_refpage(vma));
+		else
+			pfn = my_zero_pfn(vmf->address);
+		entry = pte_mkspecial(pfn_pte(pfn, vma->vm_page_prot));
 		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
 				vmf->address, &vmf->ptl);
 		if (!pte_none(*vmf->pte)) {
@@ -3764,9 +3776,18 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
-	if (!page)
-		goto oom;
+
+	if (unlikely(is_refpage_vma(vma))) {
+		page = alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_NOZERO, vma,
+				      vmf->address);
+		if (!page)
+			goto oom;
+		arch_copy_refpage(page, vmf->address, vma);
+	} else {
+		page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
+		if (!page)
+			goto oom;
+	}
 
 	if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
diff --git a/mm/migrate.c b/mm/migrate.c
index 23cbd9de030b..9a897676ff95 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2774,8 +2774,8 @@  static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	pmd_t *pmdp;
 	pte_t *ptep;
 
-	/* Only allow populating anonymous memory */
-	if (!vma_is_anonymous(vma))
+	/* Only allow populating anonymous memory without a reference page */
+	if (!vma_is_anonymous(vma) || is_refpage_vma(vma))
 		goto abort;
 
 	pgdp = pgd_offset(mm, addr);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8836e54721ae..6ca831c1821f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1283,7 +1283,7 @@  static void kernel_init_free_pages(struct page *page, int numpages, bool zero_ta
 
 	if (zero_tags) {
 		for (i = 0; i < numpages; i++)
-			tag_clear_highpage(page + i);
+			tag_set_highpage(page + i, 0);
 		return;
 	}
 
diff --git a/mm/refpage.c b/mm/refpage.c
new file mode 100644
index 000000000000..ee95e281d2d4
--- /dev/null
+++ b/mm/refpage.c
@@ -0,0 +1,98 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/anon_inodes.h>
+#include <linux/fs_context.h>
+#include <linux/highmem.h>
+#include <linux/mman.h>
+#include <linux/mount.h>
+#include <linux/syscalls.h>
+
+void prep_refpage_private_data(struct refpage_private_data *priv)
+{
+	u8 *addr = page_address(priv->refpage);
+	u8 pattern = addr[0];
+	int i;
+
+	for (i = 1; i != PAGE_SIZE; ++i)
+		if (addr[i] != pattern)
+			return;
+
+	priv->optzn_kind = REFPAGE_OPTZN_PATTERN;
+	priv->optzn_info = pattern;
+}
+
+void copy_refpage(struct page *page, unsigned long addr,
+		  struct vm_area_struct *vma)
+{
+	struct refpage_private_data *priv = vma->vm_private_data;
+
+	if (priv->optzn_kind == REFPAGE_OPTZN_PATTERN)
+		memset(page_address(page), priv->optzn_info, PAGE_SIZE);
+	else
+		copy_user_highpage(page, priv->refpage, addr, vma);
+}
+
+static void put_refpage_private_data(struct refpage_private_data *priv)
+{
+	put_page(priv->refpage);
+	kfree(priv);
+}
+
+static int refpage_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	vma_set_anonymous(vma);
+	vma->vm_private_data = vma->vm_file->private_data;
+	arch_prep_refpage_vma(vma);
+	return 0;
+}
+
+static int refpage_release(struct inode *inode, struct file *file)
+{
+	put_refpage_private_data(file->private_data);
+	return 0;
+}
+
+const struct file_operations refpage_file_operations = {
+	.mmap = refpage_mmap,
+	.release = refpage_release,
+};
+
+SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
+		flags)
+{
+	unsigned long content_addr = (unsigned long)content;
+	struct page *userpage;
+	struct refpage_private_data *private_data;
+	int fd;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
+	    get_user_pages(content_addr, 1, 0, &userpage, 0) != 1)
+		return -EFAULT;
+
+	private_data = kzalloc(sizeof(struct refpage_private_data), GFP_KERNEL);
+	if (!private_data) {
+		put_page(userpage);
+		return -ENOMEM;
+	}
+
+	private_data->refpage = alloc_page(GFP_KERNEL);
+	if (!private_data->refpage) {
+		kfree(private_data);
+		put_page(userpage);
+		return -ENOMEM;
+	}
+
+	copy_highpage(private_data->refpage, userpage);
+	arch_prep_refpage_private_data(private_data);
+	put_page(userpage);
+
+	fd = anon_inode_getfd("[refpage]", &refpage_file_operations,
+			      private_data, O_RDONLY | O_CLOEXEC);
+	if (fd < 0)
+		put_refpage_private_data(private_data);
+
+	return fd;
+}