[v4] mm: Add MREMAP_DONTUNMAP to mremap().

Message ID	20200207201856.46070-1-bgeffon@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=3ecC=33=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 48F3D214AF Date: Fri, 7 Feb 2020 12:18:56 -0800 Message-Id: <20200207201856.46070-1-bgeffon@google.com> Mime-Version: 1.0 Subject: [PATCH v4] mm: Add MREMAP_DONTUNMAP to mremap(). From: Brian Geffon <bgeffon@google.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: "Michael S . Tsirkin" <mst@redhat.com>, Brian Geffon <bgeffon@google.com>, Arnd Bergmann <arnd@arndb.de>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-api@vger.kernel.org, Andy Lutomirski <luto@amacapital.net>, Will Deacon <will@kernel.org>, Andrea Arcangeli <aarcange@redhat.com>, Sonny Rao <sonnyrao@google.com>, Minchan Kim <minchan@kernel.org>, Joel Fernandes <joel@joelfernandes.org>, Yu Zhao <yuzhao@google.com>, Jesse Barnes <jsbarnes@google.com>, Nathan Chancellor <natechancellor@gmail.com>, Florian Weimer <fweimer@redhat.com>, "Kirill A . Shutemov" <kirill@shutemov.name> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v4] mm: Add MREMAP_DONTUNMAP to mremap(). \| expand [v4] mm: Add MREMAP_DONTUNMAP to mremap().

Brian Geffon Feb. 7, 2020, 8:18 p.m. UTC

When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is
set, the source mapping will not be removed. Instead it will be
cleared as if a brand new anonymous, private mapping had been created
atomically as part of the mremap() call.  If a userfaultfd was watching
the source, it will continue to watch the new mapping.  For a mapping
that is shared or not anonymous, MREMAP_DONTUNMAP will cause the
mremap() call to fail. Because MREMAP_DONTUNMAP always results in moving
a VMA you MUST use the MREMAP_MAYMOVE flag. The final result is two
equally sized VMAs where the destination contains the PTEs of the source.

We hope to use this in Chrome OS where with userfaultfd we could write
an anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.

This feature also has a use case in Android, Lokesh Gidra has said
that "As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time consuming
but also opens a time window where a native thread may call mmap and
reserve the java heap's address range for its own usage. This flag
solves the problem."
           
Signed-off-by: Brian Geffon <bgeffon@google.com>
---
 include/uapi/linux/mman.h |  5 +-
 mm/mremap.c               | 98 ++++++++++++++++++++++++++++++---------
 2 files changed, 80 insertions(+), 23 deletions(-)

Andrew Morton Feb. 10, 2020, 1:21 a.m. UTC | #1

On Fri,  7 Feb 2020 12:18:56 -0800 Brian Geffon <bgeffon@google.com> wrote:

> When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is
> set, the source mapping will not be removed. Instead it will be
> cleared as if a brand new anonymous, private mapping had been created
> atomically as part of the mremap() call.  If a userfaultfd was watching
> the source, it will continue to watch the new mapping.  For a mapping
> that is shared or not anonymous, MREMAP_DONTUNMAP will cause the
> mremap() call to fail. Because MREMAP_DONTUNMAP always results in moving
> a VMA you MUST use the MREMAP_MAYMOVE flag. The final result is two
> equally sized VMAs where the destination contains the PTEs of the source.
> 
> We hope to use this in Chrome OS where with userfaultfd we could write
> an anonymous mapping to disk without having to STOP the process or worry
> about VMA permission changes.
> 
> This feature also has a use case in Android, Lokesh Gidra has said
> that "As part of using userfaultfd for GC, We'll have to move the physical
> pages of the java heap to a separate location. For this purpose mremap
> will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
> heap, its virtual mapping will be removed as well. Therefore, we'll
> require performing mmap immediately after. This is not only time consuming
> but also opens a time window where a native thread may call mmap and
> reserve the java heap's address range for its own usage. This flag
> solves the problem."

This seems useful and reasonably mature, so I'll queue it for
additional testing and shall await review feedback.

Could we please get some self-test code for this feature in
tools/testing/selftests/vm?  Perhaps in userfaultfd.c?

Kirill A . Shutemov Feb. 10, 2020, 10:45 a.m. UTC | #2

On Fri, Feb 07, 2020 at 12:18:56PM -0800, Brian Geffon wrote:
> When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is
> set, the source mapping will not be removed. Instead it will be
> cleared as if a brand new anonymous, private mapping had been created
> atomically as part of the mremap() call.  If a userfaultfd was watching
> the source, it will continue to watch the new mapping.  For a mapping
> that is shared or not anonymous, MREMAP_DONTUNMAP will cause the
> mremap() call to fail. Because MREMAP_DONTUNMAP always results in moving
> a VMA you MUST use the MREMAP_MAYMOVE flag. The final result is two
> equally sized VMAs where the destination contains the PTEs of the source.
> 
> We hope to use this in Chrome OS where with userfaultfd we could write
> an anonymous mapping to disk without having to STOP the process or worry
> about VMA permission changes.
> 
> This feature also has a use case in Android, Lokesh Gidra has said
> that "As part of using userfaultfd for GC, We'll have to move the physical
> pages of the java heap to a separate location. For this purpose mremap
> will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
> heap, its virtual mapping will be removed as well. Therefore, we'll
> require performing mmap immediately after. This is not only time consuming
> but also opens a time window where a native thread may call mmap and
> reserve the java heap's address range for its own usage. This flag
> solves the problem."
>            
> Signed-off-by: Brian Geffon <bgeffon@google.com>
> ---
>  include/uapi/linux/mman.h |  5 +-
>  mm/mremap.c               | 98 ++++++++++++++++++++++++++++++---------
>  2 files changed, 80 insertions(+), 23 deletions(-)
> 
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index fc1a64c3447b..923cc162609c 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -5,8 +5,9 @@
>  #include <asm/mman.h>
>  #include <asm-generic/hugetlb_encode.h>
>  
> -#define MREMAP_MAYMOVE	1
> -#define MREMAP_FIXED	2
> +#define MREMAP_MAYMOVE		1
> +#define MREMAP_FIXED		2
> +#define MREMAP_DONTUNMAP	4
>  
>  #define OVERCOMMIT_GUESS		0
>  #define OVERCOMMIT_ALWAYS		1
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 122938dcec15..9f4aa17f178b 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -318,8 +318,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  static unsigned long move_vma(struct vm_area_struct *vma,
>  		unsigned long old_addr, unsigned long old_len,
>  		unsigned long new_len, unsigned long new_addr,
> -		bool *locked, struct vm_userfaultfd_ctx *uf,
> -		struct list_head *uf_unmap)
> +		bool *locked, unsigned long flags,
> +		struct vm_userfaultfd_ctx *uf, struct list_head *uf_unmap)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct vm_area_struct *new_vma;
> @@ -408,11 +408,41 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  	if (unlikely(vma->vm_flags & VM_PFNMAP))
>  		untrack_pfn_moved(vma);
>  
> +	if (unlikely(!err && (flags & MREMAP_DONTUNMAP))) {
> +		if (vm_flags & VM_ACCOUNT) {
> +			/* Always put back VM_ACCOUNT since we won't unmap */
> +			vma->vm_flags |= VM_ACCOUNT;
> +
> +			vm_acct_memory(vma_pages(new_vma));
> +		}
> +
> +		/*
> +		 * locked_vm accounting: if the mapping remained the same size
> +		 * it will have just moved and we don't need to touch locked_vm
> +		 * because we skip the do_unmap. If the mapping shrunk before
> +		 * being moved then the do_unmap on that portion will have
> +		 * adjusted vm_locked. Only if the mapping grows do we need to
> +		 * do something special; the reason is locked_vm only accounts
> +		 * for old_len, but we're now adding new_len - old_len locked
> +		 * bytes to the new mapping.
> +		 */
> +		if (new_len > old_len)
> +			mm->locked_vm += (new_len - old_len) >> PAGE_SHIFT;

Hm. How do you enforce that we're not over RLIMIT_MEMLOCK?

Brian Geffon Feb. 10, 2020, 2:12 p.m. UTC | #3

Hi Kirill,
If the old_len == new_len then there is no change in the number of
locked pages they just moved, if the new_len < old_len then the
process of unmapping (new_len - old_len) bytes from the old mapping
will handle the locked page accounting. So in this special case where
we're growing the VMA, vma_to_resize() will enforce that growing the
vma doesn't exceed RLIMIT_MEMLOCK, but vma_to_resize() doesn't handle
incrementing mm->locked_bytes which is why we have that special case
incrementing it here.

Thanks,
Brian

On Mon, Feb 10, 2020 at 2:45 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Fri, Feb 07, 2020 at 12:18:56PM -0800, Brian Geffon wrote:
> > When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is
> > set, the source mapping will not be removed. Instead it will be
> > cleared as if a brand new anonymous, private mapping had been created
> > atomically as part of the mremap() call.  If a userfaultfd was watching
> > the source, it will continue to watch the new mapping.  For a mapping
> > that is shared or not anonymous, MREMAP_DONTUNMAP will cause the
> > mremap() call to fail. Because MREMAP_DONTUNMAP always results in moving
> > a VMA you MUST use the MREMAP_MAYMOVE flag. The final result is two
> > equally sized VMAs where the destination contains the PTEs of the source.
> >
> > We hope to use this in Chrome OS where with userfaultfd we could write
> > an anonymous mapping to disk without having to STOP the process or worry
> > about VMA permission changes.
> >
> > This feature also has a use case in Android, Lokesh Gidra has said
> > that "As part of using userfaultfd for GC, We'll have to move the physical
> > pages of the java heap to a separate location. For this purpose mremap
> > will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
> > heap, its virtual mapping will be removed as well. Therefore, we'll
> > require performing mmap immediately after. This is not only time consuming
> > but also opens a time window where a native thread may call mmap and
> > reserve the java heap's address range for its own usage. This flag
> > solves the problem."
> >
> > Signed-off-by: Brian Geffon <bgeffon@google.com>
> > ---
> >  include/uapi/linux/mman.h |  5 +-
> >  mm/mremap.c               | 98 ++++++++++++++++++++++++++++++---------
> >  2 files changed, 80 insertions(+), 23 deletions(-)
> >
> > diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> > index fc1a64c3447b..923cc162609c 100644
> > --- a/include/uapi/linux/mman.h
> > +++ b/include/uapi/linux/mman.h
> > @@ -5,8 +5,9 @@
> >  #include <asm/mman.h>
> >  #include <asm-generic/hugetlb_encode.h>
> >
> > -#define MREMAP_MAYMOVE       1
> > -#define MREMAP_FIXED 2
> > +#define MREMAP_MAYMOVE               1
> > +#define MREMAP_FIXED         2
> > +#define MREMAP_DONTUNMAP     4
> >
> >  #define OVERCOMMIT_GUESS             0
> >  #define OVERCOMMIT_ALWAYS            1
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 122938dcec15..9f4aa17f178b 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -318,8 +318,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >  static unsigned long move_vma(struct vm_area_struct *vma,
> >               unsigned long old_addr, unsigned long old_len,
> >               unsigned long new_len, unsigned long new_addr,
> > -             bool *locked, struct vm_userfaultfd_ctx *uf,
> > -             struct list_head *uf_unmap)
> > +             bool *locked, unsigned long flags,
> > +             struct vm_userfaultfd_ctx *uf, struct list_head *uf_unmap)
> >  {
> >       struct mm_struct *mm = vma->vm_mm;
> >       struct vm_area_struct *new_vma;
> > @@ -408,11 +408,41 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> >       if (unlikely(vma->vm_flags & VM_PFNMAP))
> >               untrack_pfn_moved(vma);
> >
> > +     if (unlikely(!err && (flags & MREMAP_DONTUNMAP))) {
> > +             if (vm_flags & VM_ACCOUNT) {
> > +                     /* Always put back VM_ACCOUNT since we won't unmap */
> > +                     vma->vm_flags |= VM_ACCOUNT;
> > +
> > +                     vm_acct_memory(vma_pages(new_vma));
> > +             }
> > +
> > +             /*
> > +              * locked_vm accounting: if the mapping remained the same size
> > +              * it will have just moved and we don't need to touch locked_vm
> > +              * because we skip the do_unmap. If the mapping shrunk before
> > +              * being moved then the do_unmap on that portion will have
> > +              * adjusted vm_locked. Only if the mapping grows do we need to
> > +              * do something special; the reason is locked_vm only accounts
> > +              * for old_len, but we're now adding new_len - old_len locked
> > +              * bytes to the new mapping.
> > +              */
> > +             if (new_len > old_len)
> > +                     mm->locked_vm += (new_len - old_len) >> PAGE_SHIFT;
>
> Hm. How do you enforce that we're not over RLIMIT_MEMLOCK?
>
>
> --
>  Kirill A. Shutemov

Brian Geffon Feb. 10, 2020, 6:38 p.m. UTC | #4

Thank you Andrew. I'll get working on some self-tests.

Brian

On Sun, Feb 9, 2020 at 5:21 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri,  7 Feb 2020 12:18:56 -0800 Brian Geffon <bgeffon@google.com> wrote:
>
> > When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is
> > set, the source mapping will not be removed. Instead it will be
> > cleared as if a brand new anonymous, private mapping had been created
> > atomically as part of the mremap() call.  If a userfaultfd was watching
> > the source, it will continue to watch the new mapping.  For a mapping
> > that is shared or not anonymous, MREMAP_DONTUNMAP will cause the
> > mremap() call to fail. Because MREMAP_DONTUNMAP always results in moving
> > a VMA you MUST use the MREMAP_MAYMOVE flag. The final result is two
> > equally sized VMAs where the destination contains the PTEs of the source.
> >
> > We hope to use this in Chrome OS where with userfaultfd we could write
> > an anonymous mapping to disk without having to STOP the process or worry
> > about VMA permission changes.
> >
> > This feature also has a use case in Android, Lokesh Gidra has said
> > that "As part of using userfaultfd for GC, We'll have to move the physical
> > pages of the java heap to a separate location. For this purpose mremap
> > will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
> > heap, its virtual mapping will be removed as well. Therefore, we'll
> > require performing mmap immediately after. This is not only time consuming
> > but also opens a time window where a native thread may call mmap and
> > reserve the java heap's address range for its own usage. This flag
> > solves the problem."
>
> This seems useful and reasonably mature, so I'll queue it for
> additional testing and shall await review feedback.
>
> Could we please get some self-test code for this feature in
> tools/testing/selftests/vm?  Perhaps in userfaultfd.c?
>

Daniel Colascione Feb. 11, 2020, 11:13 p.m. UTC | #5

On Fri, Feb 7, 2020 at 12:19 PM Brian Geffon <bgeffon@google.com> wrote:
>
> When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is
> set, the source mapping will not be removed. Instead it will be
> cleared as if a brand new anonymous, private mapping had been created
> atomically as part of the mremap() call.

The left-behind mapping (the "as if a brand new anonymous, private
mapping" map) is immediately writable, right? If so, we need to
account the additional commit charge. What about making the
left-behind mapping PROT_NONE? This way, we'll still solve the
address-space race in Lokesh's use case (because even a PROT_NONE
mapping reserves address space) but won't incur any additional commit
until someone calls mprotect(PROT_WRITE) on the left-behind mapping.
Imagine having two equal-sized mappings and wanting to use mremap() to
swap them: you can implement this swap by carving off a third region
of address space and making two mremap() calls. But without the
PROT_NONE, you pay additional commit for that third region even if you
don't need it.

Brian Geffon Feb. 11, 2020, 11:32 p.m. UTC | #6

Hi Daniel,

> What about making the
> left-behind mapping PROT_NONE? This way, we'll still solve the
> address-space race in Lokesh's use case (because even a PROT_NONE
> mapping reserves address space) but won't incur any additional commit
> until someone calls mprotect(PROT_WRITE) on the left-behind mapping.

This limits the usefulness of the feature IMO and really is too
specific to that one use case, suppose you want to snapshot a memory
region to disk without having to stop a thread you can
mremap(MREMAP_DONTUNMAP) it to another location and safely write it to
disk knowing the faulting thread will be stopped and you can handle it
later if it was registered with userfaultfd, if we were to also change
it to PROT_NONE that thread would see a SEGV. There are other examples
where you can possibly use this flag instead of VM_UFFD_WP, but
changing the protections of the mapping prevents you from being able
to do this without a funny signal handler dance.

Brian

Daniel Colascione Feb. 11, 2020, 11:53 p.m. UTC | #7

On Tue, Feb 11, 2020 at 3:32 PM Brian Geffon <bgeffon@google.com> wrote:
>
> Hi Daniel,
>
> > What about making the
> > left-behind mapping PROT_NONE? This way, we'll still solve the
> > address-space race in Lokesh's use case (because even a PROT_NONE
> > mapping reserves address space) but won't incur any additional commit
> > until someone calls mprotect(PROT_WRITE) on the left-behind mapping.
>
> This limits the usefulness of the feature IMO and really is too
> specific to that one use case, suppose you want to snapshot a memory
> region to disk without having to stop a thread you can
> mremap(MREMAP_DONTUNMAP) it to another location and safely write it to
> disk knowing the faulting thread will be stopped and you can handle it
> later if it was registered with userfaultfd, if we were to also change
> it to PROT_NONE that thread would see a SEGV. There are other examples
> where you can possibly use this flag instead of VM_UFFD_WP, but
> changing the protections of the mapping prevents you from being able
> to do this without a funny signal handler dance.

We seem to have identified two use cases which require contradictory
things. We can handle this with an option. Maybe the new region's
protection bits should be specified in the flags argument. The most
general API would accept any set of mmap flags to apply to the
left-behind region, but it's probably hard to squeeze that
functionality into the existing API.

Kirill A . Shutemov Feb. 13, 2020, 12:08 p.m. UTC | #8

On Mon, Feb 10, 2020 at 06:12:39AM -0800, Brian Geffon wrote:
> Hi Kirill,
> If the old_len == new_len then there is no change in the number of
> locked pages they just moved, if the new_len < old_len then the
> process of unmapping (new_len - old_len) bytes from the old mapping
> will handle the locked page accounting. So in this special case where
> we're growing the VMA, vma_to_resize() will enforce that growing the
> vma doesn't exceed RLIMIT_MEMLOCK, but vma_to_resize() doesn't handle
> incrementing mm->locked_bytes which is why we have that special case
> incrementing it here.

But if you do the operation for the VM_LOCKED vma, you'll have two locked
VMA's now, right? Where do you account the old locked vma you left behind?

Brian Geffon Feb. 13, 2020, 6:20 p.m. UTC | #9

Hi Kirill,

> But if you do the operation for the VM_LOCKED vma, you'll have two locked
> VMA's now, right? Where do you account the old locked vma you left behind?

You bring up a good point. In a previous iteration of my patch I had
it clearing the locked flags on the old VMA as technically the locked
pages had migrated. I talked myself out of that but the more I think
about it we should probably do that. Something along the lines of:

+    if (vm_flags & VM_LOCKED) {
+      /* Locked pages would have migrated to the new VMA */
+      vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+      if (new_len > old_len)
+              mm->locked_vm += (new_len - old_len) >> PAGE_SHIFT;
+   }

I feel that this is correct. The only other possible option would be
to clear only the VM_LOCKED flag on the old vma leaving VM_LOCKONFAULT
to handle the MCL_ONFAULT mlocked situation, thoughts? Regardless I'll
have to mail a new patch because that part where I'm incrementing the
mm->locked_vm lost the check on VM_LOCKED during patch versions.

Thanks again for taking the time to review.

Brian

Kirill A . Shutemov Feb. 14, 2020, 12:36 a.m. UTC | #10

On Thu, Feb 13, 2020 at 10:20:44AM -0800, Brian Geffon wrote:
> Hi Kirill,
> 
> > But if you do the operation for the VM_LOCKED vma, you'll have two locked
> > VMA's now, right? Where do you account the old locked vma you left behind?
> 
> You bring up a good point. In a previous iteration of my patch I had
> it clearing the locked flags on the old VMA as technically the locked
> pages had migrated. I talked myself out of that but the more I think
> about it we should probably do that. Something along the lines of:
> 
> +    if (vm_flags & VM_LOCKED) {
> +      /* Locked pages would have migrated to the new VMA */
> +      vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> +      if (new_len > old_len)
> +              mm->locked_vm += (new_len - old_len) >> PAGE_SHIFT;
> +   }
> 
> I feel that this is correct. The only other possible option would be
> to clear only the VM_LOCKED flag on the old vma leaving VM_LOCKONFAULT
> to handle the MCL_ONFAULT mlocked situation, thoughts? Regardless I'll
> have to mail a new patch because that part where I'm incrementing the
> mm->locked_vm lost the check on VM_LOCKED during patch versions.

Note, that we account mlock limit on per-VMA basis, not per page. Even for
VM_LOCKONFAULT.

> Thanks again for taking the time to review.

I believe the right approach is to strip VM_LOCKED[ONFAULT] from the vma
you left behind. Or the new vma. It is a policy decision.

JFYI, we do not inherit VM_LOCKED on fork(), so it's common practice to
strip VM_LOCKED on vma duplication.

Other option is to leave VM_LOCKED on both VMAs and fail the operation if
we are over the limit. But we need to have a good reason to take this
path. It makes the interface less flexible.

[v4] mm: Add MREMAP_DONTUNMAP to mremap().

Commit Message

Comments

Patch