[v6,1/2] mm: Add MREMAP_DONTUNMAP to mremap().

Message ID	20200218173221.237674-1-bgeffon@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=f+6W=4G=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9522824656 Date: Tue, 18 Feb 2020 09:32:20 -0800 Message-Id: <20200218173221.237674-1-bgeffon@google.com> Mime-Version: 1.0 Subject: [PATCH v6 1/2] mm: Add MREMAP_DONTUNMAP to mremap(). From: Brian Geffon <bgeffon@google.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: "Michael S . Tsirkin" <mst@redhat.com>, Brian Geffon <bgeffon@google.com>, Arnd Bergmann <arnd@arndb.de>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-api@vger.kernel.org, Andy Lutomirski <luto@amacapital.net>, Will Deacon <will@kernel.org>, Andrea Arcangeli <aarcange@redhat.com>, Sonny Rao <sonnyrao@google.com>, Minchan Kim <minchan@kernel.org>, Joel Fernandes <joel@joelfernandes.org>, Yu Zhao <yuzhao@google.com>, Jesse Barnes <jsbarnes@google.com>, Florian Weimer <fweimer@redhat.com>, "Kirill A . Shutemov" <kirill@shutemov.name> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v6,1/2] mm: Add MREMAP_DONTUNMAP to mremap(). \| expand [v6,1/2] mm: Add MREMAP_DONTUNMAP to mremap(). [v6,2/2] selftest: Add MREMAP_DONTUNMAP selftest.

Brian Geffon Feb. 18, 2020, 5:32 p.m. UTC

When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is
set, the source mapping will not be removed. The remap operation
will be performed as it would have been normally by moving over the
page tables to the new mapping. The old vma will have any locked
flags cleared, have no pagetables, and any userfaultfds that were
watching that range will continue watching it.

For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in moving
a VMA you MUST use the MREMAP_MAYMOVE flag. The final result is two
equally sized VMAs where the destination contains the PTEs of the source.

We hope to use this in Chrome OS where with userfaultfd we could write
an anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.

This feature also has a use case in Android, Lokesh Gidra has said
that "As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time consuming
but also opens a time window where a native thread may call mmap and
reserve the java heap's address range for its own usage. This flag
solves the problem."

  v5 -> v6:
    - Code cleanup suggested by Kirill.

  v4 -> v5:
    - Correct commit message to more accurately reflect the behavior.
    - Clear VM_LOCKED and VM_LOCKEDONFAULT on the old vma.
           
Signed-off-by: Brian Geffon <bgeffon@google.com>
---
 include/uapi/linux/mman.h |   5 +-
 mm/mremap.c               | 103 ++++++++++++++++++++++++++++++--------
 2 files changed, 85 insertions(+), 23 deletions(-)

Andrew Morton Feb. 19, 2020, 8:39 p.m. UTC | #1

On Tue, 18 Feb 2020 09:32:20 -0800 Brian Geffon <bgeffon@google.com> wrote:

> When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is
> set, the source mapping will not be removed. The remap operation
> will be performed as it would have been normally by moving over the
> page tables to the new mapping. The old vma will have any locked
> flags cleared, have no pagetables, and any userfaultfds that were
> watching that range will continue watching it.
> 
> For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
> the mremap() call to fail. Because MREMAP_DONTUNMAP always results in moving
> a VMA you MUST use the MREMAP_MAYMOVE flag. The final result is two
> equally sized VMAs where the destination contains the PTEs of the source.
> 
> We hope to use this in Chrome OS where with userfaultfd we could write
> an anonymous mapping to disk without having to STOP the process or worry
> about VMA permission changes.
> 
> This feature also has a use case in Android, Lokesh Gidra has said
> that "As part of using userfaultfd for GC, We'll have to move the physical
> pages of the java heap to a separate location. For this purpose mremap
> will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
> heap, its virtual mapping will be removed as well. Therefore, we'll
> require performing mmap immediately after. This is not only time consuming
> but also opens a time window where a native thread may call mmap and
> reserve the java heap's address range for its own usage. This flag
> solves the problem."

Thanks.

We're a bit thin on review activity on this one.  Has Lokesh been able
to review and preferably test the code?  Are you able to identify other
potential users?  perhaps even glibc?

Lokesh Gidra Feb. 19, 2020, 9:38 p.m. UTC | #2

I've validated the change, that it works for me, through manual
testing. The android runtime changes will follow shortly.

Tested-by: Lokesh Gidra <lokeshgidra@google.com>



On Wed, Feb 19, 2020 at 12:39 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Tue, 18 Feb 2020 09:32:20 -0800 Brian Geffon <bgeffon@google.com> wrote:
>
> > When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is
> > set, the source mapping will not be removed. The remap operation
> > will be performed as it would have been normally by moving over the
> > page tables to the new mapping. The old vma will have any locked
> > flags cleared, have no pagetables, and any userfaultfds that were
> > watching that range will continue watching it.
> >
> > For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
> > the mremap() call to fail. Because MREMAP_DONTUNMAP always results in moving
> > a VMA you MUST use the MREMAP_MAYMOVE flag. The final result is two
> > equally sized VMAs where the destination contains the PTEs of the source.
> >
> > We hope to use this in Chrome OS where with userfaultfd we could write
> > an anonymous mapping to disk without having to STOP the process or worry
> > about VMA permission changes.
> >
> > This feature also has a use case in Android, Lokesh Gidra has said
> > that "As part of using userfaultfd for GC, We'll have to move the physical
> > pages of the java heap to a separate location. For this purpose mremap
> > will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
> > heap, its virtual mapping will be removed as well. Therefore, we'll
> > require performing mmap immediately after. This is not only time consuming
> > but also opens a time window where a native thread may call mmap and
> > reserve the java heap's address range for its own usage. This flag
> > solves the problem."
>
> Thanks.
>
> We're a bit thin on review activity on this one.  Has Lokesh been able
> to review and preferably test the code?  Are you able to identify other
> potential users?  perhaps even glibc?

Minchan Kim Feb. 19, 2020, 11:23 p.m. UTC | #3

On Tue, Feb 18, 2020 at 09:32:20AM -0800, Brian Geffon wrote:
> When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is
> set, the source mapping will not be removed. The remap operation
> will be performed as it would have been normally by moving over the
> page tables to the new mapping. The old vma will have any locked
> flags cleared, have no pagetables, and any userfaultfds that were
> watching that range will continue watching it.
> 
> For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
> the mremap() call to fail. Because MREMAP_DONTUNMAP always results in moving
> a VMA you MUST use the MREMAP_MAYMOVE flag. The final result is two
> equally sized VMAs where the destination contains the PTEs of the source.
> 
> We hope to use this in Chrome OS where with userfaultfd we could write
> an anonymous mapping to disk without having to STOP the process or worry
> about VMA permission changes.
> 
> This feature also has a use case in Android, Lokesh Gidra has said
> that "As part of using userfaultfd for GC, We'll have to move the physical
> pages of the java heap to a separate location. For this purpose mremap
> will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
> heap, its virtual mapping will be removed as well. Therefore, we'll
> require performing mmap immediately after. This is not only time consuming
> but also opens a time window where a native thread may call mmap and
> reserve the java heap's address range for its own usage. This flag
> solves the problem."
> 
>   v5 -> v6:
>     - Code cleanup suggested by Kirill.
> 
>   v4 -> v5:
>     - Correct commit message to more accurately reflect the behavior.
>     - Clear VM_LOCKED and VM_LOCKEDONFAULT on the old vma.
>            
> Signed-off-by: Brian Geffon <bgeffon@google.com>

Description and implemenation looks clean/sane for me.

Reviewed-by: Minchan Kim <minchan@kernel.org>

When I review the patch, it seems to be broken with lots of "=20", not
sure it's my mail client problem or yours. Anyway, please double check
it.

Thanks.

Kirill A . Shutemov Feb. 20, 2020, 11:57 a.m. UTC | #4

On Tue, Feb 18, 2020 at 09:32:20AM -0800, Brian Geffon wrote:
> When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is
> set, the source mapping will not be removed. The remap operation
> will be performed as it would have been normally by moving over the
> page tables to the new mapping. The old vma will have any locked
> flags cleared, have no pagetables, and any userfaultfds that were
> watching that range will continue watching it.
> 
> For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
> the mremap() call to fail. Because MREMAP_DONTUNMAP always results in moving
> a VMA you MUST use the MREMAP_MAYMOVE flag. The final result is two
> equally sized VMAs where the destination contains the PTEs of the source.
> 
> We hope to use this in Chrome OS where with userfaultfd we could write
> an anonymous mapping to disk without having to STOP the process or worry
> about VMA permission changes.
> 
> This feature also has a use case in Android, Lokesh Gidra has said
> that "As part of using userfaultfd for GC, We'll have to move the physical
> pages of the java heap to a separate location. For this purpose mremap
> will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
> heap, its virtual mapping will be removed as well. Therefore, we'll
> require performing mmap immediately after. This is not only time consuming
> but also opens a time window where a native thread may call mmap and
> reserve the java heap's address range for its own usage. This flag
> solves the problem."
> 
>   v5 -> v6:
>     - Code cleanup suggested by Kirill.
> 
>   v4 -> v5:
>     - Correct commit message to more accurately reflect the behavior.
>     - Clear VM_LOCKED and VM_LOCKEDONFAULT on the old vma.
>            
> Signed-off-by: Brian Geffon <bgeffon@google.com>
> ---
>  include/uapi/linux/mman.h |   5 +-
>  mm/mremap.c               | 103 ++++++++++++++++++++++++++++++--------
>  2 files changed, 85 insertions(+), 23 deletions(-)
> 
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index fc1a64c3447b..923cc162609c 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -5,8 +5,9 @@
>  #include <asm/mman.h>
>  #include <asm-generic/hugetlb_encode.h>
>  
> -#define MREMAP_MAYMOVE	1
> -#define MREMAP_FIXED	2
> +#define MREMAP_MAYMOVE		1
> +#define MREMAP_FIXED		2
> +#define MREMAP_DONTUNMAP	4
>  
>  #define OVERCOMMIT_GUESS		0
>  #define OVERCOMMIT_ALWAYS		1
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 1fc8a29fbe3f..fa27103502c5 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -318,8 +318,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  static unsigned long move_vma(struct vm_area_struct *vma,
>  		unsigned long old_addr, unsigned long old_len,
>  		unsigned long new_len, unsigned long new_addr,
> -		bool *locked, struct vm_userfaultfd_ctx *uf,
> -		struct list_head *uf_unmap)
> +		bool *locked, unsigned long flags,
> +		struct vm_userfaultfd_ctx *uf, struct list_head *uf_unmap)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct vm_area_struct *new_vma;
> @@ -408,11 +408,46 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  	if (unlikely(vma->vm_flags & VM_PFNMAP))
>  		untrack_pfn_moved(vma);
>  
> +	if (unlikely(!err && (flags & MREMAP_DONTUNMAP))) {
> +		if (vm_flags & VM_ACCOUNT) {
> +			/* Always put back VM_ACCOUNT since we won't unmap */
> +			vma->vm_flags |= VM_ACCOUNT;
> +
> +			vm_acct_memory(vma_pages(new_vma));
> +		}
> +
> +		/*
> +		 * locked_vm accounting: if the mapping remained the same size
> +		 * it will have just moved and we don't need to touch locked_vm
> +		 * because we skip the do_unmap. If the mapping shrunk before
> +		 * being moved then the do_unmap on that portion will have
> +		 * adjusted vm_locked. Only if the mapping grows do we need to
> +		 * do something special; the reason is locked_vm only accounts
> +		 * for old_len, but we're now adding new_len - old_len locked
> +		 * bytes to the new mapping.
> +		 */
> +		if (vm_flags & VM_LOCKED && new_len > old_len) {
> +			mm->locked_vm += (new_len - old_len) >> PAGE_SHIFT;
> +			*locked = true;
> +		}
> +
> +		/* We always clear VM_LOCKED[ONFAULT] on the old vma */
> +		vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> +
> +		goto out;
> +	}
> +
>  	if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) {
>  		/* OOM: unable to split vma, just get accounts right */
>  		vm_unacct_memory(excess >> PAGE_SHIFT);
>  		excess = 0;
>  	}
> +
> +	if (vm_flags & VM_LOCKED) {
> +		mm->locked_vm += new_len >> PAGE_SHIFT;
> +		*locked = true;
> +	}
> +out:
>  	mm->hiwater_vm = hiwater_vm;
>  
>  	/* Restore VM_ACCOUNT if one or two pieces of vma left */
> @@ -422,16 +457,12 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  			vma->vm_next->vm_flags |= VM_ACCOUNT;
>  	}
>  
> -	if (vm_flags & VM_LOCKED) {
> -		mm->locked_vm += new_len >> PAGE_SHIFT;
> -		*locked = true;
> -	}
> -
>  	return new_addr;
>  }
>  
>  static struct vm_area_struct *vma_to_resize(unsigned long addr,
> -	unsigned long old_len, unsigned long new_len, unsigned long *p)
> +	unsigned long old_len, unsigned long new_len, unsigned long flags,
> +	unsigned long *p)
>  {
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma = find_vma(mm, addr);
> @@ -453,6 +484,10 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
>  		return ERR_PTR(-EINVAL);
>  	}
>  
> +	if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
> +			vma->vm_flags & VM_SHARED))
> +		return ERR_PTR(-EINVAL);
> +
>  	if (is_vm_hugetlb_page(vma))
>  		return ERR_PTR(-EINVAL);
>  
> @@ -497,7 +532,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
>  
>  static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
>  		unsigned long new_addr, unsigned long new_len, bool *locked,
> -		struct vm_userfaultfd_ctx *uf,
> +		unsigned long flags, struct vm_userfaultfd_ctx *uf,
>  		struct list_head *uf_unmap_early,
>  		struct list_head *uf_unmap)
>  {
> @@ -505,7 +540,7 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
>  	struct vm_area_struct *vma;
>  	unsigned long ret = -EINVAL;
>  	unsigned long charged = 0;
> -	unsigned long map_flags;
> +	unsigned long map_flags = 0;
>  
>  	if (offset_in_page(new_addr))
>  		goto out;
> @@ -534,9 +569,11 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
>  	if ((mm->map_count + 2) >= sysctl_max_map_count - 3)
>  		return -ENOMEM;
>  
> -	ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
> -	if (ret)
> -		goto out;
> +	if (flags & MREMAP_FIXED) {
> +		ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
> +		if (ret)
> +			goto out;
> +	}
>  
>  	if (old_len >= new_len) {
>  		ret = do_munmap(mm, addr+new_len, old_len - new_len, uf_unmap);
> @@ -545,13 +582,26 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
>  		old_len = new_len;
>  	}
>  
> -	vma = vma_to_resize(addr, old_len, new_len, &charged);
> +	vma = vma_to_resize(addr, old_len, new_len, flags, &charged);
>  	if (IS_ERR(vma)) {
>  		ret = PTR_ERR(vma);
>  		goto out;
>  	}
>  
> -	map_flags = MAP_FIXED;
> +	/*
> +	 * MREMAP_DONTUNMAP expands by new_len - (new_len - old_len), we will
> +	 * check that we can expand by new_len and vma_to_resize will handle
> +	 * the vma growing which is (new_len - old_len).
> +	 */

< Sorry for delay. >

I have hard time understanding the case when new_len != old_len.

Correct me if I'm wrong, but looks like that you change the size of old
mapping to be the new_len and then create a new of the same new_len.

This doesn't look right to me.

In my opinion, MREMAP_DONTUNMAP has to leave the old mapping intact. And
create the new mapping adjusted to the new_len.

Other option is to force new_len == old_len if MREMAP_DONTUNMAP is
specified. It would simplify the implementation. And I don't see why
anybody would really want anything else.

> +	if (flags & MREMAP_DONTUNMAP &&
> +		!may_expand_vm(mm, vma->vm_flags, new_len >> PAGE_SHIFT)) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	if (flags & MREMAP_FIXED)
> +		map_flags |= MAP_FIXED;
> +
>  	if (vma->vm_flags & VM_MAYSHARE)
>  		map_flags |= MAP_SHARED;
>  
> @@ -561,10 +611,16 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
>  	if (offset_in_page(ret))
>  		goto out1;
>  
> -	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, uf,
> +	/* We got a new mapping */
> +	if (!(flags & MREMAP_FIXED))
> +		new_addr = ret;
> +
> +	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, flags, uf,
>  		       uf_unmap);
> +
>  	if (!(offset_in_page(ret)))
>  		goto out;

Not related to the effort directly, but do we really use offset_in_page()
as a substitute for IS_ERR() here. That's disgusting.

> +
>  out1:
>  	vm_unacct_memory(charged);
>  
> @@ -609,12 +665,16 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>  	addr = untagged_addr(addr);
>  	new_addr = untagged_addr(new_addr);
>  
> -	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
> +	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP))
>  		return ret;
>  
>  	if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
>  		return ret;
>  
> +	/* MREMAP_DONTUNMAP is always a move */
> +	if (flags & MREMAP_DONTUNMAP && !(flags & MREMAP_MAYMOVE))
> +		return ret;
> +
>  	if (offset_in_page(addr))
>  		return ret;
>  
> @@ -632,9 +692,10 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>  	if (down_write_killable(&current->mm->mmap_sem))
>  		return -EINTR;
>  
> -	if (flags & MREMAP_FIXED) {
> +	if (flags & (MREMAP_FIXED | MREMAP_DONTUNMAP)) {
>  		ret = mremap_to(addr, old_len, new_addr, new_len,
> -				&locked, &uf, &uf_unmap_early, &uf_unmap);
> +				&locked, flags, &uf, &uf_unmap_early,
> +				&uf_unmap);
>  		goto out;
>  	}
>  
> @@ -662,7 +723,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>  	/*
>  	 * Ok, we need to grow..
>  	 */
> -	vma = vma_to_resize(addr, old_len, new_len, &charged);
> +	vma = vma_to_resize(addr, old_len, new_len, flags, &charged);
>  	if (IS_ERR(vma)) {
>  		ret = PTR_ERR(vma);
>  		goto out;
> @@ -712,7 +773,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>  		}
>  
>  		ret = move_vma(vma, addr, old_len, new_len, new_addr,
> -			       &locked, &uf, &uf_unmap);
> +			       &locked, flags, &uf, &uf_unmap);
>  	}
>  out:
>  	if (offset_in_page(ret)) {
> -- 
> 2.25.0.265.gbab2e86ba0-goog
>

Minchan Kim Feb. 20, 2020, 5:15 p.m. UTC | #5

On Tue, Feb 18, 2020 at 09:32:20AM -0800, Brian Geffon wrote:
> When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is
> set, the source mapping will not be removed. The remap operation
> will be performed as it would have been normally by moving over the
> page tables to the new mapping. The old vma will have any locked
> flags cleared, have no pagetables, and any userfaultfds that were
> watching that range will continue watching it.
> 
> For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
> the mremap() call to fail. Because MREMAP_DONTUNMAP always results in moving
> a VMA you MUST use the MREMAP_MAYMOVE flag. The final result is two
> equally sized VMAs where the destination contains the PTEs of the source.
> 
> We hope to use this in Chrome OS where with userfaultfd we could write
> an anonymous mapping to disk without having to STOP the process or worry
> about VMA permission changes.
> 
> This feature also has a use case in Android, Lokesh Gidra has said
> that "As part of using userfaultfd for GC, We'll have to move the physical
> pages of the java heap to a separate location. For this purpose mremap
> will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
> heap, its virtual mapping will be removed as well. Therefore, we'll
> require performing mmap immediately after. This is not only time consuming
> but also opens a time window where a native thread may call mmap and
> reserve the java heap's address range for its own usage. This flag
> solves the problem."
> 
>   v5 -> v6:
>     - Code cleanup suggested by Kirill.
> 
>   v4 -> v5:
>     - Correct commit message to more accurately reflect the behavior.
>     - Clear VM_LOCKED and VM_LOCKEDONFAULT on the old vma.
>            
> Signed-off-by: Brian Geffon <bgeffon@google.com>
> ---
>  include/uapi/linux/mman.h |   5 +-
>  mm/mremap.c               | 103 ++++++++++++++++++++++++++++++--------
>  2 files changed, 85 insertions(+), 23 deletions(-)
> 
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index fc1a64c3447b..923cc162609c 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -5,8 +5,9 @@
>  #include <asm/mman.h>
>  #include <asm-generic/hugetlb_encode.h>
>  
> -#define MREMAP_MAYMOVE	1
> -#define MREMAP_FIXED	2
> +#define MREMAP_MAYMOVE		1
> +#define MREMAP_FIXED		2
> +#define MREMAP_DONTUNMAP	4
>  
>  #define OVERCOMMIT_GUESS		0
>  #define OVERCOMMIT_ALWAYS		1
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 1fc8a29fbe3f..fa27103502c5 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -318,8 +318,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  static unsigned long move_vma(struct vm_area_struct *vma,
>  		unsigned long old_addr, unsigned long old_len,
>  		unsigned long new_len, unsigned long new_addr,
> -		bool *locked, struct vm_userfaultfd_ctx *uf,
> -		struct list_head *uf_unmap)
> +		bool *locked, unsigned long flags,
> +		struct vm_userfaultfd_ctx *uf, struct list_head *uf_unmap)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct vm_area_struct *new_vma;
> @@ -408,11 +408,46 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  	if (unlikely(vma->vm_flags & VM_PFNMAP))
>  		untrack_pfn_moved(vma);
>  
> +	if (unlikely(!err && (flags & MREMAP_DONTUNMAP))) {
> +		if (vm_flags & VM_ACCOUNT) {
> +			/* Always put back VM_ACCOUNT since we won't unmap */
> +			vma->vm_flags |= VM_ACCOUNT;
> +
> +			vm_acct_memory(vma_pages(new_vma));
> +		}
> +
> +		/*
> +		 * locked_vm accounting: if the mapping remained the same size
> +		 * it will have just moved and we don't need to touch locked_vm
> +		 * because we skip the do_unmap. If the mapping shrunk before
> +		 * being moved then the do_unmap on that portion will have
> +		 * adjusted vm_locked. Only if the mapping grows do we need to
> +		 * do something special; the reason is locked_vm only accounts
> +		 * for old_len, but we're now adding new_len - old_len locked
> +		 * bytes to the new mapping.
> +		 */
> +		if (vm_flags & VM_LOCKED && new_len > old_len) {
> +			mm->locked_vm += (new_len - old_len) >> PAGE_SHIFT;
> +			*locked = true;
> +		}
> +
> +		/* We always clear VM_LOCKED[ONFAULT] on the old vma */
> +		vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> +
> +		goto out;
> +	}
> +
>  	if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) {
>  		/* OOM: unable to split vma, just get accounts right */
>  		vm_unacct_memory(excess >> PAGE_SHIFT);
>  		excess = 0;
>  	}
> +
> +	if (vm_flags & VM_LOCKED) {
> +		mm->locked_vm += new_len >> PAGE_SHIFT;
> +		*locked = true;
> +	}
> +out:
>  	mm->hiwater_vm = hiwater_vm;
>  
>  	/* Restore VM_ACCOUNT if one or two pieces of vma left */
> @@ -422,16 +457,12 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  			vma->vm_next->vm_flags |= VM_ACCOUNT;
>  	}
>  
> -	if (vm_flags & VM_LOCKED) {
> -		mm->locked_vm += new_len >> PAGE_SHIFT;
> -		*locked = true;
> -	}
> -
>  	return new_addr;
>  }
>  
>  static struct vm_area_struct *vma_to_resize(unsigned long addr,
> -	unsigned long old_len, unsigned long new_len, unsigned long *p)
> +	unsigned long old_len, unsigned long new_len, unsigned long flags,
> +	unsigned long *p)
>  {
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma = find_vma(mm, addr);
> @@ -453,6 +484,10 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
>  		return ERR_PTR(-EINVAL);
>  	}
>  
> +	if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
> +			vma->vm_flags & VM_SHARED))
> +		return ERR_PTR(-EINVAL);
> +
>  	if (is_vm_hugetlb_page(vma))
>  		return ERR_PTR(-EINVAL);
>  
> @@ -497,7 +532,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
>  
>  static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
>  		unsigned long new_addr, unsigned long new_len, bool *locked,
> -		struct vm_userfaultfd_ctx *uf,
> +		unsigned long flags, struct vm_userfaultfd_ctx *uf,
>  		struct list_head *uf_unmap_early,
>  		struct list_head *uf_unmap)
>  {
> @@ -505,7 +540,7 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
>  	struct vm_area_struct *vma;
>  	unsigned long ret = -EINVAL;
>  	unsigned long charged = 0;
> -	unsigned long map_flags;
> +	unsigned long map_flags = 0;
>  
>  	if (offset_in_page(new_addr))
>  		goto out;
> @@ -534,9 +569,11 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
>  	if ((mm->map_count + 2) >= sysctl_max_map_count - 3)
>  		return -ENOMEM;
>  
> -	ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
> -	if (ret)
> -		goto out;
> +	if (flags & MREMAP_FIXED) {
> +		ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
> +		if (ret)
> +			goto out;
> +	}
>  
>  	if (old_len >= new_len) {
>  		ret = do_munmap(mm, addr+new_len, old_len - new_len, uf_unmap);

So we remove old map here.

> @@ -545,13 +582,26 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
>  		old_len = new_len;
>  	}
>  
> -	vma = vma_to_resize(addr, old_len, new_len, &charged);
> +	vma = vma_to_resize(addr, old_len, new_len, flags, &charged);

And here we got error if the addr is in non-anonymous-private vma so the
syscall will fail but old vma is gone? I guess it's not your intention?

>  	if (IS_ERR(vma)) {
>  		ret = PTR_ERR(vma);
>  		goto out;
>  	}
>  
> -	map_flags = MAP_FIXED;
> +	/*
> +	 * MREMAP_DONTUNMAP expands by new_len - (new_len - old_len), we will
> +	 * check that we can expand by new_len and vma_to_resize will handle
> +	 * the vma growing which is (new_len - old_len).
> +	 */
> +	if (flags & MREMAP_DONTUNMAP &&
> +		!may_expand_vm(mm, vma->vm_flags, new_len >> PAGE_SHIFT)) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	if (flags & MREMAP_FIXED)
> +		map_flags |= MAP_FIXED;
> +
>  	if (vma->vm_flags & VM_MAYSHARE)
>  		map_flags |= MAP_SHARED;
>  
> @@ -561,10 +611,16 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
>  	if (offset_in_page(ret))
>  		goto out1;
>  
> -	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, uf,
> +	/* We got a new mapping */
> +	if (!(flags & MREMAP_FIXED))
> +		new_addr = ret;
> +
> +	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, flags, uf,
>  		       uf_unmap);
> +
>  	if (!(offset_in_page(ret)))
>  		goto out;
> +
>  out1:
>  	vm_unacct_memory(charged);
>  
> @@ -609,12 +665,16 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>  	addr = untagged_addr(addr);
>  	new_addr = untagged_addr(new_addr);
>  
> -	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
> +	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP))
>  		return ret;
>  
>  	if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
>  		return ret;
>  
> +	/* MREMAP_DONTUNMAP is always a move */
> +	if (flags & MREMAP_DONTUNMAP && !(flags & MREMAP_MAYMOVE))
> +		return ret;
> +
>  	if (offset_in_page(addr))
>  		return ret;
>  
> @@ -632,9 +692,10 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>  	if (down_write_killable(&current->mm->mmap_sem))
>  		return -EINTR;
>  
> -	if (flags & MREMAP_FIXED) {
> +	if (flags & (MREMAP_FIXED | MREMAP_DONTUNMAP)) {
>  		ret = mremap_to(addr, old_len, new_addr, new_len,
> -				&locked, &uf, &uf_unmap_early, &uf_unmap);
> +				&locked, flags, &uf, &uf_unmap_early,
> +				&uf_unmap);
>  		goto out;
>  	}
>  
> @@ -662,7 +723,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>  	/*
>  	 * Ok, we need to grow..
>  	 */
> -	vma = vma_to_resize(addr, old_len, new_len, &charged);
> +	vma = vma_to_resize(addr, old_len, new_len, flags, &charged);
>  	if (IS_ERR(vma)) {
>  		ret = PTR_ERR(vma);
>  		goto out;
> @@ -712,7 +773,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
>  		}
>  
>  		ret = move_vma(vma, addr, old_len, new_len, new_addr,
> -			       &locked, &uf, &uf_unmap);
> +			       &locked, flags, &uf, &uf_unmap);
>  	}
>  out:
>  	if (offset_in_page(ret)) {
> -- 
> 2.25.0.265.gbab2e86ba0-goog
>

Brian Geffon Feb. 20, 2020, 6:36 p.m. UTC | #6

Hi Minchan,

> And here we got error if the addr is in non-anonymous-private vma so the
> syscall will fail but old vma is gone? I guess it's not your intention?

This is exactly what happens today in several situations, because
vma_to_resize is called unconditionally. For example if the old vma
has VM_HUGETLB and old_len < new_len it would have unmapped a portion
and then in vma_to_resize returned -EINVAL, similarly when old_len = 0
with a non-sharable mapping it will have called do_munmap only to fail
in vma_to_resize, if the vma has VM_DONTEXPAND set and you shrink the
size with old_len < new_len it would return -EFAULT after having done
the unmap on the decreased portion. So I followed the pattern to keep
the change simple and maintain consistency with existing behavior.

But with that being said, Kirill made the point that resizing a VMA
while also using MREMAP_DONTUNMAP doesn't have any clear use case and
I agree with that, I'm unable to think of a situation where you'd want
to resize a VMA and use MREMAP_DONTUNMAP. So I'm tempted to mail a new
version which returns -EINVAL if old_len != new_len that would resolve
this concern here as nothing would be unmapped ever at the old
position add it would clean up the change to very few lines of code.

What do you think?

Thank you for taking the time to review.

Brian

Brian Geffon Feb. 20, 2020, 6:45 p.m. UTC | #7

Sorry I should clarify that this is the behavior with MREMAP_FIXED is
used, and to expand on that, it would potentially even have unmapped
the region at the destination address and then fail in vma_to_resize
too, so I hope that explains why that check landed there. But should
these situations be considered a bug?

Brian

On Thu, Feb 20, 2020 at 10:36 AM Brian Geffon <bgeffon@google.com> wrote:
>
> Hi Minchan,
>
> > And here we got error if the addr is in non-anonymous-private vma so the
> > syscall will fail but old vma is gone? I guess it's not your intention?
>
> This is exactly what happens today in several situations, because
> vma_to_resize is called unconditionally. For example if the old vma
> has VM_HUGETLB and old_len < new_len it would have unmapped a portion
> and then in vma_to_resize returned -EINVAL, similarly when old_len = 0
> with a non-sharable mapping it will have called do_munmap only to fail
> in vma_to_resize, if the vma has VM_DONTEXPAND set and you shrink the
> size with old_len < new_len it would return -EFAULT after having done
> the unmap on the decreased portion. So I followed the pattern to keep
> the change simple and maintain consistency with existing behavior.
>
> But with that being said, Kirill made the point that resizing a VMA
> while also using MREMAP_DONTUNMAP doesn't have any clear use case and
> I agree with that, I'm unable to think of a situation where you'd want
> to resize a VMA and use MREMAP_DONTUNMAP. So I'm tempted to mail a new
> version which returns -EINVAL if old_len != new_len that would resolve
> this concern here as nothing would be unmapped ever at the old
> position add it would clean up the change to very few lines of code.
>
> What do you think?
>
> Thank you for taking the time to review.
>
> Brian

Minchan Kim Feb. 20, 2020, 7:14 p.m. UTC | #8

On Thu, Feb 20, 2020 at 10:36:38AM -0800, Brian Geffon wrote:
> Hi Minchan,
> 
> > And here we got error if the addr is in non-anonymous-private vma so the
> > syscall will fail but old vma is gone? I guess it's not your intention?
> 
> This is exactly what happens today in several situations, because
> vma_to_resize is called unconditionally. For example if the old vma
> has VM_HUGETLB and old_len < new_len it would have unmapped a portion
> and then in vma_to_resize returned -EINVAL, similarly when old_len = 0
> with a non-sharable mapping it will have called do_munmap only to fail
> in vma_to_resize, if the vma has VM_DONTEXPAND set and you shrink the
> size with old_len < new_len it would return -EFAULT after having done
> the unmap on the decreased portion. So I followed the pattern to keep
> the change simple and maintain consistency with existing behavior.

Fair enough. It seems to be very old existing behavior but man page
never mention about it. :(

> 
> But with that being said, Kirill made the point that resizing a VMA
> while also using MREMAP_DONTUNMAP doesn't have any clear use case and
> I agree with that, I'm unable to think of a situation where you'd want
> to resize a VMA and use MREMAP_DONTUNMAP. So I'm tempted to mail a new
> version which returns -EINVAL if old_len != new_len that would resolve
> this concern here as nothing would be unmapped ever at the old
> position add it would clean up the change to very few lines of code.
> 
> What do you think?

Agreed. That makes code more simple/clean.

Thanks!

Brian Geffon Feb. 20, 2020, 11:55 p.m. UTC | #9

Hi Kirill,

> I have hard time understanding the case when new_len != old_len.
>
> Correct me if I'm wrong, but looks like that you change the size of old
> mapping to be the new_len and then create a new of the same new_len.
>
> This doesn't look right to me.
>
> In my opinion, MREMAP_DONTUNMAP has to leave the old mapping intact. And
> create the new mapping adjusted to the new_len.
>
> Other option is to force new_len == old_len if MREMAP_DONTUNMAP is
> specified. It would simplify the implementation. And I don't see why
> anybody would really want anything else.

I had been approaching this as, "do what mremap would have done in
this situation except skip the last step." Meaning, whatever the final
state of the old mapping was MREMAP_DONTUNMAP meant that you should
just not do the unmap operation on the old mapping at the end. But I
understand why it's confusing, especially when in the case of the VMA
growing you're left with the old vma of size old_len and the new_vma
of size new_len but only containing old_len worth of pages.
Personally, I don't think this is a problem having that behavior
because it can be documented and it just adds a small amount of
flexibility.

Nonetheless, I agree with you and I also cannot come up with a
situation where you'd actually want to do this so I'm willing to
restrict it to old_len == new_len and return -EINVAL if not, it
simplifies it a bit and accounting becomes a easier because the
outcome is always the same two mappings of size old_len and the size
of the locked_vm never changes. We can always allow the resize
operation later if there becomes a need. If everyone is okay with this
restriction I can send a new patch.

Thank you again,
Brian

Kirill A . Shutemov Feb. 21, 2020, 12:23 p.m. UTC | #10

On Thu, Feb 20, 2020 at 03:55:53PM -0800, Brian Geffon wrote:
> Hi Kirill,
> 
> > I have hard time understanding the case when new_len != old_len.
> >
> > Correct me if I'm wrong, but looks like that you change the size of old
> > mapping to be the new_len and then create a new of the same new_len.
> >
> > This doesn't look right to me.
> >
> > In my opinion, MREMAP_DONTUNMAP has to leave the old mapping intact. And
> > create the new mapping adjusted to the new_len.
> >
> > Other option is to force new_len == old_len if MREMAP_DONTUNMAP is
> > specified. It would simplify the implementation. And I don't see why
> > anybody would really want anything else.
> 
> I had been approaching this as, "do what mremap would have done in
> this situation except skip the last step." Meaning, whatever the final
> state of the old mapping was MREMAP_DONTUNMAP meant that you should
> just not do the unmap operation on the old mapping at the end. But I
> understand why it's confusing, especially when in the case of the VMA
> growing you're left with the old vma of size old_len and the new_vma
> of size new_len but only containing old_len worth of pages.
> Personally, I don't think this is a problem having that behavior
> because it can be documented and it just adds a small amount of
> flexibility.
> 
> Nonetheless, I agree with you and I also cannot come up with a
> situation where you'd actually want to do this so I'm willing to
> restrict it to old_len == new_len and return -EINVAL if not, it
> simplifies it a bit and accounting becomes a easier because the
> outcome is always the same two mappings of size old_len and the size
> of the locked_vm never changes. We can always allow the resize
> operation later if there becomes a need. If everyone is okay with this
> restriction I can send a new patch.

If anyone would want to chagne size it can be archived by followup
mremap() operations. There's no need in one-shot operation.

[v6,1/2] mm: Add MREMAP_DONTUNMAP to mremap().

Commit Message

Comments

Patch