diff mbox series

[v4,23/66] mm/mmap: Use advanced maple tree API for mmap_region()

Message ID 20211201142918.921493-24-Liam.Howlett@oracle.com (mailing list archive)
State New
Headers show
Series Introducing the Maple Tree | expand

Commit Message

Liam R. Howlett Dec. 1, 2021, 2:30 p.m. UTC
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Changing mmap_region() to use the maple tree state and the advanced
maple tree interface allows for a lot less tree walking.

This change removes the last caller of munmap_vma_range(), so drop this
unused function.

Add vma_expand() to expand a VMA if possible by doing the necessary
hugepage check, uprobe_munmap of files, dcache flush, modifications then
undoing the detaches, etc.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 227 +++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 175 insertions(+), 52 deletions(-)

Comments

Vlastimil Babka Jan. 17, 2022, 4:38 p.m. UTC | #1
On 12/1/21 15:30, Liam Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> 
> Changing mmap_region() to use the maple tree state and the advanced
> maple tree interface allows for a lot less tree walking.
> 
> This change removes the last caller of munmap_vma_range(), so drop this
> unused function.
> 
> Add vma_expand() to expand a VMA if possible by doing the necessary
> hugepage check, uprobe_munmap of files, dcache flush, modifications then
> undoing the detaches, etc.
> 
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/mmap.c | 227 +++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 175 insertions(+), 52 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c06c5b850e1e..b0b7e327bf8b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -496,29 +496,6 @@ static inline struct vm_area_struct *__vma_next(struct mm_struct *mm,
>  	return vma->vm_next;
>  }
>  
> -/*
> - * munmap_vma_range() - munmap VMAs that overlap a range.
> - * @mm: The mm struct
> - * @start: The start of the range.
> - * @len: The length of the range.
> - * @pprev: pointer to the pointer that will be set to previous vm_area_struct
> - *
> - * Find all the vm_area_struct that overlap from @start to
> - * @end and munmap them.  Set @pprev to the previous vm_area_struct.
> - *
> - * Returns: -ENOMEM on munmap failure or 0 on success.
> - */
> -static inline int
> -munmap_vma_range(struct mm_struct *mm, unsigned long start, unsigned long len,
> -		 struct vm_area_struct **pprev, struct list_head *uf)
> -{
> -	// Needs optimization.
> -	while (range_has_overlap(mm, start, start + len, pprev))
> -		if (do_munmap(mm, start, len, uf))
> -			return -ENOMEM;
> -	return 0;
> -}
> -
>  static unsigned long count_vma_pages_range(struct mm_struct *mm,
>  		unsigned long addr, unsigned long end)
>  {
> @@ -619,6 +596,101 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
>  	mm->map_count++;
>  }
>  
> +/*
> + * vma_expand - Expand an existing VMA
> + * @mas: The maple state
> + * @vma: The vma to expand
> + * @start: The start of the vma
> + * @end: The exclusive end of the vma
> + */

Looks like this, similarly to the brk case, replaces one path calling
vma_merge->__vma_adjust() with something else. But this one is better
encapsulated and visible, so less likely to be forgotten in case something
changes. Would be even better if the brk case used it too :) seems like it
doesn't, at least as of this patch.

But it would be great to improve the documentation here - some params are
not documented, notably 'next', and I would explicitly state which scenarios
it does cover - i.e. vma_merge() lists 8 scenarios and I assume this can
handlea subset of those?
And scenarios not covered could be VM_WARN_ON'd?
Without such stated assumptions, it's hard/impossible to review both the
implementation against them and, and the callers against them.

> +inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
> +		      unsigned long start, unsigned long end, pgoff_t pgoff,
> +		      struct vm_area_struct *next)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct address_space *mapping = NULL;
> +	struct rb_root_cached *root = NULL;
> +	struct anon_vma *anon_vma = vma->anon_vma;
> +	struct file *file = vma->vm_file;
> +	bool remove_next = false;
> +	int error;
> +
> +	if (next && (vma != next) && (end == next->vm_end)) {

For example here this suggests that a case of 'end > next->vm_end' case is
not covered? How do I know whether it's intended or a bug?
Liam R. Howlett Jan. 21, 2022, 6:11 p.m. UTC | #2
* Vlastimil Babka <vbabka@suse.cz> [220117 11:39]:
> On 12/1/21 15:30, Liam Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > 
> > Changing mmap_region() to use the maple tree state and the advanced
> > maple tree interface allows for a lot less tree walking.
> > 
> > This change removes the last caller of munmap_vma_range(), so drop this
> > unused function.
> > 
> > Add vma_expand() to expand a VMA if possible by doing the necessary
> > hugepage check, uprobe_munmap of files, dcache flush, modifications then
> > undoing the detaches, etc.
> > 
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/mmap.c | 227 +++++++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 175 insertions(+), 52 deletions(-)
> > 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index c06c5b850e1e..b0b7e327bf8b 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -496,29 +496,6 @@ static inline struct vm_area_struct *__vma_next(struct mm_struct *mm,
> >  	return vma->vm_next;
> >  }
> >  
> > -/*
> > - * munmap_vma_range() - munmap VMAs that overlap a range.
> > - * @mm: The mm struct
> > - * @start: The start of the range.
> > - * @len: The length of the range.
> > - * @pprev: pointer to the pointer that will be set to previous vm_area_struct
> > - *
> > - * Find all the vm_area_struct that overlap from @start to
> > - * @end and munmap them.  Set @pprev to the previous vm_area_struct.
> > - *
> > - * Returns: -ENOMEM on munmap failure or 0 on success.
> > - */
> > -static inline int
> > -munmap_vma_range(struct mm_struct *mm, unsigned long start, unsigned long len,
> > -		 struct vm_area_struct **pprev, struct list_head *uf)
> > -{
> > -	// Needs optimization.
> > -	while (range_has_overlap(mm, start, start + len, pprev))
> > -		if (do_munmap(mm, start, len, uf))
> > -			return -ENOMEM;
> > -	return 0;
> > -}
> > -
> >  static unsigned long count_vma_pages_range(struct mm_struct *mm,
> >  		unsigned long addr, unsigned long end)
> >  {
> > @@ -619,6 +596,101 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
> >  	mm->map_count++;
> >  }
> >  
> > +/*
> > + * vma_expand - Expand an existing VMA
> > + * @mas: The maple state
> > + * @vma: The vma to expand
> > + * @start: The start of the vma
> > + * @end: The exclusive end of the vma
> > + */
> 
> Looks like this, similarly to the brk case, replaces one path calling
> vma_merge->__vma_adjust() with something else. But this one is better
> encapsulated and visible, so less likely to be forgotten in case something
> changes. Would be even better if the brk case used it too :) seems like it
> doesn't, at least as of this patch.
> 
> But it would be great to improve the documentation here - some params are
> not documented, notably 'next', and I would explicitly state which scenarios
> it does cover - i.e. vma_merge() lists 8 scenarios and I assume this can
> handlea subset of those?

Yes, I will add the missing parameters to the documentation.  I will
also add nodes about how this handles expanding vma and may merge next,
but does not handle other scenarios.

> And scenarios not covered could be VM_WARN_ON'd?
> Without such stated assumptions, it's hard/impossible to review both the
> implementation against them and, and the callers against them.

Okay, I'll add some VM_WARN_ON's too.

> 
> > +inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
> > +		      unsigned long start, unsigned long end, pgoff_t pgoff,
> > +		      struct vm_area_struct *next)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct address_space *mapping = NULL;
> > +	struct rb_root_cached *root = NULL;
> > +	struct anon_vma *anon_vma = vma->anon_vma;
> > +	struct file *file = vma->vm_file;
> > +	bool remove_next = false;
> > +	int error;
> > +
> > +	if (next && (vma != next) && (end == next->vm_end)) {
> 
> For example here this suggests that a case of 'end > next->vm_end' case is
> not covered? How do I know whether it's intended or a bug?
> 

Okay, thanks.  I will have a look to see how many I can come up with.
diff mbox series

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index c06c5b850e1e..b0b7e327bf8b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -496,29 +496,6 @@  static inline struct vm_area_struct *__vma_next(struct mm_struct *mm,
 	return vma->vm_next;
 }
 
-/*
- * munmap_vma_range() - munmap VMAs that overlap a range.
- * @mm: The mm struct
- * @start: The start of the range.
- * @len: The length of the range.
- * @pprev: pointer to the pointer that will be set to previous vm_area_struct
- *
- * Find all the vm_area_struct that overlap from @start to
- * @end and munmap them.  Set @pprev to the previous vm_area_struct.
- *
- * Returns: -ENOMEM on munmap failure or 0 on success.
- */
-static inline int
-munmap_vma_range(struct mm_struct *mm, unsigned long start, unsigned long len,
-		 struct vm_area_struct **pprev, struct list_head *uf)
-{
-	// Needs optimization.
-	while (range_has_overlap(mm, start, start + len, pprev))
-		if (do_munmap(mm, start, len, uf))
-			return -ENOMEM;
-	return 0;
-}
-
 static unsigned long count_vma_pages_range(struct mm_struct *mm,
 		unsigned long addr, unsigned long end)
 {
@@ -619,6 +596,101 @@  static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 	mm->map_count++;
 }
 
+/*
+ * vma_expand - Expand an existing VMA
+ * @mas: The maple state
+ * @vma: The vma to expand
+ * @start: The start of the vma
+ * @end: The exclusive end of the vma
+ */
+inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
+		      unsigned long start, unsigned long end, pgoff_t pgoff,
+		      struct vm_area_struct *next)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct address_space *mapping = NULL;
+	struct rb_root_cached *root = NULL;
+	struct anon_vma *anon_vma = vma->anon_vma;
+	struct file *file = vma->vm_file;
+	bool remove_next = false;
+	int error;
+
+	if (next && (vma != next) && (end == next->vm_end)) {
+		remove_next = true;
+		if (next->anon_vma && !vma->anon_vma) {
+			vma->anon_vma = next->anon_vma;
+			error = anon_vma_clone(vma, next);
+			if (error)
+				return error;
+		}
+	}
+
+	vma_adjust_trans_huge(vma, start, end, 0);
+
+	if (file) {
+		mapping = file->f_mapping;
+		root = &mapping->i_mmap;
+		uprobe_munmap(vma, vma->vm_start, vma->vm_end);
+		i_mmap_lock_write(mapping);
+		flush_dcache_mmap_lock(mapping);
+		vma_interval_tree_remove(vma, root);
+	} else if (anon_vma) {
+		anon_vma_lock_write(anon_vma);
+		anon_vma_interval_tree_pre_update_vma(vma);
+	}
+
+	vma->vm_start = start;
+	vma->vm_end = end;
+	vma->vm_pgoff = pgoff;
+	/* Note: mas must be pointing to the expanding VMA */
+	vma_mas_store(vma, mas);
+
+	if (file) {
+		vma_interval_tree_insert(vma, root);
+		flush_dcache_mmap_unlock(mapping);
+	}
+
+	/* Expanding over the next vma */
+	if (remove_next) {
+		/* Remove from mm linked list - also updates highest_vm_end */
+		__vma_unlink_list(mm, next);
+
+		/* Kill the cache */
+		vmacache_invalidate(mm);
+
+		if (file)
+			__remove_shared_vm_struct(next, file, mapping);
+
+	} else if (!next) {
+		mm->highest_vm_end = vm_end_gap(vma);
+	}
+
+	if (anon_vma) {
+		anon_vma_interval_tree_post_update_vma(vma);
+		anon_vma_unlock_write(anon_vma);
+	}
+
+	if (file) {
+		i_mmap_unlock_write(mapping);
+		uprobe_mmap(vma);
+	}
+
+	if (remove_next) {
+		if (file) {
+			uprobe_munmap(next, next->vm_start, next->vm_end);
+			fput(file);
+		}
+		if (next->anon_vma)
+			anon_vma_merge(vma, next);
+		mm->map_count--;
+		mpol_put(vma_policy(next));
+		vm_area_free(next);
+	}
+
+	validate_mm(mm);
+	return 0;
+}
+
 /*
  * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
  * is already present in an i_mmap tree without adjusting the tree.
@@ -1598,9 +1670,15 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 		struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma, *prev, *merge;
-	int error;
+	struct vm_area_struct *vma = NULL;
+	struct vm_area_struct *prev, *next;
+	pgoff_t pglen = len >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned long end = addr + len;
+	unsigned long merge_start = addr, merge_end = end;
+	pgoff_t vm_pgoff;
+	int error;
+	MA_STATE(mas, &mm->mm_mt, addr, end - 1);
 
 	/* Check against address space limit. */
 	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
@@ -1610,16 +1688,17 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 		 * MAP_FIXED may remove pages of mappings that intersects with
 		 * requested mapping. Account for the pages it would unmap.
 		 */
-		nr_pages = count_vma_pages_range(mm, addr, addr + len);
+		nr_pages = count_vma_pages_range(mm, addr, end);
 
 		if (!may_expand_vm(mm, vm_flags,
 					(len >> PAGE_SHIFT) - nr_pages))
 			return -ENOMEM;
 	}
 
-	/* Clear old maps, set up prev and uf */
-	if (munmap_vma_range(mm, addr, len, &prev, uf))
+	/* Unmap any existing mapping in the area */
+	if (do_munmap(mm, addr, len, uf))
 		return -ENOMEM;
+
 	/*
 	 * Private writable mapping: check memory availability
 	 */
@@ -1630,14 +1709,41 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_flags |= VM_ACCOUNT;
 	}
 
-	/*
-	 * Can we just expand an old mapping?
-	 */
-	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
-			NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
-	if (vma)
-		goto out;
+	next = mas_next(&mas, ULONG_MAX);
+	prev = mas_prev(&mas, 0);
+	if (vm_flags & VM_SPECIAL)
+		goto cannot_expand;
+
+	/* Attempt to expand an old mapping */
+	/* Check next */
+	if (next && next->vm_start == end && vma_policy(next) &&
+	    can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
+				 NULL_VM_UFFD_CTX)) {
+		merge_end = next->vm_end;
+		vma = next;
+		vm_pgoff = next->vm_pgoff - pglen;
+	}
 
+	/* Check prev */
+	if (prev && prev->vm_end == addr && !vma_policy(prev) &&
+	    can_vma_merge_after(prev, vm_flags, NULL, file, pgoff,
+				NULL_VM_UFFD_CTX)) {
+		merge_start = prev->vm_start;
+		vma = prev;
+		vm_pgoff = prev->vm_pgoff;
+	}
+
+
+	/* Actually expand, if possible */
+	if (vma &&
+	    !vma_expand(&mas, vma, merge_start, merge_end, vm_pgoff, next)) {
+		khugepaged_enter_vma_merge(prev, vm_flags);
+		goto expanded;
+	}
+
+	mas.index = addr;
+	mas.last = end - 1;
+cannot_expand:
 	/*
 	 * Determine the object being mapped and call the appropriate
 	 * specific mapper. the address has already been validated, but
@@ -1650,7 +1756,7 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 	}
 
 	vma->vm_start = addr;
-	vma->vm_end = addr + len;
+	vma->vm_end = end;
 	vma->vm_flags = vm_flags;
 	vma->vm_page_prot = vm_get_page_prot(vm_flags);
 	vma->vm_pgoff = pgoff;
@@ -1671,8 +1777,6 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 		 *
 		 * Answer: Yes, several device drivers can do it in their
 		 *         f_op->mmap method. -DaveM
-		 * Bug: If addr is changed, prev, rb_link, rb_parent should
-		 *      be updated for vma_link()
 		 */
 		WARN_ON_ONCE(addr != vma->vm_start);
 
@@ -1681,23 +1785,31 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 		/* If vm_flags changed after call_mmap(), we should try merge vma again
 		 * as we may succeed this time.
 		 */
-		if (unlikely(vm_flags != vma->vm_flags && prev)) {
-			merge = vma_merge(mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags,
-				NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX);
-			if (merge) {
+		if (unlikely(vm_flags != vma->vm_flags && prev &&
+			     prev->vm_end == addr && !vma_policy(prev) &&
+			     can_vma_merge_after(prev, vm_flags, NULL, file,
+						 pgoff, NULL_VM_UFFD_CTX))) {
+			merge_start = prev->vm_start;
+			vm_pgoff = prev->vm_pgoff;
+			if (!vma_expand(&mas, prev, merge_start, merge_end,
+					vm_pgoff, next)) {
 				/* ->mmap() can change vma->vm_file and fput the original file. So
 				 * fput the vma->vm_file here or we would add an extra fput for file
 				 * and cause general protection fault ultimately.
 				 */
 				fput(vma->vm_file);
 				vm_area_free(vma);
-				vma = merge;
-				/* Update vm_flags to pick up the change. */
+				vma = prev;
+				/* Update vm_flags and possible addr to pick up the change. We don't
+				 * warn here if addr changed as the vma is not linked by vma_link().
+				 */
+				addr = vma->vm_start;
 				vm_flags = vma->vm_flags;
 				goto unmap_writable;
 			}
 		}
 
+		mas_set(&mas, addr);
 		vm_flags = vma->vm_flags;
 	} else if (vm_flags & VM_SHARED) {
 		error = shmem_zero_setup(vma);
@@ -1716,20 +1828,35 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 			goto free_vma;
 	}
 
-	vma_link(mm, vma, prev);
+	if (vma->vm_file)
+		i_mmap_lock_write(vma->vm_file->f_mapping);
+
+	vma_mas_store(vma, &mas);
+	__vma_link_list(mm, vma, prev);
+	mm->map_count++;
+	if (vma->vm_file) {
+		if (vma->vm_flags & VM_SHARED)
+			mapping_allow_writable(vma->vm_file->f_mapping);
+
+		flush_dcache_mmap_lock(vma->vm_file->f_mapping);
+		vma_interval_tree_insert(vma, &vma->vm_file->f_mapping->i_mmap);
+		flush_dcache_mmap_unlock(vma->vm_file->f_mapping);
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+	}
+
 	/* Once vma denies write, undo our temporary denial count */
 unmap_writable:
 	if (file && vm_flags & VM_SHARED)
 		mapping_unmap_writable(file->f_mapping);
 	file = vma->vm_file;
-out:
+expanded:
 	perf_event_mmap(vma);
 
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
-					is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm))
+		    is_vm_hugetlb_page(vma) ||
+		    vma == get_gate_vma(current->mm))
 			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
 		else
 			mm->locked_vm += (len >> PAGE_SHIFT);
@@ -2557,10 +2684,6 @@  int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	prev = vma->vm_prev;
 	/* we have start < vma->vm_end  */
 
-	/* if it doesn't overlap, we have nothing.. */
-	if (vma->vm_start >= end)
-		return 0;
-
 	/*
 	 * If we need to split any vma, do it now to save pain later.
 	 *