diff mbox series

[v5,15/70] kernel/fork: Use maple tree for dup_mmap() during forking

Message ID 20220202024137.2516438-16-Liam.Howlett@oracle.com (mailing list archive)
State New
Headers show
Series Introducing the Maple Tree | expand

Commit Message

Liam R. Howlett Feb. 2, 2022, 2:42 a.m. UTC
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The maple tree was already tracking VMAs in this function by an earlier
commit, but the rbtree iterator was being used to iterate the list.
Change the iterator to use a maple tree native iterator and switch to
the maple tree advanced API to avoid multiple walks of the tree during
insert operations.  Unexport the now-unused vma_store() function.

We track whether we need to free the VMAs and tree nodes through RCU
(ie whether there have been multiple threads that can see the mm_struct
simultaneously; by pthread(), ptrace() or looking at /proc/$pid/maps).
This setting is sticky because it's too tricky to decide when it's safe
to exit RCU mode.

For performance reasons we bulk allocate the maple tree nodes.  The node
calculations are done internally to the tree and use the VMA count and
assume the worst-case node requirements.  The VM_DONT_COPY flag does
not allow for the most efficient copy method of the tree and so a bulk
loading algorithm is used.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h       |  2 --
 include/linux/sched/mm.h | 13 +++++++++++++
 kernel/fork.c            | 35 +++++++++++++++++++++++++++++------
 3 files changed, 42 insertions(+), 8 deletions(-)

Comments

Mark Hemment Feb. 3, 2022, 11:59 a.m. UTC | #1
On Wed, 2 Feb 2022 at 03:23, Liam Howlett <liam.howlett@oracle.com> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> The maple tree was already tracking VMAs in this function by an earlier
> commit, but the rbtree iterator was being used to iterate the list.
> Change the iterator to use a maple tree native iterator and switch to
> the maple tree advanced API to avoid multiple walks of the tree during
> insert operations.  Unexport the now-unused vma_store() function.
>
> We track whether we need to free the VMAs and tree nodes through RCU
> (ie whether there have been multiple threads that can see the mm_struct
> simultaneously; by pthread(), ptrace() or looking at /proc/$pid/maps).
> This setting is sticky because it's too tricky to decide when it's safe
> to exit RCU mode.
>
> For performance reasons we bulk allocate the maple tree nodes.  The node
> calculations are done internally to the tree and use the VMA count and
> assume the worst-case node requirements.  The VM_DONT_COPY flag does
> not allow for the most efficient copy method of the tree and so a bulk
> loading algorithm is used.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/mm.h       |  2 --
>  include/linux/sched/mm.h | 13 +++++++++++++
>  kernel/fork.c            | 35 +++++++++++++++++++++++++++++------
>  3 files changed, 42 insertions(+), 8 deletions(-)
....
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 51a7971651ef..8ea683fcefcd 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -377,6 +377,16 @@ void vm_area_free(struct vm_area_struct *vma)
>         kmem_cache_free(vm_area_cachep, vma);
>  }
>
> +void mm_set_in_rcu(struct mm_struct *mm)
> +{
> +       if (!mt_in_rcu(&mm->mm_mt))
> +               return;
> +       mmap_write_lock(mm);
> +       mm->mm_mt.ma_flags |= MT_FLAGS_USE_RCU;
> +       mmap_write_unlock(mm);
> +}
> +EXPORT_SYMBOL(mm_set_in_rcu);

The mt_in_rcu() test looks wrong (inverted).
mt_in_rcu() returns true only when MT_FLAGS_USE_RCU is set, so this
flag will never set here.
All callers of mm_set_in_rcu() check mt_in_rcu() so the test here
could be removed.

Cheers,
Mark
Liam R. Howlett Feb. 3, 2022, 5:20 p.m. UTC | #2
* Mark Hemment <markhemm@googlemail.com> [220203 07:00]:
> On Wed, 2 Feb 2022 at 03:23, Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > The maple tree was already tracking VMAs in this function by an earlier
> > commit, but the rbtree iterator was being used to iterate the list.
> > Change the iterator to use a maple tree native iterator and switch to
> > the maple tree advanced API to avoid multiple walks of the tree during
> > insert operations.  Unexport the now-unused vma_store() function.
> >
> > We track whether we need to free the VMAs and tree nodes through RCU
> > (ie whether there have been multiple threads that can see the mm_struct
> > simultaneously; by pthread(), ptrace() or looking at /proc/$pid/maps).
> > This setting is sticky because it's too tricky to decide when it's safe
> > to exit RCU mode.
> >
> > For performance reasons we bulk allocate the maple tree nodes.  The node
> > calculations are done internally to the tree and use the VMA count and
> > assume the worst-case node requirements.  The VM_DONT_COPY flag does
> > not allow for the most efficient copy method of the tree and so a bulk
> > loading algorithm is used.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  include/linux/mm.h       |  2 --
> >  include/linux/sched/mm.h | 13 +++++++++++++
> >  kernel/fork.c            | 35 +++++++++++++++++++++++++++++------
> >  3 files changed, 42 insertions(+), 8 deletions(-)
> ....
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 51a7971651ef..8ea683fcefcd 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -377,6 +377,16 @@ void vm_area_free(struct vm_area_struct *vma)
> >         kmem_cache_free(vm_area_cachep, vma);
> >  }
> >
> > +void mm_set_in_rcu(struct mm_struct *mm)
> > +{
> > +       if (!mt_in_rcu(&mm->mm_mt))
> > +               return;
> > +       mmap_write_lock(mm);
> > +       mm->mm_mt.ma_flags |= MT_FLAGS_USE_RCU;
> > +       mmap_write_unlock(mm);
> > +}
> > +EXPORT_SYMBOL(mm_set_in_rcu);
> 
> The mt_in_rcu() test looks wrong (inverted).
> mt_in_rcu() returns true only when MT_FLAGS_USE_RCU is set, so this
> flag will never set here.
> All callers of mm_set_in_rcu() check mt_in_rcu() so the test here
> could be removed.

Yes, you are correct.  Thanks.  I will rerun my benchmarks with this
change.

It's not a critical thing at this point since all the vma work is still
under the mmap_lock.  But it could affect performance.
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0353e9a902a8..cc6f72c86f3d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2609,8 +2609,6 @@  extern bool arch_has_descending_max_zone_pfns(void);
 /* nommu.c */
 extern atomic_long_t mmap_pages_allocated;
 extern int nommu_shrink_inode_mappings(struct inode *, size_t, size_t);
-/* mmap.c */
-void vma_store(struct mm_struct *mm, struct vm_area_struct *vma);
 
 /* interval_tree.c */
 void vma_interval_tree_insert(struct vm_area_struct *node,
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index aa5f09ca5bcf..f1226b49742e 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -8,6 +8,7 @@ 
 #include <linux/mm_types.h>
 #include <linux/gfp.h>
 #include <linux/sync_core.h>
+#include <linux/maple_tree.h>
 
 /*
  * Routines for handling mm_structs
@@ -78,6 +79,8 @@  static inline void mmdrop_sched(struct mm_struct *mm)
 }
 #endif
 
+void mm_set_in_rcu(struct mm_struct *mm);
+
 /**
  * mmget() - Pin the address space associated with a &struct mm_struct.
  * @mm: The address space to pin.
@@ -96,11 +99,21 @@  static inline void mmdrop_sched(struct mm_struct *mm)
  */
 static inline void mmget(struct mm_struct *mm)
 {
+	if (!mt_in_rcu(&mm->mm_mt))
+		mm_set_in_rcu(mm);
 	atomic_inc(&mm->mm_users);
 }
 
 static inline bool mmget_not_zero(struct mm_struct *mm)
 {
+	/*
+	 * There is a race below during task tear down that can cause the maple
+	 * tree to enter rcu mode with only a single user.  If this race
+	 * happens, the result would be that the maple tree nodes would remain
+	 * active for an extra RCU read cycle.
+	 */
+	if (!mt_in_rcu(&mm->mm_mt))
+		mm_set_in_rcu(mm);
 	return atomic_inc_not_zero(&mm->mm_users);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 51a7971651ef..8ea683fcefcd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -377,6 +377,16 @@  void vm_area_free(struct vm_area_struct *vma)
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
+void mm_set_in_rcu(struct mm_struct *mm)
+{
+	if (!mt_in_rcu(&mm->mm_mt))
+		return;
+	mmap_write_lock(mm);
+	mm->mm_mt.ma_flags |= MT_FLAGS_USE_RCU;
+	mmap_write_unlock(mm);
+}
+EXPORT_SYMBOL(mm_set_in_rcu);
+
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	void *stack = task_stack_page(tsk);
@@ -494,7 +504,9 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
-	unsigned long charge;
+	unsigned long charge = 0;
+	MA_STATE(old_mas, &oldmm->mm_mt, 0, 0);
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 	LIST_HEAD(uf);
 
 	uprobe_start_dup_mmap();
@@ -528,7 +540,12 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		goto out;
 
 	prev = NULL;
-	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
+
+	retval = mas_expected_entries(&mas, oldmm->map_count);
+	if (retval)
+		goto out;
+
+	mas_for_each(&old_mas, mpnt, ULONG_MAX) {
 		struct file *file;
 
 		if (mpnt->vm_flags & VM_DONTCOPY) {
@@ -542,7 +559,7 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		 */
 		if (fatal_signal_pending(current)) {
 			retval = -EINTR;
-			goto out;
+			goto loop_out;
 		}
 		if (mpnt->vm_flags & VM_ACCOUNT) {
 			unsigned long len = vma_pages(mpnt);
@@ -608,7 +625,9 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		rb_parent = &tmp->vm_rb;
 
 		/* Link the vma into the MT */
-		vma_store(mm, tmp);
+		mas.index = tmp->vm_start;
+		mas.last = tmp->vm_end - 1;
+		mas_store(&mas, tmp);
 
 		mm->map_count++;
 		if (!(tmp->vm_flags & VM_WIPEONFORK))
@@ -618,10 +637,13 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			tmp->vm_ops->open(tmp);
 
 		if (retval)
-			goto out;
+			goto loop_out;
+
 	}
 	/* a new mm has just been created */
 	retval = arch_dup_mmap(oldmm, mm);
+loop_out:
+	mas_destroy(&mas);
 out:
 	mmap_write_unlock(mm);
 	flush_tlb_mm(oldmm);
@@ -637,7 +659,7 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 fail_nomem:
 	retval = -ENOMEM;
 	vm_unacct_memory(charge);
-	goto out;
+	goto loop_out;
 }
 
 static inline int mm_alloc_pgd(struct mm_struct *mm)
@@ -1112,6 +1134,7 @@  static inline void __mmput(struct mm_struct *mm)
 {
 	VM_BUG_ON(atomic_read(&mm->mm_users));
 
+	mt_clear_in_rcu(&mm->mm_mt);
 	uprobe_clear_state(mm);
 	exit_aio(mm);
 	ksm_exit(mm);