Message ID | 20220202024137.2516438-16-Liam.Howlett@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Introducing the Maple Tree | expand |
On Wed, 2 Feb 2022 at 03:23, Liam Howlett <liam.howlett@oracle.com> wrote: > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com> > > The maple tree was already tracking VMAs in this function by an earlier > commit, but the rbtree iterator was being used to iterate the list. > Change the iterator to use a maple tree native iterator and switch to > the maple tree advanced API to avoid multiple walks of the tree during > insert operations. Unexport the now-unused vma_store() function. > > We track whether we need to free the VMAs and tree nodes through RCU > (ie whether there have been multiple threads that can see the mm_struct > simultaneously; by pthread(), ptrace() or looking at /proc/$pid/maps). > This setting is sticky because it's too tricky to decide when it's safe > to exit RCU mode. > > For performance reasons we bulk allocate the maple tree nodes. The node > calculations are done internally to the tree and use the VMA count and > assume the worst-case node requirements. The VM_DONT_COPY flag does > not allow for the most efficient copy method of the tree and so a bulk > loading algorithm is used. > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> > Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > include/linux/mm.h | 2 -- > include/linux/sched/mm.h | 13 +++++++++++++ > kernel/fork.c | 35 +++++++++++++++++++++++++++++------ > 3 files changed, 42 insertions(+), 8 deletions(-) .... > diff --git a/kernel/fork.c b/kernel/fork.c > index 51a7971651ef..8ea683fcefcd 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -377,6 +377,16 @@ void vm_area_free(struct vm_area_struct *vma) > kmem_cache_free(vm_area_cachep, vma); > } > > +void mm_set_in_rcu(struct mm_struct *mm) > +{ > + if (!mt_in_rcu(&mm->mm_mt)) > + return; > + mmap_write_lock(mm); > + mm->mm_mt.ma_flags |= MT_FLAGS_USE_RCU; > + mmap_write_unlock(mm); > +} > +EXPORT_SYMBOL(mm_set_in_rcu); The mt_in_rcu() test looks wrong (inverted). mt_in_rcu() returns true only when MT_FLAGS_USE_RCU is set, so this flag will never set here. All callers of mm_set_in_rcu() check mt_in_rcu() so the test here could be removed. Cheers, Mark
* Mark Hemment <markhemm@googlemail.com> [220203 07:00]: > On Wed, 2 Feb 2022 at 03:23, Liam Howlett <liam.howlett@oracle.com> wrote: > > > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com> > > > > The maple tree was already tracking VMAs in this function by an earlier > > commit, but the rbtree iterator was being used to iterate the list. > > Change the iterator to use a maple tree native iterator and switch to > > the maple tree advanced API to avoid multiple walks of the tree during > > insert operations. Unexport the now-unused vma_store() function. > > > > We track whether we need to free the VMAs and tree nodes through RCU > > (ie whether there have been multiple threads that can see the mm_struct > > simultaneously; by pthread(), ptrace() or looking at /proc/$pid/maps). > > This setting is sticky because it's too tricky to decide when it's safe > > to exit RCU mode. > > > > For performance reasons we bulk allocate the maple tree nodes. The node > > calculations are done internally to the tree and use the VMA count and > > assume the worst-case node requirements. The VM_DONT_COPY flag does > > not allow for the most efficient copy method of the tree and so a bulk > > loading algorithm is used. > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > > --- > > include/linux/mm.h | 2 -- > > include/linux/sched/mm.h | 13 +++++++++++++ > > kernel/fork.c | 35 +++++++++++++++++++++++++++++------ > > 3 files changed, 42 insertions(+), 8 deletions(-) > .... > > diff --git a/kernel/fork.c b/kernel/fork.c > > index 51a7971651ef..8ea683fcefcd 100644 > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -377,6 +377,16 @@ void vm_area_free(struct vm_area_struct *vma) > > kmem_cache_free(vm_area_cachep, vma); > > } > > > > +void mm_set_in_rcu(struct mm_struct *mm) > > +{ > > + if (!mt_in_rcu(&mm->mm_mt)) > > + return; > > + mmap_write_lock(mm); > > + mm->mm_mt.ma_flags |= MT_FLAGS_USE_RCU; > > + mmap_write_unlock(mm); > > +} > > +EXPORT_SYMBOL(mm_set_in_rcu); > > The mt_in_rcu() test looks wrong (inverted). > mt_in_rcu() returns true only when MT_FLAGS_USE_RCU is set, so this > flag will never set here. > All callers of mm_set_in_rcu() check mt_in_rcu() so the test here > could be removed. Yes, you are correct. Thanks. I will rerun my benchmarks with this change. It's not a critical thing at this point since all the vma work is still under the mmap_lock. But it could affect performance.
diff --git a/include/linux/mm.h b/include/linux/mm.h index 0353e9a902a8..cc6f72c86f3d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2609,8 +2609,6 @@ extern bool arch_has_descending_max_zone_pfns(void); /* nommu.c */ extern atomic_long_t mmap_pages_allocated; extern int nommu_shrink_inode_mappings(struct inode *, size_t, size_t); -/* mmap.c */ -void vma_store(struct mm_struct *mm, struct vm_area_struct *vma); /* interval_tree.c */ void vma_interval_tree_insert(struct vm_area_struct *node, diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index aa5f09ca5bcf..f1226b49742e 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -8,6 +8,7 @@ #include <linux/mm_types.h> #include <linux/gfp.h> #include <linux/sync_core.h> +#include <linux/maple_tree.h> /* * Routines for handling mm_structs @@ -78,6 +79,8 @@ static inline void mmdrop_sched(struct mm_struct *mm) } #endif +void mm_set_in_rcu(struct mm_struct *mm); + /** * mmget() - Pin the address space associated with a &struct mm_struct. * @mm: The address space to pin. @@ -96,11 +99,21 @@ static inline void mmdrop_sched(struct mm_struct *mm) */ static inline void mmget(struct mm_struct *mm) { + if (!mt_in_rcu(&mm->mm_mt)) + mm_set_in_rcu(mm); atomic_inc(&mm->mm_users); } static inline bool mmget_not_zero(struct mm_struct *mm) { + /* + * There is a race below during task tear down that can cause the maple + * tree to enter rcu mode with only a single user. If this race + * happens, the result would be that the maple tree nodes would remain + * active for an extra RCU read cycle. + */ + if (!mt_in_rcu(&mm->mm_mt)) + mm_set_in_rcu(mm); return atomic_inc_not_zero(&mm->mm_users); } diff --git a/kernel/fork.c b/kernel/fork.c index 51a7971651ef..8ea683fcefcd 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -377,6 +377,16 @@ void vm_area_free(struct vm_area_struct *vma) kmem_cache_free(vm_area_cachep, vma); } +void mm_set_in_rcu(struct mm_struct *mm) +{ + if (!mt_in_rcu(&mm->mm_mt)) + return; + mmap_write_lock(mm); + mm->mm_mt.ma_flags |= MT_FLAGS_USE_RCU; + mmap_write_unlock(mm); +} +EXPORT_SYMBOL(mm_set_in_rcu); + static void account_kernel_stack(struct task_struct *tsk, int account) { void *stack = task_stack_page(tsk); @@ -494,7 +504,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, struct vm_area_struct *mpnt, *tmp, *prev, **pprev; struct rb_node **rb_link, *rb_parent; int retval; - unsigned long charge; + unsigned long charge = 0; + MA_STATE(old_mas, &oldmm->mm_mt, 0, 0); + MA_STATE(mas, &mm->mm_mt, 0, 0); LIST_HEAD(uf); uprobe_start_dup_mmap(); @@ -528,7 +540,12 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, goto out; prev = NULL; - for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) { + + retval = mas_expected_entries(&mas, oldmm->map_count); + if (retval) + goto out; + + mas_for_each(&old_mas, mpnt, ULONG_MAX) { struct file *file; if (mpnt->vm_flags & VM_DONTCOPY) { @@ -542,7 +559,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, */ if (fatal_signal_pending(current)) { retval = -EINTR; - goto out; + goto loop_out; } if (mpnt->vm_flags & VM_ACCOUNT) { unsigned long len = vma_pages(mpnt); @@ -608,7 +625,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, rb_parent = &tmp->vm_rb; /* Link the vma into the MT */ - vma_store(mm, tmp); + mas.index = tmp->vm_start; + mas.last = tmp->vm_end - 1; + mas_store(&mas, tmp); mm->map_count++; if (!(tmp->vm_flags & VM_WIPEONFORK)) @@ -618,10 +637,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, tmp->vm_ops->open(tmp); if (retval) - goto out; + goto loop_out; + } /* a new mm has just been created */ retval = arch_dup_mmap(oldmm, mm); +loop_out: + mas_destroy(&mas); out: mmap_write_unlock(mm); flush_tlb_mm(oldmm); @@ -637,7 +659,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, fail_nomem: retval = -ENOMEM; vm_unacct_memory(charge); - goto out; + goto loop_out; } static inline int mm_alloc_pgd(struct mm_struct *mm) @@ -1112,6 +1134,7 @@ static inline void __mmput(struct mm_struct *mm) { VM_BUG_ON(atomic_read(&mm->mm_users)); + mt_clear_in_rcu(&mm->mm_mt); uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm);