diff mbox series

[1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap

Message ID 20211116215715.645231-1-surenb@google.com (mailing list archive)
State New
Headers show
Series [1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap | expand

Commit Message

Suren Baghdasaryan Nov. 16, 2021, 9:57 p.m. UTC
oom-reaper and process_mrelease system call should protect against
races with exit_mmap which can destroy page tables while they
walk the VMA tree. oom-reaper protects from that race by setting
MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP
before taking and releasing mmap_write_lock. process_mrelease has
to elevate mm->mm_users to prevent such race. Both oom-reaper and
process_mrelease hold mmap_read_lock when walking the VMA tree.
The locking rules and mechanisms could be simpler if exit_mmap takes
mmap_write_lock while executing destructive operations such as
free_pgtables.
Change exit_mmap to hold the mmap_write_lock when calling
free_pgtables. Operations like unmap_vmas() and unlock_range() are not
destructive and could run under mmap_read_lock but for simplicity we
take one mmap_write_lock during almost the entire operation. Note
also that because oom-reaper checks VM_LOCKED flag, unlock_range()
should not be allowed to race with it.
In most cases this lock should be uncontended. Previously, Kirill
reported ~4% regression caused by a similar change [1]. We reran the
same test and although the individual results are quite noisy, the
percentiles show lower regression with 1.6% being the worst case [2].
The change allows oom-reaper and process_mrelease to execute safely
under mmap_read_lock without worries that exit_mmap might destroy page
tables from under them.

[1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/
[2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Comments

Suren Baghdasaryan Nov. 23, 2021, 1:47 a.m. UTC | #1
On Tue, Nov 16, 2021 at 1:57 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> oom-reaper and process_mrelease system call should protect against
> races with exit_mmap which can destroy page tables while they
> walk the VMA tree. oom-reaper protects from that race by setting
> MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP
> before taking and releasing mmap_write_lock. process_mrelease has
> to elevate mm->mm_users to prevent such race. Both oom-reaper and
> process_mrelease hold mmap_read_lock when walking the VMA tree.
> The locking rules and mechanisms could be simpler if exit_mmap takes
> mmap_write_lock while executing destructive operations such as
> free_pgtables.
> Change exit_mmap to hold the mmap_write_lock when calling
> free_pgtables. Operations like unmap_vmas() and unlock_range() are not
> destructive and could run under mmap_read_lock but for simplicity we
> take one mmap_write_lock during almost the entire operation. Note
> also that because oom-reaper checks VM_LOCKED flag, unlock_range()
> should not be allowed to race with it.
> In most cases this lock should be uncontended. Previously, Kirill
> reported ~4% regression caused by a similar change [1]. We reran the
> same test and although the individual results are quite noisy, the
> percentiles show lower regression with 1.6% being the worst case [2].
> The change allows oom-reaper and process_mrelease to execute safely
> under mmap_read_lock without worries that exit_mmap might destroy page
> tables from under them.
>
> [1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/
> [2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/

Friendly nudge.
Michal, Matthew, from our discussion in
https://lore.kernel.org/all/YXKhOKIIngIuJaYi@casper.infradead.org I
was under the impression this change would be interesting for you. Any
feedback?

>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/mmap.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index bfb0ea164a90..69b3036c6dee 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3142,25 +3142,27 @@ void exit_mmap(struct mm_struct *mm)
>                  * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
>                  * __oom_reap_task_mm() will not block.
>                  *
> -                * This needs to be done before calling munlock_vma_pages_all(),
> +                * This needs to be done before calling unlock_range(),
>                  * which clears VM_LOCKED, otherwise the oom reaper cannot
>                  * reliably test it.
>                  */
>                 (void)__oom_reap_task_mm(mm);
>
>                 set_bit(MMF_OOM_SKIP, &mm->flags);
> -               mmap_write_lock(mm);
> -               mmap_write_unlock(mm);
>         }
>
> +       mmap_write_lock(mm);
>         if (mm->locked_vm)
>                 unlock_range(mm->mmap, ULONG_MAX);
>
>         arch_exit_mmap(mm);
>
>         vma = mm->mmap;
> -       if (!vma)       /* Can happen if dup_mmap() received an OOM */
> +       if (!vma) {
> +               /* Can happen if dup_mmap() received an OOM */
> +               mmap_write_unlock(mm);
>                 return;
> +       }
>
>         lru_add_drain();
>         flush_cache_mm(mm);
> @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm)
>         unmap_vmas(&tlb, vma, 0, -1);
>         free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>         tlb_finish_mmu(&tlb);
> +       mmap_write_unlock(mm);
>
>         /*
>          * Walk the list again, actually closing and freeing it,
> --
> 2.34.0.rc1.387.gb447b232ab-goog
>
Matthew Wilcox Nov. 23, 2021, 1:19 p.m. UTC | #2
On Tue, Nov 16, 2021 at 01:57:14PM -0800, Suren Baghdasaryan wrote:
> @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm)
>  	unmap_vmas(&tlb, vma, 0, -1);
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb);
> +	mmap_write_unlock(mm);
>  
>  	/*
>  	 * Walk the list again, actually closing and freeing it,

Is there a reason to unlock here instead of after the remove_vma loop?
We'll need the mmap sem held during that loop when VMAs are stored in
the maple tree.
Suren Baghdasaryan Nov. 23, 2021, 5:56 p.m. UTC | #3
On Tue, Nov 23, 2021 at 5:19 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Nov 16, 2021 at 01:57:14PM -0800, Suren Baghdasaryan wrote:
> > @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm)
> >       unmap_vmas(&tlb, vma, 0, -1);
> >       free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> >       tlb_finish_mmu(&tlb);
> > +     mmap_write_unlock(mm);
> >
> >       /*
> >        * Walk the list again, actually closing and freeing it,
>
> Is there a reason to unlock here instead of after the remove_vma loop?
> We'll need the mmap sem held during that loop when VMAs are stored in
> the maple tree.

I didn't realize remove_vma() would need to be protected as well. I
think I can move mmap_write_unlock down to cover the last walk too
with no impact.
Does anyone know if there was any specific reason to perform that last
walk with no locks held (as the comment states)? I can track that
comment back to Linux-2.6.12-rc2 merge with no earlier history, so not
sure if it's critical not to hold any locks at this point. Seems to me
it's ok to hold mmap_write_unlock but maybe I'm missing something?
Michal Hocko Nov. 24, 2021, 12:20 p.m. UTC | #4
On Tue 23-11-21 09:56:41, Suren Baghdasaryan wrote:
> On Tue, Nov 23, 2021 at 5:19 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Nov 16, 2021 at 01:57:14PM -0800, Suren Baghdasaryan wrote:
> > > @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm)
> > >       unmap_vmas(&tlb, vma, 0, -1);
> > >       free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > >       tlb_finish_mmu(&tlb);
> > > +     mmap_write_unlock(mm);
> > >
> > >       /*
> > >        * Walk the list again, actually closing and freeing it,
> >
> > Is there a reason to unlock here instead of after the remove_vma loop?
> > We'll need the mmap sem held during that loop when VMAs are stored in
> > the maple tree.
> 
> I didn't realize remove_vma() would need to be protected as well. I
> think I can move mmap_write_unlock down to cover the last walk too
> with no impact.
> Does anyone know if there was any specific reason to perform that last
> walk with no locks held (as the comment states)? I can track that
> comment back to Linux-2.6.12-rc2 merge with no earlier history, so not
> sure if it's critical not to hold any locks at this point. Seems to me
> it's ok to hold mmap_write_unlock but maybe I'm missing something?

I suspect the primary reason was that neither fput (and callbacks
invoked from it) nor vm_close would need to be very careful about
interacting with mm locks. fput is async these days so it shouldn't be
problematic. vm_ops->close doesn't have any real contract definition AFAIK
but taking mmap_sem from those would be really suprising. They should be
mostly destructing internal vma state and that shouldn't really require
address space protection.
Suren Baghdasaryan Nov. 24, 2021, 3:25 p.m. UTC | #5
On Wed, Nov 24, 2021 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 23-11-21 09:56:41, Suren Baghdasaryan wrote:
> > On Tue, Nov 23, 2021 at 5:19 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Tue, Nov 16, 2021 at 01:57:14PM -0800, Suren Baghdasaryan wrote:
> > > > @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm)
> > > >       unmap_vmas(&tlb, vma, 0, -1);
> > > >       free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > >       tlb_finish_mmu(&tlb);
> > > > +     mmap_write_unlock(mm);
> > > >
> > > >       /*
> > > >        * Walk the list again, actually closing and freeing it,
> > >
> > > Is there a reason to unlock here instead of after the remove_vma loop?
> > > We'll need the mmap sem held during that loop when VMAs are stored in
> > > the maple tree.
> >
> > I didn't realize remove_vma() would need to be protected as well. I
> > think I can move mmap_write_unlock down to cover the last walk too
> > with no impact.
> > Does anyone know if there was any specific reason to perform that last
> > walk with no locks held (as the comment states)? I can track that
> > comment back to Linux-2.6.12-rc2 merge with no earlier history, so not
> > sure if it's critical not to hold any locks at this point. Seems to me
> > it's ok to hold mmap_write_unlock but maybe I'm missing something?
>
> I suspect the primary reason was that neither fput (and callbacks
> invoked from it) nor vm_close would need to be very careful about
> interacting with mm locks. fput is async these days so it shouldn't be
> problematic. vm_ops->close doesn't have any real contract definition AFAIK
> but taking mmap_sem from those would be really suprising. They should be
> mostly destructing internal vma state and that shouldn't really require
> address space protection.

Thanks for clarification, Michal. I'll post an updated patch with
remove_vma() loop executed under mmap_write_lock protection.

> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Nov. 25, 2021, 12:01 a.m. UTC | #6
On Wed, Nov 24, 2021 at 7:25 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Nov 24, 2021 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 23-11-21 09:56:41, Suren Baghdasaryan wrote:
> > > On Tue, Nov 23, 2021 at 5:19 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Tue, Nov 16, 2021 at 01:57:14PM -0800, Suren Baghdasaryan wrote:
> > > > > @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm)
> > > > >       unmap_vmas(&tlb, vma, 0, -1);
> > > > >       free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > > >       tlb_finish_mmu(&tlb);
> > > > > +     mmap_write_unlock(mm);
> > > > >
> > > > >       /*
> > > > >        * Walk the list again, actually closing and freeing it,
> > > >
> > > > Is there a reason to unlock here instead of after the remove_vma loop?
> > > > We'll need the mmap sem held during that loop when VMAs are stored in
> > > > the maple tree.
> > >
> > > I didn't realize remove_vma() would need to be protected as well. I
> > > think I can move mmap_write_unlock down to cover the last walk too
> > > with no impact.
> > > Does anyone know if there was any specific reason to perform that last
> > > walk with no locks held (as the comment states)? I can track that
> > > comment back to Linux-2.6.12-rc2 merge with no earlier history, so not
> > > sure if it's critical not to hold any locks at this point. Seems to me
> > > it's ok to hold mmap_write_unlock but maybe I'm missing something?
> >
> > I suspect the primary reason was that neither fput (and callbacks
> > invoked from it) nor vm_close would need to be very careful about
> > interacting with mm locks. fput is async these days so it shouldn't be
> > problematic. vm_ops->close doesn't have any real contract definition AFAIK
> > but taking mmap_sem from those would be really suprising. They should be
> > mostly destructing internal vma state and that shouldn't really require
> > address space protection.
>
> Thanks for clarification, Michal. I'll post an updated patch with
> remove_vma() loop executed under mmap_write_lock protection.

v2 is posted at
https://lore.kernel.org/all/20211124235906.14437-1-surenb@google.com/

>
> > --
> > Michal Hocko
> > SUSE Labs
diff mbox series

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index bfb0ea164a90..69b3036c6dee 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3142,25 +3142,27 @@  void exit_mmap(struct mm_struct *mm)
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
 		 *
-		 * This needs to be done before calling munlock_vma_pages_all(),
+		 * This needs to be done before calling unlock_range(),
 		 * which clears VM_LOCKED, otherwise the oom reaper cannot
 		 * reliably test it.
 		 */
 		(void)__oom_reap_task_mm(mm);
 
 		set_bit(MMF_OOM_SKIP, &mm->flags);
-		mmap_write_lock(mm);
-		mmap_write_unlock(mm);
 	}
 
+	mmap_write_lock(mm);
 	if (mm->locked_vm)
 		unlock_range(mm->mmap, ULONG_MAX);
 
 	arch_exit_mmap(mm);
 
 	vma = mm->mmap;
-	if (!vma)	/* Can happen if dup_mmap() received an OOM */
+	if (!vma) {
+		/* Can happen if dup_mmap() received an OOM */
+		mmap_write_unlock(mm);
 		return;
+	}
 
 	lru_add_drain();
 	flush_cache_mm(mm);
@@ -3170,6 +3172,7 @@  void exit_mmap(struct mm_struct *mm)
 	unmap_vmas(&tlb, vma, 0, -1);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb);
+	mmap_write_unlock(mm);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,