diff mbox series

[1/1] mm: prevent a race between process_mrelease and exit_mmap

Message ID 20211022014658.263508-1-surenb@google.com (mailing list archive)
State New
Headers show
Series [1/1] mm: prevent a race between process_mrelease and exit_mmap | expand

Commit Message

Suren Baghdasaryan Oct. 22, 2021, 1:46 a.m. UTC
Race between process_mrelease and exit_mmap, where free_pgtables is
called while __oom_reap_task_mm is in progress, leads to kernel crash
during pte_offset_map_lock call. oom-reaper avoids this race by setting
MMF_OOM_VICTIM flag and causing exit_mmap to take and release
mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock.
Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to
fix this race, however that would be considered a hack. Fix this race
by elevating mm->mm_users and preventing exit_mmap from executing until
process_mrelease is finished. Patch slightly refactors the code to adapt
for a possible mmget_not_zero failure.
This fix has considerable negative impact on process_mrelease performance
and will likely need later optimization.

Fixes: 884a7e5964e0 ("mm: introduce process_mrelease system call")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/oom_kill.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

Comments

Andrew Morton Oct. 22, 2021, 2:24 a.m. UTC | #1
On Thu, 21 Oct 2021 18:46:58 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> Race between process_mrelease and exit_mmap, where free_pgtables is
> called while __oom_reap_task_mm is in progress, leads to kernel crash
> during pte_offset_map_lock call. oom-reaper avoids this race by setting
> MMF_OOM_VICTIM flag and causing exit_mmap to take and release
> mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock.
> Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to
> fix this race, however that would be considered a hack. Fix this race
> by elevating mm->mm_users and preventing exit_mmap from executing until
> process_mrelease is finished. Patch slightly refactors the code to adapt
> for a possible mmget_not_zero failure.
> This fix has considerable negative impact on process_mrelease performance
> and will likely need later optimization.

Has the impact been quantified?

And where's the added cost happening?  The changes all look quite
lightweight?
Suren Baghdasaryan Oct. 22, 2021, 5:23 a.m. UTC | #2
On Thu, Oct 21, 2021 at 7:25 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 21 Oct 2021 18:46:58 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > Race between process_mrelease and exit_mmap, where free_pgtables is
> > called while __oom_reap_task_mm is in progress, leads to kernel crash
> > during pte_offset_map_lock call. oom-reaper avoids this race by setting
> > MMF_OOM_VICTIM flag and causing exit_mmap to take and release
> > mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock.
> > Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to
> > fix this race, however that would be considered a hack. Fix this race
> > by elevating mm->mm_users and preventing exit_mmap from executing until
> > process_mrelease is finished. Patch slightly refactors the code to adapt
> > for a possible mmget_not_zero failure.
> > This fix has considerable negative impact on process_mrelease performance
> > and will likely need later optimization.
>
> Has the impact been quantified?

A ball-park figure for a large process (6GB) it takes 4x times longer
for process_mrelease to exit.

>
> And where's the added cost happening?  The changes all look quite
> lightweight?

I think it's caused by the fact that exit_mmap and all other cleanup
routines happening on the last mmput are postponed until
process_mrelease finishes __oom_reap_task_mm and drops mm->mm_users. I
suspect all that cleanup is happening at the end of process_mrelease
now and that might be contributing to the regression. I didn't have
time yet to fully understand all the reasons for that regression but
wanted to fix the crash first. Will proceed with more investigation
and hopefully with a quick fix for the lost performance.

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Michal Hocko Oct. 22, 2021, 8:03 a.m. UTC | #3
On Thu 21-10-21 18:46:58, Suren Baghdasaryan wrote:
> Race between process_mrelease and exit_mmap, where free_pgtables is
> called while __oom_reap_task_mm is in progress, leads to kernel crash
> during pte_offset_map_lock call. oom-reaper avoids this race by setting
> MMF_OOM_VICTIM flag and causing exit_mmap to take and release
> mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock.
> Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to
> fix this race, however that would be considered a hack. Fix this race
> by elevating mm->mm_users and preventing exit_mmap from executing until
> process_mrelease is finished. Patch slightly refactors the code to adapt
> for a possible mmget_not_zero failure.
> This fix has considerable negative impact on process_mrelease performance
> and will likely need later optimization.

I am not sure there is any promise that process_mrelease will run in
parallel with the exiting process. In fact the primary purpose of this
syscall is to provide a reliable way to oom kill from user space. If you
want to optimize process exit resp. its exit_mmap part then you should
be using other means. So I would be careful calling this a regression.

I do agree that taking the reference count is the right approach here. I
was wrong previously [1] when saying that pinning the mm struct is
sufficient. I have completely forgot about the subtle sync in exit_mmap.
One way we can approach that would be to take exclusive mmap_sem
throughout the exit_mmap unconditionally. There was a push back against
that though so arguments would have to be re-evaluated.

[1] http://lkml.kernel.org/r/YQzZqFwDP7eUxwcn@dhcp22.suse.cz

That being said
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> Fixes: 884a7e5964e0 ("mm: introduce process_mrelease system call")
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/oom_kill.c | 23 ++++++++++++-----------
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..989f35a2bbb1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1150,7 +1150,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	struct task_struct *task;
>  	struct task_struct *p;
>  	unsigned int f_flags;
> -	bool reap = true;
> +	bool reap = false;
>  	struct pid *pid;
>  	long ret = 0;
>  
> @@ -1177,15 +1177,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  		goto put_task;
>  	}
>  
> -	mm = p->mm;
> -	mmgrab(mm);
> -
> -	/* If the work has been done already, just exit with success */
> -	if (test_bit(MMF_OOM_SKIP, &mm->flags))
> -		reap = false;
> -	else if (!task_will_free_mem(p)) {
> -		reap = false;
> -		ret = -EINVAL;
> +	if (mmget_not_zero(p->mm)) {
> +		mm = p->mm;
> +		if (task_will_free_mem(p))
> +			reap = true;
> +		else {
> +			/* Error only if the work has not been done already */
> +			if (!test_bit(MMF_OOM_SKIP, &mm->flags))
> +				ret = -EINVAL;
> +		}
>  	}
>  	task_unlock(p);
>  
> @@ -1201,7 +1201,8 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	mmap_read_unlock(mm);
>  
>  drop_mm:
> -	mmdrop(mm);
> +	if (mm)
> +		mmput(mm);
>  put_task:
>  	put_task_struct(task);
>  put_pid:
> -- 
> 2.33.0.1079.g6e70778dc9-goog
Matthew Wilcox Oct. 22, 2021, 11:32 a.m. UTC | #4
On Fri, Oct 22, 2021 at 10:03:29AM +0200, Michal Hocko wrote:
> On Thu 21-10-21 18:46:58, Suren Baghdasaryan wrote:
> > Race between process_mrelease and exit_mmap, where free_pgtables is
> > called while __oom_reap_task_mm is in progress, leads to kernel crash
> > during pte_offset_map_lock call. oom-reaper avoids this race by setting
> > MMF_OOM_VICTIM flag and causing exit_mmap to take and release
> > mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock.
> > Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to
> > fix this race, however that would be considered a hack. Fix this race
> > by elevating mm->mm_users and preventing exit_mmap from executing until
> > process_mrelease is finished. Patch slightly refactors the code to adapt
> > for a possible mmget_not_zero failure.
> > This fix has considerable negative impact on process_mrelease performance
> > and will likely need later optimization.
> 
> I am not sure there is any promise that process_mrelease will run in
> parallel with the exiting process. In fact the primary purpose of this
> syscall is to provide a reliable way to oom kill from user space. If you
> want to optimize process exit resp. its exit_mmap part then you should
> be using other means. So I would be careful calling this a regression.
> 
> I do agree that taking the reference count is the right approach here. I
> was wrong previously [1] when saying that pinning the mm struct is
> sufficient. I have completely forgot about the subtle sync in exit_mmap.
> One way we can approach that would be to take exclusive mmap_sem
> throughout the exit_mmap unconditionally. There was a push back against
> that though so arguments would have to be re-evaluated.

I have another reason for wanting to take the mmap_sem throughout
exit_mmap.  Liam and I are working on using the Maple tree to replace
the rbtree & vma linked list.  It uses lockdep to check that you haven't
forgotten to take a lock (as of two days ago, that mean the mmap_sem
or the RCU read lock) when walking the tree.

So I'd like to hold it over:

 - unlock_range()
 - unmap_vmas()
 - free_pgtables()
 - while (vma) remove_vma()

Which is basically the whole of exit_mmap().  I'd like to know more
about why there was pushback on holding the mmap_lock across this
-- we're exiting, so nobody else should have a reference to the mm?
Michal Hocko Oct. 22, 2021, 12:04 p.m. UTC | #5
On Fri 22-10-21 12:32:08, Matthew Wilcox wrote:
> On Fri, Oct 22, 2021 at 10:03:29AM +0200, Michal Hocko wrote:
> > On Thu 21-10-21 18:46:58, Suren Baghdasaryan wrote:
> > > Race between process_mrelease and exit_mmap, where free_pgtables is
> > > called while __oom_reap_task_mm is in progress, leads to kernel crash
> > > during pte_offset_map_lock call. oom-reaper avoids this race by setting
> > > MMF_OOM_VICTIM flag and causing exit_mmap to take and release
> > > mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock.
> > > Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to
> > > fix this race, however that would be considered a hack. Fix this race
> > > by elevating mm->mm_users and preventing exit_mmap from executing until
> > > process_mrelease is finished. Patch slightly refactors the code to adapt
> > > for a possible mmget_not_zero failure.
> > > This fix has considerable negative impact on process_mrelease performance
> > > and will likely need later optimization.
> > 
> > I am not sure there is any promise that process_mrelease will run in
> > parallel with the exiting process. In fact the primary purpose of this
> > syscall is to provide a reliable way to oom kill from user space. If you
> > want to optimize process exit resp. its exit_mmap part then you should
> > be using other means. So I would be careful calling this a regression.
> > 
> > I do agree that taking the reference count is the right approach here. I
> > was wrong previously [1] when saying that pinning the mm struct is
> > sufficient. I have completely forgot about the subtle sync in exit_mmap.
> > One way we can approach that would be to take exclusive mmap_sem
> > throughout the exit_mmap unconditionally. There was a push back against
> > that though so arguments would have to be re-evaluated.
> 
> I have another reason for wanting to take the mmap_sem throughout
> exit_mmap.  Liam and I are working on using the Maple tree to replace
> the rbtree & vma linked list.  It uses lockdep to check that you haven't
> forgotten to take a lock (as of two days ago, that mean the mmap_sem
> or the RCU read lock) when walking the tree.
> 
> So I'd like to hold it over:
> 
>  - unlock_range()
>  - unmap_vmas()
>  - free_pgtables()
>  - while (vma) remove_vma()
> 
> Which is basically the whole of exit_mmap().  I'd like to know more
> about why there was pushback on holding the mmap_lock across this
> -- we're exiting, so nobody else should have a reference to the mm?

https://lore.kernel.org/all/20170724072332.31903-1-mhocko@kernel.org/
Suren Baghdasaryan Oct. 22, 2021, 5:38 p.m. UTC | #6
On Fri, Oct 22, 2021 at 1:03 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 21-10-21 18:46:58, Suren Baghdasaryan wrote:
> > Race between process_mrelease and exit_mmap, where free_pgtables is
> > called while __oom_reap_task_mm is in progress, leads to kernel crash
> > during pte_offset_map_lock call. oom-reaper avoids this race by setting
> > MMF_OOM_VICTIM flag and causing exit_mmap to take and release
> > mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock.
> > Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to
> > fix this race, however that would be considered a hack. Fix this race
> > by elevating mm->mm_users and preventing exit_mmap from executing until
> > process_mrelease is finished. Patch slightly refactors the code to adapt
> > for a possible mmget_not_zero failure.
> > This fix has considerable negative impact on process_mrelease performance
> > and will likely need later optimization.
>
> I am not sure there is any promise that process_mrelease will run in
> parallel with the exiting process. In fact the primary purpose of this
> syscall is to provide a reliable way to oom kill from user space. If you
> want to optimize process exit resp. its exit_mmap part then you should
> be using other means. So I would be careful calling this a regression.
>
> I do agree that taking the reference count is the right approach here. I
> was wrong previously [1] when saying that pinning the mm struct is
> sufficient. I have completely forgot about the subtle sync in exit_mmap.
> One way we can approach that would be to take exclusive mmap_sem
> throughout the exit_mmap unconditionally.

I agree, that would probably be the cleanest way.

> There was a push back against
> that though so arguments would have to be re-evaluated.

I'll review that discussion to better understand the reasons for the
push back. Thanks for the link.

>
> [1] http://lkml.kernel.org/r/YQzZqFwDP7eUxwcn@dhcp22.suse.cz
>
> That being said
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

>
> Thanks!
>
> > Fixes: 884a7e5964e0 ("mm: introduce process_mrelease system call")
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/oom_kill.c | 23 ++++++++++++-----------
> >  1 file changed, 12 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 831340e7ad8b..989f35a2bbb1 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1150,7 +1150,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >       struct task_struct *task;
> >       struct task_struct *p;
> >       unsigned int f_flags;
> > -     bool reap = true;
> > +     bool reap = false;
> >       struct pid *pid;
> >       long ret = 0;
> >
> > @@ -1177,15 +1177,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >               goto put_task;
> >       }
> >
> > -     mm = p->mm;
> > -     mmgrab(mm);
> > -
> > -     /* If the work has been done already, just exit with success */
> > -     if (test_bit(MMF_OOM_SKIP, &mm->flags))
> > -             reap = false;
> > -     else if (!task_will_free_mem(p)) {
> > -             reap = false;
> > -             ret = -EINVAL;
> > +     if (mmget_not_zero(p->mm)) {
> > +             mm = p->mm;
> > +             if (task_will_free_mem(p))
> > +                     reap = true;
> > +             else {
> > +                     /* Error only if the work has not been done already */
> > +                     if (!test_bit(MMF_OOM_SKIP, &mm->flags))
> > +                             ret = -EINVAL;
> > +             }
> >       }
> >       task_unlock(p);
> >
> > @@ -1201,7 +1201,8 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >       mmap_read_unlock(mm);
> >
> >  drop_mm:
> > -     mmdrop(mm);
> > +     if (mm)
> > +             mmput(mm);
> >  put_task:
> >       put_task_struct(task);
> >  put_pid:
> > --
> > 2.33.0.1079.g6e70778dc9-goog
>
> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Oct. 27, 2021, 4:08 p.m. UTC | #7
On Fri, Oct 22, 2021 at 10:38 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Oct 22, 2021 at 1:03 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 21-10-21 18:46:58, Suren Baghdasaryan wrote:
> > > Race between process_mrelease and exit_mmap, where free_pgtables is
> > > called while __oom_reap_task_mm is in progress, leads to kernel crash
> > > during pte_offset_map_lock call. oom-reaper avoids this race by setting
> > > MMF_OOM_VICTIM flag and causing exit_mmap to take and release
> > > mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock.
> > > Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to
> > > fix this race, however that would be considered a hack. Fix this race
> > > by elevating mm->mm_users and preventing exit_mmap from executing until
> > > process_mrelease is finished. Patch slightly refactors the code to adapt
> > > for a possible mmget_not_zero failure.
> > > This fix has considerable negative impact on process_mrelease performance
> > > and will likely need later optimization.
> >
> > I am not sure there is any promise that process_mrelease will run in
> > parallel with the exiting process. In fact the primary purpose of this
> > syscall is to provide a reliable way to oom kill from user space. If you
> > want to optimize process exit resp. its exit_mmap part then you should
> > be using other means. So I would be careful calling this a regression.
> >
> > I do agree that taking the reference count is the right approach here. I
> > was wrong previously [1] when saying that pinning the mm struct is
> > sufficient. I have completely forgot about the subtle sync in exit_mmap.
> > One way we can approach that would be to take exclusive mmap_sem
> > throughout the exit_mmap unconditionally.
>
> I agree, that would probably be the cleanest way.
>
> > There was a push back against
> > that though so arguments would have to be re-evaluated.
>
> I'll review that discussion to better understand the reasons for the
> push back. Thanks for the link.

Adding Kirill and Andrea.

I had some time to dig some more. The latency increase is definitely
coming due to process_mrelease calling the last mmput and exit_aio is
especially problematic. So, currently process_mrelease not only
releases memory but does more, including waiting for io to finish.

Unconditional mmap_write_lock around free_pgtables in exit_mmap seems
to me the most semantically correct way forward and the pushback is on
the basis of regressing performance of the exit path. I would like to
measure that regression to confirm this. I don't have access to a big
machine but will ask someone in another Google team to try the test
Michal wrote here
https://lore.kernel.org/all/20170725142626.GJ26723@dhcp22.suse.cz/ on
a server with and without a custom patch.
If the regression is real, then I think we could keep the "if
(unlikely(mm_is_oom_victim(mm)))" condition but wrap free_pgtables
with conditional mmap_write_lock. To me this is cleaner because it
clearly shows that we are trying to prevent free_pgtables from racing
with any mm readers (current mmap_write_lock(); mmap_write_unlock()
sequence needs a comment section to explain why this is needed). In
that case I would need to reuse MMF_OOM_VICTIM in process_mrelease to
avoid postponing the exit_mmap, like oom-reaper does. Maybe we could
rename MMF_OOM_VICTIM / MMF_OOM_SKIP to something like MMF_RELEASING /
MMF_RELEASED to make them more generic and allow their use outside of
oom-killer? Again, this is a fallback plan in case unconditional
mmap_write_lock indeed regresses the exit path.
Any comments/suggestions?


>
> >
> > [1] http://lkml.kernel.org/r/YQzZqFwDP7eUxwcn@dhcp22.suse.cz
> >
> > That being said
> > Acked-by: Michal Hocko <mhocko@suse.com>
>
> Thanks!
>
> >
> > Thanks!
> >
> > > Fixes: 884a7e5964e0 ("mm: introduce process_mrelease system call")
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  mm/oom_kill.c | 23 ++++++++++++-----------
> > >  1 file changed, 12 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 831340e7ad8b..989f35a2bbb1 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -1150,7 +1150,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> > >       struct task_struct *task;
> > >       struct task_struct *p;
> > >       unsigned int f_flags;
> > > -     bool reap = true;
> > > +     bool reap = false;
> > >       struct pid *pid;
> > >       long ret = 0;
> > >
> > > @@ -1177,15 +1177,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> > >               goto put_task;
> > >       }
> > >
> > > -     mm = p->mm;
> > > -     mmgrab(mm);
> > > -
> > > -     /* If the work has been done already, just exit with success */
> > > -     if (test_bit(MMF_OOM_SKIP, &mm->flags))
> > > -             reap = false;
> > > -     else if (!task_will_free_mem(p)) {
> > > -             reap = false;
> > > -             ret = -EINVAL;
> > > +     if (mmget_not_zero(p->mm)) {
> > > +             mm = p->mm;
> > > +             if (task_will_free_mem(p))
> > > +                     reap = true;
> > > +             else {
> > > +                     /* Error only if the work has not been done already */
> > > +                     if (!test_bit(MMF_OOM_SKIP, &mm->flags))
> > > +                             ret = -EINVAL;
> > > +             }
> > >       }
> > >       task_unlock(p);
> > >
> > > @@ -1201,7 +1201,8 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> > >       mmap_read_unlock(mm);
> > >
> > >  drop_mm:
> > > -     mmdrop(mm);
> > > +     if (mm)
> > > +             mmput(mm);
> > >  put_task:
> > >       put_task_struct(task);
> > >  put_pid:
> > > --
> > > 2.33.0.1079.g6e70778dc9-goog
> >
> > --
> > Michal Hocko
> > SUSE Labs
Matthew Wilcox Oct. 27, 2021, 5:33 p.m. UTC | #8
On Wed, Oct 27, 2021 at 09:08:21AM -0700, Suren Baghdasaryan wrote:
> Unconditional mmap_write_lock around free_pgtables in exit_mmap seems
> to me the most semantically correct way forward and the pushback is on
> the basis of regressing performance of the exit path. I would like to
> measure that regression to confirm this. I don't have access to a big
> machine but will ask someone in another Google team to try the test
> Michal wrote here
> https://lore.kernel.org/all/20170725142626.GJ26723@dhcp22.suse.cz/ on
> a server with and without a custom patch.

Sorry to hijack this, but could you ask that team to also test this
patch?  I think there's probably a good-sized win here, but I have no
profiles to share at this point.  I've only done light testing, and
it may have bugs.

NB: I only did the exit() path here.  fork() conversion is left as an
exercise for the reader^W^W Liam.

From 5f9daa14a5e58c86a73eccf59abe23d131004926 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Wed, 27 Oct 2021 13:28:35 -0400
Subject: [PATCH] mm: Add vmavec

The vmavec lets us allocate and free batches of VMAs instead of
one at a time.  Should improve fork() and exit() performance.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/vmavec.h | 38 ++++++++++++++++++++++++++++++++++++++
 kernel/fork.c          | 17 +++++++++++++++++
 mm/mmap.c              | 30 ++++++++++++++++++++++--------
 3 files changed, 77 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/vmavec.h

diff --git a/include/linux/vmavec.h b/include/linux/vmavec.h
new file mode 100644
index 000000000000..8a324e2e1258
--- /dev/null
+++ b/include/linux/vmavec.h
@@ -0,0 +1,38 @@
+/*
+ * A vma vector is an array of vm_area_structs, with a counter.
+ */
+
+struct vm_area_struct;
+
+#define VMAVEC_SIZE	15
+
+struct vmavec {
+	unsigned char nr;
+	void *vmas[VMAVEC_SIZE];
+};
+
+#define VMAVEC(name)	struct vmavec name = { }
+
+static inline bool vmavec_full(struct vmavec *vmavec)
+{
+	return vmavec->nr == VMAVEC_SIZE;
+}
+
+static inline bool vmavec_empty(struct vmavec *vmavec)
+{
+	return vmavec->nr == 0;
+}
+
+static inline
+void vmavec_push(struct vmavec *vmavec, struct vm_area_struct *vma)
+{
+	vmavec->vmas[vmavec->nr++] = vma;
+}
+
+static inline struct vm_area_struct *vmavec_pop(struct vmavec *vmavec)
+{
+	return vmavec->vmas[--vmavec->nr];
+}
+
+void vm_area_free_vec(struct vmavec *vmavec);
+void vm_area_alloc_vec(struct mm_struct *mm, struct vmavec *vmavec);
diff --git a/kernel/fork.c b/kernel/fork.c
index 38681ad44c76..ea7e8bd00be8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -97,6 +97,7 @@
 #include <linux/scs.h>
 #include <linux/io_uring.h>
 #include <linux/bpf.h>
+#include <linux/vmavec.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -375,6 +376,22 @@ void vm_area_free(struct vm_area_struct *vma)
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
+void vm_area_alloc_vec(struct mm_struct *mm, struct vmavec *vmavec)
+{
+	int i;
+
+	vmavec->nr = kmem_cache_alloc_bulk(vm_area_cachep, GFP_KERNEL,
+				VMAVEC_SIZE, vmavec->vmas);
+	for (i = 0; i < vmavec->nr; i++)
+		vma_init(vmavec->vmas[i], mm);
+}
+
+void vm_area_free_vec(struct vmavec *vmavec)
+{
+	kmem_cache_free_bulk(vm_area_cachep, vmavec->nr, vmavec->vmas);
+	vmavec->nr = 0;
+}
+
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
 	void *stack = task_stack_page(tsk);
diff --git a/mm/mmap.c b/mm/mmap.c
index 88dcc5c25225..bff4e94eec8c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -47,6 +47,7 @@
 #include <linux/pkeys.h>
 #include <linux/oom.h>
 #include <linux/sched/mm.h>
+#include <linux/vmavec.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -172,19 +173,24 @@ void unlink_file_vma(struct vm_area_struct *vma)
 	}
 }
 
-/*
- * Close a vm structure and free it, returning the next.
- */
-static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
+static void __remove_vma(struct vm_area_struct *vma)
 {
-	struct vm_area_struct *next = vma->vm_next;
-
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
+}
+
+/*
+ * Close a vm structure and free it, returning the next.
+ */
+static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
+{
+	struct vm_area_struct *next = vma->vm_next;
+
+	__remove_vma(vma);
 	vm_area_free(vma);
 	return next;
 }
@@ -3125,6 +3131,7 @@ void exit_mmap(struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
+	VMAVEC(vmavec);
 	unsigned long nr_accounted = 0;
 
 	/* mm's last user has gone, and its about to be pulled down */
@@ -3179,9 +3186,16 @@ void exit_mmap(struct mm_struct *mm)
 	while (vma) {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
-		vma = remove_vma(vma);
-		cond_resched();
+		__remove_vma(vma);
+		vmavec_push(&vmavec, vma);
+		vma = vma->vm_next;
+		if (vmavec_full(&vmavec)) {
+			vm_area_free_vec(&vmavec);
+			cond_resched();
+		}
 	}
+	if (!vmavec_empty(&vmavec))
+		vm_area_free_vec(&vmavec);
 	vm_unacct_memory(nr_accounted);
 }
Suren Baghdasaryan Oct. 27, 2021, 5:42 p.m. UTC | #9
On Wed, Oct 27, 2021 at 10:35 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Oct 27, 2021 at 09:08:21AM -0700, Suren Baghdasaryan wrote:
> > Unconditional mmap_write_lock around free_pgtables in exit_mmap seems
> > to me the most semantically correct way forward and the pushback is on
> > the basis of regressing performance of the exit path. I would like to
> > measure that regression to confirm this. I don't have access to a big
> > machine but will ask someone in another Google team to try the test
> > Michal wrote here
> > https://lore.kernel.org/all/20170725142626.GJ26723@dhcp22.suse.cz/ on
> > a server with and without a custom patch.
>
> Sorry to hijack this, but could you ask that team to also test this
> patch?  I think there's probably a good-sized win here, but I have no
> profiles to share at this point.  I've only done light testing, and
> it may have bugs.
>
> NB: I only did the exit() path here.  fork() conversion is left as an
> exercise for the reader^W^W Liam.

To clarify, this patch does not change the mmap_write_lock portion of
exit_mmap. Do you want to test it in isolation or with the locking
changes in exit_mmap I mentioned?

>
> From 5f9daa14a5e58c86a73eccf59abe23d131004926 Mon Sep 17 00:00:00 2001
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Date: Wed, 27 Oct 2021 13:28:35 -0400
> Subject: [PATCH] mm: Add vmavec
>
> The vmavec lets us allocate and free batches of VMAs instead of
> one at a time.  Should improve fork() and exit() performance.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/vmavec.h | 38 ++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c          | 17 +++++++++++++++++
>  mm/mmap.c              | 30 ++++++++++++++++++++++--------
>  3 files changed, 77 insertions(+), 8 deletions(-)
>  create mode 100644 include/linux/vmavec.h
>
> diff --git a/include/linux/vmavec.h b/include/linux/vmavec.h
> new file mode 100644
> index 000000000000..8a324e2e1258
> --- /dev/null
> +++ b/include/linux/vmavec.h
> @@ -0,0 +1,38 @@
> +/*
> + * A vma vector is an array of vm_area_structs, with a counter.
> + */
> +
> +struct vm_area_struct;
> +
> +#define VMAVEC_SIZE    15
> +
> +struct vmavec {
> +       unsigned char nr;
> +       void *vmas[VMAVEC_SIZE];
> +};
> +
> +#define VMAVEC(name)   struct vmavec name = { }
> +
> +static inline bool vmavec_full(struct vmavec *vmavec)
> +{
> +       return vmavec->nr == VMAVEC_SIZE;
> +}
> +
> +static inline bool vmavec_empty(struct vmavec *vmavec)
> +{
> +       return vmavec->nr == 0;
> +}
> +
> +static inline
> +void vmavec_push(struct vmavec *vmavec, struct vm_area_struct *vma)
> +{
> +       vmavec->vmas[vmavec->nr++] = vma;
> +}
> +
> +static inline struct vm_area_struct *vmavec_pop(struct vmavec *vmavec)
> +{
> +       return vmavec->vmas[--vmavec->nr];
> +}
> +
> +void vm_area_free_vec(struct vmavec *vmavec);
> +void vm_area_alloc_vec(struct mm_struct *mm, struct vmavec *vmavec);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 38681ad44c76..ea7e8bd00be8 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -97,6 +97,7 @@
>  #include <linux/scs.h>
>  #include <linux/io_uring.h>
>  #include <linux/bpf.h>
> +#include <linux/vmavec.h>
>
>  #include <asm/pgalloc.h>
>  #include <linux/uaccess.h>
> @@ -375,6 +376,22 @@ void vm_area_free(struct vm_area_struct *vma)
>         kmem_cache_free(vm_area_cachep, vma);
>  }
>
> +void vm_area_alloc_vec(struct mm_struct *mm, struct vmavec *vmavec)
> +{
> +       int i;
> +
> +       vmavec->nr = kmem_cache_alloc_bulk(vm_area_cachep, GFP_KERNEL,
> +                               VMAVEC_SIZE, vmavec->vmas);
> +       for (i = 0; i < vmavec->nr; i++)
> +               vma_init(vmavec->vmas[i], mm);
> +}
> +
> +void vm_area_free_vec(struct vmavec *vmavec)
> +{
> +       kmem_cache_free_bulk(vm_area_cachep, vmavec->nr, vmavec->vmas);
> +       vmavec->nr = 0;
> +}
> +
>  static void account_kernel_stack(struct task_struct *tsk, int account)
>  {
>         void *stack = task_stack_page(tsk);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 88dcc5c25225..bff4e94eec8c 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -47,6 +47,7 @@
>  #include <linux/pkeys.h>
>  #include <linux/oom.h>
>  #include <linux/sched/mm.h>
> +#include <linux/vmavec.h>
>
>  #include <linux/uaccess.h>
>  #include <asm/cacheflush.h>
> @@ -172,19 +173,24 @@ void unlink_file_vma(struct vm_area_struct *vma)
>         }
>  }
>
> -/*
> - * Close a vm structure and free it, returning the next.
> - */
> -static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
> +static void __remove_vma(struct vm_area_struct *vma)
>  {
> -       struct vm_area_struct *next = vma->vm_next;
> -
>         might_sleep();
>         if (vma->vm_ops && vma->vm_ops->close)
>                 vma->vm_ops->close(vma);
>         if (vma->vm_file)
>                 fput(vma->vm_file);
>         mpol_put(vma_policy(vma));
> +}
> +
> +/*
> + * Close a vm structure and free it, returning the next.
> + */
> +static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
> +{
> +       struct vm_area_struct *next = vma->vm_next;
> +
> +       __remove_vma(vma);
>         vm_area_free(vma);
>         return next;
>  }
> @@ -3125,6 +3131,7 @@ void exit_mmap(struct mm_struct *mm)
>  {
>         struct mmu_gather tlb;
>         struct vm_area_struct *vma;
> +       VMAVEC(vmavec);
>         unsigned long nr_accounted = 0;
>
>         /* mm's last user has gone, and its about to be pulled down */
> @@ -3179,9 +3186,16 @@ void exit_mmap(struct mm_struct *mm)
>         while (vma) {
>                 if (vma->vm_flags & VM_ACCOUNT)
>                         nr_accounted += vma_pages(vma);
> -               vma = remove_vma(vma);
> -               cond_resched();
> +               __remove_vma(vma);
> +               vmavec_push(&vmavec, vma);
> +               vma = vma->vm_next;
> +               if (vmavec_full(&vmavec)) {
> +                       vm_area_free_vec(&vmavec);
> +                       cond_resched();
> +               }
>         }
> +       if (!vmavec_empty(&vmavec))
> +               vm_area_free_vec(&vmavec);
>         vm_unacct_memory(nr_accounted);
>  }
>
> --
> 2.33.0
>
Matthew Wilcox Oct. 27, 2021, 5:51 p.m. UTC | #10
On Wed, Oct 27, 2021 at 10:42:29AM -0700, Suren Baghdasaryan wrote:
> On Wed, Oct 27, 2021 at 10:35 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Oct 27, 2021 at 09:08:21AM -0700, Suren Baghdasaryan wrote:
> > > Unconditional mmap_write_lock around free_pgtables in exit_mmap seems
> > > to me the most semantically correct way forward and the pushback is on
> > > the basis of regressing performance of the exit path. I would like to
> > > measure that regression to confirm this. I don't have access to a big
> > > machine but will ask someone in another Google team to try the test
> > > Michal wrote here
> > > https://lore.kernel.org/all/20170725142626.GJ26723@dhcp22.suse.cz/ on
> > > a server with and without a custom patch.
> >
> > Sorry to hijack this, but could you ask that team to also test this
> > patch?  I think there's probably a good-sized win here, but I have no
> > profiles to share at this point.  I've only done light testing, and
> > it may have bugs.
> >
> > NB: I only did the exit() path here.  fork() conversion is left as an
> > exercise for the reader^W^W Liam.
> 
> To clarify, this patch does not change the mmap_write_lock portion of
> exit_mmap. Do you want to test it in isolation or with the locking
> changes in exit_mmap I mentioned?

Correct, it does not.  I think it's interesting to test it in isolation,
but if you want to test it in in combination, that could also be
interesting (see if we regain some of the expected performance loss).
I just don't have a NUMA box of my own to test on, so I'm hoping to
exploit your test infrastructure ;-)

By the way, my vmavec patch should also be helpful on small systems
like phones ... ;-)
Suren Baghdasaryan Oct. 27, 2021, 6 p.m. UTC | #11
On Wed, Oct 27, 2021 at 10:52 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Oct 27, 2021 at 10:42:29AM -0700, Suren Baghdasaryan wrote:
> > On Wed, Oct 27, 2021 at 10:35 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Wed, Oct 27, 2021 at 09:08:21AM -0700, Suren Baghdasaryan wrote:
> > > > Unconditional mmap_write_lock around free_pgtables in exit_mmap seems
> > > > to me the most semantically correct way forward and the pushback is on
> > > > the basis of regressing performance of the exit path. I would like to
> > > > measure that regression to confirm this. I don't have access to a big
> > > > machine but will ask someone in another Google team to try the test
> > > > Michal wrote here
> > > > https://lore.kernel.org/all/20170725142626.GJ26723@dhcp22.suse.cz/ on
> > > > a server with and without a custom patch.
> > >
> > > Sorry to hijack this, but could you ask that team to also test this
> > > patch?  I think there's probably a good-sized win here, but I have no
> > > profiles to share at this point.  I've only done light testing, and
> > > it may have bugs.
> > >
> > > NB: I only did the exit() path here.  fork() conversion is left as an
> > > exercise for the reader^W^W Liam.
> >
> > To clarify, this patch does not change the mmap_write_lock portion of
> > exit_mmap. Do you want to test it in isolation or with the locking
> > changes in exit_mmap I mentioned?
>
> Correct, it does not.  I think it's interesting to test it in isolation,
> but if you want to test it in in combination, that could also be
> interesting (see if we regain some of the expected performance loss).
> I just don't have a NUMA box of my own to test on, so I'm hoping to
> exploit your test infrastructure ;-)
>
> By the way, my vmavec patch should also be helpful on small systems
> like phones ... ;-)

Sounds good. I'll try to queue up the patches so that it's easy to
test them both in isolation and together.
Michal Hocko Oct. 29, 2021, 1:03 p.m. UTC | #12
On Wed 27-10-21 09:08:21, Suren Baghdasaryan wrote:
> On Fri, Oct 22, 2021 at 10:38 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Fri, Oct 22, 2021 at 1:03 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 21-10-21 18:46:58, Suren Baghdasaryan wrote:
> > > > Race between process_mrelease and exit_mmap, where free_pgtables is
> > > > called while __oom_reap_task_mm is in progress, leads to kernel crash
> > > > during pte_offset_map_lock call. oom-reaper avoids this race by setting
> > > > MMF_OOM_VICTIM flag and causing exit_mmap to take and release
> > > > mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock.
> > > > Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to
> > > > fix this race, however that would be considered a hack. Fix this race
> > > > by elevating mm->mm_users and preventing exit_mmap from executing until
> > > > process_mrelease is finished. Patch slightly refactors the code to adapt
> > > > for a possible mmget_not_zero failure.
> > > > This fix has considerable negative impact on process_mrelease performance
> > > > and will likely need later optimization.
> > >
> > > I am not sure there is any promise that process_mrelease will run in
> > > parallel with the exiting process. In fact the primary purpose of this
> > > syscall is to provide a reliable way to oom kill from user space. If you
> > > want to optimize process exit resp. its exit_mmap part then you should
> > > be using other means. So I would be careful calling this a regression.
> > >
> > > I do agree that taking the reference count is the right approach here. I
> > > was wrong previously [1] when saying that pinning the mm struct is
> > > sufficient. I have completely forgot about the subtle sync in exit_mmap.
> > > One way we can approach that would be to take exclusive mmap_sem
> > > throughout the exit_mmap unconditionally.
> >
> > I agree, that would probably be the cleanest way.
> >
> > > There was a push back against
> > > that though so arguments would have to be re-evaluated.
> >
> > I'll review that discussion to better understand the reasons for the
> > push back. Thanks for the link.
> 
> Adding Kirill and Andrea.
> 
> I had some time to dig some more. The latency increase is definitely
> coming due to process_mrelease calling the last mmput and exit_aio is
> especially problematic. So, currently process_mrelease not only
> releases memory but does more, including waiting for io to finish.

Well, I still do not see why that is a problem. This syscall is meant to
release the address space not to do it fast.

> Unconditional mmap_write_lock around free_pgtables in exit_mmap seems
> to me the most semantically correct way forward and the pushback is on
> the basis of regressing performance of the exit path. I would like to
> measure that regression to confirm this. I don't have access to a big
> machine but will ask someone in another Google team to try the test
> Michal wrote here
> https://lore.kernel.org/all/20170725142626.GJ26723@dhcp22.suse.cz/ on
> a server with and without a custom patch.

Well, I do not remember all the details of the discussion but I believe
a rather large part of that discussion was a bit misled. The exist
path - and the last mmput in particular - shouldn't trigger mmap_sem
contention. There are only rare cases where somebody can race and take a
lock then (e.g. proc interfaces taking the lock before mmget_notzero).
Certainly not something to optimize for and I believe a correct and
robust code should have a preference. As we can see a lack of proper
synchronization has led to 2 very similar problem nobody revealed during
review because the code is just too tricky.

Btw. the above code will not really tell you much on a larger machine
unless you manage to trigger mmap_sem contection. Otherwise you are
measuring the mmap_sem writelock fast path and that should be really
within a noise comparing to the whole address space destruction time. If
that is not the case then we have a real problem with the locking...
Suren Baghdasaryan Oct. 29, 2021, 4:07 p.m. UTC | #13
On Fri, Oct 29, 2021 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 27-10-21 09:08:21, Suren Baghdasaryan wrote:
> > On Fri, Oct 22, 2021 at 10:38 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Fri, Oct 22, 2021 at 1:03 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 21-10-21 18:46:58, Suren Baghdasaryan wrote:
> > > > > Race between process_mrelease and exit_mmap, where free_pgtables is
> > > > > called while __oom_reap_task_mm is in progress, leads to kernel crash
> > > > > during pte_offset_map_lock call. oom-reaper avoids this race by setting
> > > > > MMF_OOM_VICTIM flag and causing exit_mmap to take and release
> > > > > mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock.
> > > > > Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to
> > > > > fix this race, however that would be considered a hack. Fix this race
> > > > > by elevating mm->mm_users and preventing exit_mmap from executing until
> > > > > process_mrelease is finished. Patch slightly refactors the code to adapt
> > > > > for a possible mmget_not_zero failure.
> > > > > This fix has considerable negative impact on process_mrelease performance
> > > > > and will likely need later optimization.
> > > >
> > > > I am not sure there is any promise that process_mrelease will run in
> > > > parallel with the exiting process. In fact the primary purpose of this
> > > > syscall is to provide a reliable way to oom kill from user space. If you
> > > > want to optimize process exit resp. its exit_mmap part then you should
> > > > be using other means. So I would be careful calling this a regression.
> > > >
> > > > I do agree that taking the reference count is the right approach here. I
> > > > was wrong previously [1] when saying that pinning the mm struct is
> > > > sufficient. I have completely forgot about the subtle sync in exit_mmap.
> > > > One way we can approach that would be to take exclusive mmap_sem
> > > > throughout the exit_mmap unconditionally.
> > >
> > > I agree, that would probably be the cleanest way.
> > >
> > > > There was a push back against
> > > > that though so arguments would have to be re-evaluated.
> > >
> > > I'll review that discussion to better understand the reasons for the
> > > push back. Thanks for the link.
> >
> > Adding Kirill and Andrea.
> >
> > I had some time to dig some more. The latency increase is definitely
> > coming due to process_mrelease calling the last mmput and exit_aio is
> > especially problematic. So, currently process_mrelease not only
> > releases memory but does more, including waiting for io to finish.
>
> Well, I still do not see why that is a problem. This syscall is meant to
> release the address space not to do it fast.

It's the same problem for a userspace memory reaper as for the
oom-reaper. The goal is to release the memory of the victim and to
quickly move on to the next one if needed.

>
> > Unconditional mmap_write_lock around free_pgtables in exit_mmap seems
> > to me the most semantically correct way forward and the pushback is on
> > the basis of regressing performance of the exit path. I would like to
> > measure that regression to confirm this. I don't have access to a big
> > machine but will ask someone in another Google team to try the test
> > Michal wrote here
> > https://lore.kernel.org/all/20170725142626.GJ26723@dhcp22.suse.cz/ on
> > a server with and without a custom patch.
>
> Well, I do not remember all the details of the discussion but I believe
> a rather large part of that discussion was a bit misled. The exist
> path - and the last mmput in particular - shouldn't trigger mmap_sem
> contention. There are only rare cases where somebody can race and take a
> lock then (e.g. proc interfaces taking the lock before mmget_notzero).
> Certainly not something to optimize for and I believe a correct and
> robust code should have a preference. As we can see a lack of proper
> synchronization has led to 2 very similar problem nobody revealed during
> review because the code is just too tricky.

I totally agree that this locking is tricky and mmap_sem contention
should be very rare in the exit_mmap path and not worth optimizing.

>
> Btw. the above code will not really tell you much on a larger machine
> unless you manage to trigger mmap_sem contection. Otherwise you are
> measuring the mmap_sem writelock fast path and that should be really
> within a noise comparing to the whole address space destruction time. If
> that is not the case then we have a real problem with the locking...

My understanding of that discussion is that the concern was that even
taking uncontended mmap_sem writelock would regress the exit path.
That was what I wanted to confirm. Am I misreading it?
Thanks,
Suren.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Nov. 1, 2021, 8:37 a.m. UTC | #14
On Fri 29-10-21 09:07:39, Suren Baghdasaryan wrote:
> On Fri, Oct 29, 2021 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Well, I still do not see why that is a problem. This syscall is meant to
> > release the address space not to do it fast.
> 
> It's the same problem for a userspace memory reaper as for the
> oom-reaper. The goal is to release the memory of the victim and to
> quickly move on to the next one if needed.

The purpose of the oom_reaper is to _guarantee_ a forward progress. It
doesn't have to be quick or optimized for speed.
 
[...]

> > Btw. the above code will not really tell you much on a larger machine
> > unless you manage to trigger mmap_sem contection. Otherwise you are
> > measuring the mmap_sem writelock fast path and that should be really
> > within a noise comparing to the whole address space destruction time. If
> > that is not the case then we have a real problem with the locking...
> 
> My understanding of that discussion is that the concern was that even
> taking uncontended mmap_sem writelock would regress the exit path.
> That was what I wanted to confirm. Am I misreading it?

No, your reading match my recollection. I just think that code
robustness in exchange of a rw semaphore write lock fast path is a
reasonable price to pay even if that has some effect on micro
benchmarks.
Suren Baghdasaryan Nov. 1, 2021, 3:44 p.m. UTC | #15
On Mon, Nov 1, 2021 at 1:37 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 29-10-21 09:07:39, Suren Baghdasaryan wrote:
> > On Fri, Oct 29, 2021 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > Well, I still do not see why that is a problem. This syscall is meant to
> > > release the address space not to do it fast.
> >
> > It's the same problem for a userspace memory reaper as for the
> > oom-reaper. The goal is to release the memory of the victim and to
> > quickly move on to the next one if needed.
>
> The purpose of the oom_reaper is to _guarantee_ a forward progress. It
> doesn't have to be quick or optimized for speed.

Fair enough. Then the same guarantees should apply to userspace memory
reapers. I think you clarified that well in your replies in
https://lore.kernel.org/all/20170725154514.GN26723@dhcp22.suse.cz:

Because there is no _guarantee_ that the final __mmput will release
the memory in finite time. And we cannot guarantee that longterm.
...
__mmput calls into exit_aio and that can wait for completion and there
is no way to guarantee this will finish in finite time.

>
> [...]
>
> > > Btw. the above code will not really tell you much on a larger machine
> > > unless you manage to trigger mmap_sem contection. Otherwise you are
> > > measuring the mmap_sem writelock fast path and that should be really
> > > within a noise comparing to the whole address space destruction time. If
> > > that is not the case then we have a real problem with the locking...
> >
> > My understanding of that discussion is that the concern was that even
> > taking uncontended mmap_sem writelock would regress the exit path.
> > That was what I wanted to confirm. Am I misreading it?
>
> No, your reading match my recollection. I just think that code
> robustness in exchange of a rw semaphore write lock fast path is a
> reasonable price to pay even if that has some effect on micro
> benchmarks.

I'm with you on this one, that's why I wanted to measure the price we
would pay. Below are the test results:

Test: https://lore.kernel.org/all/20170725142626.GJ26723@dhcp22.suse.cz/
Compiled: gcc -O2 -static test.c -o test
Test machine: 128 core / 256 thread 2x AMD EPYC 7B12 64-Core Processor
(family 17h)

baseline (Linus master, f31531e55495ca3746fb895ffdf73586be8259fa)
p50 (median)   87412
p95                  168210
p99                  190058
average           97843.8
stdev               29.85%

unconditional mmap_write_lock in exit_mmap (last column is the change
from the baseline)
p50 (median)   88312     +1.03%
p95                  170797   +1.54%
p99                  191813   +0.92%
average           97659.5  -0.19%
stdev               32.41%

unconditional mmap_write_lock in exit_mmap + Matthew's patch (last
column is the change from the baseline)
p50 (median)   88807      +1.60%
p95                  167783     -0.25%
p99                  187853     -1.16%
average           97491.4    -0.36%
stdev               30.61%

stdev is quite high in all cases, so the test is very noisy.
The impact seems quite low IMHO. WDYT?

> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Nov. 1, 2021, 7:59 p.m. UTC | #16
On Mon, Nov 1, 2021 at 8:44 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Nov 1, 2021 at 1:37 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 29-10-21 09:07:39, Suren Baghdasaryan wrote:
> > > On Fri, Oct 29, 2021 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > Well, I still do not see why that is a problem. This syscall is meant to
> > > > release the address space not to do it fast.
> > >
> > > It's the same problem for a userspace memory reaper as for the
> > > oom-reaper. The goal is to release the memory of the victim and to
> > > quickly move on to the next one if needed.
> >
> > The purpose of the oom_reaper is to _guarantee_ a forward progress. It
> > doesn't have to be quick or optimized for speed.
>
> Fair enough. Then the same guarantees should apply to userspace memory
> reapers. I think you clarified that well in your replies in
> https://lore.kernel.org/all/20170725154514.GN26723@dhcp22.suse.cz:
>
> Because there is no _guarantee_ that the final __mmput will release
> the memory in finite time. And we cannot guarantee that longterm.
> ...
> __mmput calls into exit_aio and that can wait for completion and there
> is no way to guarantee this will finish in finite time.
>
> >
> > [...]
> >
> > > > Btw. the above code will not really tell you much on a larger machine
> > > > unless you manage to trigger mmap_sem contection. Otherwise you are
> > > > measuring the mmap_sem writelock fast path and that should be really
> > > > within a noise comparing to the whole address space destruction time. If
> > > > that is not the case then we have a real problem with the locking...
> > >
> > > My understanding of that discussion is that the concern was that even
> > > taking uncontended mmap_sem writelock would regress the exit path.
> > > That was what I wanted to confirm. Am I misreading it?
> >
> > No, your reading match my recollection. I just think that code
> > robustness in exchange of a rw semaphore write lock fast path is a
> > reasonable price to pay even if that has some effect on micro
> > benchmarks.
>
> I'm with you on this one, that's why I wanted to measure the price we
> would pay. Below are the test results:
>
> Test: https://lore.kernel.org/all/20170725142626.GJ26723@dhcp22.suse.cz/
> Compiled: gcc -O2 -static test.c -o test
> Test machine: 128 core / 256 thread 2x AMD EPYC 7B12 64-Core Processor
> (family 17h)
>
> baseline (Linus master, f31531e55495ca3746fb895ffdf73586be8259fa)
> p50 (median)   87412
> p95                  168210
> p99                  190058
> average           97843.8
> stdev               29.85%
>
> unconditional mmap_write_lock in exit_mmap (last column is the change
> from the baseline)
> p50 (median)   88312     +1.03%
> p95                  170797   +1.54%
> p99                  191813   +0.92%
> average           97659.5  -0.19%
> stdev               32.41%
>
> unconditional mmap_write_lock in exit_mmap + Matthew's patch (last
> column is the change from the baseline)
> p50 (median)   88807      +1.60%
> p95                  167783     -0.25%
> p99                  187853     -1.16%
> average           97491.4    -0.36%
> stdev               30.61%
>
> stdev is quite high in all cases, so the test is very noisy.

Need to clarify that what I called here "stdev" is actually stdev /
average in %.

> The impact seems quite low IMHO. WDYT?
>
> > --
> > Michal Hocko
> > SUSE Labs
Michal Hocko Nov. 2, 2021, 7:58 a.m. UTC | #17
On Mon 01-11-21 08:44:58, Suren Baghdasaryan wrote:
[...]
> I'm with you on this one, that's why I wanted to measure the price we
> would pay. Below are the test results:
> 
> Test: https://lore.kernel.org/all/20170725142626.GJ26723@dhcp22.suse.cz/
> Compiled: gcc -O2 -static test.c -o test
> Test machine: 128 core / 256 thread 2x AMD EPYC 7B12 64-Core Processor
> (family 17h)
> 
> baseline (Linus master, f31531e55495ca3746fb895ffdf73586be8259fa)
> p50 (median)   87412
> p95                  168210
> p99                  190058
> average           97843.8
> stdev               29.85%
> 
> unconditional mmap_write_lock in exit_mmap (last column is the change
> from the baseline)
> p50 (median)   88312     +1.03%
> p95                  170797   +1.54%
> p99                  191813   +0.92%
> average           97659.5  -0.19%
> stdev               32.41%
> 
> unconditional mmap_write_lock in exit_mmap + Matthew's patch (last
> column is the change from the baseline)
> p50 (median)   88807      +1.60%
> p95                  167783     -0.25%
> p99                  187853     -1.16%
> average           97491.4    -0.36%
> stdev               30.61%
> 
> stdev is quite high in all cases, so the test is very noisy.
> The impact seems quite low IMHO. WDYT?

Results being very noisy is what I recall as well. Thanks!
Suren Baghdasaryan Nov. 2, 2021, 3:14 p.m. UTC | #18
On Tue, Nov 2, 2021 at 12:58 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 01-11-21 08:44:58, Suren Baghdasaryan wrote:
> [...]
> > I'm with you on this one, that's why I wanted to measure the price we
> > would pay. Below are the test results:
> >
> > Test: https://lore.kernel.org/all/20170725142626.GJ26723@dhcp22.suse.cz/
> > Compiled: gcc -O2 -static test.c -o test
> > Test machine: 128 core / 256 thread 2x AMD EPYC 7B12 64-Core Processor
> > (family 17h)
> >
> > baseline (Linus master, f31531e55495ca3746fb895ffdf73586be8259fa)
> > p50 (median)   87412
> > p95                  168210
> > p99                  190058
> > average           97843.8
> > stdev               29.85%
> >
> > unconditional mmap_write_lock in exit_mmap (last column is the change
> > from the baseline)
> > p50 (median)   88312     +1.03%
> > p95                  170797   +1.54%
> > p99                  191813   +0.92%
> > average           97659.5  -0.19%
> > stdev               32.41%
> >
> > unconditional mmap_write_lock in exit_mmap + Matthew's patch (last
> > column is the change from the baseline)
> > p50 (median)   88807      +1.60%
> > p95                  167783     -0.25%
> > p99                  187853     -1.16%
> > average           97491.4    -0.36%
> > stdev               30.61%
> >
> > stdev is quite high in all cases, so the test is very noisy.
> > The impact seems quite low IMHO. WDYT?
>
> Results being very noisy is what I recall as well. Thanks!

I believe, despite the noise, the percentiles show that overall we do
not noticeably regress the exit path by taking mmap_lock
unconditionally.
If there are no objections, I would like to post a patchset which
implements unconditional locking in exit_mmap() and process_madvise()
calling __oom_reap_task_mm() under protection of read mmap_lock.
Thanks!

> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Nov. 9, 2021, 7:01 p.m. UTC | #19
On Tue, Nov 2, 2021 at 8:14 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Nov 2, 2021 at 12:58 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 01-11-21 08:44:58, Suren Baghdasaryan wrote:
> > [...]
> > > I'm with you on this one, that's why I wanted to measure the price we
> > > would pay. Below are the test results:
> > >
> > > Test: https://lore.kernel.org/all/20170725142626.GJ26723@dhcp22.suse.cz/
> > > Compiled: gcc -O2 -static test.c -o test
> > > Test machine: 128 core / 256 thread 2x AMD EPYC 7B12 64-Core Processor
> > > (family 17h)
> > >
> > > baseline (Linus master, f31531e55495ca3746fb895ffdf73586be8259fa)
> > > p50 (median)   87412
> > > p95                  168210
> > > p99                  190058
> > > average           97843.8
> > > stdev               29.85%
> > >
> > > unconditional mmap_write_lock in exit_mmap (last column is the change
> > > from the baseline)
> > > p50 (median)   88312     +1.03%
> > > p95                  170797   +1.54%
> > > p99                  191813   +0.92%
> > > average           97659.5  -0.19%
> > > stdev               32.41%
> > >
> > > unconditional mmap_write_lock in exit_mmap + Matthew's patch (last
> > > column is the change from the baseline)
> > > p50 (median)   88807      +1.60%
> > > p95                  167783     -0.25%
> > > p99                  187853     -1.16%
> > > average           97491.4    -0.36%
> > > stdev               30.61%
> > >
> > > stdev is quite high in all cases, so the test is very noisy.
> > > The impact seems quite low IMHO. WDYT?
> >
> > Results being very noisy is what I recall as well. Thanks!
>
> I believe, despite the noise, the percentiles show that overall we do
> not noticeably regress the exit path by taking mmap_lock
> unconditionally.
> If there are no objections, I would like to post a patchset which
> implements unconditional locking in exit_mmap() and process_madvise()
> calling __oom_reap_task_mm() under protection of read mmap_lock.
> Thanks!

Discussing how the patch I want to post works for maple trees that
Matthew is working on, I've got a question:

IIUC, according to Michal's post here:
https://lore.kernel.org/all/20170725154514.GN26723@dhcp22.suse.cz,
unmap_vmas() can race with other mmap_lock read holders (including
oom_reap_task_mm()) with no issues.
Maple tree patchset requires rcu read lock or the mmap semaphore be
held (read or write side) when walking the tree, including inside
unmap_vmas(). When asked, he told me that he is not sure why it's
currently "safe" to walk the vma->vm_next list in unmap_vmas() while
another thread is reaping the mm.
Michal (or maybe someone else), could you please clarify why
unmap_vmas() can safely race with oom_reap_task_mm()? Or maybe my
understanding was wrong?
Thanks,
Suren.



>
> > --
> > Michal Hocko
> > SUSE Labs
Michal Hocko Nov. 9, 2021, 7:26 p.m. UTC | #20
On Tue 09-11-21 11:01:02, Suren Baghdasaryan wrote:
[...]
> Discussing how the patch I want to post works for maple trees that
> Matthew is working on, I've got a question:
> 
> IIUC, according to Michal's post here:
> https://lore.kernel.org/all/20170725154514.GN26723@dhcp22.suse.cz,
> unmap_vmas() can race with other mmap_lock read holders (including
> oom_reap_task_mm()) with no issues.
> Maple tree patchset requires rcu read lock or the mmap semaphore be
> held (read or write side) when walking the tree, including inside
> unmap_vmas(). When asked, he told me that he is not sure why it's
> currently "safe" to walk the vma->vm_next list in unmap_vmas() while
> another thread is reaping the mm.
> Michal (or maybe someone else), could you please clarify why
> unmap_vmas() can safely race with oom_reap_task_mm()? Or maybe my
> understanding was wrong?

I cannot really comment on the mapple tree part. But the existing
synchronization between oom reaper and exit_mmap is based on
- oom_reaper takes mmap_sem for reading
- exit_mmap sets MMF_OOM_SKIP and takes the exclusive mmap_sem before 
  unmap_vmas.

The oom_reaper therefore can either unmap the address space if the lock
is taken before exit_mmap or it would it would bale out on MMF_OOM_SKIP
if it takes the lock afterwards. So the reaper cannot race with
unmap_vmas.
Suren Baghdasaryan Nov. 9, 2021, 7:37 p.m. UTC | #21
On Tue, Nov 9, 2021 at 11:26 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 09-11-21 11:01:02, Suren Baghdasaryan wrote:
> [...]
> > Discussing how the patch I want to post works for maple trees that
> > Matthew is working on, I've got a question:
> >
> > IIUC, according to Michal's post here:
> > https://lore.kernel.org/all/20170725154514.GN26723@dhcp22.suse.cz,
> > unmap_vmas() can race with other mmap_lock read holders (including
> > oom_reap_task_mm()) with no issues.
> > Maple tree patchset requires rcu read lock or the mmap semaphore be
> > held (read or write side) when walking the tree, including inside
> > unmap_vmas(). When asked, he told me that he is not sure why it's
> > currently "safe" to walk the vma->vm_next list in unmap_vmas() while
> > another thread is reaping the mm.
> > Michal (or maybe someone else), could you please clarify why
> > unmap_vmas() can safely race with oom_reap_task_mm()? Or maybe my
> > understanding was wrong?
>
> I cannot really comment on the mapple tree part. But the existing
> synchronization between oom reaper and exit_mmap is based on
> - oom_reaper takes mmap_sem for reading
> - exit_mmap sets MMF_OOM_SKIP and takes the exclusive mmap_sem before
>   unmap_vmas.
>
> The oom_reaper therefore can either unmap the address space if the lock
> is taken before exit_mmap or it would it would bale out on MMF_OOM_SKIP
> if it takes the lock afterwards. So the reaper cannot race with
> unmap_vmas.

I see. So, it's the combination of MMF_OOM_SKIP and mmap_lock working
as a barrier which prevent them from racing with each other...
I wasn't sure how
https://lore.kernel.org/all/20170724072332.31903-1-mhocko@kernel.org/
was implementing this synchronization because it would take mmap_sem
write side after unmap_vmas() and IIUC there was no
"mmap_lock_write(); mmap_unlock_write();" sequence in exit_mmap at
that time. I'll need to checkout the old sources to figure this out.
Thanks,
Suren.

>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Nov. 9, 2021, 7:41 p.m. UTC | #22
On Tue 09-11-21 20:26:56, Michal Hocko wrote:
> On Tue 09-11-21 11:01:02, Suren Baghdasaryan wrote:
> [...]
> > Discussing how the patch I want to post works for maple trees that
> > Matthew is working on, I've got a question:
> > 
> > IIUC, according to Michal's post here:
> > https://lore.kernel.org/all/20170725154514.GN26723@dhcp22.suse.cz,
> > unmap_vmas() can race with other mmap_lock read holders (including
> > oom_reap_task_mm()) with no issues.
> > Maple tree patchset requires rcu read lock or the mmap semaphore be
> > held (read or write side) when walking the tree, including inside
> > unmap_vmas(). When asked, he told me that he is not sure why it's
> > currently "safe" to walk the vma->vm_next list in unmap_vmas() while
> > another thread is reaping the mm.
> > Michal (or maybe someone else), could you please clarify why
> > unmap_vmas() can safely race with oom_reap_task_mm()? Or maybe my
> > understanding was wrong?
> 
> I cannot really comment on the mapple tree part. But the existing
> synchronization between oom reaper and exit_mmap is based on
> - oom_reaper takes mmap_sem for reading
> - exit_mmap sets MMF_OOM_SKIP and takes the exclusive mmap_sem before 
>   unmap_vmas.
> 
> The oom_reaper therefore can either unmap the address space if the lock
> is taken before exit_mmap or it would it would bale out on MMF_OOM_SKIP
> if it takes the lock afterwards. So the reaper cannot race with
> unmap_vmas.

Forgot to mention, that _if_ we can get rid of the nasty unlock;lock
pattern in exit_mmap and simply take the exclusive mmap_sem there for
unmap_vmas onward then we could get rid of the MMF_OOM_SKIP as well
because oom_reaper would simply have no vmas to iterate through so the
whole thing would become much more easier to follow.
Michal Hocko Nov. 9, 2021, 7:50 p.m. UTC | #23
On Tue 09-11-21 11:37:06, Suren Baghdasaryan wrote:
> On Tue, Nov 9, 2021 at 11:26 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 09-11-21 11:01:02, Suren Baghdasaryan wrote:
> > [...]
> > > Discussing how the patch I want to post works for maple trees that
> > > Matthew is working on, I've got a question:
> > >
> > > IIUC, according to Michal's post here:
> > > https://lore.kernel.org/all/20170725154514.GN26723@dhcp22.suse.cz,
> > > unmap_vmas() can race with other mmap_lock read holders (including
> > > oom_reap_task_mm()) with no issues.
> > > Maple tree patchset requires rcu read lock or the mmap semaphore be
> > > held (read or write side) when walking the tree, including inside
> > > unmap_vmas(). When asked, he told me that he is not sure why it's
> > > currently "safe" to walk the vma->vm_next list in unmap_vmas() while
> > > another thread is reaping the mm.
> > > Michal (or maybe someone else), could you please clarify why
> > > unmap_vmas() can safely race with oom_reap_task_mm()? Or maybe my
> > > understanding was wrong?
> >
> > I cannot really comment on the mapple tree part. But the existing
> > synchronization between oom reaper and exit_mmap is based on
> > - oom_reaper takes mmap_sem for reading
> > - exit_mmap sets MMF_OOM_SKIP and takes the exclusive mmap_sem before
> >   unmap_vmas.
> >
> > The oom_reaper therefore can either unmap the address space if the lock
> > is taken before exit_mmap or it would it would bale out on MMF_OOM_SKIP
> > if it takes the lock afterwards. So the reaper cannot race with
> > unmap_vmas.
> 
> I see. So, it's the combination of MMF_OOM_SKIP and mmap_lock working
> as a barrier which prevent them from racing with each other...
> I wasn't sure how
> https://lore.kernel.org/all/20170724072332.31903-1-mhocko@kernel.org/
> was implementing this synchronization because it would take mmap_sem
> write side after unmap_vmas() and IIUC there was no
> "mmap_lock_write(); mmap_unlock_write();" sequence in exit_mmap at
> that time. I'll need to checkout the old sources to figure this out.

My memory is rather dimm but AFAIR the main problem was freeing page
tables and freeing vmas not unmap_vmas. That one was no modifying the
vma list. Essentially it was just a slightly modified madvise don't
need. So that part was allowed to race with oom_reaper.
Suren Baghdasaryan Nov. 9, 2021, 8:02 p.m. UTC | #24
On Tue, Nov 9, 2021 at 11:50 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 09-11-21 11:37:06, Suren Baghdasaryan wrote:
> > On Tue, Nov 9, 2021 at 11:26 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 09-11-21 11:01:02, Suren Baghdasaryan wrote:
> > > [...]
> > > > Discussing how the patch I want to post works for maple trees that
> > > > Matthew is working on, I've got a question:
> > > >
> > > > IIUC, according to Michal's post here:
> > > > https://lore.kernel.org/all/20170725154514.GN26723@dhcp22.suse.cz,
> > > > unmap_vmas() can race with other mmap_lock read holders (including
> > > > oom_reap_task_mm()) with no issues.
> > > > Maple tree patchset requires rcu read lock or the mmap semaphore be
> > > > held (read or write side) when walking the tree, including inside
> > > > unmap_vmas(). When asked, he told me that he is not sure why it's
> > > > currently "safe" to walk the vma->vm_next list in unmap_vmas() while
> > > > another thread is reaping the mm.
> > > > Michal (or maybe someone else), could you please clarify why
> > > > unmap_vmas() can safely race with oom_reap_task_mm()? Or maybe my
> > > > understanding was wrong?
> > >
> > > I cannot really comment on the mapple tree part. But the existing
> > > synchronization between oom reaper and exit_mmap is based on
> > > - oom_reaper takes mmap_sem for reading
> > > - exit_mmap sets MMF_OOM_SKIP and takes the exclusive mmap_sem before
> > >   unmap_vmas.
> > >
> > > The oom_reaper therefore can either unmap the address space if the lock
> > > is taken before exit_mmap or it would it would bale out on MMF_OOM_SKIP
> > > if it takes the lock afterwards. So the reaper cannot race with
> > > unmap_vmas.
> >
> > I see. So, it's the combination of MMF_OOM_SKIP and mmap_lock working
> > as a barrier which prevent them from racing with each other...
> > I wasn't sure how
> > https://lore.kernel.org/all/20170724072332.31903-1-mhocko@kernel.org/
> > was implementing this synchronization because it would take mmap_sem
> > write side after unmap_vmas() and IIUC there was no
> > "mmap_lock_write(); mmap_unlock_write();" sequence in exit_mmap at
> > that time. I'll need to checkout the old sources to figure this out.
>
> My memory is rather dimm but AFAIR the main problem was freeing page
> tables and freeing vmas not unmap_vmas. That one was no modifying the
> vma list. Essentially it was just a slightly modified madvise don't
> need. So that part was allowed to race with oom_reaper.

So, both unmap_vmas and __oom_reap_task_mm do not modify vma list and
therefore can execute concurrently. That makes sense, thanks.

Then I guess, if we want to be semantically correct in exit_mmap(), we
would have to take mmap_read_lock before unmap_vmas, then drop it and
take mmap_write_lock before free_pgtables.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Nov. 9, 2021, 8:10 p.m. UTC | #25
On Tue 09-11-21 12:02:37, Suren Baghdasaryan wrote:
> On Tue, Nov 9, 2021 at 11:50 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 09-11-21 11:37:06, Suren Baghdasaryan wrote:
> > > On Tue, Nov 9, 2021 at 11:26 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 09-11-21 11:01:02, Suren Baghdasaryan wrote:
> > > > [...]
> > > > > Discussing how the patch I want to post works for maple trees that
> > > > > Matthew is working on, I've got a question:
> > > > >
> > > > > IIUC, according to Michal's post here:
> > > > > https://lore.kernel.org/all/20170725154514.GN26723@dhcp22.suse.cz,
> > > > > unmap_vmas() can race with other mmap_lock read holders (including
> > > > > oom_reap_task_mm()) with no issues.
> > > > > Maple tree patchset requires rcu read lock or the mmap semaphore be
> > > > > held (read or write side) when walking the tree, including inside
> > > > > unmap_vmas(). When asked, he told me that he is not sure why it's
> > > > > currently "safe" to walk the vma->vm_next list in unmap_vmas() while
> > > > > another thread is reaping the mm.
> > > > > Michal (or maybe someone else), could you please clarify why
> > > > > unmap_vmas() can safely race with oom_reap_task_mm()? Or maybe my
> > > > > understanding was wrong?
> > > >
> > > > I cannot really comment on the mapple tree part. But the existing
> > > > synchronization between oom reaper and exit_mmap is based on
> > > > - oom_reaper takes mmap_sem for reading
> > > > - exit_mmap sets MMF_OOM_SKIP and takes the exclusive mmap_sem before
> > > >   unmap_vmas.
> > > >
> > > > The oom_reaper therefore can either unmap the address space if the lock
> > > > is taken before exit_mmap or it would it would bale out on MMF_OOM_SKIP
> > > > if it takes the lock afterwards. So the reaper cannot race with
> > > > unmap_vmas.
> > >
> > > I see. So, it's the combination of MMF_OOM_SKIP and mmap_lock working
> > > as a barrier which prevent them from racing with each other...
> > > I wasn't sure how
> > > https://lore.kernel.org/all/20170724072332.31903-1-mhocko@kernel.org/
> > > was implementing this synchronization because it would take mmap_sem
> > > write side after unmap_vmas() and IIUC there was no
> > > "mmap_lock_write(); mmap_unlock_write();" sequence in exit_mmap at
> > > that time. I'll need to checkout the old sources to figure this out.
> >
> > My memory is rather dimm but AFAIR the main problem was freeing page
> > tables and freeing vmas not unmap_vmas. That one was no modifying the
> > vma list. Essentially it was just a slightly modified madvise don't
> > need. So that part was allowed to race with oom_reaper.
> 
> So, both unmap_vmas and __oom_reap_task_mm do not modify vma list and
> therefore can execute concurrently. That makes sense, thanks.

Yes, those can run concurrently. One thing I completely forgot about is 
27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
which is about interaction with the munlock.
 
> Then I guess, if we want to be semantically correct in exit_mmap(), we
> would have to take mmap_read_lock before unmap_vmas, then drop it and
> take mmap_write_lock before free_pgtables.

I think it would be just more straightforward to take the exclusive lock
for the whole operation.
Suren Baghdasaryan Nov. 9, 2021, 9:10 p.m. UTC | #26
On Tue, Nov 9, 2021 at 12:10 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 09-11-21 12:02:37, Suren Baghdasaryan wrote:
> > On Tue, Nov 9, 2021 at 11:50 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 09-11-21 11:37:06, Suren Baghdasaryan wrote:
> > > > On Tue, Nov 9, 2021 at 11:26 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Tue 09-11-21 11:01:02, Suren Baghdasaryan wrote:
> > > > > [...]
> > > > > > Discussing how the patch I want to post works for maple trees that
> > > > > > Matthew is working on, I've got a question:
> > > > > >
> > > > > > IIUC, according to Michal's post here:
> > > > > > https://lore.kernel.org/all/20170725154514.GN26723@dhcp22.suse.cz,
> > > > > > unmap_vmas() can race with other mmap_lock read holders (including
> > > > > > oom_reap_task_mm()) with no issues.
> > > > > > Maple tree patchset requires rcu read lock or the mmap semaphore be
> > > > > > held (read or write side) when walking the tree, including inside
> > > > > > unmap_vmas(). When asked, he told me that he is not sure why it's
> > > > > > currently "safe" to walk the vma->vm_next list in unmap_vmas() while
> > > > > > another thread is reaping the mm.
> > > > > > Michal (or maybe someone else), could you please clarify why
> > > > > > unmap_vmas() can safely race with oom_reap_task_mm()? Or maybe my
> > > > > > understanding was wrong?
> > > > >
> > > > > I cannot really comment on the mapple tree part. But the existing
> > > > > synchronization between oom reaper and exit_mmap is based on
> > > > > - oom_reaper takes mmap_sem for reading
> > > > > - exit_mmap sets MMF_OOM_SKIP and takes the exclusive mmap_sem before
> > > > >   unmap_vmas.
> > > > >
> > > > > The oom_reaper therefore can either unmap the address space if the lock
> > > > > is taken before exit_mmap or it would it would bale out on MMF_OOM_SKIP
> > > > > if it takes the lock afterwards. So the reaper cannot race with
> > > > > unmap_vmas.
> > > >
> > > > I see. So, it's the combination of MMF_OOM_SKIP and mmap_lock working
> > > > as a barrier which prevent them from racing with each other...
> > > > I wasn't sure how
> > > > https://lore.kernel.org/all/20170724072332.31903-1-mhocko@kernel.org/
> > > > was implementing this synchronization because it would take mmap_sem
> > > > write side after unmap_vmas() and IIUC there was no
> > > > "mmap_lock_write(); mmap_unlock_write();" sequence in exit_mmap at
> > > > that time. I'll need to checkout the old sources to figure this out.
> > >
> > > My memory is rather dimm but AFAIR the main problem was freeing page
> > > tables and freeing vmas not unmap_vmas. That one was no modifying the
> > > vma list. Essentially it was just a slightly modified madvise don't
> > > need. So that part was allowed to race with oom_reaper.
> >
> > So, both unmap_vmas and __oom_reap_task_mm do not modify vma list and
> > therefore can execute concurrently. That makes sense, thanks.
>
> Yes, those can run concurrently. One thing I completely forgot about is
> 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> which is about interaction with the munlock.

Thanks for pointing it out. IIUC, ideally we want to get rid of all
these special cases and replace them with proper locking. If so, I'll
see what I can do here.

>
> > Then I guess, if we want to be semantically correct in exit_mmap(), we
> > would have to take mmap_read_lock before unmap_vmas, then drop it and
> > take mmap_write_lock before free_pgtables.
>
> I think it would be just more straightforward to take the exclusive lock
> for the whole operation.

Ok, but note that this will prevent concurrent memory reaping, so will
likely affect the speed at which memory is released during oom-kill. I
saw measurable difference when testing process_mrelease placing
mmap_write_lock before vs after unmap_vmas. If we take mmap_read_lock
before unmap_vmas and mmap_write_lock after it, then there won't be
such issue. You indicated that the speed of memory release should not
be the deciding factor here but I want to make it clear before
proceeding.
Thanks,
Suren.

> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Nov. 11, 2021, 1:49 a.m. UTC | #27
On Tue, Nov 9, 2021 at 1:10 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Nov 9, 2021 at 12:10 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 09-11-21 12:02:37, Suren Baghdasaryan wrote:
> > > On Tue, Nov 9, 2021 at 11:50 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 09-11-21 11:37:06, Suren Baghdasaryan wrote:
> > > > > On Tue, Nov 9, 2021 at 11:26 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Tue 09-11-21 11:01:02, Suren Baghdasaryan wrote:
> > > > > > [...]
> > > > > > > Discussing how the patch I want to post works for maple trees that
> > > > > > > Matthew is working on, I've got a question:
> > > > > > >
> > > > > > > IIUC, according to Michal's post here:
> > > > > > > https://lore.kernel.org/all/20170725154514.GN26723@dhcp22.suse.cz,
> > > > > > > unmap_vmas() can race with other mmap_lock read holders (including
> > > > > > > oom_reap_task_mm()) with no issues.
> > > > > > > Maple tree patchset requires rcu read lock or the mmap semaphore be
> > > > > > > held (read or write side) when walking the tree, including inside
> > > > > > > unmap_vmas(). When asked, he told me that he is not sure why it's
> > > > > > > currently "safe" to walk the vma->vm_next list in unmap_vmas() while
> > > > > > > another thread is reaping the mm.
> > > > > > > Michal (or maybe someone else), could you please clarify why
> > > > > > > unmap_vmas() can safely race with oom_reap_task_mm()? Or maybe my
> > > > > > > understanding was wrong?
> > > > > >
> > > > > > I cannot really comment on the mapple tree part. But the existing
> > > > > > synchronization between oom reaper and exit_mmap is based on
> > > > > > - oom_reaper takes mmap_sem for reading
> > > > > > - exit_mmap sets MMF_OOM_SKIP and takes the exclusive mmap_sem before
> > > > > >   unmap_vmas.
> > > > > >
> > > > > > The oom_reaper therefore can either unmap the address space if the lock
> > > > > > is taken before exit_mmap or it would it would bale out on MMF_OOM_SKIP
> > > > > > if it takes the lock afterwards. So the reaper cannot race with
> > > > > > unmap_vmas.
> > > > >
> > > > > I see. So, it's the combination of MMF_OOM_SKIP and mmap_lock working
> > > > > as a barrier which prevent them from racing with each other...
> > > > > I wasn't sure how
> > > > > https://lore.kernel.org/all/20170724072332.31903-1-mhocko@kernel.org/
> > > > > was implementing this synchronization because it would take mmap_sem
> > > > > write side after unmap_vmas() and IIUC there was no
> > > > > "mmap_lock_write(); mmap_unlock_write();" sequence in exit_mmap at
> > > > > that time. I'll need to checkout the old sources to figure this out.
> > > >
> > > > My memory is rather dimm but AFAIR the main problem was freeing page
> > > > tables and freeing vmas not unmap_vmas. That one was no modifying the
> > > > vma list. Essentially it was just a slightly modified madvise don't
> > > > need. So that part was allowed to race with oom_reaper.
> > >
> > > So, both unmap_vmas and __oom_reap_task_mm do not modify vma list and
> > > therefore can execute concurrently. That makes sense, thanks.
> >
> > Yes, those can run concurrently. One thing I completely forgot about is
> > 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> > which is about interaction with the munlock.

Agrh! This interaction with the munlock you mentioned requires us to
take mmap_write_lock before munlock_vma_pages_all and that prevents
__oom_reap_task_mm from running concurrently with unmap_vmas. The
reapers would not be as effective as they are now after such a change
:(

>
> Thanks for pointing it out. IIUC, ideally we want to get rid of all
> these special cases and replace them with proper locking. If so, I'll
> see what I can do here.
>
> >
> > > Then I guess, if we want to be semantically correct in exit_mmap(), we
> > > would have to take mmap_read_lock before unmap_vmas, then drop it and
> > > take mmap_write_lock before free_pgtables.
> >
> > I think it would be just more straightforward to take the exclusive lock
> > for the whole operation.
>
> Ok, but note that this will prevent concurrent memory reaping, so will
> likely affect the speed at which memory is released during oom-kill. I
> saw measurable difference when testing process_mrelease placing
> mmap_write_lock before vs after unmap_vmas. If we take mmap_read_lock
> before unmap_vmas and mmap_write_lock after it, then there won't be
> such issue. You indicated that the speed of memory release should not
> be the deciding factor here but I want to make it clear before
> proceeding.
> Thanks,
> Suren.
>
> > --
> > Michal Hocko
> > SUSE Labs
Michal Hocko Nov. 11, 2021, 9:20 a.m. UTC | #28
On Wed 10-11-21 17:49:37, Suren Baghdasaryan wrote:
> On Tue, Nov 9, 2021 at 1:10 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Nov 9, 2021 at 12:10 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > Yes, those can run concurrently. One thing I completely forgot about is
> > > 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> > > which is about interaction with the munlock.
> 
> Agrh! This interaction with the munlock you mentioned requires us to
> take mmap_write_lock before munlock_vma_pages_all and that prevents
> __oom_reap_task_mm from running concurrently with unmap_vmas. The
> reapers would not be as effective as they are now after such a change
> :(

__oom_reap_task_mm will not run concurrently with unmap_vmas even
with the current code. The mmap_sem barrier right before munlock code
prevents that.
Suren Baghdasaryan Nov. 11, 2021, 3:02 p.m. UTC | #29
On Thu, Nov 11, 2021 at 1:20 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 10-11-21 17:49:37, Suren Baghdasaryan wrote:
> > On Tue, Nov 9, 2021 at 1:10 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Tue, Nov 9, 2021 at 12:10 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > > Yes, those can run concurrently. One thing I completely forgot about is
> > > > 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> > > > which is about interaction with the munlock.
> >
> > Agrh! This interaction with the munlock you mentioned requires us to
> > take mmap_write_lock before munlock_vma_pages_all and that prevents
> > __oom_reap_task_mm from running concurrently with unmap_vmas. The
> > reapers would not be as effective as they are now after such a change
> > :(
>
> __oom_reap_task_mm will not run concurrently with unmap_vmas even
> with the current code. The mmap_sem barrier right before munlock code
> prevents that.

You are right, it will run concurrently with another
__oom_reap_task_mm in the exit_mmap. But I thought we wanted to get
rid of that call to __oom_reap_task_mm in exit_mmap or did I
misunderstand?

>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Nov. 12, 2021, 8:58 a.m. UTC | #30
On Thu 11-11-21 07:02:42, Suren Baghdasaryan wrote:
> On Thu, Nov 11, 2021 at 1:20 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 10-11-21 17:49:37, Suren Baghdasaryan wrote:
> > > On Tue, Nov 9, 2021 at 1:10 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Tue, Nov 9, 2021 at 12:10 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > > Yes, those can run concurrently. One thing I completely forgot about is
> > > > > 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> > > > > which is about interaction with the munlock.
> > >
> > > Agrh! This interaction with the munlock you mentioned requires us to
> > > take mmap_write_lock before munlock_vma_pages_all and that prevents
> > > __oom_reap_task_mm from running concurrently with unmap_vmas. The
> > > reapers would not be as effective as they are now after such a change
> > > :(
> >
> > __oom_reap_task_mm will not run concurrently with unmap_vmas even
> > with the current code. The mmap_sem barrier right before munlock code
> > prevents that.
> 
> You are right, it will run concurrently with another
> __oom_reap_task_mm in the exit_mmap. But I thought we wanted to get
> rid of that call to __oom_reap_task_mm in exit_mmap or did I
> misunderstand?

I do not remember this to be objective or the motivation. IIRC we wanted
to make the locking more robust which would help your process_mrelease
use case. This one currently suffers from a much heavier cost if it
turns out to be the last holder of the reference count on the address
space.
Suren Baghdasaryan Nov. 12, 2021, 4 p.m. UTC | #31
On Fri, Nov 12, 2021 at 12:58 AM 'Michal Hocko' via kernel-team
<kernel-team@android.com> wrote:
>
> On Thu 11-11-21 07:02:42, Suren Baghdasaryan wrote:
> > On Thu, Nov 11, 2021 at 1:20 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 10-11-21 17:49:37, Suren Baghdasaryan wrote:
> > > > On Tue, Nov 9, 2021 at 1:10 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > On Tue, Nov 9, 2021 at 12:10 PM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > > Yes, those can run concurrently. One thing I completely forgot about is
> > > > > > 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
> > > > > > which is about interaction with the munlock.
> > > >
> > > > Agrh! This interaction with the munlock you mentioned requires us to
> > > > take mmap_write_lock before munlock_vma_pages_all and that prevents
> > > > __oom_reap_task_mm from running concurrently with unmap_vmas. The
> > > > reapers would not be as effective as they are now after such a change
> > > > :(
> > >
> > > __oom_reap_task_mm will not run concurrently with unmap_vmas even
> > > with the current code. The mmap_sem barrier right before munlock code
> > > prevents that.
> >
> > You are right, it will run concurrently with another
> > __oom_reap_task_mm in the exit_mmap. But I thought we wanted to get
> > rid of that call to __oom_reap_task_mm in exit_mmap or did I
> > misunderstand?
>
> I do not remember this to be objective or the motivation. IIRC we wanted
> to make the locking more robust which would help your process_mrelease
> use case. This one currently suffers from a much heavier cost if it
> turns out to be the last holder of the reference count on the address
> space.

Ok, I wrongly assumed the mmap_lock cleanup should be deeper. Will
keep pounding on it. Thanks!

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..989f35a2bbb1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1150,7 +1150,7 @@  SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	struct task_struct *task;
 	struct task_struct *p;
 	unsigned int f_flags;
-	bool reap = true;
+	bool reap = false;
 	struct pid *pid;
 	long ret = 0;
 
@@ -1177,15 +1177,15 @@  SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 		goto put_task;
 	}
 
-	mm = p->mm;
-	mmgrab(mm);
-
-	/* If the work has been done already, just exit with success */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags))
-		reap = false;
-	else if (!task_will_free_mem(p)) {
-		reap = false;
-		ret = -EINVAL;
+	if (mmget_not_zero(p->mm)) {
+		mm = p->mm;
+		if (task_will_free_mem(p))
+			reap = true;
+		else {
+			/* Error only if the work has not been done already */
+			if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+				ret = -EINVAL;
+		}
 	}
 	task_unlock(p);
 
@@ -1201,7 +1201,8 @@  SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	mmap_read_unlock(mm);
 
 drop_mm:
-	mmdrop(mm);
+	if (mm)
+		mmput(mm);
 put_task:
 	put_task_struct(task);
 put_pid: