diff mbox

[-mm] mm, oom: remove oom_lock from exit_mmap

Message ID 20180713142612.GD19960@dhcp22.suse.cz (mailing list archive)
State New, archived
Headers show

Commit Message

Michal Hocko July 13, 2018, 2:26 p.m. UTC
On Thu 12-07-18 14:34:00, David Rientjes wrote:
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 0fe4087d5151..e6328cef090f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -488,9 +488,11 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>  	 * Tell all users of get_user/copy_from_user etc... that the content
>  	 * is no longer stable. No barriers really needed because unmapping
>  	 * should imply barriers already and the reader would hit a page fault
> -	 * if it stumbled over a reaped memory.
> +	 * if it stumbled over a reaped memory. If MMF_UNSTABLE is already set,
> +	 * reaping as already occurred so nothing left to do.
>  	 */
> -	set_bit(MMF_UNSTABLE, &mm->flags);
> +	if (test_and_set_bit(MMF_UNSTABLE, &mm->flags))
> +		return;

This could lead to pre mature oom victim selection
oom_reaper			exiting victim
oom_reap_task			exit_mmap
  __oom_reap_task_mm		  __oom_reap_task_mm
				    test_and_set_bit(MMF_UNSTABLE) # wins the race
  test_and_set_bit(MMF_UNSTABLE)
set_bit(MMF_OOM_SKIP) # new victim can be selected now.

Besides that, why should we back off in the first place. We can
race the two without any problems AFAICS. We already do have proper
synchronization between the two due to mmap_sem and MMF_OOM_SKIP.

Comments

Tetsuo Handa July 13, 2018, 9:18 p.m. UTC | #1
On 2018/07/13 23:26, Michal Hocko wrote:
> On Thu 12-07-18 14:34:00, David Rientjes wrote:
> [...]
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 0fe4087d5151..e6328cef090f 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -488,9 +488,11 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>>  	 * Tell all users of get_user/copy_from_user etc... that the content
>>  	 * is no longer stable. No barriers really needed because unmapping
>>  	 * should imply barriers already and the reader would hit a page fault
>> -	 * if it stumbled over a reaped memory.
>> +	 * if it stumbled over a reaped memory. If MMF_UNSTABLE is already set,
>> +	 * reaping as already occurred so nothing left to do.
>>  	 */
>> -	set_bit(MMF_UNSTABLE, &mm->flags);
>> +	if (test_and_set_bit(MMF_UNSTABLE, &mm->flags))
>> +		return;
> 
> This could lead to pre mature oom victim selection
> oom_reaper			exiting victim
> oom_reap_task			exit_mmap
>   __oom_reap_task_mm		  __oom_reap_task_mm
> 				    test_and_set_bit(MMF_UNSTABLE) # wins the race
>   test_and_set_bit(MMF_UNSTABLE)
> set_bit(MMF_OOM_SKIP) # new victim can be selected now.
> 
> Besides that, why should we back off in the first place. We can
> race the two without any problems AFAICS. We already do have proper
> synchronization between the two due to mmap_sem and MMF_OOM_SKIP.
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index fc41c0543d7f..4642964f7741 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3073,9 +3073,7 @@ void exit_mmap(struct mm_struct *mm)
>  		 * which clears VM_LOCKED, otherwise the oom reaper cannot
>  		 * reliably test it.
>  		 */
> -		mutex_lock(&oom_lock);
>  		__oom_reap_task_mm(mm);
> -		mutex_unlock(&oom_lock);
>  
>  		set_bit(MMF_OOM_SKIP, &mm->flags);

David and Michal are using different version as a baseline here.
David is making changes using timeout based back off (in linux-next.git)
which is inappropriately trying to use MMF_UNSTABLE for two purposes.

Michal is making changes using current code (in linux.git) which does not
address David's concern.

My version ( https://marc.info/?l=linux-mm&m=153119509215026 ) is
making changes using current code which also provides oom-badness
based back off in order to address David's concern.

>  		down_write(&mm->mmap_sem);

Anyway, I suggest doing

  mutex_lock(&oom_lock);
  set_bit(MMF_OOM_SKIP, &mm->flags);
  mutex_unlock(&oom_lock);

like I mentioned at
http://lkml.kernel.org/r/201807130620.w6D6KiAJ093010@www262.sakura.ne.jp
even if we make changes on top of linux-next's timeout based back off.
Michal Hocko July 16, 2018, 6:13 a.m. UTC | #2
On Sat 14-07-18 06:18:58, Tetsuo Handa wrote:
> On 2018/07/13 23:26, Michal Hocko wrote:
> > On Thu 12-07-18 14:34:00, David Rientjes wrote:
> > [...]
> >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >> index 0fe4087d5151..e6328cef090f 100644
> >> --- a/mm/oom_kill.c
> >> +++ b/mm/oom_kill.c
> >> @@ -488,9 +488,11 @@ void __oom_reap_task_mm(struct mm_struct *mm)
> >>  	 * Tell all users of get_user/copy_from_user etc... that the content
> >>  	 * is no longer stable. No barriers really needed because unmapping
> >>  	 * should imply barriers already and the reader would hit a page fault
> >> -	 * if it stumbled over a reaped memory.
> >> +	 * if it stumbled over a reaped memory. If MMF_UNSTABLE is already set,
> >> +	 * reaping as already occurred so nothing left to do.
> >>  	 */
> >> -	set_bit(MMF_UNSTABLE, &mm->flags);
> >> +	if (test_and_set_bit(MMF_UNSTABLE, &mm->flags))
> >> +		return;
> > 
> > This could lead to pre mature oom victim selection
> > oom_reaper			exiting victim
> > oom_reap_task			exit_mmap
> >   __oom_reap_task_mm		  __oom_reap_task_mm
> > 				    test_and_set_bit(MMF_UNSTABLE) # wins the race
> >   test_and_set_bit(MMF_UNSTABLE)
> > set_bit(MMF_OOM_SKIP) # new victim can be selected now.
> > 
> > Besides that, why should we back off in the first place. We can
> > race the two without any problems AFAICS. We already do have proper
> > synchronization between the two due to mmap_sem and MMF_OOM_SKIP.
> > 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index fc41c0543d7f..4642964f7741 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3073,9 +3073,7 @@ void exit_mmap(struct mm_struct *mm)
> >  		 * which clears VM_LOCKED, otherwise the oom reaper cannot
> >  		 * reliably test it.
> >  		 */
> > -		mutex_lock(&oom_lock);
> >  		__oom_reap_task_mm(mm);
> > -		mutex_unlock(&oom_lock);
> >  
> >  		set_bit(MMF_OOM_SKIP, &mm->flags);
> 
> David and Michal are using different version as a baseline here.
> David is making changes using timeout based back off (in linux-next.git)
> which is inappropriately trying to use MMF_UNSTABLE for two purposes.
> 
> Michal is making changes using current code (in linux.git) which does not
> address David's concern.

Yes I have based it on top of Linus tree because the point of this patch
is to get rid of the locking which is no longer needed. I do not see
what concern are you talking about.
> 
> My version ( https://marc.info/?l=linux-mm&m=153119509215026 ) is
> making changes using current code which also provides oom-badness
> based back off in order to address David's concern.
> 
> >  		down_write(&mm->mmap_sem);
> 
> Anyway, I suggest doing
> 
>   mutex_lock(&oom_lock);
>   set_bit(MMF_OOM_SKIP, &mm->flags);
>   mutex_unlock(&oom_lock);

Why do we need it?

> like I mentioned at
> http://lkml.kernel.org/r/201807130620.w6D6KiAJ093010@www262.sakura.ne.jp
> even if we make changes on top of linux-next's timeout based back off.

says
: (3) Prevent from selecting new OOM victim when there is an !MMF_OOM_SKIP mm
:     which current thread should wait for.
[...]
: Regarding (A), we can reduce the range oom_lock serializes from
: "__oom_reap_task_mm()" to "setting MMF_OOM_SKIP", for oom_lock is useful for (3).

But why there is a lock needed for this? This doesn't make much sense to
me. If we do not have MMF_OOM_SKIP set we still should have mm_is_oom_victim
so no new task should be selected. If we race with the oom reaper than
ok, we would just not select a new victim and retry later.
Tetsuo Handa July 16, 2018, 7:04 a.m. UTC | #3
On 2018/07/16 15:13, Michal Hocko wrote:
> On Sat 14-07-18 06:18:58, Tetsuo Handa wrote:
>>> @@ -3073,9 +3073,7 @@ void exit_mmap(struct mm_struct *mm)
>>>  		 * which clears VM_LOCKED, otherwise the oom reaper cannot
>>>  		 * reliably test it.
>>>  		 */
>>> -		mutex_lock(&oom_lock);
>>>  		__oom_reap_task_mm(mm);
>>> -		mutex_unlock(&oom_lock);
>>>  
>>>  		set_bit(MMF_OOM_SKIP, &mm->flags);
>>
>> David and Michal are using different version as a baseline here.
>> David is making changes using timeout based back off (in linux-next.git)
>> which is inappropriately trying to use MMF_UNSTABLE for two purposes.
>>
>> Michal is making changes using current code (in linux.git) which does not
>> address David's concern.
> 
> Yes I have based it on top of Linus tree because the point of this patch
> is to get rid of the locking which is no longer needed. I do not see
> what concern are you talking about.

I'm saying that applying your patch does not work on linux-next.git
because David's patch already did s/MMF_OOM_SKIP/MMF_UNSTABLE/ .

>>
>> My version ( https://marc.info/?l=linux-mm&m=153119509215026 ) is
>> making changes using current code which also provides oom-badness
>> based back off in order to address David's concern.
>>
>>>  		down_write(&mm->mmap_sem);
>>
>> Anyway, I suggest doing
>>
>>   mutex_lock(&oom_lock);
>>   set_bit(MMF_OOM_SKIP, &mm->flags);
>>   mutex_unlock(&oom_lock);
> 
> Why do we need it?
> 
>> like I mentioned at
>> http://lkml.kernel.org/r/201807130620.w6D6KiAJ093010@www262.sakura.ne.jp
>> even if we make changes on top of linux-next's timeout based back off.
> 
> says
> : (3) Prevent from selecting new OOM victim when there is an !MMF_OOM_SKIP mm
> :     which current thread should wait for.
> [...]
> : Regarding (A), we can reduce the range oom_lock serializes from
> : "__oom_reap_task_mm()" to "setting MMF_OOM_SKIP", for oom_lock is useful for (3).
> 
> But why there is a lock needed for this? This doesn't make much sense to
> me. If we do not have MMF_OOM_SKIP set we still should have mm_is_oom_victim
> so no new task should be selected. If we race with the oom reaper than
> ok, we would just not select a new victim and retry later.
> 

How mm_is_oom_victim() helps? mm_is_oom_victim() is used by exit_mmap() whether
current thread should call __oom_reap_task_mm().

I'm talking about below sequence (i.e. after returning from __oom_reap_task_mm()).

  CPU 0                                   CPU 1
  
  mutex_trylock(&oom_lock) in __alloc_pages_may_oom() succeeds.
  get_page_from_freelist() fails.
  Enters out_of_memory().

                                          __oom_reap_task_mm() reclaims some memory.
                                          Sets MMF_OOM_SKIP.

  select_bad_process() selects new victim because MMF_OOM_SKIP is already set.
  Kills a new OOM victim without retrying last second allocation attempt.
  Leaves out_of_memory().
  mutex_unlock(&oom_lock) in __alloc_pages_may_oom() is called.

If setting MMF_OOM_SKIP is guarded by oom_lock, we can enforce
last second allocation attempt like below.

  CPU 0                                   CPU 1
  
  mutex_trylock(&oom_lock) in __alloc_pages_may_oom() succeeds.
  get_page_from_freelist() fails.
  Enters out_of_memory().

                                          __oom_reap_task_mm() reclaims some memory.
                                          mutex_lock(&oom_lock);

  select_bad_process() does not select new victim because MMF_OOM_SKIP is not yet set.
  Leaves out_of_memory().
  mutex_unlock(&oom_lock) in __alloc_pages_may_oom() is called.

                                          Sets MMF_OOM_SKIP.
                                          mutex_unlock(&oom_lock);

  get_page_from_freelist() likely succeeds before reaching __alloc_pages_may_oom() again.
  Saved one OOM victim from being needlessly killed.

That is, guarding setting MMF_OOM_SKIP works as if synchronize_rcu(); it waits for anybody
who already acquired (or started waiting for) oom_lock to release oom_lock, in order to
prevent select_bad_process() from needlessly selecting new OOM victim.
Michal Hocko July 16, 2018, 7:44 a.m. UTC | #4
On Mon 16-07-18 16:04:26, Tetsuo Handa wrote:
> On 2018/07/16 15:13, Michal Hocko wrote:
> > On Sat 14-07-18 06:18:58, Tetsuo Handa wrote:
> >>> @@ -3073,9 +3073,7 @@ void exit_mmap(struct mm_struct *mm)
> >>>  		 * which clears VM_LOCKED, otherwise the oom reaper cannot
> >>>  		 * reliably test it.
> >>>  		 */
> >>> -		mutex_lock(&oom_lock);
> >>>  		__oom_reap_task_mm(mm);
> >>> -		mutex_unlock(&oom_lock);
> >>>  
> >>>  		set_bit(MMF_OOM_SKIP, &mm->flags);
> >>
> >> David and Michal are using different version as a baseline here.
> >> David is making changes using timeout based back off (in linux-next.git)
> >> which is inappropriately trying to use MMF_UNSTABLE for two purposes.
> >>
> >> Michal is making changes using current code (in linux.git) which does not
> >> address David's concern.
> > 
> > Yes I have based it on top of Linus tree because the point of this patch
> > is to get rid of the locking which is no longer needed. I do not see
> > what concern are you talking about.
> 
> I'm saying that applying your patch does not work on linux-next.git
> because David's patch already did s/MMF_OOM_SKIP/MMF_UNSTABLE/ .

This patch has been nacked by me AFAIR so I assume it should be dropped
from the mmotm tree.
 
> >> My version ( https://marc.info/?l=linux-mm&m=153119509215026 ) is
> >> making changes using current code which also provides oom-badness
> >> based back off in order to address David's concern.
> >>
> >>>  		down_write(&mm->mmap_sem);
> >>
> >> Anyway, I suggest doing
> >>
> >>   mutex_lock(&oom_lock);
> >>   set_bit(MMF_OOM_SKIP, &mm->flags);
> >>   mutex_unlock(&oom_lock);
> > 
> > Why do we need it?
> > 
> >> like I mentioned at
> >> http://lkml.kernel.org/r/201807130620.w6D6KiAJ093010@www262.sakura.ne.jp
> >> even if we make changes on top of linux-next's timeout based back off.
> > 
> > says
> > : (3) Prevent from selecting new OOM victim when there is an !MMF_OOM_SKIP mm
> > :     which current thread should wait for.
> > [...]
> > : Regarding (A), we can reduce the range oom_lock serializes from
> > : "__oom_reap_task_mm()" to "setting MMF_OOM_SKIP", for oom_lock is useful for (3).
> > 
> > But why there is a lock needed for this? This doesn't make much sense to
> > me. If we do not have MMF_OOM_SKIP set we still should have mm_is_oom_victim
> > so no new task should be selected. If we race with the oom reaper than
> > ok, we would just not select a new victim and retry later.
> > 
> 
> How mm_is_oom_victim() helps? mm_is_oom_victim() is used by exit_mmap() whether
> current thread should call __oom_reap_task_mm().
> 
> I'm talking about below sequence (i.e. after returning from __oom_reap_task_mm()).
> 
>   CPU 0                                   CPU 1
>   
>   mutex_trylock(&oom_lock) in __alloc_pages_may_oom() succeeds.
>   get_page_from_freelist() fails.
>   Enters out_of_memory().
> 
>                                           __oom_reap_task_mm() reclaims some memory.
>                                           Sets MMF_OOM_SKIP.
> 
>   select_bad_process() selects new victim because MMF_OOM_SKIP is already set.
>   Kills a new OOM victim without retrying last second allocation attempt.
>   Leaves out_of_memory().
>   mutex_unlock(&oom_lock) in __alloc_pages_may_oom() is called.

OK, that wasn't clear from your above wording. As you explicitly
mentioned !MMF_OOM_SKIP mm.

> If setting MMF_OOM_SKIP is guarded by oom_lock, we can enforce
> last second allocation attempt like below.
>
>   CPU 0                                   CPU 1
>   
>   mutex_trylock(&oom_lock) in __alloc_pages_may_oom() succeeds.
>   get_page_from_freelist() fails.
>   Enters out_of_memory().
> 
>                                           __oom_reap_task_mm() reclaims some memory.
>                                           mutex_lock(&oom_lock);
> 
>   select_bad_process() does not select new victim because MMF_OOM_SKIP is not yet set.
>   Leaves out_of_memory().
>   mutex_unlock(&oom_lock) in __alloc_pages_may_oom() is called.
> 
>                                           Sets MMF_OOM_SKIP.
>                                           mutex_unlock(&oom_lock);
> 
>   get_page_from_freelist() likely succeeds before reaching __alloc_pages_may_oom() again.
>   Saved one OOM victim from being needlessly killed.
> 
> That is, guarding setting MMF_OOM_SKIP works as if synchronize_rcu(); it waits for anybody
> who already acquired (or started waiting for) oom_lock to release oom_lock, in order to
> prevent select_bad_process() from needlessly selecting new OOM victim.

Hmm, is this a practical problem though? Do we really need to have a
broader locking context just to defeat this race? How about this goes
into a separate patch with some data justifying it?
Tetsuo Handa July 16, 2018, 10:38 a.m. UTC | #5
On 2018/07/16 16:44, Michal Hocko wrote:
>> If setting MMF_OOM_SKIP is guarded by oom_lock, we can enforce
>> last second allocation attempt like below.
>>
>>   CPU 0                                   CPU 1
>>   
>>   mutex_trylock(&oom_lock) in __alloc_pages_may_oom() succeeds.
>>   get_page_from_freelist() fails.
>>   Enters out_of_memory().
>>
>>                                           __oom_reap_task_mm() reclaims some memory.
>>                                           mutex_lock(&oom_lock);
>>
>>   select_bad_process() does not select new victim because MMF_OOM_SKIP is not yet set.
>>   Leaves out_of_memory().
>>   mutex_unlock(&oom_lock) in __alloc_pages_may_oom() is called.
>>
>>                                           Sets MMF_OOM_SKIP.
>>                                           mutex_unlock(&oom_lock);
>>
>>   get_page_from_freelist() likely succeeds before reaching __alloc_pages_may_oom() again.
>>   Saved one OOM victim from being needlessly killed.
>>
>> That is, guarding setting MMF_OOM_SKIP works as if synchronize_rcu(); it waits for anybody
>> who already acquired (or started waiting for) oom_lock to release oom_lock, in order to
>> prevent select_bad_process() from needlessly selecting new OOM victim.
> 
> Hmm, is this a practical problem though? Do we really need to have a
> broader locking context just to defeat this race?

Yes, for you think that select_bad_process() might take long time. It is possible
that MMF_OOM_SKIP is set while the owner of oom_lock is preempted. It is not such
a small window that select_bad_process() finds an mm which got MMF_OOM_SKIP
immediately before examining that mm.

>                                                   How about this goes
> into a separate patch with some data justifying it?
> 

No. We won't be able to get data until we let people test using released
kernels. I don't like again getting reports like
http://lkml.kernel.org/r/1495034780-9520-1-git-send-email-guro@fb.com
by not guarding MMF_OOM_SKIP.
Michal Hocko July 16, 2018, 11:15 a.m. UTC | #6
On Mon 16-07-18 19:38:21, Tetsuo Handa wrote:
> On 2018/07/16 16:44, Michal Hocko wrote:
> >> If setting MMF_OOM_SKIP is guarded by oom_lock, we can enforce
> >> last second allocation attempt like below.
> >>
> >>   CPU 0                                   CPU 1
> >>   
> >>   mutex_trylock(&oom_lock) in __alloc_pages_may_oom() succeeds.
> >>   get_page_from_freelist() fails.
> >>   Enters out_of_memory().
> >>
> >>                                           __oom_reap_task_mm() reclaims some memory.
> >>                                           mutex_lock(&oom_lock);
> >>
> >>   select_bad_process() does not select new victim because MMF_OOM_SKIP is not yet set.
> >>   Leaves out_of_memory().
> >>   mutex_unlock(&oom_lock) in __alloc_pages_may_oom() is called.
> >>
> >>                                           Sets MMF_OOM_SKIP.
> >>                                           mutex_unlock(&oom_lock);
> >>
> >>   get_page_from_freelist() likely succeeds before reaching __alloc_pages_may_oom() again.
> >>   Saved one OOM victim from being needlessly killed.
> >>
> >> That is, guarding setting MMF_OOM_SKIP works as if synchronize_rcu(); it waits for anybody
> >> who already acquired (or started waiting for) oom_lock to release oom_lock, in order to
> >> prevent select_bad_process() from needlessly selecting new OOM victim.
> > 
> > Hmm, is this a practical problem though? Do we really need to have a
> > broader locking context just to defeat this race?
> 
> Yes, for you think that select_bad_process() might take long time. It is possible
> that MMF_OOM_SKIP is set while the owner of oom_lock is preempted. It is not such
> a small window that select_bad_process() finds an mm which got MMF_OOM_SKIP
> immediately before examining that mm.

I only do care if the race is practical to hit. And that is why I would
like a simplification first (so drop the oom_lock in the oom_reaper
path) and then follow up with some decent justification on top.
David Rientjes July 17, 2018, 4:14 a.m. UTC | #7
On Fri, 13 Jul 2018, Michal Hocko wrote:

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 0fe4087d5151..e6328cef090f 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -488,9 +488,11 @@ void __oom_reap_task_mm(struct mm_struct *mm)
> >  	 * Tell all users of get_user/copy_from_user etc... that the content
> >  	 * is no longer stable. No barriers really needed because unmapping
> >  	 * should imply barriers already and the reader would hit a page fault
> > -	 * if it stumbled over a reaped memory.
> > +	 * if it stumbled over a reaped memory. If MMF_UNSTABLE is already set,
> > +	 * reaping as already occurred so nothing left to do.
> >  	 */
> > -	set_bit(MMF_UNSTABLE, &mm->flags);
> > +	if (test_and_set_bit(MMF_UNSTABLE, &mm->flags))
> > +		return;
> 
> This could lead to pre mature oom victim selection
> oom_reaper			exiting victim
> oom_reap_task			exit_mmap
>   __oom_reap_task_mm		  __oom_reap_task_mm
> 				    test_and_set_bit(MMF_UNSTABLE) # wins the race
>   test_and_set_bit(MMF_UNSTABLE)
> set_bit(MMF_OOM_SKIP) # new victim can be selected now.
> 

This is not the current state of the code in the -mm tree: MMF_OOM_SKIP 
only gets set by the oom reaper when the timeout has expired when the 
victim has failed to free memory in the exit path.

> Besides that, why should we back off in the first place. We can
> race the two without any problems AFAICS. We already do have proper
> synchronization between the two due to mmap_sem and MMF_OOM_SKIP.
> 

test_and_set_bit() here is not strictly required, I thought it was better 
since any unmapping done in this context is going to be handled by 
whichever thread set MMF_UNSTABLE.
David Rientjes July 17, 2018, 4:22 a.m. UTC | #8
On Sat, 14 Jul 2018, Tetsuo Handa wrote:

> David is making changes using timeout based back off (in linux-next.git)
> which is inappropriately trying to use MMF_UNSTABLE for two purposes.
> 

If you believe there is a problem with the use of MMF_UNSTABLE as it sits 
in -mm, please follow up directly in the thread that proposed the patch.  
I have seen two replies to that thread from you: one that incorporates it 
into your work, and one that links to a verison of my patch in your 
patchset.  I haven't seen a concern raised about the use of MMF_UNSTABLE, 
but perhaps it's somewhere in the 10,000 other emails about the oom 
killer.
diff mbox

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index fc41c0543d7f..4642964f7741 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3073,9 +3073,7 @@  void exit_mmap(struct mm_struct *mm)
 		 * which clears VM_LOCKED, otherwise the oom reaper cannot
 		 * reliably test it.
 		 */
-		mutex_lock(&oom_lock);
 		__oom_reap_task_mm(mm);
-		mutex_unlock(&oom_lock);
 
 		set_bit(MMF_OOM_SKIP, &mm->flags);
 		down_write(&mm->mmap_sem);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 32e6f7becb40..f11108af122d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -529,28 +529,9 @@  void __oom_reap_task_mm(struct mm_struct *mm)
 
 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
-	bool ret = true;
-
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
-
 	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		return false;
 	}
 
 	/*
@@ -562,7 +543,7 @@  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	if (mm_has_blockable_invalidate_notifiers(mm)) {
 		up_read(&mm->mmap_sem);
 		schedule_timeout_idle(HZ);
-		goto unlock_oom;
+		return true;
 	}
 
 	/*
@@ -589,9 +570,7 @@  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	up_read(&mm->mmap_sem);
 
 	trace_finish_task_reaping(tsk->pid);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
+	return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10