diff mbox series

mm/oom_kill: wake futex waiters before annihilating victim shared mutex

Message ID 20211207214902.772614-1-jsavitz@redhat.com (mailing list archive)
State New
Headers show
Series mm/oom_kill: wake futex waiters before annihilating victim shared mutex | expand

Commit Message

Joel Savitz Dec. 7, 2021, 9:49 p.m. UTC
In the case that two or more processes share a futex located within
a shared mmaped region, such as a process that shares a lock between
itself and a number of child processes, we have observed that when
a process holding the lock is oom killed, at least one waiter is never
alerted to this new development and simply continues to wait.

This is visible via pthreads by checking the __owner field of the
pthread_mutex_t structure within a waiting process, perhaps with gdb.

We identify reproduction of this issue by checking a waiting process of
a test program and viewing the contents of the pthread_mutex_t, taking note
of the value in the owner field, and then checking dmesg to see if the
owner has already been killed.

This issue can be tricky to reproduce, but with the modifications of
this small patch, I have found it to be impossible to reproduce. There
may be additional considerations that I have not taken into account in
this patch and I welcome any comments and criticism.

Co-developed-by: Nico Pache <npache@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
---
 mm/oom_kill.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Joel Savitz Dec. 7, 2021, 10:32 p.m. UTC | #1
Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
Joel Savitz Dec. 7, 2021, 10:34 p.m. UTC | #2
Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
Andrew Morton Dec. 7, 2021, 11:47 p.m. UTC | #3
(cc's added)

On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:

> In the case that two or more processes share a futex located within
> a shared mmaped region, such as a process that shares a lock between
> itself and a number of child processes, we have observed that when
> a process holding the lock is oom killed, at least one waiter is never
> alerted to this new development and simply continues to wait.

Well dang.  Is there any way of killing off that waiting process, or do
we have a resource leak here?

> This is visible via pthreads by checking the __owner field of the
> pthread_mutex_t structure within a waiting process, perhaps with gdb.
> 
> We identify reproduction of this issue by checking a waiting process of
> a test program and viewing the contents of the pthread_mutex_t, taking note
> of the value in the owner field, and then checking dmesg to see if the
> owner has already been killed.
> 
> This issue can be tricky to reproduce, but with the modifications of
> this small patch, I have found it to be impossible to reproduce. There
> may be additional considerations that I have not taken into account in
> this patch and I welcome any comments and criticism.

> Co-developed-by: Nico Pache <npache@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> Signed-off-by: Joel Savitz <jsavitz@redhat.com>
> ---
>  mm/oom_kill.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ddabefcfb5a..fa58bd10a0df 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -44,6 +44,7 @@
>  #include <linux/kthread.h>
>  #include <linux/init.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/futex.h>
>  
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -890,6 +891,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>  	 * in order to prevent the OOM victim from depleting the memory
>  	 * reserves from the user space under its control.
>  	 */
> +	futex_exit_release(victim);
>  	do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
>  	mark_oom_victim(victim);
>  	pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
> @@ -930,6 +932,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>  		 */
>  		if (unlikely(p->flags & PF_KTHREAD))
>  			continue;
> +		futex_exit_release(p);
>  		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
>  	}
>  	rcu_read_unlock();
> -- 
> 2.33.1
Nico Pache Dec. 8, 2021, 12:46 a.m. UTC | #4
On 12/7/21 18:47, Andrew Morton wrote:
> (cc's added)
> 
> On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:
> 
>> In the case that two or more processes share a futex located within
>> a shared mmaped region, such as a process that shares a lock between
>> itself and a number of child processes, we have observed that when
>> a process holding the lock is oom killed, at least one waiter is never
>> alerted to this new development and simply continues to wait.
> 
> Well dang.  Is there any way of killing off that waiting process, or do
> we have a resource leak here?

If I understood your question correctly, there is a way to recover the system by
killing the process that is utilizing the futex; however, the purpose of robust
futexes is to avoid having to do this.

From my work with Joel on this it seems like a race is occurring between the
oom_reaper and the exit signal sent to the OMM'd process. By setting the
futex_exit_release before these signals are sent we avoid this.

> 
>> This is visible via pthreads by checking the __owner field of the
>> pthread_mutex_t structure within a waiting process, perhaps with gdb.
>>
>> We identify reproduction of this issue by checking a waiting process of
>> a test program and viewing the contents of the pthread_mutex_t, taking note
>> of the value in the owner field, and then checking dmesg to see if the
>> owner has already been killed.
>>
>> This issue can be tricky to reproduce, but with the modifications of
>> this small patch, I have found it to be impossible to reproduce. There
>> may be additional considerations that I have not taken into account in
>> this patch and I welcome any comments and criticism.
> 
>> Co-developed-by: Nico Pache <npache@redhat.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> Signed-off-by: Joel Savitz <jsavitz@redhat.com>
>> ---
>>  mm/oom_kill.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 1ddabefcfb5a..fa58bd10a0df 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -44,6 +44,7 @@
>>  #include <linux/kthread.h>
>>  #include <linux/init.h>
>>  #include <linux/mmu_notifier.h>
>> +#include <linux/futex.h>
>>  
>>  #include <asm/tlb.h>
>>  #include "internal.h"
>> @@ -890,6 +891,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>>  	 * in order to prevent the OOM victim from depleting the memory
>>  	 * reserves from the user space under its control.
>>  	 */
>> +	futex_exit_release(victim);
>>  	do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
>>  	mark_oom_victim(victim);
>>  	pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
>> @@ -930,6 +932,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>>  		 */
>>  		if (unlikely(p->flags & PF_KTHREAD))
>>  			continue;
>> +		futex_exit_release(p);
>>  		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
>>  	}
>>  	rcu_read_unlock();
>> -- 
>> 2.33.1
>
Andrew Morton Dec. 8, 2021, 1:58 a.m. UTC | #5
On Tue, 7 Dec 2021 19:46:57 -0500 Nico Pache <npache@redhat.com> wrote:

> 
> 
> On 12/7/21 18:47, Andrew Morton wrote:
> > (cc's added)
> > 
> > On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:
> > 
> >> In the case that two or more processes share a futex located within
> >> a shared mmaped region, such as a process that shares a lock between
> >> itself and a number of child processes, we have observed that when
> >> a process holding the lock is oom killed, at least one waiter is never
> >> alerted to this new development and simply continues to wait.
> > 
> > Well dang.  Is there any way of killing off that waiting process, or do
> > we have a resource leak here?
> 
> If I understood your question correctly, there is a way to recover the system by
> killing the process that is utilizing the futex; however, the purpose of robust
> futexes is to avoid having to do this.

OK.  My concern was whether we have a way in which userspace can
permanently leak memory, which opens a (lame) form of denial-of-service
attack.

> >From my work with Joel on this it seems like a race is occurring between the
> oom_reaper and the exit signal sent to the OMM'd process. By setting the
> futex_exit_release before these signals are sent we avoid this.

OK.  It would be nice if the patch had some comments explaining *why*
we're doing this strange futex thing here.  Although that wouldn't be
necessary if futex_exit_release() was documented...
Joel Savitz Dec. 8, 2021, 3:38 a.m. UTC | #6
On Tue, Dec 7, 2021 at 8:58 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 7 Dec 2021 19:46:57 -0500 Nico Pache <npache@redhat.com> wrote:
>
> >
> >
> > On 12/7/21 18:47, Andrew Morton wrote:
> > > (cc's added)
> > >
> > > On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:
> > >
> > >> In the case that two or more processes share a futex located within
> > >> a shared mmaped region, such as a process that shares a lock between
> > >> itself and a number of child processes, we have observed that when
> > >> a process holding the lock is oom killed, at least one waiter is never
> > >> alerted to this new development and simply continues to wait.
> > >
> > > Well dang.  Is there any way of killing off that waiting process, or do
> > > we have a resource leak here?
> >
> > If I understood your question correctly, there is a way to recover the system by
> > killing the process that is utilizing the futex; however, the purpose of robust
> > futexes is to avoid having to do this.
>
> OK.  My concern was whether we have a way in which userspace can
> permanently leak memory, which opens a (lame) form of denial-of-service
> attack.
I believe the resources are freed when the process is killed so to my
knowledge there is no resource leak in the case we were investigating.

> > >From my work with Joel on this it seems like a race is occurring between the
> > oom_reaper and the exit signal sent to the OMM'd process. By setting the
> > futex_exit_release before these signals are sent we avoid this.
>
> OK.  It would be nice if the patch had some comments explaining *why*
> we're doing this strange futex thing here.  Although that wouldn't be
> necessary if futex_exit_release() was documented...
>
Sounds good, will send a v2 tomorrow

Best,
Joel Savitz
Michal Hocko Dec. 8, 2021, 9:01 a.m. UTC | #7
On Tue 07-12-21 15:47:59, Andrew Morton wrote:
> (cc's added)

Extend CC to have all futex maintainers on board.
 
> On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:
> 
> > In the case that two or more processes share a futex located within
> > a shared mmaped region, such as a process that shares a lock between
> > itself and a number of child processes, we have observed that when
> > a process holding the lock is oom killed, at least one waiter is never
> > alerted to this new development and simply continues to wait.
> 
> Well dang.  Is there any way of killing off that waiting process, or do
> we have a resource leak here?
> 
> > This is visible via pthreads by checking the __owner field of the
> > pthread_mutex_t structure within a waiting process, perhaps with gdb.
> > 
> > We identify reproduction of this issue by checking a waiting process of
> > a test program and viewing the contents of the pthread_mutex_t, taking note
> > of the value in the owner field, and then checking dmesg to see if the
> > owner has already been killed.
> > 
> > This issue can be tricky to reproduce, but with the modifications of
> > this small patch, I have found it to be impossible to reproduce. There
> > may be additional considerations that I have not taken into account in
> > this patch and I welcome any comments and criticism.

Why does OOM killer need a special handling. All the oom killer does is
to send a fatal signal to the victim. Why is this any different from
sending SIGKILL from the userspace?

> > Co-developed-by: Nico Pache <npache@redhat.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > Signed-off-by: Joel Savitz <jsavitz@redhat.com>
> > ---
> >  mm/oom_kill.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 1ddabefcfb5a..fa58bd10a0df 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -44,6 +44,7 @@
> >  #include <linux/kthread.h>
> >  #include <linux/init.h>
> >  #include <linux/mmu_notifier.h>
> > +#include <linux/futex.h>
> >  
> >  #include <asm/tlb.h>
> >  #include "internal.h"
> > @@ -890,6 +891,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> >  	 * in order to prevent the OOM victim from depleting the memory
> >  	 * reserves from the user space under its control.
> >  	 */
> > +	futex_exit_release(victim);
> >  	do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
> >  	mark_oom_victim(victim);
> >  	pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
> > @@ -930,6 +932,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> >  		 */
> >  		if (unlikely(p->flags & PF_KTHREAD))
> >  			continue;
> > +		futex_exit_release(p);
> >  		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
> >  	}
> >  	rcu_read_unlock();
> > -- 
> > 2.33.1
Michal Hocko Dec. 8, 2021, 4:05 p.m. UTC | #8
On Wed 08-12-21 10:01:44, Michal Hocko wrote:
> On Tue 07-12-21 15:47:59, Andrew Morton wrote:
> > (cc's added)
> 
> Extend CC to have all futex maintainers on board.
>  
> > On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:
> > 
> > > In the case that two or more processes share a futex located within
> > > a shared mmaped region, such as a process that shares a lock between
> > > itself and a number of child processes, we have observed that when
> > > a process holding the lock is oom killed, at least one waiter is never
> > > alerted to this new development and simply continues to wait.
> > 
> > Well dang.  Is there any way of killing off that waiting process, or do
> > we have a resource leak here?
> > 
> > > This is visible via pthreads by checking the __owner field of the
> > > pthread_mutex_t structure within a waiting process, perhaps with gdb.
> > > 
> > > We identify reproduction of this issue by checking a waiting process of
> > > a test program and viewing the contents of the pthread_mutex_t, taking note
> > > of the value in the owner field, and then checking dmesg to see if the
> > > owner has already been killed.
> > > 
> > > This issue can be tricky to reproduce, but with the modifications of
> > > this small patch, I have found it to be impossible to reproduce. There
> > > may be additional considerations that I have not taken into account in
> > > this patch and I welcome any comments and criticism.
> 
> Why does OOM killer need a special handling. All the oom killer does is
> to send a fatal signal to the victim. Why is this any different from
> sending SIGKILL from the userspace?

I have had a closer look and I guess I can see what you are trying to
achieve. futex_exit_release is normally called from exit_mm context. You
are likely seeing a situation when the oom victim is blocked and cannot
exit. That is certainly possible but it shouldn't be a permanent state.
So I would be more interested about your particular issue and how long
the task has been stuck unable to exit.

Whether this is safe to be called from the oom killer context I cannot
really judge. That would be a question to Futex folks.
Joel Savitz Dec. 9, 2021, 2:59 a.m. UTC | #9
On Wed, Dec 8, 2021 at 11:05 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 08-12-21 10:01:44, Michal Hocko wrote:
> > On Tue 07-12-21 15:47:59, Andrew Morton wrote:
> > > (cc's added)
> >
> > Extend CC to have all futex maintainers on board.
> >
> > > On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:
> > >
> > > > In the case that two or more processes share a futex located within
> > > > a shared mmaped region, such as a process that shares a lock between
> > > > itself and a number of child processes, we have observed that when
> > > > a process holding the lock is oom killed, at least one waiter is never
> > > > alerted to this new development and simply continues to wait.
> > >
> > > Well dang.  Is there any way of killing off that waiting process, or do
> > > we have a resource leak here?
> > >
> > > > This is visible via pthreads by checking the __owner field of the
> > > > pthread_mutex_t structure within a waiting process, perhaps with gdb.
> > > >
> > > > We identify reproduction of this issue by checking a waiting process of
> > > > a test program and viewing the contents of the pthread_mutex_t, taking note
> > > > of the value in the owner field, and then checking dmesg to see if the
> > > > owner has already been killed.
> > > >
> > > > This issue can be tricky to reproduce, but with the modifications of
> > > > this small patch, I have found it to be impossible to reproduce. There
> > > > may be additional considerations that I have not taken into account in
> > > > this patch and I welcome any comments and criticism.
> >
> > Why does OOM killer need a special handling. All the oom killer does is
> > to send a fatal signal to the victim. Why is this any different from
> > sending SIGKILL from the userspace?
>
> I have had a closer look and I guess I can see what you are trying to
> achieve. futex_exit_release is normally called from exit_mm context. You
> are likely seeing a situation when the oom victim is blocked and cannot
> exit. That is certainly possible but it shouldn't be a permanent state.
> So I would be more interested about your particular issue and how long
> the task has been stuck unable to exit.

Before applying this patch I never saw a task eventually exit during
the reproduction of this system state.
Every task in this waiting-on-a-dead-owner situation state appeared to
be permanently blocked until user intervention killed it manually.

>
> Whether this is safe to be called from the oom killer context I cannot
> really judge. That would be a question to Futex folks.

I am also very interested in feedback from the Futex folks.
This is the first fix for the bug that I have found but I am not sure
whether this introduces other issues due to the context.

> --
> Michal Hocko
> SUSE Labs
>

Best,
Joel Savitz
Michal Hocko Dec. 9, 2021, 7:51 a.m. UTC | #10
On Wed 08-12-21 21:59:29, Joel Savitz wrote:
> On Wed, Dec 8, 2021 at 11:05 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > I have had a closer look and I guess I can see what you are trying to
> > achieve. futex_exit_release is normally called from exit_mm context. You
> > are likely seeing a situation when the oom victim is blocked and cannot
> > exit. That is certainly possible but it shouldn't be a permanent state.
> > So I would be more interested about your particular issue and how long
> > the task has been stuck unable to exit.
> 
> Before applying this patch I never saw a task eventually exit during
> the reproduction of this system state.

What has happened to the oom victim and why it has never exited?

Side note. I have noticed that you have sent v2 with minor changes. It
is usualy better to resolve review feedback before posting a new
version. Discussion gets rather hard to follow otherwise.

Thanks!
Joel Savitz Jan. 14, 2022, 2:39 p.m. UTC | #11
> What has happened to the oom victim and why it has never exited?

What appears to happen is that the oom victim is sent SIGKILL by the
process that triggers the oom while also being marked as an oom
victim.

As you mention in your patchset introducing the oom reaper in commit
aac4536355496 ("mm, oom: introduce oom reaper"), the purpose the the
oom reaper is to try and free more memory more quickly than it
otherwise would have been by assuming anonymous or swapped out pages
won't be needed in the exit path as the owner is already dying.
However, this assumption is violated by the futex_cleanup() path,
which needs access to userspace in fetch_robust_entry() when it is
called in exit_robust_list(). Trace_printk()s in this failure path
reveal an apparent race between the oom reaper thread reaping the
victim's mm and the futex_cleanup() path. There may be other ways that
this race manifests but we have been most consistently able to trace
that one.

Since in the case of an oom victim using robust futexes the core
assumption of the oom reaper is violated, we propose to solve this
problem by either canceling or delaying the waking of the oom reaper
thread by wake_oom_reaper in the case that tsk->robust_list is
non-NULL.

e.g. the bug does not reproduce with this patch (from npache@redhat.com):

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 989f35a2bbb1..b8c518fdcf4d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -665,6 +665,19 @@ static void wake_oom_reaper(struct task_struct *tsk)
        if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
                return;

+#ifdef CONFIG_FUTEX
+       /*
+        * don't wake the oom_reaper thread if we still have a robust
list to handle
+        * This will then rely on the sigkill to handle the cleanup of memory
+        */
+       if(tsk->robust_list)
+               return;
+#ifdef CONFIG_COMPAT
+       if(tsk->compat_robust_list)
+               return;
+#endif
+#endif
+
        get_task_struct(tsk);

        spin_lock(&oom_reaper_lock);

Best,
Joel Savitz
Waiman Long Jan. 14, 2022, 2:55 p.m. UTC | #12
On 1/14/22 09:39, Joel Savitz wrote:
>> What has happened to the oom victim and why it has never exited?
> What appears to happen is that the oom victim is sent SIGKILL by the
> process that triggers the oom while also being marked as an oom
> victim.
>
> As you mention in your patchset introducing the oom reaper in commit
> aac4536355496 ("mm, oom: introduce oom reaper"), the purpose the the
> oom reaper is to try and free more memory more quickly than it
> otherwise would have been by assuming anonymous or swapped out pages
> won't be needed in the exit path as the owner is already dying.
> However, this assumption is violated by the futex_cleanup() path,
> which needs access to userspace in fetch_robust_entry() when it is
> called in exit_robust_list(). Trace_printk()s in this failure path
> reveal an apparent race between the oom reaper thread reaping the
> victim's mm and the futex_cleanup() path. There may be other ways that
> this race manifests but we have been most consistently able to trace
> that one.
>
> Since in the case of an oom victim using robust futexes the core
> assumption of the oom reaper is violated, we propose to solve this
> problem by either canceling or delaying the waking of the oom reaper
> thread by wake_oom_reaper in the case that tsk->robust_list is
> non-NULL.
>
> e.g. the bug does not reproduce with this patch (from npache@redhat.com):
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 989f35a2bbb1..b8c518fdcf4d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -665,6 +665,19 @@ static void wake_oom_reaper(struct task_struct *tsk)
>          if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>                  return;
>
> +#ifdef CONFIG_FUTEX
> +       /*
> +        * don't wake the oom_reaper thread if we still have a robust
> list to handle
> +        * This will then rely on the sigkill to handle the cleanup of memory
> +        */
> +       if(tsk->robust_list)
> +               return;
> +#ifdef CONFIG_COMPAT
> +       if(tsk->compat_robust_list)
> +               return;
> +#endif
> +#endif
> +
>          get_task_struct(tsk);
>
>          spin_lock(&oom_reaper_lock);

OK, that can explain why the robust futex is not properly cleaned up. 
Could you post a more formal v2 patch with description about the 
possible race condition?

Cheers,
Longman
Waiman Long Jan. 14, 2022, 2:58 p.m. UTC | #13
On 1/14/22 09:55, Waiman Long wrote:
> On 1/14/22 09:39, Joel Savitz wrote:
>>> What has happened to the oom victim and why it has never exited?
>> What appears to happen is that the oom victim is sent SIGKILL by the
>> process that triggers the oom while also being marked as an oom
>> victim.
>>
>> As you mention in your patchset introducing the oom reaper in commit
>> aac4536355496 ("mm, oom: introduce oom reaper"), the purpose the the
>> oom reaper is to try and free more memory more quickly than it
>> otherwise would have been by assuming anonymous or swapped out pages
>> won't be needed in the exit path as the owner is already dying.
>> However, this assumption is violated by the futex_cleanup() path,
>> which needs access to userspace in fetch_robust_entry() when it is
>> called in exit_robust_list(). Trace_printk()s in this failure path
>> reveal an apparent race between the oom reaper thread reaping the
>> victim's mm and the futex_cleanup() path. There may be other ways that
>> this race manifests but we have been most consistently able to trace
>> that one.
>>
>> Since in the case of an oom victim using robust futexes the core
>> assumption of the oom reaper is violated, we propose to solve this
>> problem by either canceling or delaying the waking of the oom reaper
>> thread by wake_oom_reaper in the case that tsk->robust_list is
>> non-NULL.
>>
>> e.g. the bug does not reproduce with this patch (from 
>> npache@redhat.com):
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 989f35a2bbb1..b8c518fdcf4d 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -665,6 +665,19 @@ static void wake_oom_reaper(struct task_struct 
>> *tsk)
>>          if (test_and_set_bit(MMF_OOM_REAP_QUEUED, 
>> &tsk->signal->oom_mm->flags))
>>                  return;
>>
>> +#ifdef CONFIG_FUTEX
>> +       /*
>> +        * don't wake the oom_reaper thread if we still have a robust
>> list to handle
>> +        * This will then rely on the sigkill to handle the cleanup 
>> of memory
>> +        */
>> +       if(tsk->robust_list)
>> +               return;
>> +#ifdef CONFIG_COMPAT
>> +       if(tsk->compat_robust_list)
>> +               return;
>> +#endif
>> +#endif
>> +
>>          get_task_struct(tsk);
>>
>>          spin_lock(&oom_reaper_lock);
>
> OK, that can explain why the robust futex is not properly cleaned up. 
> Could you post a more formal v2 patch with description about the 
> possible race condition?
>
It should be v3. Sorry for the mix-up.

Cheers,
Longman
Michal Hocko Jan. 17, 2022, 11:33 a.m. UTC | #14
I have only noticed your email now after replying to v3 so our emails
have crossed.

On Fri 14-01-22 09:39:55, Joel Savitz wrote:
> > What has happened to the oom victim and why it has never exited?
> 
> What appears to happen is that the oom victim is sent SIGKILL by the
> process that triggers the oom while also being marked as an oom
> victim.
> 
> As you mention in your patchset introducing the oom reaper in commit
> aac4536355496 ("mm, oom: introduce oom reaper"), the purpose the the
> oom reaper is to try and free more memory more quickly than it
> otherwise would have been by assuming anonymous or swapped out pages
> won't be needed in the exit path as the owner is already dying.
> However, this assumption is violated by the futex_cleanup() path,
> which needs access to userspace in fetch_robust_entry() when it is
> called in exit_robust_list(). Trace_printk()s in this failure path
> reveal an apparent race between the oom reaper thread reaping the
> victim's mm and the futex_cleanup() path. There may be other ways that
> this race manifests but we have been most consistently able to trace
> that one.

Please let's continue the discussion in the v3 email thread:
http://lkml.kernel.org/r/20220114180135.83308-1-npache@redhat.com
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ddabefcfb5a..fa58bd10a0df 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -44,6 +44,7 @@ 
 #include <linux/kthread.h>
 #include <linux/init.h>
 #include <linux/mmu_notifier.h>
+#include <linux/futex.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -890,6 +891,7 @@  static void __oom_kill_process(struct task_struct *victim, const char *message)
 	 * in order to prevent the OOM victim from depleting the memory
 	 * reserves from the user space under its control.
 	 */
+	futex_exit_release(victim);
 	do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
 	mark_oom_victim(victim);
 	pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
@@ -930,6 +932,7 @@  static void __oom_kill_process(struct task_struct *victim, const char *message)
 		 */
 		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
+		futex_exit_release(p);
 		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
 	}
 	rcu_read_unlock();