diff mbox series

[v3] mm/oom: do not oom reap task with an unresolved robust futex

Message ID 20220114180135.83308-1-npache@redhat.com (mailing list archive)
State New
Headers show
Series [v3] mm/oom: do not oom reap task with an unresolved robust futex | expand

Commit Message

Nico Pache Jan. 14, 2022, 6:01 p.m. UTC
In the case that two or more processes share a futex located within
a shared mmaped region, such as a process that shares a lock between
itself and child processes, we have observed that when a process holding
the lock is oom killed, at least one waiter is never alerted to this new
development and simply continues to wait.

This is visible via pthreads by checking the __owner field of the
pthread_mutex_t structure within a waiting process, perhaps with gdb.

We identify reproduction of this issue by checking a waiting process of
a test program and viewing the contents of the pthread_mutex_t, taking note
of the value in the owner field, and then checking dmesg to see if the
owner has already been killed.

As mentioned by Michal in his patchset introducing the oom reaper,
commit aac4536355496 ("mm, oom: introduce oom reaper"), the purpose of the
oom reaper is to try and free memory more quickly; however, In the case
that a robust futex is being used, we want to avoid utilizing the
concurrent oom reaper. This is due to a race that can occur between the
SIGKILL handling the robust futex, and the oom reaper freeing the memory
needed to maintain the robust list.

In the case that the oom victim is utilizing a robust futex, and the
SIGKILL has not yet handled the futex death, the tsk->robust_list should
be non-NULL. This issue can be tricky to reproduce, but with the
modifications of this patch, we have found it to be impossible to
reproduce.

Add a check for tsk->robust_list is non-NULL in wake_oom_reaper() to return
early and prevent waking the oom reaper.

Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer

Co-developed-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/oom_kill.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Comments

Michal Hocko Jan. 17, 2022, 8:52 a.m. UTC | #1
On Fri 14-01-22 13:01:35, Nico Pache wrote:
> In the case that two or more processes share a futex located within
> a shared mmaped region, such as a process that shares a lock between
> itself and child processes, we have observed that when a process holding
> the lock is oom killed, at least one waiter is never alerted to this new
> development and simply continues to wait.
> 
> This is visible via pthreads by checking the __owner field of the
> pthread_mutex_t structure within a waiting process, perhaps with gdb.
> 
> We identify reproduction of this issue by checking a waiting process of
> a test program and viewing the contents of the pthread_mutex_t, taking note
> of the value in the owner field, and then checking dmesg to see if the
> owner has already been killed.

I believe we really need to find out why the original holder of the
futex is not woken up to release the lock when exiting.

> As mentioned by Michal in his patchset introducing the oom reaper,
> commit aac4536355496 ("mm, oom: introduce oom reaper"), the purpose of the
> oom reaper is to try and free memory more quickly; however, In the case
> that a robust futex is being used, we want to avoid utilizing the
> concurrent oom reaper. This is due to a race that can occur between the
> SIGKILL handling the robust futex, and the oom reaper freeing the memory
> needed to maintain the robust list.

OOM reaper is only unmapping private memory. It doesn't touch a shared
mappings. So how could it interfere?

> In the case that the oom victim is utilizing a robust futex, and the
> SIGKILL has not yet handled the futex death, the tsk->robust_list should
> be non-NULL. This issue can be tricky to reproduce, but with the
> modifications of this patch, we have found it to be impossible to
> reproduce.

We really need a deeper analysis of the udnerlying problem because right
now I do not really see why the oom reaper should interfere with shared
futex.

> Add a check for tsk->robust_list is non-NULL in wake_oom_reaper() to return
> early and prevent waking the oom reaper.
> 
> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
> 
> Co-developed-by: Joel Savitz <jsavitz@redhat.com>
> Signed-off-by: Joel Savitz <jsavitz@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/oom_kill.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ddabefcfb5a..3cdaac9c7de5 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -667,6 +667,21 @@ static void wake_oom_reaper(struct task_struct *tsk)
>  	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>  		return;
>  
> +#ifdef CONFIG_FUTEX
> +	/*
> +	 * If the ooming task's SIGKILL has not finished handling the
> +	 * robust futex it is not correct to reap the mm concurrently.
> +	 * Do not wake the oom reaper when the task still contains a
> +	 * robust list.
> +	 */
> +	if (tsk->robust_list)
> +		return;
> +#ifdef CONFIG_COMPAT
> +	if (tsk->compat_robust_list)
> +		return;
> +#endif
> +#endif

If this turns out to be really needed, which I do not really see at the
moment, then this is not the right way to handle this situation. The oom
victim could get stuck and the oom killer wouldn't be able to move
forward. If anything the victim would need to get MMF_OOM_SKIP set.

> +
>  	get_task_struct(tsk);
>  
>  	spin_lock(&oom_reaper_lock);
> -- 
> 2.33.1
Waiman Long Jan. 17, 2022, 4:05 p.m. UTC | #2
On 1/17/22 03:52, Michal Hocko wrote:
> On Fri 14-01-22 13:01:35, Nico Pache wrote:
>> In the case that two or more processes share a futex located within
>> a shared mmaped region, such as a process that shares a lock between
>> itself and child processes, we have observed that when a process holding
>> the lock is oom killed, at least one waiter is never alerted to this new
>> development and simply continues to wait.
>>
>> This is visible via pthreads by checking the __owner field of the
>> pthread_mutex_t structure within a waiting process, perhaps with gdb.
>>
>> We identify reproduction of this issue by checking a waiting process of
>> a test program and viewing the contents of the pthread_mutex_t, taking note
>> of the value in the owner field, and then checking dmesg to see if the
>> owner has already been killed.
> I believe we really need to find out why the original holder of the
> futex is not woken up to release the lock when exiting.

For a robust futex lock holder or waiter that is to be killed, it is not 
the responsibility of the task itself to wake up and release the lock. 
It is the kernel that recognizes that the task is holding or waiting for 
the robust futex and clean thing up.


>> As mentioned by Michal in his patchset introducing the oom reaper,
>> commit aac4536355496 ("mm, oom: introduce oom reaper"), the purpose of the
>> oom reaper is to try and free memory more quickly; however, In the case
>> that a robust futex is being used, we want to avoid utilizing the
>> concurrent oom reaper. This is due to a race that can occur between the
>> SIGKILL handling the robust futex, and the oom reaper freeing the memory
>> needed to maintain the robust list.
> OOM reaper is only unmapping private memory. It doesn't touch a shared
> mappings. So how could it interfere?
>
The futex itself may be in shared memory, however the robust list entry 
can be in private memory. So when the robust list is being scanned in 
this case, we can be in a use-after-free situation.
>> In the case that the oom victim is utilizing a robust futex, and the
>> SIGKILL has not yet handled the futex death, the tsk->robust_list should
>> be non-NULL. This issue can be tricky to reproduce, but with the
>> modifications of this patch, we have found it to be impossible to
>> reproduce.
> We really need a deeper analysis of the udnerlying problem because right
> now I do not really see why the oom reaper should interfere with shared
> futex.
As I said above, the robust list processing can involve private memory.
>
>> Add a check for tsk->robust_list is non-NULL in wake_oom_reaper() to return
>> early and prevent waking the oom reaper.
>>
>> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
>>
>> Co-developed-by: Joel Savitz <jsavitz@redhat.com>
>> Signed-off-by: Joel Savitz <jsavitz@redhat.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/oom_kill.c | 15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 1ddabefcfb5a..3cdaac9c7de5 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -667,6 +667,21 @@ static void wake_oom_reaper(struct task_struct *tsk)
>>   	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>>   		return;
>>   
>> +#ifdef CONFIG_FUTEX
>> +	/*
>> +	 * If the ooming task's SIGKILL has not finished handling the
>> +	 * robust futex it is not correct to reap the mm concurrently.
>> +	 * Do not wake the oom reaper when the task still contains a
>> +	 * robust list.
>> +	 */
>> +	if (tsk->robust_list)
>> +		return;
>> +#ifdef CONFIG_COMPAT
>> +	if (tsk->compat_robust_list)
>> +		return;
>> +#endif
>> +#endif
> If this turns out to be really needed, which I do not really see at the
> moment, then this is not the right way to handle this situation. The oom
> victim could get stuck and the oom killer wouldn't be able to move
> forward. If anything the victim would need to get MMF_OOM_SKIP set.

There can be other way to do that, but letting the normal kill signal 
processing finishing its job and properly invoke futex_cleanup() is 
certainly one possible solution.

Cheers,
Longman
Nico Pache Jan. 17, 2022, 10:56 p.m. UTC | #3
On 1/17/22 11:05, Waiman Long wrote:
> On 1/17/22 03:52, Michal Hocko wrote:
>> On Fri 14-01-22 13:01:35, Nico Pache wrote:
>>> In the case that two or more processes share a futex located within
>>> a shared mmaped region, such as a process that shares a lock between
>>> itself and child processes, we have observed that when a process holding
>>> the lock is oom killed, at least one waiter is never alerted to this new
>>> development and simply continues to wait.
>>>
>>> This is visible via pthreads by checking the __owner field of the
>>> pthread_mutex_t structure within a waiting process, perhaps with gdb.
>>>
>>> We identify reproduction of this issue by checking a waiting process of
>>> a test program and viewing the contents of the pthread_mutex_t, taking note
>>> of the value in the owner field, and then checking dmesg to see if the
>>> owner has already been killed.
>> I believe we really need to find out why the original holder of the
>> futex is not woken up to release the lock when exiting.
> 
> For a robust futex lock holder or waiter that is to be killed, it is not the
> responsibility of the task itself to wake up and release the lock. It is the
> kernel that recognizes that the task is holding or waiting for the robust futex
> and clean thing up.
> 
> 
>>> As mentioned by Michal in his patchset introducing the oom reaper,
>>> commit aac4536355496 ("mm, oom: introduce oom reaper"), the purpose of the
>>> oom reaper is to try and free memory more quickly; however, In the case
>>> that a robust futex is being used, we want to avoid utilizing the
>>> concurrent oom reaper. This is due to a race that can occur between the
>>> SIGKILL handling the robust futex, and the oom reaper freeing the memory
>>> needed to maintain the robust list.
>> OOM reaper is only unmapping private memory. It doesn't touch a shared
>> mappings. So how could it interfere?
>>
> The futex itself may be in shared memory, however the robust list entry can be
> in private memory. So when the robust list is being scanned in this case, we can
> be in a use-after-free situation.

I believe this is true.  The userspace allocation for the pthread occurs as a
private mapping:
https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L368

>>> In the case that the oom victim is utilizing a robust futex, and the
>>> SIGKILL has not yet handled the futex death, the tsk->robust_list should
>>> be non-NULL. This issue can be tricky to reproduce, but with the
>>> modifications of this patch, we have found it to be impossible to
>>> reproduce.
>> We really need a deeper analysis of the udnerlying problem because right
>> now I do not really see why the oom reaper should interfere with shared
>> futex.
> As I said above, the robust list processing can involve private memory.
>>
>>> Add a check for tsk->robust_list is non-NULL in wake_oom_reaper() to return
>>> early and prevent waking the oom reaper.
>>>
>>> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
>>>
>>> Co-developed-by: Joel Savitz <jsavitz@redhat.com>
>>> Signed-off-by: Joel Savitz <jsavitz@redhat.com>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>>   mm/oom_kill.c | 15 +++++++++++++++
>>>   1 file changed, 15 insertions(+)
>>>
>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>> index 1ddabefcfb5a..3cdaac9c7de5 100644
>>> --- a/mm/oom_kill.c
>>> +++ b/mm/oom_kill.c
>>> @@ -667,6 +667,21 @@ static void wake_oom_reaper(struct task_struct *tsk)
>>>       if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>>>           return;
>>>   +#ifdef CONFIG_FUTEX
>>> +    /*
>>> +     * If the ooming task's SIGKILL has not finished handling the
>>> +     * robust futex it is not correct to reap the mm concurrently.
>>> +     * Do not wake the oom reaper when the task still contains a
>>> +     * robust list.
>>> +     */
>>> +    if (tsk->robust_list)
>>> +        return;
>>> +#ifdef CONFIG_COMPAT
>>> +    if (tsk->compat_robust_list)
>>> +        return;
>>> +#endif
>>> +#endif
>> If this turns out to be really needed, which I do not really see at the
>> moment, then this is not the right way to handle this situation. The oom
>> victim could get stuck and the oom killer wouldn't be able to move
>> forward. If anything the victim would need to get MMF_OOM_SKIP set.

I will try this, but I don't immediately see any difference between this return
case and setting the bit, passing the oom_reaper_list, then skipping it based on
the flag. Do you mind explaining how this could lead to the oom killer getting
stuck?

Cheers,
-- Nico
> 
> There can be other way to do that, but letting the normal kill signal processing
> finishing its job and properly invoke futex_cleanup() is certainly one possible
> solution.
> 
> Cheers,
> Longman
>
Michal Hocko Jan. 18, 2022, 8:51 a.m. UTC | #4
On Mon 17-01-22 17:56:28, Nico Pache wrote:
> On 1/17/22 11:05, Waiman Long wrote:
> > On 1/17/22 03:52, Michal Hocko wrote:
[...]
> >>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >>> index 1ddabefcfb5a..3cdaac9c7de5 100644
> >>> --- a/mm/oom_kill.c
> >>> +++ b/mm/oom_kill.c
> >>> @@ -667,6 +667,21 @@ static void wake_oom_reaper(struct task_struct *tsk)
> >>>       if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
> >>>           return;
> >>>   +#ifdef CONFIG_FUTEX
> >>> +    /*
> >>> +     * If the ooming task's SIGKILL has not finished handling the
> >>> +     * robust futex it is not correct to reap the mm concurrently.
> >>> +     * Do not wake the oom reaper when the task still contains a
> >>> +     * robust list.
> >>> +     */
> >>> +    if (tsk->robust_list)
> >>> +        return;
> >>> +#ifdef CONFIG_COMPAT
> >>> +    if (tsk->compat_robust_list)
> >>> +        return;
> >>> +#endif
> >>> +#endif
> >> If this turns out to be really needed, which I do not really see at the
> >> moment, then this is not the right way to handle this situation. The oom
> >> victim could get stuck and the oom killer wouldn't be able to move
> >> forward. If anything the victim would need to get MMF_OOM_SKIP set.
> 
> I will try this, but I don't immediately see any difference between this return
> case and setting the bit, passing the oom_reaper_list, then skipping it based on
> the flag. Do you mind explaining how this could lead to the oom killer getting
> stuck?

The primary purpose of the oom_reaper is to guarantee a forward
progress. If a task gets stuck in the kernel - e.g. on locks then it
won't bail out and won't handle signals (i.e. SIGKILL from the
userspace). The oom killer prevents new oom victims selection in a
presence of an existing oom victim (see oom_evaluate_task). That means
that we not only send a SIGKILL to the victim, we also wake up the oom
reaper which then asynchronously tears down the private memory of the
task (thus release at least some of its memory) and once it is done it
will set MMF_OOM_SKIP flag which will tell the oom killer
(oom_evaluate_task) that this victim is no longer interesting and a new
victim can be selected.

Makes sense?

Part of the async tear down is also MMF_UNSTABLE handling (see
__oom_reap_task_mm) which tells #PF handling code
(check_stable_address_space) that the underlying memory could have been
tempered with and thus it should return SIGBUS. The underlying
assumption is that the process (and all tasks which share its mm) has
been killed and it will never return to the userspace so the de-facto
memory corruption doesn't matter.

One thing that is still unclear to me is why this leads to any locked up
tasks. Looking at exit_robust_list I can see that it is accessing the
userspace memory but this should return EFAULT in this situation. My
assumption (which might be really wrong) is that futex shared among
processes which are not sharing mm nor signal handling will be sitting
in a shared memory. 

Now to the actual fix. I do not think we want to hide the task from the
oom reaper as you are suggesting. Futexes are very likely to be used for
many processes and that would make the whole async scheme useless. We
need something like the below.

futex_exit_release is not directly usable as it implicitly depends
on memory allocation (#PF) and that is not acceptable. So instead we
need something futex_exit_try_release or similar which would fail the
operation in case get_user (pagefault_disable) needs to really handle
the #PF or if the futex_exit_mutex is locked. In other words this would
have to be a completely non-blocking operation. The oom reaper would
then bail out.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ef5860fc7d22..57660d3d1b79 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -613,6 +613,9 @@ static void oom_reap_task(struct task_struct *tsk)
 	int attempts = 0;
 	struct mm_struct *mm = tsk->signal->oom_mm;
 
+	if (futex_exit_try_release(tsk))
+		goto fail;
+
 	/* Retry the mmap_read_trylock(mm) a few times */
 	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
 		schedule_timeout_idle(HZ/10);
@@ -621,6 +624,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	    test_bit(MMF_OOM_SKIP, &mm->flags))
 		goto done;
 
+fail:
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
 	sched_show_task(tsk);
@@ -1184,6 +1188,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	if (!reap)
 		goto drop_mm;
 
+	if (futex_exit_try_release(tsk)) {
+		ret = -EAGAIN;
+		goto drop_mm;
+	}
+	
 	if (mmap_read_lock_killable(mm)) {
 		ret = -EINTR;
 		goto drop_mm;
Nico Pache Feb. 14, 2022, 8:39 p.m. UTC | #5
On 1/18/22 03:51, Michal Hocko wrote:
> On Mon 17-01-22 17:56:28, Nico Pache wrote:
>> On 1/17/22 11:05, Waiman Long wrote:
>>> On 1/17/22 03:52, Michal Hocko wrote:
> [...]
>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>>>> index 1ddabefcfb5a..3cdaac9c7de5 100644
>>>>> --- a/mm/oom_kill.c
>>>>> +++ b/mm/oom_kill.c
>>>>> @@ -667,6 +667,21 @@ static void wake_oom_reaper(struct task_struct *tsk)
>>>>>       if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>>>>>           return;
>>>>>   +#ifdef CONFIG_FUTEX
>>>>> +    /*
>>>>> +     * If the ooming task's SIGKILL has not finished handling the
>>>>> +     * robust futex it is not correct to reap the mm concurrently.
>>>>> +     * Do not wake the oom reaper when the task still contains a
>>>>> +     * robust list.
>>>>> +     */
>>>>> +    if (tsk->robust_list)
>>>>> +        return;
>>>>> +#ifdef CONFIG_COMPAT
>>>>> +    if (tsk->compat_robust_list)
>>>>> +        return;
>>>>> +#endif
>>>>> +#endif
>>>> If this turns out to be really needed, which I do not really see at the
>>>> moment, then this is not the right way to handle this situation. The oom
>>>> victim could get stuck and the oom killer wouldn't be able to move
>>>> forward. If anything the victim would need to get MMF_OOM_SKIP set.
>>
>> I will try this, but I don't immediately see any difference between this return
>> case and setting the bit, passing the oom_reaper_list, then skipping it based on
>> the flag. Do you mind explaining how this could lead to the oom killer getting
>> stuck?
> 
> The primary purpose of the oom_reaper is to guarantee a forward
> progress. If a task gets stuck in the kernel - e.g. on locks then it
> won't bail out and won't handle signals (i.e. SIGKILL from the
> userspace). The oom killer prevents new oom victims selection in a
> presence of an existing oom victim (see oom_evaluate_task). That means
> that we not only send a SIGKILL to the victim, we also wake up the oom
> reaper which then asynchronously tears down the private memory of the
> task (thus release at least some of its memory) and once it is done it
> will set MMF_OOM_SKIP flag which will tell the oom killer
> (oom_evaluate_task) that this victim is no longer interesting and a new
> victim can be selected.
> 
> Makes sense?
Thank does, Thank you for clearing that up!
> 
> Part of the async tear down is also MMF_UNSTABLE handling (see
> __oom_reap_task_mm) which tells #PF handling code
> (check_stable_address_space) that the underlying memory could have been
> tempered with and thus it should return SIGBUS. The underlying
> assumption is that the process (and all tasks which share its mm) has
> been killed and it will never return to the userspace so the de-facto
> memory corruption doesn't matter.
> 
> One thing that is still unclear to me is why this leads to any locked up
> tasks. Looking at exit_robust_list I can see that it is accessing the
> userspace memory but this should return EFAULT in this situation. My
> assumption (which might be really wrong) is that futex shared among
> processes which are not sharing mm nor signal handling will be sitting
> in a shared memory. 
> 
> Now to the actual fix. I do not think we want to hide the task from the
> oom reaper as you are suggesting. Futexes are very likely to be used for
> many processes and that would make the whole async scheme useless. We
> need something like the below.
> 
> futex_exit_release is not directly usable as it implicitly depends
> on memory allocation (#PF) and that is not acceptable. So instead we
> need something futex_exit_try_release or similar which would fail the
> operation in case get_user (pagefault_disable) needs to really handle
> the #PF or if the futex_exit_mutex is locked. In other words this would
> have to be a completely non-blocking operation. The oom reaper would
> then bail out.
Yeah Joel came up with a similar idea once we realized his v2 had the issue with
sleeping.

We've recently been discussing the following if statement in __oom_reap_task_mm:
	if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))

Given the comment above it, and some of the upstream discussion the original
RFC, we are struggling to see why this should be a `||` and not an `&&`. If we
only want to reap anon memory and reaping shared memory can be dangerous is this
statement incorrect?

We have a patch queued up to make this change, but wanted to get your opinion on
why this was originally designed this way in case we are missing something.

-- Nico

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ef5860fc7d22..57660d3d1b79 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -613,6 +613,9 @@ static void oom_reap_task(struct task_struct *tsk)
>  	int attempts = 0;
>  	struct mm_struct *mm = tsk->signal->oom_mm;
>  
> +	if (futex_exit_try_release(tsk))
> +		goto fail;
> +
>  	/* Retry the mmap_read_trylock(mm) a few times */
>  	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
>  		schedule_timeout_idle(HZ/10);
> @@ -621,6 +624,7 @@ static void oom_reap_task(struct task_struct *tsk)
>  	    test_bit(MMF_OOM_SKIP, &mm->flags))
>  		goto done;
>  
> +fail:
>  	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
>  		task_pid_nr(tsk), tsk->comm);
>  	sched_show_task(tsk);
> @@ -1184,6 +1188,11 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	if (!reap)
>  		goto drop_mm;
>  
> +	if (futex_exit_try_release(tsk)) {
> +		ret = -EAGAIN;
> +		goto drop_mm;
> +	}
> +	
>  	if (mmap_read_lock_killable(mm)) {
>  		ret = -EINTR;
>  		goto drop_mm;
Michal Hocko March 2, 2022, 2:24 p.m. UTC | #6
Sorry, this has slipped through cracks.

On Mon 14-02-22 15:39:31, Nico Pache wrote:
[...]
> We've recently been discussing the following if statement in __oom_reap_task_mm:
> 	if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> 
> Given the comment above it, and some of the upstream discussion the original
> RFC, we are struggling to see why this should be a `||` and not an `&&`. If we
> only want to reap anon memory and reaping shared memory can be dangerous is this
> statement incorrect?
> 
> We have a patch queued up to make this change, but wanted to get your opinion on
> why this was originally designed this way in case we are missing something.

I do not really see why this would be wrong. Private file backed
mappings can contain a reapable memory as well. I do not see how this
would solve the futex issue.
Nico Pache March 2, 2022, 5:26 p.m. UTC | #7
On 3/2/22 09:24, Michal Hocko wrote:
> Sorry, this has slipped through cracks.
> 
> On Mon 14-02-22 15:39:31, Nico Pache wrote:
> [...]
>> We've recently been discussing the following if statement in __oom_reap_task_mm:
>> 	if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
>>
>> Given the comment above it, and some of the upstream discussion the original
>> RFC, we are struggling to see why this should be a `||` and not an `&&`. If we
>> only want to reap anon memory and reaping shared memory can be dangerous is this
>> statement incorrect?
>>
>> We have a patch queued up to make this change, but wanted to get your opinion on
>> why this was originally designed this way in case we are missing something.
> 
> I do not really see why this would be wrong. Private file backed
> mappings can contain a reapable memory as well. I do not see how this
> would solve the futex issue.
We were basing our discussion around the following comment:
/*
 * Only anonymous pages have a good chance to be dropped
 * without additional steps which we cannot afford as we
 * are OOM already.
 *
 * We do not even care about fs backed pages because all
 * which are reclaimable have already been reclaimed and
 * we do not want to block exit_mmap by keeping mm ref
 * count elevated without a good reason.
 */

So changing to an && would align the functionality with this comment by ignoring
fs backed pages, and additionally it prevents shared mappings from being reaped.
We have tested this change and found we can no longer reproduce the issue. In
our case we allocate the mutex on a MAP_SHARED|MAP_ANONYMOUS mmap so the if-
statement in question would no longer return true after the && change.

If it is the case that private fs backed pages matter perhaps we want something
like this:
	if ((vma_is_anonymous(vma) && !(vma->vm_flags & VM_SHARED))
	||(!vma_is_anonymous(vma) && !(vma->vm_flags & VM_SHARED)))

or more simply:
	if(!(vma->vm_flags & VM_SHARED))

to exclude all VM_SHARED mappings.

-- Nico
Michal Hocko March 3, 2022, 7:48 a.m. UTC | #8
On Wed 02-03-22 12:26:45, Nico Pache wrote:
> 
> 
> On 3/2/22 09:24, Michal Hocko wrote:
> > Sorry, this has slipped through cracks.
> > 
> > On Mon 14-02-22 15:39:31, Nico Pache wrote:
> > [...]
> >> We've recently been discussing the following if statement in __oom_reap_task_mm:
> >> 	if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> >>
> >> Given the comment above it, and some of the upstream discussion the original
> >> RFC, we are struggling to see why this should be a `||` and not an `&&`. If we
> >> only want to reap anon memory and reaping shared memory can be dangerous is this
> >> statement incorrect?
> >>
> >> We have a patch queued up to make this change, but wanted to get your opinion on
> >> why this was originally designed this way in case we are missing something.
> > 
> > I do not really see why this would be wrong. Private file backed
> > mappings can contain a reapable memory as well. I do not see how this
> > would solve the futex issue.
> We were basing our discussion around the following comment:
> /*
>  * Only anonymous pages have a good chance to be dropped
>  * without additional steps which we cannot afford as we
>  * are OOM already.
>  *
>  * We do not even care about fs backed pages because all
>  * which are reclaimable have already been reclaimed and
>  * we do not want to block exit_mmap by keeping mm ref
>  * count elevated without a good reason.
>  */
> 
> So changing to an && would align the functionality with this comment by ignoring
> fs backed pages, and additionally it prevents shared mappings from being reaped.
> We have tested this change and found we can no longer reproduce the issue. In
> our case we allocate the mutex on a MAP_SHARED|MAP_ANONYMOUS mmap so the if-
> statement in question would no longer return true after the && change.
> 
> If it is the case that private fs backed pages matter perhaps we want something
> like this:
> 	if ((vma_is_anonymous(vma) && !(vma->vm_flags & VM_SHARED))
> 	||(!vma_is_anonymous(vma) && !(vma->vm_flags & VM_SHARED)))
> 
> or more simply:
> 	if(!(vma->vm_flags & VM_SHARED))
> 
> to exclude all VM_SHARED mappings.

I would have to think about that some more but I do not really see how
this is related to the futex issue. In other words what kind of problem
does this solve?
Nico Pache March 9, 2022, 12:24 a.m. UTC | #9
On 3/3/22 00:48, Michal Hocko wrote:
> On Wed 02-03-22 12:26:45, Nico Pache wrote:
>>
>>
>> On 3/2/22 09:24, Michal Hocko wrote:
>>> Sorry, this has slipped through cracks.
>>>
>>> On Mon 14-02-22 15:39:31, Nico Pache wrote:
>>> [...]
>>>> We've recently been discussing the following if statement in __oom_reap_task_mm:
>>>> 	if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
>>>>
>>>> Given the comment above it, and some of the upstream discussion the original
>>>> RFC, we are struggling to see why this should be a `||` and not an `&&`. If we
>>>> only want to reap anon memory and reaping shared memory can be dangerous is this
>>>> statement incorrect?
>>>>
>>>> We have a patch queued up to make this change, but wanted to get your opinion on
>>>> why this was originally designed this way in case we are missing something.
>>>
>>> I do not really see why this would be wrong. Private file backed
>>> mappings can contain a reapable memory as well. I do not see how this
>>> would solve the futex issue
>> We were basing our discussion around the following comment:
>> /*
>>  * Only anonymous pages have a good chance to be dropped
>>  * without additional steps which we cannot afford as we
>>  * are OOM already.
>>  *
>>  * We do not even care about fs backed pages because all
>>  * which are reclaimable have already been reclaimed and
>>  * we do not want to block exit_mmap by keeping mm ref
>>  * count elevated without a good reason.
>>  */
>>
>> So changing to an && would align the functionality with this comment by ignoring
>> fs backed pages, and additionally it prevents shared mappings from being reaped.
>> We have tested this change and found we can no longer reproduce the issue. In
>> our case we allocate the mutex on a MAP_SHARED|MAP_ANONYMOUS mmap so the if-
>> statement in question would no longer return true after the && change.
>>
>> If it is the case that private fs backed pages matter perhaps we want something
>> like this:
>> 	if ((vma_is_anonymous(vma) && !(vma->vm_flags & VM_SHARED))
>> 	||(!vma_is_anonymous(vma) && !(vma->vm_flags & VM_SHARED)))
>>
>> or more simply:
>> 	if(!(vma->vm_flags & VM_SHARED))
>>
>> to exclude all VM_SHARED mappings.
> 
> I would have to think about that some more but I do not really see how
> this is related to the futex issue. In other words what kind of problem
> does this solve?
> 

We had a misunderstanding of what vma_is_anonymous actually checks for... It
returns true if the VMA is PRIVATE|ANONYMOUS. We may follow up with a patch to
change the name of this function or at least add a comment at the top of the
function to be more descriptive. Furthermore, we ended up being able to
reproduce this issue on the && kernel.

We have also found the actual cause of the issue, and we'll post that fix. Its
related to the glibc allocation done for pthreads as we discussed earlier in
this thread. The mapping that stores the futex robust list is in userspace and a
race occurs between the oom_reap_task_mm and the exit path that handles the
futex cleanup.

Cheers,
-- Nico
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ddabefcfb5a..3cdaac9c7de5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -667,6 +667,21 @@  static void wake_oom_reaper(struct task_struct *tsk)
 	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
 		return;
 
+#ifdef CONFIG_FUTEX
+	/*
+	 * If the ooming task's SIGKILL has not finished handling the
+	 * robust futex it is not correct to reap the mm concurrently.
+	 * Do not wake the oom reaper when the task still contains a
+	 * robust list.
+	 */
+	if (tsk->robust_list)
+		return;
+#ifdef CONFIG_COMPAT
+	if (tsk->compat_robust_list)
+		return;
+#endif
+#endif
+
 	get_task_struct(tsk);
 
 	spin_lock(&oom_reaper_lock);