Patchwork bug in memcg oom-killer results in a hung syscall in another process in the same cgroup

login
register
mail settings
Submitter Konstantin Khebnikov
Date July 12, 2016, 3:52 p.m.
Message ID <57851224.2020902@yandex-team.ru>
Download mbox | patch
Permalink /patch/9225661/
State New
Headers show

Comments

Konstantin Khebnikov - July 12, 2016, 3:52 p.m.
On 12.07.2016 18:35, Shayan Pooya wrote:
>>> With strace, when running 500 concurrent mem-hog tasks on the same
>>> kernel, 33 of them failed with:
>>>
>>> strace: ../sysdeps/nptl/fork.c:136: __libc_fork: Assertion
>>> `THREAD_GETMEM (self, tid) != ppid' failed.
>>>
>>> Which is: https://sourceware.org/bugzilla/show_bug.cgi?id=15392
>>> And discussed before at: https://lkml.org/lkml/2015/2/6/470 but that
>>> patch was not accepted.
>>
>> OK, so the problem is that the oom killed task doesn't report the futex
>> release properly? If yes then I fail to see how that is memcg specific.
>> Could you try to clarify what you consider a bug again, please? I am not
>> really sure I understand this report.
>
> It looks like it is just a very easy way to reproduce the problem that
> Konstantin described in that lkml thread. That patch was not accepted
> and I see no other fixes for that issue upstream. Here is a copy of
> his root-cause analysis from said thread:
>
> Whole sequence looks like: task calls fork, glibc calls syscall clone with
> CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument.
> Child task gets read-only copy of VM including TLS. Child calls put_user()
> to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page
> fault and it fails because do_wp_page()  hits memcg limit without invoking
> OOM-killer because this is page-fault from kernel-space.  Put_user returns
> -EFAULT, which is ignored.  Child returns into user-space and catches here
> assert (THREAD_GETMEM (self, tid) != ppid), glibc tries to print something
> but hangs on deadlock on internal locks. Halt and catch fire.
>
>

Yep. Bug still not fixed in upstream. In our kernel I've plugged it with this:
Oleg Nesterov - July 12, 2016, 4:52 p.m.
On 07/12, Konstantin Khlebnikov wrote:
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2808,8 +2808,9 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
>         balance_callback(rq);
>         preempt_enable();
>
> -       if (current->set_child_tid)
> -               put_user(task_pid_vnr(current), current->set_child_tid);
> +       if (current->set_child_tid &&
> +           put_user(task_pid_vnr(current), current->set_child_tid))
> +               force_sig(SIGSEGV, current);
>  }
>
> Add Oleg into CC. IIRR he had some ideas how to fix this. =)

Heh. OK, OK, thank you Konstantin ;)

I'll try to recall tomorrow, but iirc I only have some ideas of how
we can happily blame the FAULT_FLAG_USER logic.

d, in this particular case, perhaps glibc/set_child_tid too because
(again, iirc) it would nice to simply kill it, it is only used for
some sanity checks...

Oleg.
Shayan Pooya - July 12, 2016, 10:57 p.m.
> Yep. Bug still not fixed in upstream. In our kernel I've plugged it with
> this:
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2808,8 +2808,9 @@ asmlinkage __visible void schedule_tail(struct
> task_struct *prev)
>         balance_callback(rq);
>         preempt_enable();
>
> -       if (current->set_child_tid)
> -               put_user(task_pid_vnr(current), current->set_child_tid);
> +       if (current->set_child_tid &&
> +           put_user(task_pid_vnr(current), current->set_child_tid))
> +               force_sig(SIGSEGV, current);
>  }

I just verified that with your patch there is no hung processes and I
see processes getting SIGSEGV as expected.
Thanks!

Patch

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2808,8 +2808,9 @@  asmlinkage __visible void schedule_tail(struct task_struct *prev)
         balance_callback(rq);
         preempt_enable();

-       if (current->set_child_tid)
-               put_user(task_pid_vnr(current), current->set_child_tid);
+       if (current->set_child_tid &&
+           put_user(task_pid_vnr(current), current->set_child_tid))
+               force_sig(SIGSEGV, current);
  }

Add Oleg into CC. IIRR he had some ideas how to fix this. =)