Patchwork [v2] mm: terminate shrink_slab loop if signal is pending

login
register
mail settings
Submitter Suren Baghdasaryan
Date Dec. 8, 2017, 1:23 a.m.
Message ID <20171208012305.83134-1-surenb@google.com>
Download mbox | patch
Permalink /patch/10101449/
State New
Headers show

Comments

Suren Baghdasaryan - Dec. 8, 2017, 1:23 a.m.
Slab shrinkers can be quite time consuming and when signal
is pending they can delay handling of the signal. If fatal
signal is pending there is no point in shrinking that process
since it will be killed anyway. This change checks for pending
fatal signals inside shrink_slab loop and if one is detected
terminates this loop early.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>

---
V2:
Sergey Senozhatsky:
  - Fix missing parentheses
---
 mm/vmscan.c | 7 +++++++
 1 file changed, 7 insertions(+)
Michal Hocko - Dec. 8, 2017, 8:22 a.m.
On Thu 07-12-17 17:23:05, Suren Baghdasaryan wrote:
> Slab shrinkers can be quite time consuming and when signal
> is pending they can delay handling of the signal. If fatal
> signal is pending there is no point in shrinking that process
> since it will be killed anyway.

The thing is that we are _not_ shrinking _that_ process. We are
shrinking globally shared objects and the fact that the memory pressure
is so large that the kswapd doesn't keep pace with it means that we have
to throttle all allocation sites by doing this direct reclaim. I agree
that expediting killed task is a good thing in general because such a
process should free at least some memory.

> This change checks for pending
> fatal signals inside shrink_slab loop and if one is detected
> terminates this loop early.

This changelog doesn't really address my previous review feedback, I am
afraid. You should mention more details about problems you are seeing
and what causes them. If we have a shrinker which takes considerable
amount of time them we should be addressing that. If that is not
possible then it should be documented at least.

The changelog also should describe how does this play along with the
rest of the allocation path.

The patch is not mergeable in this form I am afraid.

> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> 
> ---
> V2:
> Sergey Senozhatsky:
>   - Fix missing parentheses
> ---
>  mm/vmscan.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c02c850ea349..28e4bdc72c16 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -486,6 +486,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  			.memcg = memcg,
>  		};
>  
> +		/*
> +		 * We are about to die and free our memory.
> +		 * Stop shrinking which might delay signal handling.
> +		 */
> +		if (unlikely(fatal_signal_pending(current)))
> +			break;
> +
>  		/*
>  		 * If kernel memory accounting is disabled, we ignore
>  		 * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> -- 
> 2.15.1.424.g9478a66081-goog
>
Tetsuo Handa - Dec. 8, 2017, 11:36 a.m.
On 2017/12/08 17:22, Michal Hocko wrote:
> On Thu 07-12-17 17:23:05, Suren Baghdasaryan wrote:
>> Slab shrinkers can be quite time consuming and when signal
>> is pending they can delay handling of the signal. If fatal
>> signal is pending there is no point in shrinking that process
>> since it will be killed anyway.
> 
> The thing is that we are _not_ shrinking _that_ process. We are
> shrinking globally shared objects and the fact that the memory pressure
> is so large that the kswapd doesn't keep pace with it means that we have
> to throttle all allocation sites by doing this direct reclaim. I agree
> that expediting killed task is a good thing in general because such a
> process should free at least some memory.

But doesn't doing direct reclaim mean that allocation request of already
fatal_signal_pending() threads will not succeed unless some memory is
reclaimed (or selected as an OOM victim)? Won't it just spin the "too
small to fail" retry loop at full speed in the worst case?

> 
>> This change checks for pending
>> fatal signals inside shrink_slab loop and if one is detected
>> terminates this loop early.
> 
> This changelog doesn't really address my previous review feedback, I am
> afraid. You should mention more details about problems you are seeing
> and what causes them. If we have a shrinker which takes considerable
> amount of time them we should be addressing that. If that is not
> possible then it should be documented at least.

Unfortunately, it is possible to be get blocked inside shrink_slab() for so long
like an example from http://lkml.kernel.org/r/1512705038.7843.6.camel@gmail.com .

----------
[18432.707027] INFO: task Chrome_IOThread:27225 blocked for more than
120 seconds.
[18432.707034]       Not tainted 4.15.0-rc2-amd-vega+ #10
[18432.707039] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[18432.707045] Chrome_IOThread D11304 27225   3654 0x00000000
[18432.707057] Call Trace:
[18432.707070]  ? __schedule+0x2e3/0xb90
[18432.707086]  ? __lock_page+0xa9/0x180
[18432.707095]  schedule+0x2f/0x90
[18432.707102]  io_schedule+0x12/0x40
[18432.707109]  __lock_page+0xe9/0x180
[18432.707121]  ? page_cache_tree_insert+0x130/0x130
[18432.707138]  deferred_split_scan+0x2b6/0x300
[18432.707160]  shrink_slab.part.47+0x1f8/0x590
[18432.707179]  ? percpu_ref_put_many+0x84/0x100
[18432.707197]  shrink_node+0x2f4/0x300
[18432.707219]  do_try_to_free_pages+0xca/0x350
[18432.707236]  try_to_free_pages+0x140/0x350
[18432.707259]  __alloc_pages_slowpath+0x43c/0x1080
[18432.707298]  __alloc_pages_nodemask+0x3ac/0x430
[18432.707316]  alloc_pages_vma+0x7c/0x200
[18432.707331]  __handle_mm_fault+0x8a1/0x1230
[18432.707359]  handle_mm_fault+0x14c/0x310
[18432.707373]  __do_page_fault+0x28c/0x530
[18432.707450]  do_page_fault+0x32/0x270
[18432.707470]  page_fault+0x22/0x30
----------
Michal Hocko - Dec. 8, 2017, 11:48 a.m.
On Fri 08-12-17 20:36:16, Tetsuo Handa wrote:
> On 2017/12/08 17:22, Michal Hocko wrote:
> > On Thu 07-12-17 17:23:05, Suren Baghdasaryan wrote:
> >> Slab shrinkers can be quite time consuming and when signal
> >> is pending they can delay handling of the signal. If fatal
> >> signal is pending there is no point in shrinking that process
> >> since it will be killed anyway.
> > 
> > The thing is that we are _not_ shrinking _that_ process. We are
> > shrinking globally shared objects and the fact that the memory pressure
> > is so large that the kswapd doesn't keep pace with it means that we have
> > to throttle all allocation sites by doing this direct reclaim. I agree
> > that expediting killed task is a good thing in general because such a
> > process should free at least some memory.
> 
> But doesn't doing direct reclaim mean that allocation request of already
> fatal_signal_pending() threads will not succeed unless some memory is
> reclaimed (or selected as an OOM victim)? Won't it just spin the "too
> small to fail" retry loop at full speed in the worst case?

Well, normally kswapd would do the work on the background. But this
would have to be carefully evaluated. That is why I've said "expedite"
rather than skip.
 
> >> This change checks for pending
> >> fatal signals inside shrink_slab loop and if one is detected
> >> terminates this loop early.
> > 
> > This changelog doesn't really address my previous review feedback, I am
> > afraid. You should mention more details about problems you are seeing
> > and what causes them. If we have a shrinker which takes considerable
> > amount of time them we should be addressing that. If that is not
> > possible then it should be documented at least.
> 
> Unfortunately, it is possible to be get blocked inside shrink_slab() for so long
> like an example from http://lkml.kernel.org/r/1512705038.7843.6.camel@gmail.com .

As I've said any excessive shrinker should definitely be evaluated.
Tetsuo Handa - Dec. 8, 2017, 2:03 p.m.
Michal Hocko wrote:
> On Fri 08-12-17 20:36:16, Tetsuo Handa wrote:
> > On 2017/12/08 17:22, Michal Hocko wrote:
> > > On Thu 07-12-17 17:23:05, Suren Baghdasaryan wrote:
> > >> Slab shrinkers can be quite time consuming and when signal
> > >> is pending they can delay handling of the signal. If fatal
> > >> signal is pending there is no point in shrinking that process
> > >> since it will be killed anyway.
> > > 
> > > The thing is that we are _not_ shrinking _that_ process. We are
> > > shrinking globally shared objects and the fact that the memory pressure
> > > is so large that the kswapd doesn't keep pace with it means that we have
> > > to throttle all allocation sites by doing this direct reclaim. I agree
> > > that expediting killed task is a good thing in general because such a
> > > process should free at least some memory.
> > 
> > But doesn't doing direct reclaim mean that allocation request of already
> > fatal_signal_pending() threads will not succeed unless some memory is
> > reclaimed (or selected as an OOM victim)? Won't it just spin the "too
> > small to fail" retry loop at full speed in the worst case?
> 
> Well, normally kswapd would do the work on the background. But this
> would have to be carefully evaluated. That is why I've said "expedite"
> rather than skip.

Relying on kswapd is a bad assumption, for kswapd can be blocked on e.g. fs
locks waiting for somebody else to reclaim memory.

>  
> > >> This change checks for pending
> > >> fatal signals inside shrink_slab loop and if one is detected
> > >> terminates this loop early.
> > > 
> > > This changelog doesn't really address my previous review feedback, I am
> > > afraid. You should mention more details about problems you are seeing
> > > and what causes them. If we have a shrinker which takes considerable
> > > amount of time them we should be addressing that. If that is not
> > > possible then it should be documented at least.
> > 
> > Unfortunately, it is possible to be get blocked inside shrink_slab() for so long
> > like an example from http://lkml.kernel.org/r/1512705038.7843.6.camel@gmail.com .
> 
> As I've said any excessive shrinker should definitely be evaluated.

The cause of stall inside shrink_slab() can be memory pressure itself.
There would be no problem if kswapd is sufficient (i.e. direct reclaim is
not needed). But there are many problems if direct reclaim is needed.



I agree that making waits/loops killable is generally good. But be sure to be
prepared for the worst case. For example, start __GFP_KILLABLE from "best effort"
basis (i.e. no guarantee that the allocating thread will leave the page allocator
slowpath immediately) and check for fatal_signal_pending() only if
__GFP_KILLABLE is set. That is,

+		/*
+		 * We are about to die and free our memory.
+		 * Stop shrinking which might delay signal handling.
+		 */
+		if (unlikely((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current)))
+			break;

at shrink_slab() etc. and

+		if ((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current))
+			goto nopage;

at __alloc_pages_slowpath().
Suren Baghdasaryan - Dec. 8, 2017, 6:06 p.m.
On Fri, Dec 8, 2017 at 6:03 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> Michal Hocko wrote:
>> On Fri 08-12-17 20:36:16, Tetsuo Handa wrote:
>> > On 2017/12/08 17:22, Michal Hocko wrote:
>> > > On Thu 07-12-17 17:23:05, Suren Baghdasaryan wrote:
>> > >> Slab shrinkers can be quite time consuming and when signal
>> > >> is pending they can delay handling of the signal. If fatal
>> > >> signal is pending there is no point in shrinking that process
>> > >> since it will be killed anyway.
>> > >
>> > > The thing is that we are _not_ shrinking _that_ process. We are
>> > > shrinking globally shared objects and the fact that the memory pressure
>> > > is so large that the kswapd doesn't keep pace with it means that we have
>> > > to throttle all allocation sites by doing this direct reclaim. I agree
>> > > that expediting killed task is a good thing in general because such a
>> > > process should free at least some memory.

Agree, wording here is inaccurate. My original intent was to have a
safeguard against slow shrinkers but I understand your concern that
this can mask a real problem in a shrinker. In essence expediting the
killing is the ultimate goal here but as you mentioned it's not as
simple as this change.

>> >
>> > But doesn't doing direct reclaim mean that allocation request of already
>> > fatal_signal_pending() threads will not succeed unless some memory is
>> > reclaimed (or selected as an OOM victim)? Won't it just spin the "too
>> > small to fail" retry loop at full speed in the worst case?

That seems to be the case in my test. If we could bail out safely from
the retry loop in __alloc_pages_slowpath that would be a big win in
expediting the signal.

>>
>> Well, normally kswapd would do the work on the background. But this
>> would have to be carefully evaluated. That is why I've said "expedite"
>> rather than skip.
>
> Relying on kswapd is a bad assumption, for kswapd can be blocked on e.g. fs
> locks waiting for somebody else to reclaim memory.
>
>>
>> > >> This change checks for pending
>> > >> fatal signals inside shrink_slab loop and if one is detected
>> > >> terminates this loop early.
>> > >
>> > > This changelog doesn't really address my previous review feedback, I am
>> > > afraid. You should mention more details about problems you are seeing
>> > > and what causes them.

The problem I'm facing is that a SIGKILL sent from user space to kill
the least important process is delayed enough for OOM-killer to get a
chance to kill something else, possibly a more important process. Here
"important" is from user's point of view. So the delay in SIGKILL
delivery effectively causes extra kills. Traces indicate that this
delay happens when process being killed is in direct reclaim and
shrinkers (before I fixed them) were the biggest cause for the delay.

>> > > If we have a shrinker which takes considerable
>> > > amount of time them we should be addressing that. If that is not
>> > > possible then it should be documented at least.

I already submitted patches for couple shrinkers. Problem became less
pronounced and less frequent but the retry loop Tetsuo mentioned still
visibly delays the delivery. The worst case I've seen after fixing
shrinkers is 43ms.

>> >
>> > Unfortunately, it is possible to be get blocked inside shrink_slab() for so long
>> > like an example from http://lkml.kernel.org/r/1512705038.7843.6.camel@gmail.com .
>>
>> As I've said any excessive shrinker should definitely be evaluated.
>
> The cause of stall inside shrink_slab() can be memory pressure itself.
> There would be no problem if kswapd is sufficient (i.e. direct reclaim is
> not needed). But there are many problems if direct reclaim is needed.
>
>
>
> I agree that making waits/loops killable is generally good. But be sure to be
> prepared for the worst case. For example, start __GFP_KILLABLE from "best effort"
> basis (i.e. no guarantee that the allocating thread will leave the page allocator
> slowpath immediately) and check for fatal_signal_pending() only if
> __GFP_KILLABLE is set. That is,
>
> +               /*
> +                * We are about to die and free our memory.
> +                * Stop shrinking which might delay signal handling.
> +                */
> +               if (unlikely((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current)))
> +                       break;
>
> at shrink_slab() etc. and
>
> +               if ((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current))
> +                       goto nopage;
>
> at __alloc_pages_slowpath().

I was thinking about something similar and will experiment to see if
this solves the problem and if it has any side effects. Anyone sees
any obvious problems with this approach?
David Rientjes - Dec. 8, 2017, 9:02 p.m.
On Thu, 7 Dec 2017, Suren Baghdasaryan wrote:

> Slab shrinkers can be quite time consuming and when signal
> is pending they can delay handling of the signal. If fatal
> signal is pending there is no point in shrinking that process
> since it will be killed anyway. This change checks for pending
> fatal signals inside shrink_slab loop and if one is detected
> terminates this loop early.
> 

I've proposed a similar patch in the past, but for a check on TIF_MEMDIE, 
which would today be a tsk_is_oom_victim(current), since we had observed 
lengthy stalls in reclaim that would have been prevented if the oom victim 
had exited out, returned back to the page allocator, allocated with 
ALLOC_NO_WATERMARKS, and proceeded to quickly exit.

I'm not sure that all fatal_signal_pending() tasks should get the same 
treatment, but I understand the point that the task is killed and should 
free memory when it fully exits.  How much memory is unknown.

 > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> 
> ---
> V2:
> Sergey Senozhatsky:
>   - Fix missing parentheses
> ---
>  mm/vmscan.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c02c850ea349..28e4bdc72c16 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -486,6 +486,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  			.memcg = memcg,
>  		};
>  
> +		/*
> +		 * We are about to die and free our memory.
> +		 * Stop shrinking which might delay signal handling.
> +		 */
> +		if (unlikely(fatal_signal_pending(current)))
> +			break;
> +
>  		/*
>  		 * If kernel memory accounting is disabled, we ignore
>  		 * SHRINKER_MEMCG_AWARE flag and call all shrinkers
Suren Baghdasaryan - Dec. 9, 2017, 3:16 a.m.
On Fri, Dec 8, 2017 at 1:02 PM, David Rientjes <rientjes@google.com> wrote:
> On Thu, 7 Dec 2017, Suren Baghdasaryan wrote:
>
>> Slab shrinkers can be quite time consuming and when signal
>> is pending they can delay handling of the signal. If fatal
>> signal is pending there is no point in shrinking that process
>> since it will be killed anyway. This change checks for pending
>> fatal signals inside shrink_slab loop and if one is detected
>> terminates this loop early.
>>
>
> I've proposed a similar patch in the past, but for a check on TIF_MEMDIE,
> which would today be a tsk_is_oom_victim(current), since we had observed
> lengthy stalls in reclaim that would have been prevented if the oom victim
> had exited out, returned back to the page allocator, allocated with
> ALLOC_NO_WATERMARKS, and proceeded to quickly exit.
>
> I'm not sure that all fatal_signal_pending() tasks should get the same
> treatment, but I understand the point that the task is killed and should
> free memory when it fully exits.  How much memory is unknown.
>

Thanks for the input. For my particular use case TIF_MEMDIE check
would not help because I'm trying to kill a process before OOM kicks
in, however the approach is interesting and provides food for thought.

>  > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>>
>> ---
>> V2:
>> Sergey Senozhatsky:
>>   - Fix missing parentheses
>> ---
>>  mm/vmscan.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index c02c850ea349..28e4bdc72c16 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -486,6 +486,13 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>                       .memcg = memcg,
>>               };
>>
>> +             /*
>> +              * We are about to die and free our memory.
>> +              * Stop shrinking which might delay signal handling.
>> +              */
>> +             if (unlikely(fatal_signal_pending(current)))
>> +                     break;
>> +
>>               /*
>>                * If kernel memory accounting is disabled, we ignore
>>                * SHRINKER_MEMCG_AWARE flag and call all shrinkers
Tetsuo Handa - Dec. 9, 2017, 8:08 a.m.
Suren Baghdasaryan wrote:
> On Fri, Dec 8, 2017 at 6:03 AM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >> > >> This change checks for pending
> >> > >> fatal signals inside shrink_slab loop and if one is detected
> >> > >> terminates this loop early.
> >> > >
> >> > > This changelog doesn't really address my previous review feedback, I am
> >> > > afraid. You should mention more details about problems you are seeing
> >> > > and what causes them.
> 
> The problem I'm facing is that a SIGKILL sent from user space to kill
> the least important process is delayed enough for OOM-killer to get a
> chance to kill something else, possibly a more important process. Here
> "important" is from user's point of view. So the delay in SIGKILL
> delivery effectively causes extra kills. Traces indicate that this
> delay happens when process being killed is in direct reclaim and
> shrinkers (before I fixed them) were the biggest cause for the delay.

Sending SIGKILL from userspace is not releasing memory fast enough to prevent
the OOM killer from invoking? Yes, under memory pressure, even an attempt to
send SIGKILL from userspace could be delayed due to e.g. page fault.

Unless it is memcg OOM, you could try OOM notifier callback for checking
whether there are SIGKILL pending processes and wait for timeout if any.
This situation resembles timeout-based OOM killing discussion, where the OOM
killer is enabled again (based on an assumption that the OOM victim got stuck
due to OOM lockup) after some timeout from previous OOM-killing. And since
we did not merge timeout-based approach, there is no timestamp field for
remembering when the SIGKILL was delivered. Therefore, an attempt to wait for
timeout would become messy.

Regardless of whether you try OOM notifier callback, I think that adding
__GFP_KILLABLE and allow bailing out without calling out_of_memory() and
warn_alloc() will help terminating killed processes faster. I think that
majority of memory allocation requests can give up upon SIGKILL. Important
allocation requests which should not give up upon SIGKILL (e.g. committing
to filesystem metadata and storage) can be offloaded to kernel threads.

> 
> >> > > If we have a shrinker which takes considerable
> >> > > amount of time them we should be addressing that. If that is not
> >> > > possible then it should be documented at least.
> 
> I already submitted patches for couple shrinkers. Problem became less
> pronounced and less frequent but the retry loop Tetsuo mentioned still
> visibly delays the delivery. The worst case I've seen after fixing
> shrinkers is 43ms.

You meant "delays the termination (of the killed process)" rather than
"delays the delivery (of SIGKILL)", didn't you?

> > I agree that making waits/loops killable is generally good. But be sure to be
> > prepared for the worst case. For example, start __GFP_KILLABLE from "best effort"
> > basis (i.e. no guarantee that the allocating thread will leave the page allocator
> > slowpath immediately) and check for fatal_signal_pending() only if
> > __GFP_KILLABLE is set. That is,
> >
> > +               /*
> > +                * We are about to die and free our memory.
> > +                * Stop shrinking which might delay signal handling.
> > +                */
> > +               if (unlikely((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current)))
> > +                       break;
> >
> > at shrink_slab() etc. and
> >
> > +               if ((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current))
> > +                       goto nopage;
> >
> > at __alloc_pages_slowpath().
> 
> I was thinking about something similar and will experiment to see if
> this solves the problem and if it has any side effects. Anyone sees
> any obvious problems with this approach?
> 

It is Michal who thinks that "killability for a particular allocation request sounds
like a hack" ( http://lkml.kernel.org/r/201606112335.HBG09891.OLFJOFtVMOQHSF@I-love.SAKURA.ne.jp ).
I'm willing to give up memory allocations from functions which are called by
system calls if SIGKILL is pending. Thus, it should be time to try __GFP_KILLABLE.
Tetsuo Handa - Dec. 9, 2017, 12:44 p.m.
On 2017/12/09 6:02, David Rientjes wrote:
> On Thu, 7 Dec 2017, Suren Baghdasaryan wrote:
> 
>> Slab shrinkers can be quite time consuming and when signal
>> is pending they can delay handling of the signal. If fatal
>> signal is pending there is no point in shrinking that process
>> since it will be killed anyway. This change checks for pending
>> fatal signals inside shrink_slab loop and if one is detected
>> terminates this loop early.
>>
> 
> I've proposed a similar patch in the past, but for a check on TIF_MEMDIE, 
> which would today be a tsk_is_oom_victim(current), since we had observed 
> lengthy stalls in reclaim that would have been prevented if the oom victim 
> had exited out, returned back to the page allocator, allocated with 
> ALLOC_NO_WATERMARKS, and proceeded to quickly exit.
> 
> I'm not sure that all fatal_signal_pending() tasks should get the same 
> treatment, but I understand the point that the task is killed and should 
> free memory when it fully exits.  How much memory is unknown.
> 
We can use __GFP_KILLABLE. Unless there is performance impact for checking
fatal_siganl_pending(), allowing only fatal_signal_pending() threads with
__GFP_KILLABLE to bail out (without using memory reserves) should be safe.
Michal Hocko - Dec. 10, 2017, 10:13 a.m.
On Fri 08-12-17 10:06:26, Suren Baghdasaryan wrote:
> On Fri, Dec 8, 2017 at 6:03 AM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> > Michal Hocko wrote:
> >> On Fri 08-12-17 20:36:16, Tetsuo Handa wrote:
> >> > On 2017/12/08 17:22, Michal Hocko wrote:
> >> > > On Thu 07-12-17 17:23:05, Suren Baghdasaryan wrote:
> >> > >> Slab shrinkers can be quite time consuming and when signal
> >> > >> is pending they can delay handling of the signal. If fatal
> >> > >> signal is pending there is no point in shrinking that process
> >> > >> since it will be killed anyway.
> >> > >
> >> > > The thing is that we are _not_ shrinking _that_ process. We are
> >> > > shrinking globally shared objects and the fact that the memory pressure
> >> > > is so large that the kswapd doesn't keep pace with it means that we have
> >> > > to throttle all allocation sites by doing this direct reclaim. I agree
> >> > > that expediting killed task is a good thing in general because such a
> >> > > process should free at least some memory.
> 
> Agree, wording here is inaccurate. My original intent was to have a
> safeguard against slow shrinkers but I understand your concern that
> this can mask a real problem in a shrinker. In essence expediting the
> killing is the ultimate goal here but as you mentioned it's not as
> simple as this change.

Moreover it doesn't work if the SIGKILL can be delivered asynchronously
(which is your case AFAICU).  You can be already running the slow
shrinker at that time...
 
[...]
> > I agree that making waits/loops killable is generally good. But be sure to be
> > prepared for the worst case. For example, start __GFP_KILLABLE from "best effort"
> > basis (i.e. no guarantee that the allocating thread will leave the page allocator
> > slowpath immediately) and check for fatal_signal_pending() only if
> > __GFP_KILLABLE is set. That is,
> >
> > +               /*
> > +                * We are about to die and free our memory.
> > +                * Stop shrinking which might delay signal handling.
> > +                */
> > +               if (unlikely((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current)))
> > +                       break;
> >
> > at shrink_slab() etc. and
> >
> > +               if ((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current))
> > +                       goto nopage;
> >
> > at __alloc_pages_slowpath().
> 
> I was thinking about something similar and will experiment to see if
> this solves the problem and if it has any side effects. Anyone sees
> any obvious problems with this approach?

Tetsuo has been proposing this flag in the past and I've had objections
why this is not a great idea. I do not have any link handy but the core
objection was that the semantic would be too fuzzy. All the allocations
in the same context would have to be killable for this flag to have any
effect. Spreading it all over the kernel is simply not feasible.
Michal Hocko - Dec. 10, 2017, 10:17 a.m.
On Sat 09-12-17 17:08:42, Tetsuo Handa wrote:
> Suren Baghdasaryan wrote:
> > On Fri, Dec 8, 2017 at 6:03 AM, Tetsuo Handa
> > <penguin-kernel@i-love.sakura.ne.jp> wrote:
> > >> > >> This change checks for pending
> > >> > >> fatal signals inside shrink_slab loop and if one is detected
> > >> > >> terminates this loop early.
> > >> > >
> > >> > > This changelog doesn't really address my previous review feedback, I am
> > >> > > afraid. You should mention more details about problems you are seeing
> > >> > > and what causes them.
> > 
> > The problem I'm facing is that a SIGKILL sent from user space to kill
> > the least important process is delayed enough for OOM-killer to get a
> > chance to kill something else, possibly a more important process. Here
> > "important" is from user's point of view. So the delay in SIGKILL
> > delivery effectively causes extra kills. Traces indicate that this
> > delay happens when process being killed is in direct reclaim and
> > shrinkers (before I fixed them) were the biggest cause for the delay.
> 
> Sending SIGKILL from userspace is not releasing memory fast enough to prevent
> the OOM killer from invoking? Yes, under memory pressure, even an attempt to
> send SIGKILL from userspace could be delayed due to e.g. page fault.
> 
> Unless it is memcg OOM, you could try OOM notifier callback for checking
> whether there are SIGKILL pending processes and wait for timeout if any.

Hell no! You surely do not want all the OOM livelocks you were pushing
so hard to get fixed, do you?

The whole problem here is that there are two implementations of the OOM
handling and they do not use any synchronization. You cannot be really
surprise they step on each others toes. That is one of the reasons why I
really hated the LMK in the kernel btw.

Stalling shrinkers is a real problem and it should be addressed but
let's not screw an already nasty/fragile code all around that.
Tetsuo Handa - Dec. 10, 2017, 11:37 a.m.
Michal Hocko wrote:
> > > I agree that making waits/loops killable is generally good. But be sure to be
> > > prepared for the worst case. For example, start __GFP_KILLABLE from "best effort"
> > > basis (i.e. no guarantee that the allocating thread will leave the page allocator
> > > slowpath immediately) and check for fatal_signal_pending() only if
> > > __GFP_KILLABLE is set. That is,
> > >
> > > +               /*
> > > +                * We are about to die and free our memory.
> > > +                * Stop shrinking which might delay signal handling.
> > > +                */
> > > +               if (unlikely((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current)))
> > > +                       break;
> > >
> > > at shrink_slab() etc. and
> > >
> > > +               if ((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current))
> > > +                       goto nopage;
> > >
> > > at __alloc_pages_slowpath().
> > 
> > I was thinking about something similar and will experiment to see if
> > this solves the problem and if it has any side effects. Anyone sees
> > any obvious problems with this approach?
> 
> Tetsuo has been proposing this flag in the past and I've had objections
> why this is not a great idea. I do not have any link handy but the core
> objection was that the semantic would be too fuzzy. All the allocations
> in the same context would have to be killable for this flag to have any
> effect. Spreading it all over the kernel is simply not feasible.
> 

Refusing __GFP_KILLABLE based on "All the allocations in the same context
would have to be killable" does not make sense. Outside of MM, we update
code to use _killable version step by step based on best effort basis.
People don't call efforts to change like

  func1() {
    // As of this point it is easy to bail out.
    if (mutex_lock_killable(&lock1) == 0) {
      func2();
      mutex_unlock(&lock1);
    }
  }

  func2() {
    mutex_lock(&lock2);
    // Do something which is not possible to bail out for now.
    mutex_unlock(&lock2);
  }

pointless.

If you insist on "All the allocations in the same context would
have to be killable", then we will offload all activities to some
kernel thread.
Suren Baghdasaryan - Dec. 11, 2017, 9:05 p.m.
On Sat, Dec 9, 2017 at 12:08 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> Suren Baghdasaryan wrote:
>> On Fri, Dec 8, 2017 at 6:03 AM, Tetsuo Handa
>> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>> >> > >> This change checks for pending
>> >> > >> fatal signals inside shrink_slab loop and if one is detected
>> >> > >> terminates this loop early.
>> >> > >
>> >> > > This changelog doesn't really address my previous review feedback, I am
>> >> > > afraid. You should mention more details about problems you are seeing
>> >> > > and what causes them.
>>
>> The problem I'm facing is that a SIGKILL sent from user space to kill
>> the least important process is delayed enough for OOM-killer to get a
>> chance to kill something else, possibly a more important process. Here
>> "important" is from user's point of view. So the delay in SIGKILL
>> delivery effectively causes extra kills. Traces indicate that this
>> delay happens when process being killed is in direct reclaim and
>> shrinkers (before I fixed them) were the biggest cause for the delay.
>
> Sending SIGKILL from userspace is not releasing memory fast enough to prevent
> the OOM killer from invoking? Yes, under memory pressure, even an attempt to
> send SIGKILL from userspace could be delayed due to e.g. page fault.
>

I understand that there will be cases when OOM is unavoidable. I'm
trying to minimize the cases when SIGKILL processing is delayed.

> Unless it is memcg OOM, you could try OOM notifier callback for checking
> whether there are SIGKILL pending processes and wait for timeout if any.
> This situation resembles timeout-based OOM killing discussion, where the OOM
> killer is enabled again (based on an assumption that the OOM victim got stuck
> due to OOM lockup) after some timeout from previous OOM-killing. And since
> we did not merge timeout-based approach, there is no timestamp field for
> remembering when the SIGKILL was delivered. Therefore, an attempt to wait for
> timeout would become messy.
>
> Regardless of whether you try OOM notifier callback, I think that adding
> __GFP_KILLABLE and allow bailing out without calling out_of_memory() and
> warn_alloc() will help terminating killed processes faster. I think that
> majority of memory allocation requests can give up upon SIGKILL. Important
> allocation requests which should not give up upon SIGKILL (e.g. committing
> to filesystem metadata and storage) can be offloaded to kernel threads.
>
>>
>> >> > > If we have a shrinker which takes considerable
>> >> > > amount of time them we should be addressing that. If that is not
>> >> > > possible then it should be documented at least.
>>
>> I already submitted patches for couple shrinkers. Problem became less
>> pronounced and less frequent but the retry loop Tetsuo mentioned still
>> visibly delays the delivery. The worst case I've seen after fixing
>> shrinkers is 43ms.
>
> You meant "delays the termination (of the killed process)" rather than
> "delays the delivery (of SIGKILL)", didn't you?
>

To be more precise, I'm talking about the delay between send_signal()
and get_signal() or in other words the delay between the moment when
we send the SIGKILL and the moment we handle it.

>> > I agree that making waits/loops killable is generally good. But be sure to be
>> > prepared for the worst case. For example, start __GFP_KILLABLE from "best effort"
>> > basis (i.e. no guarantee that the allocating thread will leave the page allocator
>> > slowpath immediately) and check for fatal_signal_pending() only if
>> > __GFP_KILLABLE is set. That is,
>> >
>> > +               /*
>> > +                * We are about to die and free our memory.
>> > +                * Stop shrinking which might delay signal handling.
>> > +                */
>> > +               if (unlikely((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current)))
>> > +                       break;
>> >
>> > at shrink_slab() etc. and
>> >
>> > +               if ((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current))
>> > +                       goto nopage;
>> >
>> > at __alloc_pages_slowpath().
>>
>> I was thinking about something similar and will experiment to see if
>> this solves the problem and if it has any side effects. Anyone sees
>> any obvious problems with this approach?
>>
>
> It is Michal who thinks that "killability for a particular allocation request sounds
> like a hack" ( http://lkml.kernel.org/r/201606112335.HBG09891.OLFJOFtVMOQHSF@I-love.SAKURA.ne.jp ).
> I'm willing to give up memory allocations from functions which are called by
> system calls if SIGKILL is pending. Thus, it should be time to try __GFP_KILLABLE.
Suren Baghdasaryan - Dec. 11, 2017, 9:12 p.m.
On Sun, Dec 10, 2017 at 2:13 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 08-12-17 10:06:26, Suren Baghdasaryan wrote:
>> On Fri, Dec 8, 2017 at 6:03 AM, Tetsuo Handa
>> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>> > Michal Hocko wrote:
>> >> On Fri 08-12-17 20:36:16, Tetsuo Handa wrote:
>> >> > On 2017/12/08 17:22, Michal Hocko wrote:
>> >> > > On Thu 07-12-17 17:23:05, Suren Baghdasaryan wrote:
>> >> > >> Slab shrinkers can be quite time consuming and when signal
>> >> > >> is pending they can delay handling of the signal. If fatal
>> >> > >> signal is pending there is no point in shrinking that process
>> >> > >> since it will be killed anyway.
>> >> > >
>> >> > > The thing is that we are _not_ shrinking _that_ process. We are
>> >> > > shrinking globally shared objects and the fact that the memory pressure
>> >> > > is so large that the kswapd doesn't keep pace with it means that we have
>> >> > > to throttle all allocation sites by doing this direct reclaim. I agree
>> >> > > that expediting killed task is a good thing in general because such a
>> >> > > process should free at least some memory.
>>
>> Agree, wording here is inaccurate. My original intent was to have a
>> safeguard against slow shrinkers but I understand your concern that
>> this can mask a real problem in a shrinker. In essence expediting the
>> killing is the ultimate goal here but as you mentioned it's not as
>> simple as this change.
>
> Moreover it doesn't work if the SIGKILL can be delivered asynchronously
> (which is your case AFAICU).  You can be already running the slow
> shrinker at that time...

That is true but at least my change would not allow more than one
shrinker to run while SIGKILL is pending. And that's why this change
is not a fix for slow shrinkers, they have to be properly fixed with
or without this change.

>
> [...]
>> > I agree that making waits/loops killable is generally good. But be sure to be
>> > prepared for the worst case. For example, start __GFP_KILLABLE from "best effort"
>> > basis (i.e. no guarantee that the allocating thread will leave the page allocator
>> > slowpath immediately) and check for fatal_signal_pending() only if
>> > __GFP_KILLABLE is set. That is,
>> >
>> > +               /*
>> > +                * We are about to die and free our memory.
>> > +                * Stop shrinking which might delay signal handling.
>> > +                */
>> > +               if (unlikely((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current)))
>> > +                       break;
>> >
>> > at shrink_slab() etc. and
>> >
>> > +               if ((gfp_mask & __GFP_KILLABLE) && fatal_signal_pending(current))
>> > +                       goto nopage;
>> >
>> > at __alloc_pages_slowpath().
>>
>> I was thinking about something similar and will experiment to see if
>> this solves the problem and if it has any side effects. Anyone sees
>> any obvious problems with this approach?
>
> Tetsuo has been proposing this flag in the past and I've had objections
> why this is not a great idea. I do not have any link handy but the core
> objection was that the semantic would be too fuzzy. All the allocations
> in the same context would have to be killable for this flag to have any
> effect. Spreading it all over the kernel is simply not feasible.

Sounds like I'll have to do much more reading before touching this code :)

>
> --
> Michal Hocko
> SUSE Labs

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c02c850ea349..28e4bdc72c16 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -486,6 +486,13 @@  static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 			.memcg = memcg,
 		};
 
+		/*
+		 * We are about to die and free our memory.
+		 * Stop shrinking which might delay signal handling.
+		 */
+		if (unlikely(fatal_signal_pending(current)))
+			break;
+
 		/*
 		 * If kernel memory accounting is disabled, we ignore
 		 * SHRINKER_MEMCG_AWARE flag and call all shrinkers