diff mbox series

[RFC] mm, oom: oom ratelimit auto tuning

Message ID 1586597774-6831-1-git-send-email-laoar.shao@gmail.com (mailing list archive)
State New, archived
Headers show
Series [RFC] mm, oom: oom ratelimit auto tuning | expand

Commit Message

Yafang Shao April 11, 2020, 9:36 a.m. UTC
Recently we find an issue that when OOM happens the server is almost
unresponsive for several minutes. That is caused by a slow serial set
with "console=ttyS1,19200". As the speed of this serial is too slow, it
will take almost 10 seconds to print a full OOM message into it. And
then all tasks allocating pages will be blocked as there is almost no
pages can be reclaimed. At that time, the memory pressure is around 90
for a long time. If we don't print the OOM messages into this serial,
a full OOM message only takes less than 1ms and the memory pressure is
less than 40.

We can avoid printing OOM messages into slow serial by adjusting
/proc/sys/kernel/printk to fix this issue, but then all messages with
KERN_WARNING level can't be printed into it neither, that may loss some
useful messages when we want to collect messages from the it for
debugging purpose.

So it is better to decrease the ratelimit. We can introduce some sysctl
knobes similar with printk_ratelimit and burst, but it will burden the
amdin. Let the kernel automatically adjust the ratelimit, that would be
a better choice.

The OOM ratelimit starts with a slow rate, and it will increase slowly
if the speed of the console is rapid and decrease rapidly if the speed
of the console is slow. oom_rs.burst will be in [1, 10] and
oom_rs.interval will always greater than 5 * HZ.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/oom_kill.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 48 insertions(+), 3 deletions(-)

Comments

Michal Hocko April 14, 2020, 7:39 a.m. UTC | #1
On Sat 11-04-20 05:36:14, Yafang Shao wrote:
> Recently we find an issue that when OOM happens the server is almost
> unresponsive for several minutes. That is caused by a slow serial set
> with "console=ttyS1,19200". As the speed of this serial is too slow, it
> will take almost 10 seconds to print a full OOM message into it. And
> then all tasks allocating pages will be blocked as there is almost no
> pages can be reclaimed. At that time, the memory pressure is around 90
> for a long time. If we don't print the OOM messages into this serial,
> a full OOM message only takes less than 1ms and the memory pressure is
> less than 40.

Which part of the oom report takes the most time? I would expect this to
be the dump_tasks part which can be pretty large when there is a lot of
eligible tasks to kill.
 
> We can avoid printing OOM messages into slow serial by adjusting
> /proc/sys/kernel/printk to fix this issue, but then all messages with
> KERN_WARNING level can't be printed into it neither, that may loss some
> useful messages when we want to collect messages from the it for
> debugging purpose.

A large part of the oom report is printed with KERN_INFO log level. So
you can reduce a large part of the output while not losing other
potentially important information.

> So it is better to decrease the ratelimit. We can introduce some sysctl
> knobes similar with printk_ratelimit and burst, but it will burden the
> amdin. Let the kernel automatically adjust the ratelimit, that would be
> a better choice.

No new knobs for ratelimiting. Admin shouldn't really care about these
things. Besides that I strongly suspect that you would be much better of
by disabling /proc/sys/vm/oom_dump_tasks which would reduce the amount
of output a lot. Or do you really require this information when
debugging oom reports?

> The OOM ratelimit starts with a slow rate, and it will increase slowly
> if the speed of the console is rapid and decrease rapidly if the speed
> of the console is slow. oom_rs.burst will be in [1, 10] and
> oom_rs.interval will always greater than 5 * HZ.

I am not against increasing the ratelimit timeout. But this patch seems
to be trying to be too clever.  Why cannot we simply increase the
parameters of the ratelimit? I am also interested whether this actually
works. AFAIR ratelimit doesn't really work reliably when the ratelimited
operation takes a long time because the internals have no way to see
when the operation finished.

> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  mm/oom_kill.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 48 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index dfc357614e56..23dba8ccf313 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -954,8 +954,10 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  {
>  	struct task_struct *victim = oc->chosen;
>  	struct mem_cgroup *oom_group;
> -	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> -					      DEFAULT_RATELIMIT_BURST);
> +	static DEFINE_RATELIMIT_STATE(oom_rs, 20 * HZ, 1);
> +	int delta;
> +	unsigned long start;
> +	unsigned long end;
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> @@ -972,8 +974,51 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	}
>  	task_unlock(victim);
>  
> -	if (__ratelimit(&oom_rs))
> +	if (__ratelimit(&oom_rs)) {
> +		start = jiffies;
>  		dump_header(oc, victim);
> +		end = jiffies;
> +		delta = end - start;
> +
> +		/*
> +		 * The OOM messages may be printed to a serial with very low
> +		 * speed, e.g. console=ttyS1,19200. It will take long
> +		 * time to print these OOM messages to this serial, and
> +		 * then processes allocating pages will all be blocked due
> +		 * to it can hardly reclaim pages. That will case high
> +		 * memory pressure and the system may be unresponsive for a
> +		 * long time.
> +		 * In this case, we should decrease the OOM ratelimit or
> +		 * avoid printing OOM messages into the slow serial. But if
> +		 * we avoid printing OOM messages into the slow serial, all
> +		 * messages with KERN_WARNING level can't be printed into
> +		 * it neither, that may loss some useful messages when we
> +		 * want to collect messages from the console for debugging
> +		 * purpose. So it is better to decrease the ratelimit. We
> +		 * can introduce some sysctl knobes similar with
> +		 * printk_ratelimit and burst, but it will burden the
> +		 * admin. Let the kernel automatically adjust the ratelimit
> +		 * would be a better chioce.
> +		 * In bellow algorithm, it will decrease the OOM ratelimit
> +		 * rapidly if the console is slow and increase the OOM
> +		 * ratelimit slowly if the console is fast. oom_rs.burst
> +		 * will be in [1, 10] and oom_rs.interval will always
> +		 * greater than 5 * HZ.
> +		 */
> +		if (delta < oom_rs.interval / 10) {
> +			if (oom_rs.interval >= 10 * HZ)
> +				oom_rs.interval /= 2;
> +			else if (oom_rs.interval > 6 * HZ)
> +				oom_rs.interval -= HZ;
> +
> +			if (oom_rs.burst < 10)
> +				oom_rs.burst += 1;
> +		} else if (oom_rs.burst > 1) {
> +			oom_rs.burst = 1;
> +			oom_rs.interval = 4 * delta;
> +		}
> +
> +	}
>  
>  	/*
>  	 * Do we need to kill the entire memory cgroup?
> -- 
> 2.18.2
Yafang Shao April 14, 2020, 12:32 p.m. UTC | #2
On Tue, Apr 14, 2020 at 3:39 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Sat 11-04-20 05:36:14, Yafang Shao wrote:
> > Recently we find an issue that when OOM happens the server is almost
> > unresponsive for several minutes. That is caused by a slow serial set
> > with "console=ttyS1,19200". As the speed of this serial is too slow, it
> > will take almost 10 seconds to print a full OOM message into it. And
> > then all tasks allocating pages will be blocked as there is almost no
> > pages can be reclaimed. At that time, the memory pressure is around 90
> > for a long time. If we don't print the OOM messages into this serial,
> > a full OOM message only takes less than 1ms and the memory pressure is
> > less than 40.
>
> Which part of the oom report takes the most time? I would expect this to
> be the dump_tasks part which can be pretty large when there is a lot of
> eligible tasks to kill.
>

Yes, dump_tasks takes around 6s of the total 10s,  show_mem take
around 2s, and dump_stack takes around 0.8s.

> > We can avoid printing OOM messages into slow serial by adjusting
> > /proc/sys/kernel/printk to fix this issue, but then all messages with
> > KERN_WARNING level can't be printed into it neither, that may loss some
> > useful messages when we want to collect messages from the it for
> > debugging purpose.
>
> A large part of the oom report is printed with KERN_INFO log level. So
> you can reduce a large part of the output while not losing other
> potentially important information.
>

Reduce the KERN_INFO log can save lots of time, but I just worried
that sometimes the user may need the full log and if then can't find
these logs they may complain.

> > So it is better to decrease the ratelimit. We can introduce some sysctl
> > knobes similar with printk_ratelimit and burst, but it will burden the
> > amdin. Let the kernel automatically adjust the ratelimit, that would be
> > a better choice.
>
> No new knobs for ratelimiting. Admin shouldn't really care about these
> things.

Agreed.

[snip]
> Besides that I strongly suspect that you would be much better of
> by disabling /proc/sys/vm/oom_dump_tasks which would reduce the amount
> of output a lot. Or do you really require this information when
> debugging oom reports?
>

Yes, disabling /proc/sys/vm/oom_dump_tasks can save lots of time.
But I'm not sure whehter we can disable it totally, because disabling
it would prevent the tasks log from being wrote into /var/log/messages
neither.

> > The OOM ratelimit starts with a slow rate, and it will increase slowly
> > if the speed of the console is rapid and decrease rapidly if the speed
> > of the console is slow. oom_rs.burst will be in [1, 10] and
> > oom_rs.interval will always greater than 5 * HZ.
>
> I am not against increasing the ratelimit timeout. But this patch seems
> to be trying to be too clever.  Why cannot we simply increase the
> parameters of the ratelimit?

I justed worried that the user may complain it if too many
oom_kill_process callbacks are suppressed.
But considering that OOM burst at the same time are always because of
the same reason, so I think one snapshot of the OOM may be enough.
Simply setting oom_rs with {20 * HZ, 1} can resolve this issue.

> I am also interested whether this actually
> works. AFAIR ratelimit doesn't really work reliably when the ratelimited
> operation takes a long time because the internals have no way to see
> when the operation finished.
>

Agree with you that ratelimit() was not so reliable.

> >  mm/oom_kill.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 48 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index dfc357614e56..23dba8ccf313 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -954,8 +954,10 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  {
> >       struct task_struct *victim = oc->chosen;
> >       struct mem_cgroup *oom_group;
> > -     static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > -                                           DEFAULT_RATELIMIT_BURST);
> > +     static DEFINE_RATELIMIT_STATE(oom_rs, 20 * HZ, 1);
> > +     int delta;
> > +     unsigned long start;
> > +     unsigned long end;
> >
> >       /*
> >        * If the task is already exiting, don't alarm the sysadmin or kill
> > @@ -972,8 +974,51 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >       }
> >       task_unlock(victim);
> >
> > -     if (__ratelimit(&oom_rs))
> > +     if (__ratelimit(&oom_rs)) {
> > +             start = jiffies;
> >               dump_header(oc, victim);
> > +             end = jiffies;
> > +             delta = end - start;
> > +
> > +             /*
> > +              * The OOM messages may be printed to a serial with very low
> > +              * speed, e.g. console=ttyS1,19200. It will take long
> > +              * time to print these OOM messages to this serial, and
> > +              * then processes allocating pages will all be blocked due
> > +              * to it can hardly reclaim pages. That will case high
> > +              * memory pressure and the system may be unresponsive for a
> > +              * long time.
> > +              * In this case, we should decrease the OOM ratelimit or
> > +              * avoid printing OOM messages into the slow serial. But if
> > +              * we avoid printing OOM messages into the slow serial, all
> > +              * messages with KERN_WARNING level can't be printed into
> > +              * it neither, that may loss some useful messages when we
> > +              * want to collect messages from the console for debugging
> > +              * purpose. So it is better to decrease the ratelimit. We
> > +              * can introduce some sysctl knobes similar with
> > +              * printk_ratelimit and burst, but it will burden the
> > +              * admin. Let the kernel automatically adjust the ratelimit
> > +              * would be a better chioce.
> > +              * In bellow algorithm, it will decrease the OOM ratelimit
> > +              * rapidly if the console is slow and increase the OOM
> > +              * ratelimit slowly if the console is fast. oom_rs.burst
> > +              * will be in [1, 10] and oom_rs.interval will always
> > +              * greater than 5 * HZ.
> > +              */
> > +             if (delta < oom_rs.interval / 10) {
> > +                     if (oom_rs.interval >= 10 * HZ)
> > +                             oom_rs.interval /= 2;
> > +                     else if (oom_rs.interval > 6 * HZ)
> > +                             oom_rs.interval -= HZ;
> > +
> > +                     if (oom_rs.burst < 10)
> > +                             oom_rs.burst += 1;
> > +             } else if (oom_rs.burst > 1) {
> > +                     oom_rs.burst = 1;
> > +                     oom_rs.interval = 4 * delta;
> > +             }
> > +
> > +     }
> >
> >       /*
> >        * Do we need to kill the entire memory cgroup?
> > --
> > 2.18.2
>
> --


Thanks
Yafang
Michal Hocko April 14, 2020, 2:32 p.m. UTC | #3
On Tue 14-04-20 20:32:54, Yafang Shao wrote:
> On Tue, Apr 14, 2020 at 3:39 PM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > Besides that I strongly suspect that you would be much better of
> > by disabling /proc/sys/vm/oom_dump_tasks which would reduce the amount
> > of output a lot. Or do you really require this information when
> > debugging oom reports?
> >
> 
> Yes, disabling /proc/sys/vm/oom_dump_tasks can save lots of time.
> But I'm not sure whehter we can disable it totally, because disabling
> it would prevent the tasks log from being wrote into /var/log/messages
> neither.

Yes, eligible tasks would be really missing. The real question is
whether you are really going to miss that information. From my
experience of looking into oom reports for years I can tell that the
list might be useful but in a vast majority of cases I simply do not
really neeed it because the stat of memory and chosen victims are much
more important. The list of tasks is usually interesting only when you
want to double check whether the victim selection was reasonable or
cases where a list of tasks itself can tell whether something went wild
in the userspace.

> > > The OOM ratelimit starts with a slow rate, and it will increase slowly
> > > if the speed of the console is rapid and decrease rapidly if the speed
> > > of the console is slow. oom_rs.burst will be in [1, 10] and
> > > oom_rs.interval will always greater than 5 * HZ.
> >
> > I am not against increasing the ratelimit timeout. But this patch seems
> > to be trying to be too clever.  Why cannot we simply increase the
> > parameters of the ratelimit?
> 
> I justed worried that the user may complain it if too many
> oom_kill_process callbacks are suppressed.

This can be a real concern indeed.

> But considering that OOM burst at the same time are always because of
> the same reason,

This is not really the case. Please note that many parallel OOM killers
might happen in memory cgroup setups.

> so I think one snapshot of the OOM may be enough.
> Simply setting oom_rs with {20 * HZ, 1} can resolve this issue.

Does it really though? The ratelimit doesn't stop the long taking
output. It simply cannot because the work is already done.

That being said, making the ratelimiting more aggressive sounds more
like a workaround than an actual fix. So I would go that route only if
there is no other option. I believe the real problem here is in printk
being too synchronous here. This is a general problem and something
printk maintainers are already working on.

For now I would recommend to workaround this problem by reducing the log
level or disabling dump_tasks.
Yafang Shao April 14, 2020, 2:58 p.m. UTC | #4
On Tue, Apr 14, 2020 at 10:32 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 14-04-20 20:32:54, Yafang Shao wrote:
> > On Tue, Apr 14, 2020 at 3:39 PM Michal Hocko <mhocko@kernel.org> wrote:
> [...]
> > > Besides that I strongly suspect that you would be much better of
> > > by disabling /proc/sys/vm/oom_dump_tasks which would reduce the amount
> > > of output a lot. Or do you really require this information when
> > > debugging oom reports?
> > >
> >
> > Yes, disabling /proc/sys/vm/oom_dump_tasks can save lots of time.
> > But I'm not sure whehter we can disable it totally, because disabling
> > it would prevent the tasks log from being wrote into /var/log/messages
> > neither.
>
> Yes, eligible tasks would be really missing. The real question is
> whether you are really going to miss that information. From my
> experience of looking into oom reports for years I can tell that the
> list might be useful but in a vast majority of cases I simply do not
> really neeed it because the stat of memory and chosen victims are much
> more important. The list of tasks is usually interesting only when you
> want to double check whether the victim selection was reasonable or
> cases where a list of tasks itself can tell whether something went wild
> in the userspace.
>

Agreed. From my experience, the list of tasks is mainly used to double
check the oom score.

> > > > The OOM ratelimit starts with a slow rate, and it will increase slowly
> > > > if the speed of the console is rapid and decrease rapidly if the speed
> > > > of the console is slow. oom_rs.burst will be in [1, 10] and
> > > > oom_rs.interval will always greater than 5 * HZ.
> > >
> > > I am not against increasing the ratelimit timeout. But this patch seems
> > > to be trying to be too clever.  Why cannot we simply increase the
> > > parameters of the ratelimit?
> >
> > I justed worried that the user may complain it if too many
> > oom_kill_process callbacks are suppressed.
>
> This can be a real concern indeed.
>
> > But considering that OOM burst at the same time are always because of
> > the same reason,
>
> This is not really the case. Please note that many parallel OOM killers
> might happen in memory cgroup setups.
>
> > so I think one snapshot of the OOM may be enough.
> > Simply setting oom_rs with {20 * HZ, 1} can resolve this issue.
>
> Does it really though? The ratelimit doesn't stop the long taking
> output. It simply cannot because the work is already done.
>
> That being said, making the ratelimiting more aggressive sounds more
> like a workaround than an actual fix. So I would go that route only if
> there is no other option. I believe the real problem here is in printk
> being too synchronous here. This is a general problem and something
> printk maintainers are already working on.
>

Yes, printk being too sync is the real issue. If the printk an be
async, then we don't need to worry about it at all.

> For now I would recommend to workaround this problem by reducing the log
> level or disabling dump_tasks.
>

Reducing the log level is what we have been doing.
Many thanks for your patient explaination.


Thanks
Yafang
Tetsuo Handa April 15, 2020, 5:58 a.m. UTC | #5
On 2020/04/14 23:58, Yafang Shao wrote:
>>>>> The OOM ratelimit starts with a slow rate, and it will increase slowly
>>>>> if the speed of the console is rapid and decrease rapidly if the speed
>>>>> of the console is slow. oom_rs.burst will be in [1, 10] and
>>>>> oom_rs.interval will always greater than 5 * HZ.
>>>>
>>>> I am not against increasing the ratelimit timeout. But this patch seems
>>>> to be trying to be too clever.  Why cannot we simply increase the
>>>> parameters of the ratelimit?
>>>
>>> I justed worried that the user may complain it if too many
>>> oom_kill_process callbacks are suppressed.
>>
>> This can be a real concern indeed.

I'm proposing automated ratelimiting of dump_tasks() at
http://lkml.kernel.org/r/1563360901-8277-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
I believe that automated ratelimiting of dump_tasks() remains necessary
even after printk() became asynchronous.

>>
>>> But considering that OOM burst at the same time are always because of
>>> the same reason,
>>
>> This is not really the case. Please note that many parallel OOM killers
>> might happen in memory cgroup setups.
>>
>>> so I think one snapshot of the OOM may be enough.
>>> Simply setting oom_rs with {20 * HZ, 1} can resolve this issue.
>>
>> Does it really though? The ratelimit doesn't stop the long taking
>> output. It simply cannot because the work is already done.
>>
>> That being said, making the ratelimiting more aggressive sounds more
>> like a workaround than an actual fix. So I would go that route only if
>> there is no other option. I believe the real problem here is in printk
>> being too synchronous here. This is a general problem and something
>> printk maintainers are already working on.
>>
> 
> Yes, printk being too sync is the real issue. If the printk an be
> async, then we don't need to worry about it at all.

I strongly disagree. dump_tasks() will needlessly fill printk() log buffer
(and potentially loose other kernel messages due to buffer full / disk full).

By the way, Petr and Sergey, how is the progress of making printk() asynchronous?
When can we expect that work to be merged?
Yafang Shao April 17, 2020, 11:57 a.m. UTC | #6
On Wed, Apr 15, 2020 at 1:58 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2020/04/14 23:58, Yafang Shao wrote:
> >>>>> The OOM ratelimit starts with a slow rate, and it will increase slowly
> >>>>> if the speed of the console is rapid and decrease rapidly if the speed
> >>>>> of the console is slow. oom_rs.burst will be in [1, 10] and
> >>>>> oom_rs.interval will always greater than 5 * HZ.
> >>>>
> >>>> I am not against increasing the ratelimit timeout. But this patch seems
> >>>> to be trying to be too clever.  Why cannot we simply increase the
> >>>> parameters of the ratelimit?
> >>>
> >>> I justed worried that the user may complain it if too many
> >>> oom_kill_process callbacks are suppressed.
> >>
> >> This can be a real concern indeed.
>
> I'm proposing automated ratelimiting of dump_tasks() at
> http://lkml.kernel.org/r/1563360901-8277-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
> I believe that automated ratelimiting of dump_tasks() remains necessary
> even after printk() became asynchronous.
>

Thanks for your information.
I haven't read your proposal carefully, but take a first glance I
think it would be a useful improvement.


> >>
> >>> But considering that OOM burst at the same time are always because of
> >>> the same reason,
> >>
> >> This is not really the case. Please note that many parallel OOM killers
> >> might happen in memory cgroup setups.
> >>
> >>> so I think one snapshot of the OOM may be enough.
> >>> Simply setting oom_rs with {20 * HZ, 1} can resolve this issue.
> >>
> >> Does it really though? The ratelimit doesn't stop the long taking
> >> output. It simply cannot because the work is already done.
> >>
> >> That being said, making the ratelimiting more aggressive sounds more
> >> like a workaround than an actual fix. So I would go that route only if
> >> there is no other option. I believe the real problem here is in printk
> >> being too synchronous here. This is a general problem and something
> >> printk maintainers are already working on.
> >>
> >
> > Yes, printk being too sync is the real issue. If the printk an be
> > async, then we don't need to worry about it at all.
>
> I strongly disagree. dump_tasks() will needlessly fill printk() log buffer
> (and potentially loose other kernel messages due to buffer full / disk full).
>

Yup, printk() log buffer will be a issue if the console is too slow.
After the printk() is implemented as async, I thinks it is worth to do
some optimization.

> By the way, Petr and Sergey, how is the progress of making printk() asynchronous?
> When can we expect that work to be merged?
>



Thanks
Yafang
Tetsuo Handa April 17, 2020, 1:03 p.m. UTC | #7
On 2020/04/17 20:57, Yafang Shao wrote:
>>>>> I justed worried that the user may complain it if too many
>>>>> oom_kill_process callbacks are suppressed.
>>>>
>>>> This can be a real concern indeed.
>>
>> I'm proposing automated ratelimiting of dump_tasks() at
>> http://lkml.kernel.org/r/1563360901-8277-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
>> I believe that automated ratelimiting of dump_tasks() remains necessary
>> even after printk() became asynchronous.
>>
> 
> Thanks for your information.
> I haven't read your proposal carefully, but take a first glance I
> think it would be a useful improvement.

Thank you. That patch alone avoids just RCU stall. But
https://lkml.kernel.org/r/7de2310d-afbd-e616-e83a-d75103b986c6@i-love.sakura.ne.jp and
https://lkml.kernel.org/r/57be50b2-a97a-e559-e4bd-10d923895f83@i-love.sakura.ne.jp
referenced from that thread allows defer printing of OOM victim candidates. And

>>> Yes, printk being too sync is the real issue. If the printk an be
>>> async, then we don't need to worry about it at all.
>>
>> I strongly disagree. dump_tasks() will needlessly fill printk() log buffer
>> (and potentially loose other kernel messages due to buffer full / disk full).
>>
> 
> Yup, printk() log buffer will be a issue if the console is too slow.
> After the printk() is implemented as async, I thinks it is worth to do
> some optimization.

my suggestion is to offload printing of OOM victim candidates to a workqueue context.
Then, even after printk() became asynchronous, that workqueue waits for completion of
printing to consoles for each OOM victim candidate. This way, only dump_tasks() where
dumping of past OOM-killer invocations has not completed will suppress dump_tasks()
 from later OOM-killer invocations in a way duplicated OOM victims won't be reported
for many times (and also saves printk() log buffer / disk space).

I need real world reports (like your report)...
Yafang Shao April 17, 2020, 1:55 p.m. UTC | #8
On Fri, Apr 17, 2020 at 9:04 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2020/04/17 20:57, Yafang Shao wrote:
> >>>>> I justed worried that the user may complain it if too many
> >>>>> oom_kill_process callbacks are suppressed.
> >>>>
> >>>> This can be a real concern indeed.
> >>
> >> I'm proposing automated ratelimiting of dump_tasks() at
> >> http://lkml.kernel.org/r/1563360901-8277-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
> >> I believe that automated ratelimiting of dump_tasks() remains necessary
> >> even after printk() became asynchronous.
> >>
> >
> > Thanks for your information.
> > I haven't read your proposal carefully, but take a first glance I
> > think it would be a useful improvement.
>
> Thank you. That patch alone avoids just RCU stall. But
> https://lkml.kernel.org/r/7de2310d-afbd-e616-e83a-d75103b986c6@i-love.sakura.ne.jp and
> https://lkml.kernel.org/r/57be50b2-a97a-e559-e4bd-10d923895f83@i-love.sakura.ne.jp
> referenced from that thread allows defer printing of OOM victim candidates. And
>
> >>> Yes, printk being too sync is the real issue. If the printk an be
> >>> async, then we don't need to worry about it at all.
> >>
> >> I strongly disagree. dump_tasks() will needlessly fill printk() log buffer
> >> (and potentially loose other kernel messages due to buffer full / disk full).
> >>
> >
> > Yup, printk() log buffer will be a issue if the console is too slow.
> > After the printk() is implemented as async, I thinks it is worth to do
> > some optimization.
>
> my suggestion is to offload printing of OOM victim candidates to a workqueue context.
> Then, even after printk() became asynchronous, that workqueue waits for completion of
> printing to consoles for each OOM victim candidate. This way, only dump_tasks() where
> dumping of past OOM-killer invocations has not completed will suppress dump_tasks()
>  from later OOM-killer invocations in a way duplicated OOM victims won't be reported
> for many times (and also saves printk() log buffer / disk space).
>

Sounds like a good idea.

> I need real world reports (like your report)...

I'd appreciate it if my report could help you.
Feel free to let me know if you need more infomation.

Thanks
Yafang
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dfc357614e56..23dba8ccf313 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -954,8 +954,10 @@  static void oom_kill_process(struct oom_control *oc, const char *message)
 {
 	struct task_struct *victim = oc->chosen;
 	struct mem_cgroup *oom_group;
-	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
+	static DEFINE_RATELIMIT_STATE(oom_rs, 20 * HZ, 1);
+	int delta;
+	unsigned long start;
+	unsigned long end;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -972,8 +974,51 @@  static void oom_kill_process(struct oom_control *oc, const char *message)
 	}
 	task_unlock(victim);
 
-	if (__ratelimit(&oom_rs))
+	if (__ratelimit(&oom_rs)) {
+		start = jiffies;
 		dump_header(oc, victim);
+		end = jiffies;
+		delta = end - start;
+
+		/*
+		 * The OOM messages may be printed to a serial with very low
+		 * speed, e.g. console=ttyS1,19200. It will take long
+		 * time to print these OOM messages to this serial, and
+		 * then processes allocating pages will all be blocked due
+		 * to it can hardly reclaim pages. That will case high
+		 * memory pressure and the system may be unresponsive for a
+		 * long time.
+		 * In this case, we should decrease the OOM ratelimit or
+		 * avoid printing OOM messages into the slow serial. But if
+		 * we avoid printing OOM messages into the slow serial, all
+		 * messages with KERN_WARNING level can't be printed into
+		 * it neither, that may loss some useful messages when we
+		 * want to collect messages from the console for debugging
+		 * purpose. So it is better to decrease the ratelimit. We
+		 * can introduce some sysctl knobes similar with
+		 * printk_ratelimit and burst, but it will burden the
+		 * admin. Let the kernel automatically adjust the ratelimit
+		 * would be a better chioce.
+		 * In bellow algorithm, it will decrease the OOM ratelimit
+		 * rapidly if the console is slow and increase the OOM
+		 * ratelimit slowly if the console is fast. oom_rs.burst
+		 * will be in [1, 10] and oom_rs.interval will always
+		 * greater than 5 * HZ.
+		 */
+		if (delta < oom_rs.interval / 10) {
+			if (oom_rs.interval >= 10 * HZ)
+				oom_rs.interval /= 2;
+			else if (oom_rs.interval > 6 * HZ)
+				oom_rs.interval -= HZ;
+
+			if (oom_rs.burst < 10)
+				oom_rs.burst += 1;
+		} else if (oom_rs.burst > 1) {
+			oom_rs.burst = 1;
+			oom_rs.interval = 4 * delta;
+		}
+
+	}
 
 	/*
 	 * Do we need to kill the entire memory cgroup?