[v3] mm: memcontrol: Don't flood OOM messages with no eligible task.
diff mbox series

Message ID 1539770782-3343-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
State New
Headers show
Series
  • [v3] mm: memcontrol: Don't flood OOM messages with no eligible task.
Related show

Commit Message

Tetsuo Handa Oct. 17, 2018, 10:06 a.m. UTC
syzbot is hitting RCU stall at shmem_fault() [1].
This is because memcg-OOM events with no eligible task (current thread
is marked as OOM-unkillable) continued calling dump_header() from
out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol:
print proper OOM header when no eligible victim left.").

Michal proposed ratelimiting dump_header() [2]. But I don't think that
that patch is appropriate because that patch does not ratelimit

  "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
  "Out of memory and no killable processes...\n"

messages which can be printed for every few milliseconds (i.e. effectively
denial of service for console users) until the OOM situation is solved.

Let's make sure that next dump_header() waits for at least 60 seconds from
previous "Out of memory and no killable processes..." message. Michal is
thinking that any interval is meaningless without knowing the printk()
throughput. But since printk() is synchronous unless handed over to
somebody else by commit dbdda842fe96f893 ("printk: Add console owner and
waiter logic to load balance console writes"), it is likely that all OOM
messages from this out_of_memory() request is already flushed to consoles
when pr_warn("Out of memory and no killable processes...\n") returned.
Thus, we will be able to allow console users to do what they need to do.

To summarize, this patch allows threads in requested memcg to complete
memory allocation requests for doing recovery operation, and also allows
administrators to manually do recovery operation from console if
OOM-unkillable thread is failing to solve the OOM situation automatically.

[1] https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64
[2] https://lkml.kernel.org/r/20181010151135.25766-1-mhocko@kernel.org

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+77e6b28a7a7106ad0def@syzkaller.appspotmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Michal Hocko Oct. 17, 2018, 10:28 a.m. UTC | #1
On Wed 17-10-18 19:06:22, Tetsuo Handa wrote:
> syzbot is hitting RCU stall at shmem_fault() [1].
> This is because memcg-OOM events with no eligible task (current thread
> is marked as OOM-unkillable) continued calling dump_header() from
> out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol:
> print proper OOM header when no eligible victim left.").
> 
> Michal proposed ratelimiting dump_header() [2]. But I don't think that
> that patch is appropriate because that patch does not ratelimit
> 
>   "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
>   "Out of memory and no killable processes...\n"
> 
> messages which can be printed for every few milliseconds (i.e. effectively
> denial of service for console users) until the OOM situation is solved.
> 
> Let's make sure that next dump_header() waits for at least 60 seconds from
> previous "Out of memory and no killable processes..." message. Michal is
> thinking that any interval is meaningless without knowing the printk()
> throughput. But since printk() is synchronous unless handed over to
> somebody else by commit dbdda842fe96f893 ("printk: Add console owner and
> waiter logic to load balance console writes"), it is likely that all OOM
> messages from this out_of_memory() request is already flushed to consoles
> when pr_warn("Out of memory and no killable processes...\n") returned.
> Thus, we will be able to allow console users to do what they need to do.
> 
> To summarize, this patch allows threads in requested memcg to complete
> memory allocation requests for doing recovery operation, and also allows
> administrators to manually do recovery operation from console if
> OOM-unkillable thread is failing to solve the OOM situation automatically.

Could you explain why this is any better than using a well established
ratelimit approach?
Sergey Senozhatsky Oct. 17, 2018, 11:17 a.m. UTC | #2
On (10/17/18 12:28), Michal Hocko wrote:
> > Michal proposed ratelimiting dump_header() [2]. But I don't think that
> > that patch is appropriate because that patch does not ratelimit
> > 
> >   "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
> >   "Out of memory and no killable processes...\n"
[..]
> > Let's make sure that next dump_header() waits for at least 60 seconds from
> > previous "Out of memory and no killable processes..." message.
> 
> Could you explain why this is any better than using a well established
> ratelimit approach?

Tetsuo, let's use a well established rate-limit approach both in
dump_hedaer() and out_of_memory(). I actually was under impression
that Michal added rate-limiting to both of these functions.

The appropriate rate-limit value looks like something that printk()
should know and be able to tell to the rest of the kernel. I don't
think that middle ground will ever be found elsewhere.


printk() knows what consoles are registered, and printk() also knows
(sometimes) what console="..." options the kernel was provided with.
If baud rates ware not provided as console= options, then serial
consoles usually use some default value. We can probably ask consoles.

So *maybe* we can do something like this

//
// WARNING: this is just a sketch. A silly idea.
//          I don't know if we can make it usable.
//

---

int printk_ratelimit_interval(void)
{
       int ret = DEFAULT_RATELIMIT_INTERVAL;
       struct tty_driver *driver = NULL;
       speed_t min_baud = MAX_INT;

       console_lock();
       for_each_console(c) {
               speed_t br;

               if (!c->device)
                       continue;
               if (!(c->flags & CON_ENABLED))
                       continue;
               if (!c->write)
                       continue;
               driver = c->device(c, index);
               if (!driver)
                       continue;

               br = tty_get_baud_rate(tty_driver to tty_struct [???]);
               min_baud = min(min_baud, br);
       }
       console_unlock();

       switch (min_baud) {
       case 115200:
               return ret;

       case ...blah blah...:
               return ret * 2;

       case 9600:
               return ret * 4;
       }
       return ret;
}

---

	-ss
Michal Hocko Oct. 17, 2018, 11:29 a.m. UTC | #3
On Wed 17-10-18 20:17:24, Sergey Senozhatsky wrote:
> On (10/17/18 12:28), Michal Hocko wrote:
> > > Michal proposed ratelimiting dump_header() [2]. But I don't think that
> > > that patch is appropriate because that patch does not ratelimit
> > > 
> > >   "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
> > >   "Out of memory and no killable processes...\n"
> [..]
> > > Let's make sure that next dump_header() waits for at least 60 seconds from
> > > previous "Out of memory and no killable processes..." message.
> > 
> > Could you explain why this is any better than using a well established
> > ratelimit approach?
> 
> Tetsuo, let's use a well established rate-limit approach both in
> dump_hedaer() and out_of_memory(). I actually was under impression
> that Michal added rate-limiting to both of these functions.

I have http://lkml.kernel.org/r/20181010151135.25766-1-mhocko@kernel.org
Then the discussion took the usual direction of back and forth resulting
in "you do not ratelimit the allocation oom context" and "please do that
as an incremental patch" route and here we are. I do not have time and
energy to argue in an endless loop.

> The appropriate rate-limit value looks like something that printk()
> should know and be able to tell to the rest of the kernel. I don't
> think that middle ground will ever be found elsewhere.

Yes, that makes sense.
Tetsuo Handa Oct. 18, 2018, 2:46 a.m. UTC | #4
Sergey Senozhatsky wrote:
> On (10/17/18 12:28), Michal Hocko wrote:
> > > Michal proposed ratelimiting dump_header() [2]. But I don't think that
> > > that patch is appropriate because that patch does not ratelimit
> > > 
> > >   "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
> > >   "Out of memory and no killable processes...\n"
> [..]
> > > Let's make sure that next dump_header() waits for at least 60 seconds from
> > > previous "Out of memory and no killable processes..." message.
> > 
> > Could you explain why this is any better than using a well established
> > ratelimit approach?

This is essentially a ratelimit approach, roughly equivalent with:

  static DEFINE_RATELIMIT_STATE(oom_no_victim_rs, 60 * HZ, 1);
  oom_no_victim_rs.flags |= RATELIMIT_MSG_ON_RELEASE;

  if (__ratelimit(&oom_no_victim_rs)) {
    dump_header(oc, NULL);
    pr_warn("Out of memory and no killable processes...\n");
    oom_no_victim_rs.begin = jiffies;
  }

> 
> Tetsuo, let's use a well established rate-limit approach both in
> dump_hedaer() and out_of_memory(). I actually was under impression
> that Michal added rate-limiting to both of these functions.

Honestly speaking, I wonder whether rate-limit approach helps.
Using a late-limit is a runaround at best.

The fundamental problem here is that we are doing

  while (!fatal_signal_pending(current) && !(current->flags & PF_EXITING)) {
      pr_warn("Help me! I can't kill somebody...\n");
      cond_resched();
  }

when we are under memcg OOM situation and we can't OOM-kill some process
(i.e. we can make no progress). No matter how we throttle pr_warn(), this
will keep consuming CPU resources until the memcg OOM situation is solved.

We call panic() if this is global OOM situation. But for memcg OOM situation,
we do nothing and just hope that the memcg OOM situation is solved shortly.

Until commit 3100dab2aa09dc6e ("mm: memcontrol: print proper OOM header when
no eligible victim left."), we used WARN(1) to report that we are in a bad
situation. And since syzbot happened to hit this WARN(1) with panic_on_warn == 1,
that commit removed this WARN(1) and instead added dump_header() and pr_warn().
And then since syzbot happened to hit RCU stalls at dump_header() added by
that commit, we are trying to mitigate this problem.

But what happens if threads hitting this path are SCHED_RT priority and deprived
threads not hitting this path (e.g. administrator's console session) of all CPU
resources for doing recovery operation? We might succeed with reducing frequency
of messing up the console screens with printk(), but we might fail to solve this
memcg OOM situation after all...

> 
> The appropriate rate-limit value looks like something that printk()
> should know and be able to tell to the rest of the kernel. I don't
> think that middle ground will ever be found elsewhere.
> 
> 
> printk() knows what consoles are registered, and printk() also knows
> (sometimes) what console="..." options the kernel was provided with.
> If baud rates ware not provided as console= options, then serial
> consoles usually use some default value. We can probably ask consoles.
> 
> So *maybe* we can do something like this
> 
> //
> // WARNING: this is just a sketch. A silly idea.
> //          I don't know if we can make it usable.
> //
> 
> ---
> 
> int printk_ratelimit_interval(void)
> {
>        int ret = DEFAULT_RATELIMIT_INTERVAL;
>        struct tty_driver *driver = NULL;
>        speed_t min_baud = MAX_INT;
> 
>        console_lock();
>        for_each_console(c) {
>                speed_t br;
> 
>                if (!c->device)
>                        continue;
>                if (!(c->flags & CON_ENABLED))
>                        continue;
>                if (!c->write)
>                        continue;
>                driver = c->device(c, index);
>                if (!driver)
>                        continue;
> 
>                br = tty_get_baud_rate(tty_driver to tty_struct [???]);
>                min_baud = min(min_baud, br);
>        }
>        console_unlock();
> 
>        switch (min_baud) {
>        case 115200:
>                return ret;
> 
>        case ...blah blah...:
>                return ret * 2;
> 
>        case 9600:
>                return ret * 4;
>        }
>        return ret;
> }

I don't think that baud rate is relevant. Writing to console messes up
operations by console users. What matters is that we don't mess up consoles
to the level (or frequency) where console users cannot do their operations.
That is, interval between the last moment we wrote to a console and the
first moment we will write to a console for the next time matters. Roughly
speaking, remember the time stamp when we called call_console_drivers() for
the last time, and compare with that stamp before trying to call a sort of
ratelimited printk(). My patch is doing it using per call-site stamp recording.
Sergey Senozhatsky Oct. 18, 2018, 4:27 a.m. UTC | #5
On (10/18/18 11:46), Tetsuo Handa wrote:
> Sergey Senozhatsky wrote:
> > 
> > int printk_ratelimit_interval(void)
> > {
> >        int ret = DEFAULT_RATELIMIT_INTERVAL;
> >        struct tty_driver *driver = NULL;
> >        speed_t min_baud = MAX_INT;
> > 
> >        console_lock();
> >        for_each_console(c) {
> >                speed_t br;
> > 
> >                if (!c->device)
> >                        continue;
> >                if (!(c->flags & CON_ENABLED))
> >                        continue;
> >                if (!c->write)
> >                        continue;
> >                driver = c->device(c, index);
> >                if (!driver)
> >                        continue;
> > 
> >                br = tty_get_baud_rate(tty_driver to tty_struct [???]);
> >                min_baud = min(min_baud, br);
> >        }
> >        console_unlock();
> > 
> >        switch (min_baud) {
> >        case 115200:
> >                return ret;
> > 
> >        case ...blah blah...:
> >                return ret * 2;
> > 
> >        case 9600:
> >                return ret * 4;
> >        }
> >        return ret;
> > }
> 
> I don't think that baud rate is relevant. Writing to console messes up
> operations by console users. What matters is that we don't mess up consoles
> to the level (or frequency) where console users cannot do their operations.
> That is, interval between the last moment we wrote to a console and the
> first moment we will write to a console for the next time matters. Roughly
> speaking, remember the time stamp when we called call_console_drivers() for
> the last time, and compare with that stamp before trying to call a sort of
> ratelimited printk(). My patch is doing it using per call-site stamp recording.

To my personal taste, "baud rate of registered and enabled consoles"
approach is drastically more relevant than hard coded 10 * HZ or
60 * HZ magic numbers... But not in the form of that "min baud rate"
brain fart, which I have posted.

What I'd do:

-- Iterate over all registered and enabled serial consoles
-- Sum up all the baud rates
-- Calculate (*roughly*) how many bytes per second/minute/etc my
   call_console_driver() can push

        -- we actually don't even have to iterate all consoles. Just
	   add a baud rate in register_console() and sub baud rate
	   in unregister_console() of each console individually
	-- and have a static unsigned long in printk.c, which OOM
	   can use in rate-limit interval check

-- Leave all the noise behind: e.g. console_sem can be locked by
   a preempted fbcon, etc. It's out of my control; bad luck, there
   is nothing I can do about it.
-- Then I would, probably, take the most recent, say, 100 OOM
   reports, and calculate the *average* strlen() value
   (including \r and \n at the end of each line)

	(strlen(oom_report1) + ... + strlen(omm_report100)) / 100

   Then I'd try to reach an agreement with MM people that we will
   use this "average" oom_report_strlen() in ratelimit interval
   calculation. Yes, some reports will be longer, some shorter.

	Say, suppose...
	
	I have 2 consoles, and I can write 250 bytes per second.
	And average oom_report is 5000 bytes.
	Then I can print one oom_report every (5000 / 250) seconds
	in the *best* case. That's the optimistic baseline. There
	can be printk()-s from other CPUs, etc. etc. No one can
	predict those things.

	Note, how things change when I have just 1 console enabled.

	I have 1 console, and I can write 500 bytes per second.
	And average oom_report is 5000 bytes.
	Then I can print one oom_report every (5000 / 500) seconds
	in the *best* case.

Just my $0.02. Who knows, may be it's dumb and ugly.
I don't have a dog in this fight.

	-ss
Tetsuo Handa Oct. 18, 2018, 5:26 a.m. UTC | #6
Sergey Senozhatsky wrote:
> To my personal taste, "baud rate of registered and enabled consoles"
> approach is drastically more relevant than hard coded 10 * HZ or
> 60 * HZ magic numbers... But not in the form of that "min baud rate"
> brain fart, which I have posted.

I'm saying that my 60 * HZ is "duration which the OOM killer keeps refraining
 from calling printk()". Such period is required for allowing console users
to do their operations without being disturbed by the OOM killer.

Even if you succeeded to calculate average speed of the OOM killer messages
being flushed to consoles, printing the OOM killer messages with that speed
will keep the console users unable to do their operations.

Please do not print the OOM killer messages when the OOM killer cannot make
progress... It just disturbs console users.
Sergey Senozhatsky Oct. 18, 2018, 6:10 a.m. UTC | #7
On (10/18/18 14:26), Tetsuo Handa wrote:
> Sergey Senozhatsky wrote:
> > To my personal taste, "baud rate of registered and enabled consoles"
> > approach is drastically more relevant than hard coded 10 * HZ or
> > 60 * HZ magic numbers... But not in the form of that "min baud rate"
> > brain fart, which I have posted.
> 
> I'm saying that my 60 * HZ is "duration which the OOM killer keeps refraining
>  from calling printk()". Such period is required for allowing console users
> to do their operations without being disturbed by the OOM killer.
> 

Got you. I'm probably not paying too much attention to this discussion.
You start your commit message with "RCU stalls" and end with a compleely
different problem "admin interaction". I skipped the last part of the
commit message.

OK. That makes sense if any user intervention/interaction actually happens.
I'm not sure that someone at facebook or google logins to every server
that is under OOM to do something critically important there. Net console
logs and postmortem analysis, *perhaps*, would be their choice. I believe
it was Johannes who said that his net console is capable of keeping up
with the traffic and that 60 * HZ is too long for him. So I can see why
people might not be happy with your patch. I don't think that 60 * HZ
enforcement will go anywhere.

Now, if your problem is
     "I'm actually logged in, and want to do something
      sane, how do I stop this OOM report flood because
      it wipes out everything I have on my console?"

then let's formulate it as
     "I'm actually logged in, and want to do something
      sane, how do I stop this OOM report flood because
      it wipes out everything I have on my console?"

and let's hear from MM people what they can suggest.

Michal, Andrew, Johannes, any thoughts?

For instance,
   change /proc/sys/kernel/printk and suppress most of the warnings?

   // not only OOM but possibly other printk()-s that can come from
   // different CPUs

If your problem is "syzbot hits RCU stalls" then let's have a baud rate
based ratelimiting; I think we can get more or less reasonable timeout
values.

	-ss
Michal Hocko Oct. 18, 2018, 6:55 a.m. UTC | #8
On Thu 18-10-18 11:46:50, Tetsuo Handa wrote:
> Sergey Senozhatsky wrote:
> > On (10/17/18 12:28), Michal Hocko wrote:
> > > > Michal proposed ratelimiting dump_header() [2]. But I don't think that
> > > > that patch is appropriate because that patch does not ratelimit
> > > > 
> > > >   "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
> > > >   "Out of memory and no killable processes...\n"
> > [..]
> > > > Let's make sure that next dump_header() waits for at least 60 seconds from
> > > > previous "Out of memory and no killable processes..." message.
> > > 
> > > Could you explain why this is any better than using a well established
> > > ratelimit approach?
> 
> This is essentially a ratelimit approach, roughly equivalent with:
> 
>   static DEFINE_RATELIMIT_STATE(oom_no_victim_rs, 60 * HZ, 1);
>   oom_no_victim_rs.flags |= RATELIMIT_MSG_ON_RELEASE;
> 
>   if (__ratelimit(&oom_no_victim_rs)) {
>     dump_header(oc, NULL);
>     pr_warn("Out of memory and no killable processes...\n");
>     oom_no_victim_rs.begin = jiffies;
>   }

Then there is no reason to reinvent the wheel. So use the standard
ratelimit approach. Or put it in other words, this place is no special
to any other that needs some sort of printk throttling. We surely do not
want an ad-hoc solutions all over the kernel.

And once you realize that the ratelimit api is the proper one (put aside
any potential improvements in the implementation of this api) then you
quickly learn that we already do throttle oom reports and it would be
nice to unify that and ... we are back to a naked patch. So please stop
being stuborn and try to cooperate finally.
Michal Hocko Oct. 18, 2018, 7:56 a.m. UTC | #9
On Thu 18-10-18 15:10:18, Sergey Senozhatsky wrote:
[...]
> and let's hear from MM people what they can suggest.
> 
> Michal, Andrew, Johannes, any thoughts?

I have already stated my position. Let's not reinvent the wheel and use
the standard printk throttling. If there are cases where oom reports
cause more harm than good I am open to add a knob to allow disabling it
altogether (it can be even fine grained one to control whether to dump
show_mem, task_list etc.).

But please let's stop this dubious one-off approaches.
Sergey Senozhatsky Oct. 18, 2018, 8:13 a.m. UTC | #10
On (10/18/18 09:56), Michal Hocko wrote:
> On Thu 18-10-18 15:10:18, Sergey Senozhatsky wrote:
> [...]
> > and let's hear from MM people what they can suggest.
> > 
> > Michal, Andrew, Johannes, any thoughts?
> 
> I have already stated my position. Let's not reinvent the wheel and use
> the standard printk throttling. If there are cases where oom reports
> cause more harm than good I am open to add a knob to allow disabling it
> altogether (it can be even fine grained one to control whether to dump
> show_mem, task_list etc.).

A knob might do.
As well as /proc/sys/kernel/printk tweaks, probably. One can even add
echo "a b c d" > /proc/sys/kernel/printk to .bashrc and adjust printk
console levels on login and rollback to old values in .bash_logout
May be.

> But please let's stop this dubious one-off approaches.

OK. Well, I'm not proposing anything actually. I didn't even
realize until recently that Tetsuo was talking about "user
interaction" problem; I thought that his problem was stalled
RCU.

	-ss
Tetsuo Handa Oct. 18, 2018, 10:37 a.m. UTC | #11
On 2018/10/18 15:55, Michal Hocko wrote:
> On Thu 18-10-18 11:46:50, Tetsuo Handa wrote:
>> This is essentially a ratelimit approach, roughly equivalent with:
>>
>>   static DEFINE_RATELIMIT_STATE(oom_no_victim_rs, 60 * HZ, 1);
>>   oom_no_victim_rs.flags |= RATELIMIT_MSG_ON_RELEASE;
>>
>>   if (__ratelimit(&oom_no_victim_rs)) {
>>     dump_header(oc, NULL);
>>     pr_warn("Out of memory and no killable processes...\n");
>>     oom_no_victim_rs.begin = jiffies;
>>   }
> 
> Then there is no reason to reinvent the wheel. So use the standard
> ratelimit approach. Or put it in other words, this place is no special
> to any other that needs some sort of printk throttling. We surely do not
> want an ad-hoc solutions all over the kernel.

netdev_wait_allrefs() in net/core/dev.c is doing the same thing. Since
out_of_memory() is serialized by oom_lock mutex, there is no need to use
"struct ratelimit_state"->lock field. Plain "unsigned long" is enough.

> 
> And once you realize that the ratelimit api is the proper one (put aside
> any potential improvements in the implementation of this api) then you
> quickly learn that we already do throttle oom reports and it would be
> nice to unify that and ... we are back to a naked patch. So please stop
> being stuborn and try to cooperate finally.

I don't think that ratelimit API is the proper one, for I am touching
"struct ratelimit_state"->begin field which is not exported by ratelimit API.
But if you insist on ratelimit API version, I can tolerate with below one.

 mm/oom_kill.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..7c6118e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,12 @@ bool out_of_memory(struct oom_control *oc)
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
+		static DEFINE_RATELIMIT_STATE(no_eligible_rs, 60 * HZ, 1);
+
+		ratelimit_set_flags(&no_eligible_rs, RATELIMIT_MSG_ON_RELEASE);
+		if ((is_sysrq_oom(oc) || is_memcg_oom(oc)) &&
+		    !__ratelimit(&no_eligible_rs))
+			return false;
 		dump_header(oc, NULL);
 		pr_warn("Out of memory and no killable processes...\n");
 		/*
@@ -1115,6 +1121,7 @@ bool out_of_memory(struct oom_control *oc)
 		 */
 		if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
 			panic("System is deadlocked on memory\n");
+		no_eligible_rs.begin = jiffies;
 	}
 	if (oc->chosen && oc->chosen != (void *)-1UL)
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
Michal Hocko Oct. 18, 2018, 11:23 a.m. UTC | #12
On Thu 18-10-18 19:37:18, Tetsuo Handa wrote:
> On 2018/10/18 15:55, Michal Hocko wrote:
> > On Thu 18-10-18 11:46:50, Tetsuo Handa wrote:
> >> This is essentially a ratelimit approach, roughly equivalent with:
> >>
> >>   static DEFINE_RATELIMIT_STATE(oom_no_victim_rs, 60 * HZ, 1);
> >>   oom_no_victim_rs.flags |= RATELIMIT_MSG_ON_RELEASE;
> >>
> >>   if (__ratelimit(&oom_no_victim_rs)) {
> >>     dump_header(oc, NULL);
> >>     pr_warn("Out of memory and no killable processes...\n");
> >>     oom_no_victim_rs.begin = jiffies;
> >>   }
> > 
> > Then there is no reason to reinvent the wheel. So use the standard
> > ratelimit approach. Or put it in other words, this place is no special
> > to any other that needs some sort of printk throttling. We surely do not
> > want an ad-hoc solutions all over the kernel.
> 
> netdev_wait_allrefs() in net/core/dev.c is doing the same thing. Since
> out_of_memory() is serialized by oom_lock mutex, there is no need to use
> "struct ratelimit_state"->lock field. Plain "unsigned long" is enough.

That code probably predates generalized ratelimit api.

> > And once you realize that the ratelimit api is the proper one (put aside
> > any potential improvements in the implementation of this api) then you
> > quickly learn that we already do throttle oom reports and it would be
> > nice to unify that and ... we are back to a naked patch. So please stop
> > being stuborn and try to cooperate finally.
> 
> I don't think that ratelimit API is the proper one, for I am touching
> "struct ratelimit_state"->begin field which is not exported by ratelimit API.
> But if you insist on ratelimit API version, I can tolerate with below one.

I just give up. I do not really see why you always have to make the code
more complex than necessary and squash different things together. This
is a complete kernel code development antipattern.

I am not goging to reply to this thread more but let me note that this
is beyond fun in any aspect I can think off (and yeah I have considered
dark sense of humor as well).

> 
>  mm/oom_kill.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f10aa53..7c6118e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1106,6 +1106,12 @@ bool out_of_memory(struct oom_control *oc)
>  	select_bad_process(oc);
>  	/* Found nothing?!?! */
>  	if (!oc->chosen) {
> +		static DEFINE_RATELIMIT_STATE(no_eligible_rs, 60 * HZ, 1);
> +
> +		ratelimit_set_flags(&no_eligible_rs, RATELIMIT_MSG_ON_RELEASE);
> +		if ((is_sysrq_oom(oc) || is_memcg_oom(oc)) &&
> +		    !__ratelimit(&no_eligible_rs))
> +			return false;
>  		dump_header(oc, NULL);
>  		pr_warn("Out of memory and no killable processes...\n");
>  		/*
> @@ -1115,6 +1121,7 @@ bool out_of_memory(struct oom_control *oc)
>  		 */
>  		if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
>  			panic("System is deadlocked on memory\n");
> +		no_eligible_rs.begin = jiffies;
>  	}
>  	if (oc->chosen && oc->chosen != (void *)-1UL)
>  		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> -- 
> 1.8.3.1
Tetsuo Handa Oct. 18, 2018, 11:58 a.m. UTC | #13
On 2018/10/18 17:13, Sergey Senozhatsky wrote:
> On (10/18/18 09:56), Michal Hocko wrote:
>> On Thu 18-10-18 15:10:18, Sergey Senozhatsky wrote:
>> [...]
>>> and let's hear from MM people what they can suggest.
>>>
>>> Michal, Andrew, Johannes, any thoughts?
>>
>> I have already stated my position. Let's not reinvent the wheel and use
>> the standard printk throttling. If there are cases where oom reports
>> cause more harm than good I am open to add a knob to allow disabling it
>> altogether (it can be even fine grained one to control whether to dump
>> show_mem, task_list etc.).

Moderate OOM reports with making progress is good.
But OOM reports without making progress is bad.

> 
> A knob might do.
> As well as /proc/sys/kernel/printk tweaks, probably. One can even add
> echo "a b c d" > /proc/sys/kernel/printk to .bashrc and adjust printk
> console levels on login and rollback to old values in .bash_logout
> May be.

That can work for only single login with root user case.
Not everyone logs into console as root user.

It is pity that we can't send kernel messages to only selected consoles
(e.g. all messages are sent to netconsole, but only critical messages are
sent to local consoles).

> 
>> But please let's stop this dubious one-off approaches.
> 
> OK. Well, I'm not proposing anything actually. I didn't even
> realize until recently that Tetsuo was talking about "user
> interaction" problem; I thought that his problem was stalled
> RCU.

The "stalled RCU" was the trigger for considering this problem.
Nobody has seriously considered what we should do when the memcg OOM killer
cannot make progress. If the OOM killer cannot make progress, we need to
handle situations where the OOM-unkillable process cannot solve the memcg OOM
situation. Then, the poorest recovery method is that the root user enters
commands for recovery (It might be to increase the memory limit. It might be
to move to a different cgroup. It might be to gracefully terminate the
OOM-unkillable process.) from "consoles" where the OOM messages are sent.

If we start from the worst case, it is obvious that we need to make sure that
the OOM messages do not disturb "consoles" so that the root user can enter
commands for recovery. That boils down to a "user interaction" problem.
Not limiting

  "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
  "Out of memory and no killable processes...\n"

is very annoying.

And I really can't understand why Michal thinks "handling this requirement" as
"make the code more complex than necessary and squash different things together".
Satisfying the most difficult error path handling is not a simple thing.
Petr Mladek Oct. 18, 2018, 2:30 p.m. UTC | #14
On Thu 2018-10-18 13:27:39, Sergey Senozhatsky wrote:
> On (10/18/18 11:46), Tetsuo Handa wrote:
> > Sergey Senozhatsky wrote:
> > > 
> > > int printk_ratelimit_interval(void)
> > > {
> > >        int ret = DEFAULT_RATELIMIT_INTERVAL;
> > >        struct tty_driver *driver = NULL;
> > >        speed_t min_baud = MAX_INT;
> > > 
> > >        console_lock();
> > >        for_each_console(c) {
> > >                speed_t br;
> > > 
> > >                if (!c->device)
> > >                        continue;
> > >                if (!(c->flags & CON_ENABLED))
> > >                        continue;
> > >                if (!c->write)
> > >                        continue;
> > >                driver = c->device(c, index);
> > >                if (!driver)
> > >                        continue;
> > > 
> > >                br = tty_get_baud_rate(tty_driver to tty_struct [???]);
> > >                min_baud = min(min_baud, br);
> > >        }
> > >        console_unlock();
> > > 
> > >        switch (min_baud) {
> > >        case 115200:
> > >                return ret;
> > > 
> > >        case ...blah blah...:
> > >                return ret * 2;
> > > 
> > >        case 9600:
> > >                return ret * 4;
> > >        }
> > >        return ret;
> > > }
> > 
> > I don't think that baud rate is relevant. Writing to console messes up
> > operations by console users. What matters is that we don't mess up consoles
> > to the level (or frequency) where console users cannot do their operations.
> > That is, interval between the last moment we wrote to a console and the
> > first moment we will write to a console for the next time matters. Roughly
> > speaking, remember the time stamp when we called call_console_drivers() for
> > the last time, and compare with that stamp before trying to call a sort of
> > ratelimited printk(). My patch is doing it using per call-site stamp recording.
> 
> To my personal taste, "baud rate of registered and enabled consoles"
> approach is drastically more relevant than hard coded 10 * HZ or
> 60 * HZ magic numbers... But not in the form of that "min baud rate"
> brain fart, which I have posted.
> 
> What I'd do:
> 
> -- Iterate over all registered and enabled serial consoles
> -- Sum up all the baud rates
> -- Calculate (*roughly*) how many bytes per second/minute/etc my
>    call_console_driver() can push
> 
>         -- we actually don't even have to iterate all consoles. Just
> 	   add a baud rate in register_console() and sub baud rate
> 	   in unregister_console() of each console individually
> 	-- and have a static unsigned long in printk.c, which OOM
> 	   can use in rate-limit interval check
> 
> -- Leave all the noise behind: e.g. console_sem can be locked by
>    a preempted fbcon, etc. It's out of my control; bad luck, there
>    is nothing I can do about it.
> -- Then I would, probably, take the most recent, say, 100 OOM
>    reports, and calculate the *average* strlen() value
>    (including \r and \n at the end of each line)

This looks very complex and I see even more problems:

  + You would need to update the rate limit intervals when
    new console is attached. Note that the ratelimits might
    get initialized early during boot. It might be solvable
    but ...

  + You might need to update the length of the message
    when the text (code) is updated. It might be hard
    to maintain.

  + You would need to take into account also console_level
    and the level of the printed messages

  + This approach does not take into account all other
    messages that might be printed by other subsystems.


I have just talked with Michal in person. He pointed out
that we primary need to know when and if the last printed
message already reached the console.

A solution might be to handle printk and ratelimit together.
For example:

   + store log_next_idx of the printed message into
     the ratelimit structure

   + eventually store pointer of the ratelimit structure
     into printk_log

   + eventually store the timestamp when the message
     reached the console into the ratelimit structure

Then the ratelimited printk might take into acount whether
the previous message already reached console and even when.


Well, this is still rather complex. I am not persuaded that
it is worth it.

I suggest to take a breath, stop looking for a perfect solution
for a bit. The existing ratelimit might be perfectly fine
in practice. You might always create stress test that would
fail but the test might be far from reality. Any complex
solution might bring more problems that solve.

Console full of repeated messages need not be a catastrophe
when it helps to fix the problem and the system is usable
and need a reboot anyway.

Best Regards,
Petr
Sergey Senozhatsky Oct. 18, 2018, 11:54 p.m. UTC | #15
On (10/18/18 20:58), Tetsuo Handa wrote:
> > 
> > A knob might do.
> > As well as /proc/sys/kernel/printk tweaks, probably. One can even add
> > echo "a b c d" > /proc/sys/kernel/printk to .bashrc and adjust printk
> > console levels on login and rollback to old values in .bash_logout
> > May be.
> 
> That can work for only single login with root user case.
> Not everyone logs into console as root user.

Add sudo ;)

> It is pity that we can't send kernel messages to only selected consoles
> (e.g. all messages are sent to netconsole, but only critical messages are
> sent to local consoles).

OK, that's a fair point. There was a patch from FB, which would allow us
to set a log_level on per-console basis. So the noise goes to heav^W net
console; only critical stuff goes to the serial console (if I recall it
correctly). I'm not sure what happened to that patch, it was a while ago.
I'll try to find that out.

[..]
> That boils down to a "user interaction" problem.
> Not limiting
> 
>   "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
>   "Out of memory and no killable processes...\n"
> 
> is very annoying.
> 
> And I really can't understand why Michal thinks "handling this requirement" as
> "make the code more complex than necessary and squash different things together".

Michal is trying very hard to address the problem in a reasonable way.
The problem you are talking about is not MM specific. You can have a
faulty SCSI device, corrupted FS, and so and on.

	-ss
Tetsuo Handa Oct. 19, 2018, 12:18 a.m. UTC | #16
Petr Mladek wrote:
> This looks very complex and I see even more problems:
> 
>   + You would need to update the rate limit intervals when
>     new console is attached. Note that the ratelimits might
>     get initialized early during boot. It might be solvable
>     but ...
> 
>   + You might need to update the length of the message
>     when the text (code) is updated. It might be hard
>     to maintain.

I assumed we calculate the average dynamically, for the amount of
messages printed by an OOM event is highly unstable (depends on
hardware configuration such as number of nodes, number of zones,
and how many processes are there as a candidate for OOM victim).

> 
>   + You would need to take into account also console_level
>     and the level of the printed messages

Isn't that counted by call_console_drivers() ?

> 
>   + This approach does not take into account all other
>     messages that might be printed by other subsystems.

Yes. And I wonder whether unlimited printk() alone is the cause of RCU
stalls. I think that printk() is serving as an amplifier for any CPU users.
That is, the average speed might not be safe enough to guarantee that RCU
stalls won't occur. Then, there is no safe average value we can use.

> 
> 
> I have just talked with Michal in person. He pointed out
> that we primary need to know when and if the last printed
> message already reached the console.

I think we can estimate when call_console_drivers() started/completed for
the last time as when and if the last printed message already reached the
console. Sometimes callers might append to the logbuf without waiting for
completion of call_console_drivers(), but the system is already unusable
if majority of ratelimited printk() users hit that race window.

> 
> A solution might be to handle printk and ratelimit together.
> For example:
> 
>    + store log_next_idx of the printed message into
>      the ratelimit structure
> 
>    + eventually store pointer of the ratelimit structure
>      into printk_log
> 
>    + eventually store the timestamp when the message
>      reached the console into the ratelimit structure
> 
> Then the ratelimited printk might take into acount whether
> the previous message already reached console and even when.

If printk() becomes asynchronous (e.g. printk() kernel thread), we would
need to use something like srcu_synchronize() so that the caller waits for
only completion of messages the caller wants to wait.

> 
> 
> Well, this is still rather complex. I am not persuaded that
> it is worth it.
> 
> I suggest to take a breath, stop looking for a perfect solution
> for a bit. The existing ratelimit might be perfectly fine
> in practice. You might always create stress test that would
> fail but the test might be far from reality. Any complex
> solution might bring more problems that solve.
> 
> Console full of repeated messages need not be a catastrophe
> when it helps to fix the problem and the system is usable
> and need a reboot anyway.

I wish that memcg OOM events do not use printk(). Since memcg OOM is not
out of physical memory, we can dynamically allocate physical memory for
holding memcg OOM messages and let the userspace poll it via some interface.
Tetsuo Handa Oct. 19, 2018, 10:35 a.m. UTC | #17
On 2018/10/19 8:54, Sergey Senozhatsky wrote:
> On (10/18/18 20:58), Tetsuo Handa wrote:
>>>
>>> A knob might do.
>>> As well as /proc/sys/kernel/printk tweaks, probably. One can even add
>>> echo "a b c d" > /proc/sys/kernel/printk to .bashrc and adjust printk
>>> console levels on login and rollback to old values in .bash_logout
>>> May be.
>>
>> That can work for only single login with root user case.
>> Not everyone logs into console as root user.
> 
> Add sudo ;)

That will not work. ;-) As long as the console loglevel setting is
system wide, we can't allow multiple login sessions.

> 
>> It is pity that we can't send kernel messages to only selected consoles
>> (e.g. all messages are sent to netconsole, but only critical messages are
>> sent to local consoles).
> 
> OK, that's a fair point. There was a patch from FB, which would allow us
> to set a log_level on per-console basis. So the noise goes to heav^W net
> console; only critical stuff goes to the serial console (if I recall it
> correctly). I'm not sure what happened to that patch, it was a while ago.
> I'll try to find that out.

Per a console loglevel setting would help for several environments.
But syzbot environment cannot count on netconsole. We can't expect that
unlimited printk() will become safe.

> 
> [..]
>> That boils down to a "user interaction" problem.
>> Not limiting
>>
>>   "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
>>   "Out of memory and no killable processes...\n"
>>
>> is very annoying.
>>
>> And I really can't understand why Michal thinks "handling this requirement" as
>> "make the code more complex than necessary and squash different things together".
> 
> Michal is trying very hard to address the problem in a reasonable way.

OK. But Michal, do we have a reasonable way which can be applied now instead of
my patch or one of below patches? Just enumerating words like "hackish" or "a mess"
without YOU ACTUALLY PROPOSE PATCHES will bounce back to YOU.

> The problem you are talking about is not MM specific. You can have a
> faulty SCSI device, corrupted FS, and so and on.

"a faulty SCSI device, corrupted FS, and so and on" are reporting problems
which will complete a request. They can use (and are using) ratelimit,
aren't they?

"a memcg OOM with no eligible task" is reporting a problem which cannot
complete a request. But it can use ratelimit as well.

But we have an immediately applicable mitigation for a problem that
already OOM-killed threads are triggering "a memcg OOM with no eligible
task" using one of below patches.



From 0a533d15949eac25f5ce7ce6e53f5830608f08e7 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 19 Oct 2018 15:52:56 +0900
Subject: [PATCH v2] mm, oom: OOM victims do not need to select next OOM victim unless __GFP_NOFAIL.

Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks") changed to select next OOM victim as soon as
MMF_OOM_SKIP is set, a memcg OOM event from a user process can generate
220+ times (12400+ lines / 730+ KB) of OOM-killer messages with
"Out of memory and no killable processes..." (i.e. no progress) due to
a race window.

This patch completely eliminates such race window by making
out_of_memory() from OOM victims no-op, for OOM victims do not
forever retry (unless __GFP_NOFAIL).

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..0e8d20b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1058,6 +1058,9 @@ bool out_of_memory(struct oom_control *oc)
 	if (oom_killer_disabled)
 		return false;
 
+	if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL))
+		return true;
+
 	if (!is_memcg_oom(oc)) {
 		blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
 		if (freed > 0)
Sergey Senozhatsky Oct. 23, 2018, 12:47 a.m. UTC | #18
On (10/19/18 19:35), Tetsuo Handa wrote:
> > 
> > OK, that's a fair point. There was a patch from FB, which would allow us
> > to set a log_level on per-console basis. So the noise goes to heav^W net
> > console; only critical stuff goes to the serial console (if I recall it
> > correctly). I'm not sure what happened to that patch, it was a while ago.
> > I'll try to find that out.
> 
> Per a console loglevel setting would help for several environments.
> But syzbot environment cannot count on netconsole. We can't expect that
> unlimited printk() will become safe.

This target is moving too fast :) RCU stall -> user interaction -> syzbot

I talked to Calvin Owens (who's working on the per-console loglevel
patch set; CC-ed) and Calvin said that "It's in-progress". So we probably
will have this functionality one day. That's all we can do from printk
side wrt user-interaction problem.

> > The problem you are talking about is not MM specific. You can have a
> > faulty SCSI device, corrupted FS, and so and on.
> 
> "a faulty SCSI device, corrupted FS, and so and on" are reporting problems
> which will complete a request. They can use (and are using) ratelimit,
> aren't they?

Looking at scsi_request_fn(), the answer is probably "sure they can;
but no, they aren't". In majority of cases the reason we replace printk
with printk_ratelimit is because someone reports a stall or a lockup.
Otherwise, people use printk(), which is absolutely fine.

	-ss
Petr Mladek Oct. 23, 2018, 8:21 a.m. UTC | #19
On Fri 2018-10-19 09:18:16, Tetsuo Handa wrote:
> Petr Mladek wrote:
> > This looks very complex and I see even more problems:
> > 
> >   + You would need to update the rate limit intervals when
> >     new console is attached. Note that the ratelimits might
> >     get initialized early during boot. It might be solvable
> >     but ...
> > 
> >   + You might need to update the length of the message
> >     when the text (code) is updated. It might be hard
> >     to maintain.
> 
> I assumed we calculate the average dynamically, for the amount of
> messages printed by an OOM event is highly unstable (depends on
> hardware configuration such as number of nodes, number of zones,
> and how many processes are there as a candidate for OOM victim).

Is there any idea how the average length can be counted dynamically?

Note that ____ratelimit() currently does not get any
information from printk/console. It would need to be locakless.
We do not want to complicate printk() with even more locks.


> > 
> >   + This approach does not take into account all other
> >     messages that might be printed by other subsystems.
> 
> Yes. And I wonder whether unlimited printk() alone is the cause of RCU
> stalls. I think that printk() is serving as an amplifier for any CPU users.
> That is, the average speed might not be safe enough to guarantee that RCU
> stalls won't occur. Then, there is no safe average value we can use.

This is why I suggested to avoid counting OOM messages and just check
if and when the last OOM message reached console.


> > I have just talked with Michal in person. He pointed out
> > that we primary need to know when and if the last printed
> > message already reached the console.
> 
> I think we can estimate when call_console_drivers() started/completed for
> the last time as when and if the last printed message already reached the
> console. Sometimes callers might append to the logbuf without waiting for
> completion of call_console_drivers(), but the system is already unusable
> if majority of ratelimited printk() users hit that race window.

I am confused. We are talking about ratemiting. We do not want to
wait for anything. The only guestion is whether it makes sense
to print the "same" message Xth time when even the 1st message
have not reached the console yet.

This reminds me another problem. We would need to use the same
decision for all printk() calls that logically belongs to each
other. Otherwise we might get mixed lines that might confuse
poeple. I mean that OOM messages might look like:

  OOM: A
  OOM: B
  OOM: C

If we do not synchronize the rateliting, we might see:

  OOM: A
  OOM: B
  OOM: C
  OOM: B
  OOM: B
  OOM: A
  OOM: C
  OOM: C


> > A solution might be to handle printk and ratelimit together.
> > For example:
> > 
> >    + store log_next_idx of the printed message into
> >      the ratelimit structure
> > 
> >    + eventually store pointer of the ratelimit structure
> >      into printk_log
> > 
> >    + eventually store the timestamp when the message
> >      reached the console into the ratelimit structure
> > 
> > Then the ratelimited printk might take into acount whether
> > the previous message already reached console and even when.
> 
> If printk() becomes asynchronous (e.g. printk() kernel thread), we would
> need to use something like srcu_synchronize() so that the caller waits for
> only completion of messages the caller wants to wait.

I do not understand this. printk() must not block OOM progress.


> > Well, this is still rather complex. I am not persuaded that
> > it is worth it.
> > 
> > I suggest to take a breath, stop looking for a perfect solution
> > for a bit. The existing ratelimit might be perfectly fine
> > in practice. You might always create stress test that would
> > fail but the test might be far from reality. Any complex
> > solution might bring more problems that solve.
> > 
> > Console full of repeated messages need not be a catastrophe
> > when it helps to fix the problem and the system is usable
> > and need a reboot anyway.
> 
> I wish that memcg OOM events do not use printk(). Since memcg OOM is not
> out of physical memory, we can dynamically allocate physical memory for
> holding memcg OOM messages and let the userspace poll it via some interface.

Would the userspace work when the system gets blocked on allocations?

Best Regards,
Petr
Petr Mladek Oct. 23, 2018, 8:37 a.m. UTC | #20
On Fri 2018-10-19 19:35:53, Tetsuo Handa wrote:
> On 2018/10/19 8:54, Sergey Senozhatsky wrote:
> > On (10/18/18 20:58), Tetsuo Handa wrote:
> >> That boils down to a "user interaction" problem.
> >> Not limiting
> >>
> >>   "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
> >>   "Out of memory and no killable processes...\n"
> >>
> >> is very annoying.
> >>
> >> And I really can't understand why Michal thinks "handling this requirement" as
> >> "make the code more complex than necessary and squash different things together".
> > 
> > Michal is trying very hard to address the problem in a reasonable way.
> 
> OK. But Michal, do we have a reasonable way which can be applied now instead of
> my patch or one of below patches? Just enumerating words like "hackish" or "a mess"
> without YOU ACTUALLY PROPOSE PATCHES will bounce back to YOU.

Michal suggested using ratelimit, the standard solution.

My understanding is that this situation happens when the system is
misconfigured and unusable without manual intervention. If
the user is able to see what the problem is then we are good.

You talk about interactivity but who asked for this?
IMHO, if system ends in OOM situation, it would need to get
restarted in most cases anyway. Then people have a chance
to fix the configuration after the reboot.

Best Regards,
Petr
Michal Hocko Oct. 23, 2018, 8:54 a.m. UTC | #21
[I strongly suspect this whole email thread went way out of scope of the
 issue really deserves]

I didn't want to participate any further but let me clarify one thing
because I can see how the discussion could generate some confusion.

On Tue 23-10-18 10:37:38, Petr Mladek wrote:
[...]
> My understanding is that this situation happens when the system is
> misconfigured and unusable without manual intervention. If
> the user is able to see what the problem is then we are good.

Not really. The flood of _memcg_ oom report about no eligible tasks
should indeed happen only when the memcg is misconfigured. The system is
and should be still usable at this stage. Ratelimit is aimed to reduce
pointless message which do not help to debug the issue itself much.
There is a race condition as explained by Tetsuo that could lead to this
situation even without a misconfiguration and that is clearly a bug and
something to deal with and patches have been posted in that regards [1]

The rest of the discussion is about how to handle printk rate-limiting
properly and whether ad-hoc solution is more appropriate than a real API
we have in place and whether the later needs some enhancements. That is
completely orthogonal on the issue at hands and as such it should be
really discussed separately.

[1] http://lkml.kernel.org/r/20181022071323.9550-1-mhocko@kernel.org
Tetsuo Handa Oct. 23, 2018, 10:23 a.m. UTC | #22
On 2018/10/23 17:21, Petr Mladek wrote:
> On Fri 2018-10-19 09:18:16, Tetsuo Handa wrote:
>> I assumed we calculate the average dynamically, for the amount of
>> messages printed by an OOM event is highly unstable (depends on
>> hardware configuration such as number of nodes, number of zones,
>> and how many processes are there as a candidate for OOM victim).
> 
> Is there any idea how the average length can be counted dynamically?

I don't have one. Maybe sum up return values of printk() from OOM context?



> This reminds me another problem. We would need to use the same
> decision for all printk() calls that logically belongs to each
> other. Otherwise we might get mixed lines that might confuse
> poeple. I mean that OOM messages might look like:
> 
>   OOM: A
>   OOM: B
>   OOM: C
> 
> If we do not synchronize the rateliting, we might see:
> 
>   OOM: A
>   OOM: B
>   OOM: C
>   OOM: B
>   OOM: B
>   OOM: A
>   OOM: C
>   OOM: C

Messages from out_of_memory() are serialized by oom_lock mutex.
Messages from warn_alloc() are not serialized, and thus cause confusion.



>> I wish that memcg OOM events do not use printk(). Since memcg OOM is not
>> out of physical memory, we can dynamically allocate physical memory for
>> holding memcg OOM messages and let the userspace poll it via some interface.
> 
> Would the userspace work when the system gets blocked on allocations?

Yes for memcg OOM events. No for global OOM events.
You can try reproducers shown below from your environment.

Regarding case 2, we can solve the problem by checking tsk_is_oom_victim(current) == true.
But regarding case 1, Michal's patch is not sufficient for allowing administrators
to enter commands for recovery from console.

---------- Case 1: Flood of memcg OOM events caused by misconfiguration. ----------

#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
	FILE *fp;
	const unsigned long size = 1048576 * 200;
	char *buf = malloc(size);
	mkdir("/sys/fs/cgroup/memory/test1", 0755);
	fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
	fprintf(fp, "%lu\n", size / 2);
	fclose(fp);
	fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
	fprintf(fp, "%u\n", getpid());
	fclose(fp);
	fp = fopen("/proc/self/oom_score_adj", "w");
	fprintf(fp, "-1000\n");
	fclose(fp);
	fp = fopen("/dev/zero", "r");
	fread(buf, 1, size, fp);
	fclose(fp);
	return 0;
}

---------- Case 2: Flood of memcg OOM events caused by MMF_OOM_SKIP race. ----------

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <sys/mman.h>

#define NUMTHREADS 256
#define MMAPSIZE 4 * 10485760
#define STACKSIZE 4096
static int pipe_fd[2] = { EOF, EOF };
static int memory_eater(void *unused)
{
	int fd = open("/dev/zero", O_RDONLY);
	char *buf = mmap(NULL, MMAPSIZE, PROT_WRITE | PROT_READ,
			 MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
	read(pipe_fd[0], buf, 1);
	read(fd, buf, MMAPSIZE);
	pause();
	return 0;
}
int main(int argc, char *argv[])
{
	int i;
	char *stack;
	FILE *fp;
	const unsigned long size = 1048576 * 200;
	mkdir("/sys/fs/cgroup/memory/test1", 0755);
	fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
	fprintf(fp, "%lu\n", size);
	fclose(fp);
	fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
	fprintf(fp, "%u\n", getpid());
	fclose(fp);
	if (setgid(-2) || setuid(-2))
		return 1;
	stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ,
		     MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
	for (i = 0; i < NUMTHREADS; i++)
		if (clone(memory_eater, stack + (i + 1) * STACKSIZE,
			  CLONE_SIGHAND | CLONE_THREAD | CLONE_VM | CLONE_FS | CLONE_FILES, NULL) == -1)
			break;
	sleep(1);
	close(pipe_fd[1]);
	pause();
	return 0;
}

Patch
diff mbox series

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..9056f9b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,11 @@  bool out_of_memory(struct oom_control *oc)
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
+		static unsigned long last_warned;
+
+		if ((is_sysrq_oom(oc) || is_memcg_oom(oc)) &&
+		    time_in_range(jiffies, last_warned, last_warned + 60 * HZ))
+			return false;
 		dump_header(oc, NULL);
 		pr_warn("Out of memory and no killable processes...\n");
 		/*
@@ -1115,6 +1120,7 @@  bool out_of_memory(struct oom_control *oc)
 		 */
 		if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
 			panic("System is deadlocked on memory\n");
+		last_warned = jiffies;
 	}
 	if (oc->chosen && oc->chosen != (void *)-1UL)
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :