Message ID | 1539770782-3343-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] mm: memcontrol: Don't flood OOM messages with no eligible task. | expand |
On Wed 17-10-18 19:06:22, Tetsuo Handa wrote: > syzbot is hitting RCU stall at shmem_fault() [1]. > This is because memcg-OOM events with no eligible task (current thread > is marked as OOM-unkillable) continued calling dump_header() from > out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol: > print proper OOM header when no eligible victim left."). > > Michal proposed ratelimiting dump_header() [2]. But I don't think that > that patch is appropriate because that patch does not ratelimit > > "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" > "Out of memory and no killable processes...\n" > > messages which can be printed for every few milliseconds (i.e. effectively > denial of service for console users) until the OOM situation is solved. > > Let's make sure that next dump_header() waits for at least 60 seconds from > previous "Out of memory and no killable processes..." message. Michal is > thinking that any interval is meaningless without knowing the printk() > throughput. But since printk() is synchronous unless handed over to > somebody else by commit dbdda842fe96f893 ("printk: Add console owner and > waiter logic to load balance console writes"), it is likely that all OOM > messages from this out_of_memory() request is already flushed to consoles > when pr_warn("Out of memory and no killable processes...\n") returned. > Thus, we will be able to allow console users to do what they need to do. > > To summarize, this patch allows threads in requested memcg to complete > memory allocation requests for doing recovery operation, and also allows > administrators to manually do recovery operation from console if > OOM-unkillable thread is failing to solve the OOM situation automatically. Could you explain why this is any better than using a well established ratelimit approach?
On (10/17/18 12:28), Michal Hocko wrote: > > Michal proposed ratelimiting dump_header() [2]. But I don't think that > > that patch is appropriate because that patch does not ratelimit > > > > "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" > > "Out of memory and no killable processes...\n" [..] > > Let's make sure that next dump_header() waits for at least 60 seconds from > > previous "Out of memory and no killable processes..." message. > > Could you explain why this is any better than using a well established > ratelimit approach? Tetsuo, let's use a well established rate-limit approach both in dump_hedaer() and out_of_memory(). I actually was under impression that Michal added rate-limiting to both of these functions. The appropriate rate-limit value looks like something that printk() should know and be able to tell to the rest of the kernel. I don't think that middle ground will ever be found elsewhere. printk() knows what consoles are registered, and printk() also knows (sometimes) what console="..." options the kernel was provided with. If baud rates ware not provided as console= options, then serial consoles usually use some default value. We can probably ask consoles. So *maybe* we can do something like this // // WARNING: this is just a sketch. A silly idea. // I don't know if we can make it usable. // --- int printk_ratelimit_interval(void) { int ret = DEFAULT_RATELIMIT_INTERVAL; struct tty_driver *driver = NULL; speed_t min_baud = MAX_INT; console_lock(); for_each_console(c) { speed_t br; if (!c->device) continue; if (!(c->flags & CON_ENABLED)) continue; if (!c->write) continue; driver = c->device(c, index); if (!driver) continue; br = tty_get_baud_rate(tty_driver to tty_struct [???]); min_baud = min(min_baud, br); } console_unlock(); switch (min_baud) { case 115200: return ret; case ...blah blah...: return ret * 2; case 9600: return ret * 4; } return ret; } --- -ss
On Wed 17-10-18 20:17:24, Sergey Senozhatsky wrote: > On (10/17/18 12:28), Michal Hocko wrote: > > > Michal proposed ratelimiting dump_header() [2]. But I don't think that > > > that patch is appropriate because that patch does not ratelimit > > > > > > "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" > > > "Out of memory and no killable processes...\n" > [..] > > > Let's make sure that next dump_header() waits for at least 60 seconds from > > > previous "Out of memory and no killable processes..." message. > > > > Could you explain why this is any better than using a well established > > ratelimit approach? > > Tetsuo, let's use a well established rate-limit approach both in > dump_hedaer() and out_of_memory(). I actually was under impression > that Michal added rate-limiting to both of these functions. I have http://lkml.kernel.org/r/20181010151135.25766-1-mhocko@kernel.org Then the discussion took the usual direction of back and forth resulting in "you do not ratelimit the allocation oom context" and "please do that as an incremental patch" route and here we are. I do not have time and energy to argue in an endless loop. > The appropriate rate-limit value looks like something that printk() > should know and be able to tell to the rest of the kernel. I don't > think that middle ground will ever be found elsewhere. Yes, that makes sense.
Sergey Senozhatsky wrote: > On (10/17/18 12:28), Michal Hocko wrote: > > > Michal proposed ratelimiting dump_header() [2]. But I don't think that > > > that patch is appropriate because that patch does not ratelimit > > > > > > "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" > > > "Out of memory and no killable processes...\n" > [..] > > > Let's make sure that next dump_header() waits for at least 60 seconds from > > > previous "Out of memory and no killable processes..." message. > > > > Could you explain why this is any better than using a well established > > ratelimit approach? This is essentially a ratelimit approach, roughly equivalent with: static DEFINE_RATELIMIT_STATE(oom_no_victim_rs, 60 * HZ, 1); oom_no_victim_rs.flags |= RATELIMIT_MSG_ON_RELEASE; if (__ratelimit(&oom_no_victim_rs)) { dump_header(oc, NULL); pr_warn("Out of memory and no killable processes...\n"); oom_no_victim_rs.begin = jiffies; } > > Tetsuo, let's use a well established rate-limit approach both in > dump_hedaer() and out_of_memory(). I actually was under impression > that Michal added rate-limiting to both of these functions. Honestly speaking, I wonder whether rate-limit approach helps. Using a late-limit is a runaround at best. The fundamental problem here is that we are doing while (!fatal_signal_pending(current) && !(current->flags & PF_EXITING)) { pr_warn("Help me! I can't kill somebody...\n"); cond_resched(); } when we are under memcg OOM situation and we can't OOM-kill some process (i.e. we can make no progress). No matter how we throttle pr_warn(), this will keep consuming CPU resources until the memcg OOM situation is solved. We call panic() if this is global OOM situation. But for memcg OOM situation, we do nothing and just hope that the memcg OOM situation is solved shortly. Until commit 3100dab2aa09dc6e ("mm: memcontrol: print proper OOM header when no eligible victim left."), we used WARN(1) to report that we are in a bad situation. And since syzbot happened to hit this WARN(1) with panic_on_warn == 1, that commit removed this WARN(1) and instead added dump_header() and pr_warn(). And then since syzbot happened to hit RCU stalls at dump_header() added by that commit, we are trying to mitigate this problem. But what happens if threads hitting this path are SCHED_RT priority and deprived threads not hitting this path (e.g. administrator's console session) of all CPU resources for doing recovery operation? We might succeed with reducing frequency of messing up the console screens with printk(), but we might fail to solve this memcg OOM situation after all... > > The appropriate rate-limit value looks like something that printk() > should know and be able to tell to the rest of the kernel. I don't > think that middle ground will ever be found elsewhere. > > > printk() knows what consoles are registered, and printk() also knows > (sometimes) what console="..." options the kernel was provided with. > If baud rates ware not provided as console= options, then serial > consoles usually use some default value. We can probably ask consoles. > > So *maybe* we can do something like this > > // > // WARNING: this is just a sketch. A silly idea. > // I don't know if we can make it usable. > // > > --- > > int printk_ratelimit_interval(void) > { > int ret = DEFAULT_RATELIMIT_INTERVAL; > struct tty_driver *driver = NULL; > speed_t min_baud = MAX_INT; > > console_lock(); > for_each_console(c) { > speed_t br; > > if (!c->device) > continue; > if (!(c->flags & CON_ENABLED)) > continue; > if (!c->write) > continue; > driver = c->device(c, index); > if (!driver) > continue; > > br = tty_get_baud_rate(tty_driver to tty_struct [???]); > min_baud = min(min_baud, br); > } > console_unlock(); > > switch (min_baud) { > case 115200: > return ret; > > case ...blah blah...: > return ret * 2; > > case 9600: > return ret * 4; > } > return ret; > } I don't think that baud rate is relevant. Writing to console messes up operations by console users. What matters is that we don't mess up consoles to the level (or frequency) where console users cannot do their operations. That is, interval between the last moment we wrote to a console and the first moment we will write to a console for the next time matters. Roughly speaking, remember the time stamp when we called call_console_drivers() for the last time, and compare with that stamp before trying to call a sort of ratelimited printk(). My patch is doing it using per call-site stamp recording.
On (10/18/18 11:46), Tetsuo Handa wrote: > Sergey Senozhatsky wrote: > > > > int printk_ratelimit_interval(void) > > { > > int ret = DEFAULT_RATELIMIT_INTERVAL; > > struct tty_driver *driver = NULL; > > speed_t min_baud = MAX_INT; > > > > console_lock(); > > for_each_console(c) { > > speed_t br; > > > > if (!c->device) > > continue; > > if (!(c->flags & CON_ENABLED)) > > continue; > > if (!c->write) > > continue; > > driver = c->device(c, index); > > if (!driver) > > continue; > > > > br = tty_get_baud_rate(tty_driver to tty_struct [???]); > > min_baud = min(min_baud, br); > > } > > console_unlock(); > > > > switch (min_baud) { > > case 115200: > > return ret; > > > > case ...blah blah...: > > return ret * 2; > > > > case 9600: > > return ret * 4; > > } > > return ret; > > } > > I don't think that baud rate is relevant. Writing to console messes up > operations by console users. What matters is that we don't mess up consoles > to the level (or frequency) where console users cannot do their operations. > That is, interval between the last moment we wrote to a console and the > first moment we will write to a console for the next time matters. Roughly > speaking, remember the time stamp when we called call_console_drivers() for > the last time, and compare with that stamp before trying to call a sort of > ratelimited printk(). My patch is doing it using per call-site stamp recording. To my personal taste, "baud rate of registered and enabled consoles" approach is drastically more relevant than hard coded 10 * HZ or 60 * HZ magic numbers... But not in the form of that "min baud rate" brain fart, which I have posted. What I'd do: -- Iterate over all registered and enabled serial consoles -- Sum up all the baud rates -- Calculate (*roughly*) how many bytes per second/minute/etc my call_console_driver() can push -- we actually don't even have to iterate all consoles. Just add a baud rate in register_console() and sub baud rate in unregister_console() of each console individually -- and have a static unsigned long in printk.c, which OOM can use in rate-limit interval check -- Leave all the noise behind: e.g. console_sem can be locked by a preempted fbcon, etc. It's out of my control; bad luck, there is nothing I can do about it. -- Then I would, probably, take the most recent, say, 100 OOM reports, and calculate the *average* strlen() value (including \r and \n at the end of each line) (strlen(oom_report1) + ... + strlen(omm_report100)) / 100 Then I'd try to reach an agreement with MM people that we will use this "average" oom_report_strlen() in ratelimit interval calculation. Yes, some reports will be longer, some shorter. Say, suppose... I have 2 consoles, and I can write 250 bytes per second. And average oom_report is 5000 bytes. Then I can print one oom_report every (5000 / 250) seconds in the *best* case. That's the optimistic baseline. There can be printk()-s from other CPUs, etc. etc. No one can predict those things. Note, how things change when I have just 1 console enabled. I have 1 console, and I can write 500 bytes per second. And average oom_report is 5000 bytes. Then I can print one oom_report every (5000 / 500) seconds in the *best* case. Just my $0.02. Who knows, may be it's dumb and ugly. I don't have a dog in this fight. -ss
Sergey Senozhatsky wrote: > To my personal taste, "baud rate of registered and enabled consoles" > approach is drastically more relevant than hard coded 10 * HZ or > 60 * HZ magic numbers... But not in the form of that "min baud rate" > brain fart, which I have posted. I'm saying that my 60 * HZ is "duration which the OOM killer keeps refraining from calling printk()". Such period is required for allowing console users to do their operations without being disturbed by the OOM killer. Even if you succeeded to calculate average speed of the OOM killer messages being flushed to consoles, printing the OOM killer messages with that speed will keep the console users unable to do their operations. Please do not print the OOM killer messages when the OOM killer cannot make progress... It just disturbs console users.
On (10/18/18 14:26), Tetsuo Handa wrote: > Sergey Senozhatsky wrote: > > To my personal taste, "baud rate of registered and enabled consoles" > > approach is drastically more relevant than hard coded 10 * HZ or > > 60 * HZ magic numbers... But not in the form of that "min baud rate" > > brain fart, which I have posted. > > I'm saying that my 60 * HZ is "duration which the OOM killer keeps refraining > from calling printk()". Such period is required for allowing console users > to do their operations without being disturbed by the OOM killer. > Got you. I'm probably not paying too much attention to this discussion. You start your commit message with "RCU stalls" and end with a compleely different problem "admin interaction". I skipped the last part of the commit message. OK. That makes sense if any user intervention/interaction actually happens. I'm not sure that someone at facebook or google logins to every server that is under OOM to do something critically important there. Net console logs and postmortem analysis, *perhaps*, would be their choice. I believe it was Johannes who said that his net console is capable of keeping up with the traffic and that 60 * HZ is too long for him. So I can see why people might not be happy with your patch. I don't think that 60 * HZ enforcement will go anywhere. Now, if your problem is "I'm actually logged in, and want to do something sane, how do I stop this OOM report flood because it wipes out everything I have on my console?" then let's formulate it as "I'm actually logged in, and want to do something sane, how do I stop this OOM report flood because it wipes out everything I have on my console?" and let's hear from MM people what they can suggest. Michal, Andrew, Johannes, any thoughts? For instance, change /proc/sys/kernel/printk and suppress most of the warnings? // not only OOM but possibly other printk()-s that can come from // different CPUs If your problem is "syzbot hits RCU stalls" then let's have a baud rate based ratelimiting; I think we can get more or less reasonable timeout values. -ss
On Thu 18-10-18 11:46:50, Tetsuo Handa wrote: > Sergey Senozhatsky wrote: > > On (10/17/18 12:28), Michal Hocko wrote: > > > > Michal proposed ratelimiting dump_header() [2]. But I don't think that > > > > that patch is appropriate because that patch does not ratelimit > > > > > > > > "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" > > > > "Out of memory and no killable processes...\n" > > [..] > > > > Let's make sure that next dump_header() waits for at least 60 seconds from > > > > previous "Out of memory and no killable processes..." message. > > > > > > Could you explain why this is any better than using a well established > > > ratelimit approach? > > This is essentially a ratelimit approach, roughly equivalent with: > > static DEFINE_RATELIMIT_STATE(oom_no_victim_rs, 60 * HZ, 1); > oom_no_victim_rs.flags |= RATELIMIT_MSG_ON_RELEASE; > > if (__ratelimit(&oom_no_victim_rs)) { > dump_header(oc, NULL); > pr_warn("Out of memory and no killable processes...\n"); > oom_no_victim_rs.begin = jiffies; > } Then there is no reason to reinvent the wheel. So use the standard ratelimit approach. Or put it in other words, this place is no special to any other that needs some sort of printk throttling. We surely do not want an ad-hoc solutions all over the kernel. And once you realize that the ratelimit api is the proper one (put aside any potential improvements in the implementation of this api) then you quickly learn that we already do throttle oom reports and it would be nice to unify that and ... we are back to a naked patch. So please stop being stuborn and try to cooperate finally.
On Thu 18-10-18 15:10:18, Sergey Senozhatsky wrote: [...] > and let's hear from MM people what they can suggest. > > Michal, Andrew, Johannes, any thoughts? I have already stated my position. Let's not reinvent the wheel and use the standard printk throttling. If there are cases where oom reports cause more harm than good I am open to add a knob to allow disabling it altogether (it can be even fine grained one to control whether to dump show_mem, task_list etc.). But please let's stop this dubious one-off approaches.
On (10/18/18 09:56), Michal Hocko wrote: > On Thu 18-10-18 15:10:18, Sergey Senozhatsky wrote: > [...] > > and let's hear from MM people what they can suggest. > > > > Michal, Andrew, Johannes, any thoughts? > > I have already stated my position. Let's not reinvent the wheel and use > the standard printk throttling. If there are cases where oom reports > cause more harm than good I am open to add a knob to allow disabling it > altogether (it can be even fine grained one to control whether to dump > show_mem, task_list etc.). A knob might do. As well as /proc/sys/kernel/printk tweaks, probably. One can even add echo "a b c d" > /proc/sys/kernel/printk to .bashrc and adjust printk console levels on login and rollback to old values in .bash_logout May be. > But please let's stop this dubious one-off approaches. OK. Well, I'm not proposing anything actually. I didn't even realize until recently that Tetsuo was talking about "user interaction" problem; I thought that his problem was stalled RCU. -ss
On 2018/10/18 15:55, Michal Hocko wrote: > On Thu 18-10-18 11:46:50, Tetsuo Handa wrote: >> This is essentially a ratelimit approach, roughly equivalent with: >> >> static DEFINE_RATELIMIT_STATE(oom_no_victim_rs, 60 * HZ, 1); >> oom_no_victim_rs.flags |= RATELIMIT_MSG_ON_RELEASE; >> >> if (__ratelimit(&oom_no_victim_rs)) { >> dump_header(oc, NULL); >> pr_warn("Out of memory and no killable processes...\n"); >> oom_no_victim_rs.begin = jiffies; >> } > > Then there is no reason to reinvent the wheel. So use the standard > ratelimit approach. Or put it in other words, this place is no special > to any other that needs some sort of printk throttling. We surely do not > want an ad-hoc solutions all over the kernel. netdev_wait_allrefs() in net/core/dev.c is doing the same thing. Since out_of_memory() is serialized by oom_lock mutex, there is no need to use "struct ratelimit_state"->lock field. Plain "unsigned long" is enough. > > And once you realize that the ratelimit api is the proper one (put aside > any potential improvements in the implementation of this api) then you > quickly learn that we already do throttle oom reports and it would be > nice to unify that and ... we are back to a naked patch. So please stop > being stuborn and try to cooperate finally. I don't think that ratelimit API is the proper one, for I am touching "struct ratelimit_state"->begin field which is not exported by ratelimit API. But if you insist on ratelimit API version, I can tolerate with below one. mm/oom_kill.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f10aa53..7c6118e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1106,6 +1106,12 @@ bool out_of_memory(struct oom_control *oc) select_bad_process(oc); /* Found nothing?!?! */ if (!oc->chosen) { + static DEFINE_RATELIMIT_STATE(no_eligible_rs, 60 * HZ, 1); + + ratelimit_set_flags(&no_eligible_rs, RATELIMIT_MSG_ON_RELEASE); + if ((is_sysrq_oom(oc) || is_memcg_oom(oc)) && + !__ratelimit(&no_eligible_rs)) + return false; dump_header(oc, NULL); pr_warn("Out of memory and no killable processes...\n"); /* @@ -1115,6 +1121,7 @@ bool out_of_memory(struct oom_control *oc) */ if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) panic("System is deadlocked on memory\n"); + no_eligible_rs.begin = jiffies; } if (oc->chosen && oc->chosen != (void *)-1UL) oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
On Thu 18-10-18 19:37:18, Tetsuo Handa wrote: > On 2018/10/18 15:55, Michal Hocko wrote: > > On Thu 18-10-18 11:46:50, Tetsuo Handa wrote: > >> This is essentially a ratelimit approach, roughly equivalent with: > >> > >> static DEFINE_RATELIMIT_STATE(oom_no_victim_rs, 60 * HZ, 1); > >> oom_no_victim_rs.flags |= RATELIMIT_MSG_ON_RELEASE; > >> > >> if (__ratelimit(&oom_no_victim_rs)) { > >> dump_header(oc, NULL); > >> pr_warn("Out of memory and no killable processes...\n"); > >> oom_no_victim_rs.begin = jiffies; > >> } > > > > Then there is no reason to reinvent the wheel. So use the standard > > ratelimit approach. Or put it in other words, this place is no special > > to any other that needs some sort of printk throttling. We surely do not > > want an ad-hoc solutions all over the kernel. > > netdev_wait_allrefs() in net/core/dev.c is doing the same thing. Since > out_of_memory() is serialized by oom_lock mutex, there is no need to use > "struct ratelimit_state"->lock field. Plain "unsigned long" is enough. That code probably predates generalized ratelimit api. > > And once you realize that the ratelimit api is the proper one (put aside > > any potential improvements in the implementation of this api) then you > > quickly learn that we already do throttle oom reports and it would be > > nice to unify that and ... we are back to a naked patch. So please stop > > being stuborn and try to cooperate finally. > > I don't think that ratelimit API is the proper one, for I am touching > "struct ratelimit_state"->begin field which is not exported by ratelimit API. > But if you insist on ratelimit API version, I can tolerate with below one. I just give up. I do not really see why you always have to make the code more complex than necessary and squash different things together. This is a complete kernel code development antipattern. I am not goging to reply to this thread more but let me note that this is beyond fun in any aspect I can think off (and yeah I have considered dark sense of humor as well). > > mm/oom_kill.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index f10aa53..7c6118e 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1106,6 +1106,12 @@ bool out_of_memory(struct oom_control *oc) > select_bad_process(oc); > /* Found nothing?!?! */ > if (!oc->chosen) { > + static DEFINE_RATELIMIT_STATE(no_eligible_rs, 60 * HZ, 1); > + > + ratelimit_set_flags(&no_eligible_rs, RATELIMIT_MSG_ON_RELEASE); > + if ((is_sysrq_oom(oc) || is_memcg_oom(oc)) && > + !__ratelimit(&no_eligible_rs)) > + return false; > dump_header(oc, NULL); > pr_warn("Out of memory and no killable processes...\n"); > /* > @@ -1115,6 +1121,7 @@ bool out_of_memory(struct oom_control *oc) > */ > if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) > panic("System is deadlocked on memory\n"); > + no_eligible_rs.begin = jiffies; > } > if (oc->chosen && oc->chosen != (void *)-1UL) > oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : > -- > 1.8.3.1
On 2018/10/18 17:13, Sergey Senozhatsky wrote: > On (10/18/18 09:56), Michal Hocko wrote: >> On Thu 18-10-18 15:10:18, Sergey Senozhatsky wrote: >> [...] >>> and let's hear from MM people what they can suggest. >>> >>> Michal, Andrew, Johannes, any thoughts? >> >> I have already stated my position. Let's not reinvent the wheel and use >> the standard printk throttling. If there are cases where oom reports >> cause more harm than good I am open to add a knob to allow disabling it >> altogether (it can be even fine grained one to control whether to dump >> show_mem, task_list etc.). Moderate OOM reports with making progress is good. But OOM reports without making progress is bad. > > A knob might do. > As well as /proc/sys/kernel/printk tweaks, probably. One can even add > echo "a b c d" > /proc/sys/kernel/printk to .bashrc and adjust printk > console levels on login and rollback to old values in .bash_logout > May be. That can work for only single login with root user case. Not everyone logs into console as root user. It is pity that we can't send kernel messages to only selected consoles (e.g. all messages are sent to netconsole, but only critical messages are sent to local consoles). > >> But please let's stop this dubious one-off approaches. > > OK. Well, I'm not proposing anything actually. I didn't even > realize until recently that Tetsuo was talking about "user > interaction" problem; I thought that his problem was stalled > RCU. The "stalled RCU" was the trigger for considering this problem. Nobody has seriously considered what we should do when the memcg OOM killer cannot make progress. If the OOM killer cannot make progress, we need to handle situations where the OOM-unkillable process cannot solve the memcg OOM situation. Then, the poorest recovery method is that the root user enters commands for recovery (It might be to increase the memory limit. It might be to move to a different cgroup. It might be to gracefully terminate the OOM-unkillable process.) from "consoles" where the OOM messages are sent. If we start from the worst case, it is obvious that we need to make sure that the OOM messages do not disturb "consoles" so that the root user can enter commands for recovery. That boils down to a "user interaction" problem. Not limiting "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" "Out of memory and no killable processes...\n" is very annoying. And I really can't understand why Michal thinks "handling this requirement" as "make the code more complex than necessary and squash different things together". Satisfying the most difficult error path handling is not a simple thing.
On Thu 2018-10-18 13:27:39, Sergey Senozhatsky wrote: > On (10/18/18 11:46), Tetsuo Handa wrote: > > Sergey Senozhatsky wrote: > > > > > > int printk_ratelimit_interval(void) > > > { > > > int ret = DEFAULT_RATELIMIT_INTERVAL; > > > struct tty_driver *driver = NULL; > > > speed_t min_baud = MAX_INT; > > > > > > console_lock(); > > > for_each_console(c) { > > > speed_t br; > > > > > > if (!c->device) > > > continue; > > > if (!(c->flags & CON_ENABLED)) > > > continue; > > > if (!c->write) > > > continue; > > > driver = c->device(c, index); > > > if (!driver) > > > continue; > > > > > > br = tty_get_baud_rate(tty_driver to tty_struct [???]); > > > min_baud = min(min_baud, br); > > > } > > > console_unlock(); > > > > > > switch (min_baud) { > > > case 115200: > > > return ret; > > > > > > case ...blah blah...: > > > return ret * 2; > > > > > > case 9600: > > > return ret * 4; > > > } > > > return ret; > > > } > > > > I don't think that baud rate is relevant. Writing to console messes up > > operations by console users. What matters is that we don't mess up consoles > > to the level (or frequency) where console users cannot do their operations. > > That is, interval between the last moment we wrote to a console and the > > first moment we will write to a console for the next time matters. Roughly > > speaking, remember the time stamp when we called call_console_drivers() for > > the last time, and compare with that stamp before trying to call a sort of > > ratelimited printk(). My patch is doing it using per call-site stamp recording. > > To my personal taste, "baud rate of registered and enabled consoles" > approach is drastically more relevant than hard coded 10 * HZ or > 60 * HZ magic numbers... But not in the form of that "min baud rate" > brain fart, which I have posted. > > What I'd do: > > -- Iterate over all registered and enabled serial consoles > -- Sum up all the baud rates > -- Calculate (*roughly*) how many bytes per second/minute/etc my > call_console_driver() can push > > -- we actually don't even have to iterate all consoles. Just > add a baud rate in register_console() and sub baud rate > in unregister_console() of each console individually > -- and have a static unsigned long in printk.c, which OOM > can use in rate-limit interval check > > -- Leave all the noise behind: e.g. console_sem can be locked by > a preempted fbcon, etc. It's out of my control; bad luck, there > is nothing I can do about it. > -- Then I would, probably, take the most recent, say, 100 OOM > reports, and calculate the *average* strlen() value > (including \r and \n at the end of each line) This looks very complex and I see even more problems: + You would need to update the rate limit intervals when new console is attached. Note that the ratelimits might get initialized early during boot. It might be solvable but ... + You might need to update the length of the message when the text (code) is updated. It might be hard to maintain. + You would need to take into account also console_level and the level of the printed messages + This approach does not take into account all other messages that might be printed by other subsystems. I have just talked with Michal in person. He pointed out that we primary need to know when and if the last printed message already reached the console. A solution might be to handle printk and ratelimit together. For example: + store log_next_idx of the printed message into the ratelimit structure + eventually store pointer of the ratelimit structure into printk_log + eventually store the timestamp when the message reached the console into the ratelimit structure Then the ratelimited printk might take into acount whether the previous message already reached console and even when. Well, this is still rather complex. I am not persuaded that it is worth it. I suggest to take a breath, stop looking for a perfect solution for a bit. The existing ratelimit might be perfectly fine in practice. You might always create stress test that would fail but the test might be far from reality. Any complex solution might bring more problems that solve. Console full of repeated messages need not be a catastrophe when it helps to fix the problem and the system is usable and need a reboot anyway. Best Regards, Petr
On (10/18/18 20:58), Tetsuo Handa wrote: > > > > A knob might do. > > As well as /proc/sys/kernel/printk tweaks, probably. One can even add > > echo "a b c d" > /proc/sys/kernel/printk to .bashrc and adjust printk > > console levels on login and rollback to old values in .bash_logout > > May be. > > That can work for only single login with root user case. > Not everyone logs into console as root user. Add sudo ;) > It is pity that we can't send kernel messages to only selected consoles > (e.g. all messages are sent to netconsole, but only critical messages are > sent to local consoles). OK, that's a fair point. There was a patch from FB, which would allow us to set a log_level on per-console basis. So the noise goes to heav^W net console; only critical stuff goes to the serial console (if I recall it correctly). I'm not sure what happened to that patch, it was a while ago. I'll try to find that out. [..] > That boils down to a "user interaction" problem. > Not limiting > > "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" > "Out of memory and no killable processes...\n" > > is very annoying. > > And I really can't understand why Michal thinks "handling this requirement" as > "make the code more complex than necessary and squash different things together". Michal is trying very hard to address the problem in a reasonable way. The problem you are talking about is not MM specific. You can have a faulty SCSI device, corrupted FS, and so and on. -ss
Petr Mladek wrote: > This looks very complex and I see even more problems: > > + You would need to update the rate limit intervals when > new console is attached. Note that the ratelimits might > get initialized early during boot. It might be solvable > but ... > > + You might need to update the length of the message > when the text (code) is updated. It might be hard > to maintain. I assumed we calculate the average dynamically, for the amount of messages printed by an OOM event is highly unstable (depends on hardware configuration such as number of nodes, number of zones, and how many processes are there as a candidate for OOM victim). > > + You would need to take into account also console_level > and the level of the printed messages Isn't that counted by call_console_drivers() ? > > + This approach does not take into account all other > messages that might be printed by other subsystems. Yes. And I wonder whether unlimited printk() alone is the cause of RCU stalls. I think that printk() is serving as an amplifier for any CPU users. That is, the average speed might not be safe enough to guarantee that RCU stalls won't occur. Then, there is no safe average value we can use. > > > I have just talked with Michal in person. He pointed out > that we primary need to know when and if the last printed > message already reached the console. I think we can estimate when call_console_drivers() started/completed for the last time as when and if the last printed message already reached the console. Sometimes callers might append to the logbuf without waiting for completion of call_console_drivers(), but the system is already unusable if majority of ratelimited printk() users hit that race window. > > A solution might be to handle printk and ratelimit together. > For example: > > + store log_next_idx of the printed message into > the ratelimit structure > > + eventually store pointer of the ratelimit structure > into printk_log > > + eventually store the timestamp when the message > reached the console into the ratelimit structure > > Then the ratelimited printk might take into acount whether > the previous message already reached console and even when. If printk() becomes asynchronous (e.g. printk() kernel thread), we would need to use something like srcu_synchronize() so that the caller waits for only completion of messages the caller wants to wait. > > > Well, this is still rather complex. I am not persuaded that > it is worth it. > > I suggest to take a breath, stop looking for a perfect solution > for a bit. The existing ratelimit might be perfectly fine > in practice. You might always create stress test that would > fail but the test might be far from reality. Any complex > solution might bring more problems that solve. > > Console full of repeated messages need not be a catastrophe > when it helps to fix the problem and the system is usable > and need a reboot anyway. I wish that memcg OOM events do not use printk(). Since memcg OOM is not out of physical memory, we can dynamically allocate physical memory for holding memcg OOM messages and let the userspace poll it via some interface.
On 2018/10/19 8:54, Sergey Senozhatsky wrote: > On (10/18/18 20:58), Tetsuo Handa wrote: >>> >>> A knob might do. >>> As well as /proc/sys/kernel/printk tweaks, probably. One can even add >>> echo "a b c d" > /proc/sys/kernel/printk to .bashrc and adjust printk >>> console levels on login and rollback to old values in .bash_logout >>> May be. >> >> That can work for only single login with root user case. >> Not everyone logs into console as root user. > > Add sudo ;) That will not work. ;-) As long as the console loglevel setting is system wide, we can't allow multiple login sessions. > >> It is pity that we can't send kernel messages to only selected consoles >> (e.g. all messages are sent to netconsole, but only critical messages are >> sent to local consoles). > > OK, that's a fair point. There was a patch from FB, which would allow us > to set a log_level on per-console basis. So the noise goes to heav^W net > console; only critical stuff goes to the serial console (if I recall it > correctly). I'm not sure what happened to that patch, it was a while ago. > I'll try to find that out. Per a console loglevel setting would help for several environments. But syzbot environment cannot count on netconsole. We can't expect that unlimited printk() will become safe. > > [..] >> That boils down to a "user interaction" problem. >> Not limiting >> >> "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" >> "Out of memory and no killable processes...\n" >> >> is very annoying. >> >> And I really can't understand why Michal thinks "handling this requirement" as >> "make the code more complex than necessary and squash different things together". > > Michal is trying very hard to address the problem in a reasonable way. OK. But Michal, do we have a reasonable way which can be applied now instead of my patch or one of below patches? Just enumerating words like "hackish" or "a mess" without YOU ACTUALLY PROPOSE PATCHES will bounce back to YOU. > The problem you are talking about is not MM specific. You can have a > faulty SCSI device, corrupted FS, and so and on. "a faulty SCSI device, corrupted FS, and so and on" are reporting problems which will complete a request. They can use (and are using) ratelimit, aren't they? "a memcg OOM with no eligible task" is reporting a problem which cannot complete a request. But it can use ratelimit as well. But we have an immediately applicable mitigation for a problem that already OOM-killed threads are triggering "a memcg OOM with no eligible task" using one of below patches. From 0a533d15949eac25f5ce7ce6e53f5830608f08e7 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Fri, 19 Oct 2018 15:52:56 +0900 Subject: [PATCH v2] mm, oom: OOM victims do not need to select next OOM victim unless __GFP_NOFAIL. Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks") changed to select next OOM victim as soon as MMF_OOM_SKIP is set, a memcg OOM event from a user process can generate 220+ times (12400+ lines / 730+ KB) of OOM-killer messages with "Out of memory and no killable processes..." (i.e. no progress) due to a race window. This patch completely eliminates such race window by making out_of_memory() from OOM victims no-op, for OOM victims do not forever retry (unless __GFP_NOFAIL). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- mm/oom_kill.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f10aa53..0e8d20b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1058,6 +1058,9 @@ bool out_of_memory(struct oom_control *oc) if (oom_killer_disabled) return false; + if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL)) + return true; + if (!is_memcg_oom(oc)) { blocking_notifier_call_chain(&oom_notify_list, 0, &freed); if (freed > 0)
On (10/19/18 19:35), Tetsuo Handa wrote: > > > > OK, that's a fair point. There was a patch from FB, which would allow us > > to set a log_level on per-console basis. So the noise goes to heav^W net > > console; only critical stuff goes to the serial console (if I recall it > > correctly). I'm not sure what happened to that patch, it was a while ago. > > I'll try to find that out. > > Per a console loglevel setting would help for several environments. > But syzbot environment cannot count on netconsole. We can't expect that > unlimited printk() will become safe. This target is moving too fast :) RCU stall -> user interaction -> syzbot I talked to Calvin Owens (who's working on the per-console loglevel patch set; CC-ed) and Calvin said that "It's in-progress". So we probably will have this functionality one day. That's all we can do from printk side wrt user-interaction problem. > > The problem you are talking about is not MM specific. You can have a > > faulty SCSI device, corrupted FS, and so and on. > > "a faulty SCSI device, corrupted FS, and so and on" are reporting problems > which will complete a request. They can use (and are using) ratelimit, > aren't they? Looking at scsi_request_fn(), the answer is probably "sure they can; but no, they aren't". In majority of cases the reason we replace printk with printk_ratelimit is because someone reports a stall or a lockup. Otherwise, people use printk(), which is absolutely fine. -ss
On Fri 2018-10-19 09:18:16, Tetsuo Handa wrote: > Petr Mladek wrote: > > This looks very complex and I see even more problems: > > > > + You would need to update the rate limit intervals when > > new console is attached. Note that the ratelimits might > > get initialized early during boot. It might be solvable > > but ... > > > > + You might need to update the length of the message > > when the text (code) is updated. It might be hard > > to maintain. > > I assumed we calculate the average dynamically, for the amount of > messages printed by an OOM event is highly unstable (depends on > hardware configuration such as number of nodes, number of zones, > and how many processes are there as a candidate for OOM victim). Is there any idea how the average length can be counted dynamically? Note that ____ratelimit() currently does not get any information from printk/console. It would need to be locakless. We do not want to complicate printk() with even more locks. > > > > + This approach does not take into account all other > > messages that might be printed by other subsystems. > > Yes. And I wonder whether unlimited printk() alone is the cause of RCU > stalls. I think that printk() is serving as an amplifier for any CPU users. > That is, the average speed might not be safe enough to guarantee that RCU > stalls won't occur. Then, there is no safe average value we can use. This is why I suggested to avoid counting OOM messages and just check if and when the last OOM message reached console. > > I have just talked with Michal in person. He pointed out > > that we primary need to know when and if the last printed > > message already reached the console. > > I think we can estimate when call_console_drivers() started/completed for > the last time as when and if the last printed message already reached the > console. Sometimes callers might append to the logbuf without waiting for > completion of call_console_drivers(), but the system is already unusable > if majority of ratelimited printk() users hit that race window. I am confused. We are talking about ratemiting. We do not want to wait for anything. The only guestion is whether it makes sense to print the "same" message Xth time when even the 1st message have not reached the console yet. This reminds me another problem. We would need to use the same decision for all printk() calls that logically belongs to each other. Otherwise we might get mixed lines that might confuse poeple. I mean that OOM messages might look like: OOM: A OOM: B OOM: C If we do not synchronize the rateliting, we might see: OOM: A OOM: B OOM: C OOM: B OOM: B OOM: A OOM: C OOM: C > > A solution might be to handle printk and ratelimit together. > > For example: > > > > + store log_next_idx of the printed message into > > the ratelimit structure > > > > + eventually store pointer of the ratelimit structure > > into printk_log > > > > + eventually store the timestamp when the message > > reached the console into the ratelimit structure > > > > Then the ratelimited printk might take into acount whether > > the previous message already reached console and even when. > > If printk() becomes asynchronous (e.g. printk() kernel thread), we would > need to use something like srcu_synchronize() so that the caller waits for > only completion of messages the caller wants to wait. I do not understand this. printk() must not block OOM progress. > > Well, this is still rather complex. I am not persuaded that > > it is worth it. > > > > I suggest to take a breath, stop looking for a perfect solution > > for a bit. The existing ratelimit might be perfectly fine > > in practice. You might always create stress test that would > > fail but the test might be far from reality. Any complex > > solution might bring more problems that solve. > > > > Console full of repeated messages need not be a catastrophe > > when it helps to fix the problem and the system is usable > > and need a reboot anyway. > > I wish that memcg OOM events do not use printk(). Since memcg OOM is not > out of physical memory, we can dynamically allocate physical memory for > holding memcg OOM messages and let the userspace poll it via some interface. Would the userspace work when the system gets blocked on allocations? Best Regards, Petr
On Fri 2018-10-19 19:35:53, Tetsuo Handa wrote: > On 2018/10/19 8:54, Sergey Senozhatsky wrote: > > On (10/18/18 20:58), Tetsuo Handa wrote: > >> That boils down to a "user interaction" problem. > >> Not limiting > >> > >> "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" > >> "Out of memory and no killable processes...\n" > >> > >> is very annoying. > >> > >> And I really can't understand why Michal thinks "handling this requirement" as > >> "make the code more complex than necessary and squash different things together". > > > > Michal is trying very hard to address the problem in a reasonable way. > > OK. But Michal, do we have a reasonable way which can be applied now instead of > my patch or one of below patches? Just enumerating words like "hackish" or "a mess" > without YOU ACTUALLY PROPOSE PATCHES will bounce back to YOU. Michal suggested using ratelimit, the standard solution. My understanding is that this situation happens when the system is misconfigured and unusable without manual intervention. If the user is able to see what the problem is then we are good. You talk about interactivity but who asked for this? IMHO, if system ends in OOM situation, it would need to get restarted in most cases anyway. Then people have a chance to fix the configuration after the reboot. Best Regards, Petr
[I strongly suspect this whole email thread went way out of scope of the issue really deserves] I didn't want to participate any further but let me clarify one thing because I can see how the discussion could generate some confusion. On Tue 23-10-18 10:37:38, Petr Mladek wrote: [...] > My understanding is that this situation happens when the system is > misconfigured and unusable without manual intervention. If > the user is able to see what the problem is then we are good. Not really. The flood of _memcg_ oom report about no eligible tasks should indeed happen only when the memcg is misconfigured. The system is and should be still usable at this stage. Ratelimit is aimed to reduce pointless message which do not help to debug the issue itself much. There is a race condition as explained by Tetsuo that could lead to this situation even without a misconfiguration and that is clearly a bug and something to deal with and patches have been posted in that regards [1] The rest of the discussion is about how to handle printk rate-limiting properly and whether ad-hoc solution is more appropriate than a real API we have in place and whether the later needs some enhancements. That is completely orthogonal on the issue at hands and as such it should be really discussed separately. [1] http://lkml.kernel.org/r/20181022071323.9550-1-mhocko@kernel.org
On 2018/10/23 17:21, Petr Mladek wrote: > On Fri 2018-10-19 09:18:16, Tetsuo Handa wrote: >> I assumed we calculate the average dynamically, for the amount of >> messages printed by an OOM event is highly unstable (depends on >> hardware configuration such as number of nodes, number of zones, >> and how many processes are there as a candidate for OOM victim). > > Is there any idea how the average length can be counted dynamically? I don't have one. Maybe sum up return values of printk() from OOM context? > This reminds me another problem. We would need to use the same > decision for all printk() calls that logically belongs to each > other. Otherwise we might get mixed lines that might confuse > poeple. I mean that OOM messages might look like: > > OOM: A > OOM: B > OOM: C > > If we do not synchronize the rateliting, we might see: > > OOM: A > OOM: B > OOM: C > OOM: B > OOM: B > OOM: A > OOM: C > OOM: C Messages from out_of_memory() are serialized by oom_lock mutex. Messages from warn_alloc() are not serialized, and thus cause confusion. >> I wish that memcg OOM events do not use printk(). Since memcg OOM is not >> out of physical memory, we can dynamically allocate physical memory for >> holding memcg OOM messages and let the userspace poll it via some interface. > > Would the userspace work when the system gets blocked on allocations? Yes for memcg OOM events. No for global OOM events. You can try reproducers shown below from your environment. Regarding case 2, we can solve the problem by checking tsk_is_oom_victim(current) == true. But regarding case 1, Michal's patch is not sufficient for allowing administrators to enter commands for recovery from console. ---------- Case 1: Flood of memcg OOM events caused by misconfiguration. ---------- #include <stdio.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #include <stdlib.h> int main(int argc, char *argv[]) { FILE *fp; const unsigned long size = 1048576 * 200; char *buf = malloc(size); mkdir("/sys/fs/cgroup/memory/test1", 0755); fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w"); fprintf(fp, "%lu\n", size / 2); fclose(fp); fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w"); fprintf(fp, "%u\n", getpid()); fclose(fp); fp = fopen("/proc/self/oom_score_adj", "w"); fprintf(fp, "-1000\n"); fclose(fp); fp = fopen("/dev/zero", "r"); fread(buf, 1, size, fp); fclose(fp); return 0; } ---------- Case 2: Flood of memcg OOM events caused by MMF_OOM_SKIP race. ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <sched.h> #include <sys/mman.h> #define NUMTHREADS 256 #define MMAPSIZE 4 * 10485760 #define STACKSIZE 4096 static int pipe_fd[2] = { EOF, EOF }; static int memory_eater(void *unused) { int fd = open("/dev/zero", O_RDONLY); char *buf = mmap(NULL, MMAPSIZE, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_SHARED, EOF, 0); read(pipe_fd[0], buf, 1); read(fd, buf, MMAPSIZE); pause(); return 0; } int main(int argc, char *argv[]) { int i; char *stack; FILE *fp; const unsigned long size = 1048576 * 200; mkdir("/sys/fs/cgroup/memory/test1", 0755); fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w"); fprintf(fp, "%lu\n", size); fclose(fp); fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w"); fprintf(fp, "%u\n", getpid()); fclose(fp); if (setgid(-2) || setuid(-2)) return 1; stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_SHARED, EOF, 0); for (i = 0; i < NUMTHREADS; i++) if (clone(memory_eater, stack + (i + 1) * STACKSIZE, CLONE_SIGHAND | CLONE_THREAD | CLONE_VM | CLONE_FS | CLONE_FILES, NULL) == -1) break; sleep(1); close(pipe_fd[1]); pause(); return 0; }
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f10aa53..9056f9b 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1106,6 +1106,11 @@ bool out_of_memory(struct oom_control *oc) select_bad_process(oc); /* Found nothing?!?! */ if (!oc->chosen) { + static unsigned long last_warned; + + if ((is_sysrq_oom(oc) || is_memcg_oom(oc)) && + time_in_range(jiffies, last_warned, last_warned + 60 * HZ)) + return false; dump_header(oc, NULL); pr_warn("Out of memory and no killable processes...\n"); /* @@ -1115,6 +1120,7 @@ bool out_of_memory(struct oom_control *oc) */ if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) panic("System is deadlocked on memory\n"); + last_warned = jiffies; } if (oc->chosen && oc->chosen != (void *)-1UL) oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
syzbot is hitting RCU stall at shmem_fault() [1]. This is because memcg-OOM events with no eligible task (current thread is marked as OOM-unkillable) continued calling dump_header() from out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol: print proper OOM header when no eligible victim left."). Michal proposed ratelimiting dump_header() [2]. But I don't think that that patch is appropriate because that patch does not ratelimit "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n" "Out of memory and no killable processes...\n" messages which can be printed for every few milliseconds (i.e. effectively denial of service for console users) until the OOM situation is solved. Let's make sure that next dump_header() waits for at least 60 seconds from previous "Out of memory and no killable processes..." message. Michal is thinking that any interval is meaningless without knowing the printk() throughput. But since printk() is synchronous unless handed over to somebody else by commit dbdda842fe96f893 ("printk: Add console owner and waiter logic to load balance console writes"), it is likely that all OOM messages from this out_of_memory() request is already flushed to consoles when pr_warn("Out of memory and no killable processes...\n") returned. Thus, we will be able to allow console users to do what they need to do. To summarize, this patch allows threads in requested memcg to complete memory allocation requests for doing recovery operation, and also allows administrators to manually do recovery operation from console if OOM-unkillable thread is failing to solve the OOM situation automatically. [1] https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64 [2] https://lkml.kernel.org/r/20181010151135.25766-1-mhocko@kernel.org Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+77e6b28a7a7106ad0def@syzkaller.appspotmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> --- mm/oom_kill.c | 6 ++++++ 1 file changed, 6 insertions(+)