diff mbox series

[RFC] memcg, oom: throttle dump_header for memcg ooms without eligible tasks

Message ID 20181010151135.25766-1-mhocko@kernel.org (mailing list archive)
State New, archived
Headers show
Series [RFC] memcg, oom: throttle dump_header for memcg ooms without eligible tasks | expand

Commit Message

Michal Hocko Oct. 10, 2018, 3:11 p.m. UTC
From: Michal Hocko <mhocko@suse.com>

syzbot has noticed that it can trigger RCU stalls from the memcg oom
path:
RIP: 0010:dump_stack+0x358/0x3ab lib/dump_stack.c:118
Code: 74 0c 48 c7 c7 f0 f5 31 89 e8 9f 0e 0e fa 48 83 3d 07 15 7d 01 00 0f
84 63 fe ff ff e8 1c 89 c9 f9 48 8b bd 70 ff ff ff 57 9d <0f> 1f 44 00 00
e8 09 89 c9 f9 48 8b 8d 68 ff ff ff b8 ff ff 37 00
RSP: 0018:ffff88017d3a5c70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000040000 RBX: 1ffffffff1263ebe RCX: ffffc90001e5a000
RDX: 0000000000040000 RSI: ffffffff87b4e0f4 RDI: 0000000000000246
RBP: ffff88017d3a5d18 R08: ffff8801d7e02480 R09: fffffbfff13da030
R10: fffffbfff13da030 R11: 0000000000000003 R12: 1ffff1002fa74b96
R13: 00000000ffffffff R14: 0000000000000200 R15: 0000000000000000
  dump_header+0x27b/0xf72 mm/oom_kill.c:441
  out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
  mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
  mem_cgroup_oom mm/memcontrol.c:1701 [inline]
  try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
  mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
  mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
  shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
  shmem_fault+0x25f/0x960 mm/shmem.c:1982
  __do_fault+0x100/0x6b0 mm/memory.c:2996
  do_read_fault mm/memory.c:3408 [inline]
  do_fault mm/memory.c:3531 [inline]

The primary reason of the stall lies in an expensive printk handling
of oom report flood because a misconfiguration on the syzbot side
caused that there is simply no eligible task because they have
OOM_SCORE_ADJ_MIN set. This generates the oom report for each allocation
from the memcg context.

While normal workloads should be much more careful about potential heavy
memory consumers that are OOM disabled it makes some sense to rate limit
a potentially expensive oom reports for cases when there is no eligible
victim found. Do that by moving the rate limit logic inside dump_header.
We no longer rely on the caller to do that. It was only oom_kill_process
which has been throttling. Other two call sites simply didn't have to
care because one just paniced on the OOM when configured that way and
no eligible task would panic for the global case as well. Memcg changed
the picture because we do not panic and we might have multiple sources
of the same event.

Once we are here, make sure that the reason to trigger the OOM is
printed without ratelimiting because this is really valuable to
debug what happened.

Reported-by: syzbot+77e6b28a7a7106ad0def@syzkaller.appspotmail.com
Cc: guro@fb.com
Cc: hannes@cmpxchg.org
Cc: kirill.shutemov@linux.intel.com
Cc: linux-kernel@vger.kernel.org
Cc: penguin-kernel@i-love.sakura.ne.jp
Cc: rientjes@google.com
Cc: yang.s@alibaba-inc.com
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Comments

Tetsuo Handa Oct. 11, 2018, 6:37 a.m. UTC | #1
Michal Hocko wrote:
> Once we are here, make sure that the reason to trigger the OOM is
> printed without ratelimiting because this is really valuable to
> debug what happened.

Here is my version.

From 0c9ab34fd01837d4c85794042ecb9e922c9eed5a Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 11 Oct 2018 15:27:42 +0900
Subject: [PATCH] mm: memcontrol: Don't flood OOM messages with no eligible task.

syzbot is hitting RCU stall at shmem_fault() [1].
This is because memcg-OOM events with no eligible task (current thread
is marked as OOM-unkillable) continued calling dump_header() from
out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol:
print proper OOM header when no eligible victim left.").

Let's make sure that next dump_header() waits for at least 60 seconds from
previous "Out of memory and no killable processes..." message, if current
thread already reported it. This change allows current thread to complete
memory allocation requests for doing recovery operation, and also allows
us to know whether the current thread is doing recovery operation.

[1] https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+77e6b28a7a7106ad0def@syzkaller.appspotmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  1 +
 mm/oom_kill.c         | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57..701cf3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -723,6 +723,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_MEMCG
 	unsigned			in_user_fault:1;
+	unsigned			memcg_oom_no_eligible_warned:1;
 #ifdef CONFIG_MEMCG_KMEM
 	unsigned			memcg_kmem_skip_account:1;
 #endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..f954d99 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,21 @@ bool out_of_memory(struct oom_control *oc)
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
+#ifdef CONFIG_MEMCG
+		/*
+		 * Don't flood with dump_header() if already reported, in case
+		 * current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN. Maybe
+		 * this variable should be per "struct mem_cgroup". But since
+		 * we can't prove that multiple concurrent memcg OOM without
+		 * eligible task won't cause flooding, choose global variable
+		 * for safety.
+		 */
+		static u64 last_warned;
+
+		if (is_memcg_oom(oc) && current->memcg_oom_no_eligible_warned
+		    && time_before64(get_jiffies_64(), last_warned + 60 * HZ))
+			return false;
+#endif
 		dump_header(oc, NULL);
 		pr_warn("Out of memory and no killable processes...\n");
 		/*
@@ -1115,6 +1130,14 @@ bool out_of_memory(struct oom_control *oc)
 		 */
 		if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
 			panic("System is deadlocked on memory\n");
+#ifdef CONFIG_MEMCG
+		if (is_memcg_oom(oc)) {
+			last_warned = get_jiffies_64();
+			current->memcg_oom_no_eligible_warned = 1;
+		}
+	} else if (is_memcg_oom(oc)) {
+		current->memcg_oom_no_eligible_warned = 0;
+#endif
 	}
 	if (oc->chosen && oc->chosen != (void *)-1UL)
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
Tetsuo Handa Oct. 12, 2018, 10:47 a.m. UTC | #2
On 2018/10/11 15:37, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> Once we are here, make sure that the reason to trigger the OOM is
>> printed without ratelimiting because this is really valuable to
>> debug what happened.
> 
> Here is my version.
> 

Hmm, per mem_cgroup flag would be better than per task_struct flag.

From 5e80805e4988f23372a3b352fd4185e6bad53f58 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 12 Oct 2018 16:24:49 +0900
Subject: [PATCH v2] mm: memcontrol: Don't flood OOM messages with no eligible task.

syzbot is hitting RCU stall at shmem_fault() [1].
This is because memcg-OOM events with no eligible task (current thread
is marked as OOM-unkillable) continued calling dump_header() from
out_of_memory() enabled by commit 3100dab2aa09dc6e ("mm: memcontrol:
print proper OOM header when no eligible victim left.").

Let's make sure that next dump_header() waits for at least 60 seconds from
previous "Out of memory and no killable processes..." message, if requested
memcg already reported it. This change allows threads in requested memcg to
complete memory allocation requests for doing recovery operation, and also
allows us to know whether threads in requested memcg is doing recovery
operation.

[1] https://syzkaller.appspot.com/bug?id=4ae3fff7fcf4c33a47c1192d2d62d2e03efffa64

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+77e6b28a7a7106ad0def@syzkaller.appspotmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
---
 include/linux/memcontrol.h |  3 +++
 mm/oom_kill.c              | 23 +++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 652f602..8da2340 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -230,6 +230,9 @@ struct mem_cgroup {
 	 */
 	bool oom_group;
 
+	/* Previous OOM-kill request failed due to no eligible task? */
+	bool oom_no_eligible_warned;
+
 	/* protected by memcg_oom_lock */
 	bool		oom_lock;
 	int		under_oom;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..52a8319 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,21 @@ bool out_of_memory(struct oom_control *oc)
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
+#ifdef CONFIG_MEMCG
+		/*
+		 * Don't flood with dump_header() if already reported, in case
+		 * current->signal->oom_score_adj == OOM_SCORE_ADJ_MIN. Maybe
+		 * this variable should be per "struct mem_cgroup". But since
+		 * we can't prove that multiple concurrent memcg OOM without
+		 * eligible task won't cause flooding, choose global variable
+		 * for safety.
+		 */
+		static u64 last_warned;
+
+		if (is_memcg_oom(oc) && oc->memcg->oom_no_eligible_warned &&
+		    time_before64(get_jiffies_64(), last_warned + 60 * HZ))
+			return false;
+#endif
 		dump_header(oc, NULL);
 		pr_warn("Out of memory and no killable processes...\n");
 		/*
@@ -1115,6 +1130,14 @@ bool out_of_memory(struct oom_control *oc)
 		 */
 		if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
 			panic("System is deadlocked on memory\n");
+#ifdef CONFIG_MEMCG
+		if (is_memcg_oom(oc)) {
+			oc->memcg->oom_no_eligible_warned = true;
+			last_warned = get_jiffies_64();
+		}
+	} else if (is_memcg_oom(oc)) {
+		oc->memcg->oom_no_eligible_warned = false;
+#endif
 	}
 	if (oc->chosen && oc->chosen != (void *)-1UL)
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
Johannes Weiner Oct. 12, 2018, 11:20 a.m. UTC | #3
On Wed, Oct 10, 2018 at 05:11:35PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> syzbot has noticed that it can trigger RCU stalls from the memcg oom
> path:
> RIP: 0010:dump_stack+0x358/0x3ab lib/dump_stack.c:118
> Code: 74 0c 48 c7 c7 f0 f5 31 89 e8 9f 0e 0e fa 48 83 3d 07 15 7d 01 00 0f
> 84 63 fe ff ff e8 1c 89 c9 f9 48 8b bd 70 ff ff ff 57 9d <0f> 1f 44 00 00
> e8 09 89 c9 f9 48 8b 8d 68 ff ff ff b8 ff ff 37 00
> RSP: 0018:ffff88017d3a5c70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
> RAX: 0000000000040000 RBX: 1ffffffff1263ebe RCX: ffffc90001e5a000
> RDX: 0000000000040000 RSI: ffffffff87b4e0f4 RDI: 0000000000000246
> RBP: ffff88017d3a5d18 R08: ffff8801d7e02480 R09: fffffbfff13da030
> R10: fffffbfff13da030 R11: 0000000000000003 R12: 1ffff1002fa74b96
> R13: 00000000ffffffff R14: 0000000000000200 R15: 0000000000000000
>   dump_header+0x27b/0xf72 mm/oom_kill.c:441
>   out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
>   mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
>   mem_cgroup_oom mm/memcontrol.c:1701 [inline]
>   try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
>   mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
>   mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
>   shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
>   shmem_fault+0x25f/0x960 mm/shmem.c:1982
>   __do_fault+0x100/0x6b0 mm/memory.c:2996
>   do_read_fault mm/memory.c:3408 [inline]
>   do_fault mm/memory.c:3531 [inline]
> 
> The primary reason of the stall lies in an expensive printk handling
> of oom report flood because a misconfiguration on the syzbot side
> caused that there is simply no eligible task because they have
> OOM_SCORE_ADJ_MIN set. This generates the oom report for each allocation
> from the memcg context.
> 
> While normal workloads should be much more careful about potential heavy
> memory consumers that are OOM disabled it makes some sense to rate limit
> a potentially expensive oom reports for cases when there is no eligible
> victim found. Do that by moving the rate limit logic inside dump_header.
> We no longer rely on the caller to do that. It was only oom_kill_process
> which has been throttling. Other two call sites simply didn't have to
> care because one just paniced on the OOM when configured that way and
> no eligible task would panic for the global case as well. Memcg changed
> the picture because we do not panic and we might have multiple sources
> of the same event.
> 
> Once we are here, make sure that the reason to trigger the OOM is
> printed without ratelimiting because this is really valuable to
> debug what happened.
> 
> Reported-by: syzbot+77e6b28a7a7106ad0def@syzkaller.appspotmail.com
> Cc: guro@fb.com
> Cc: hannes@cmpxchg.org
> Cc: kirill.shutemov@linux.intel.com
> Cc: linux-kernel@vger.kernel.org
> Cc: penguin-kernel@i-love.sakura.ne.jp
> Cc: rientjes@google.com
> Cc: yang.s@alibaba-inc.com
> Signed-off-by: Michal Hocko <mhocko@suse.com>

So not more than 10 dumps in each 5s interval. That looks reasonable
to me. By the time it starts dropping data you have more than enough
information to go on already.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Michal Hocko Oct. 12, 2018, 12:08 p.m. UTC | #4
On Fri 12-10-18 07:20:08, Johannes Weiner wrote:
> On Wed, Oct 10, 2018 at 05:11:35PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > syzbot has noticed that it can trigger RCU stalls from the memcg oom
> > path:
> > RIP: 0010:dump_stack+0x358/0x3ab lib/dump_stack.c:118
> > Code: 74 0c 48 c7 c7 f0 f5 31 89 e8 9f 0e 0e fa 48 83 3d 07 15 7d 01 00 0f
> > 84 63 fe ff ff e8 1c 89 c9 f9 48 8b bd 70 ff ff ff 57 9d <0f> 1f 44 00 00
> > e8 09 89 c9 f9 48 8b 8d 68 ff ff ff b8 ff ff 37 00
> > RSP: 0018:ffff88017d3a5c70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
> > RAX: 0000000000040000 RBX: 1ffffffff1263ebe RCX: ffffc90001e5a000
> > RDX: 0000000000040000 RSI: ffffffff87b4e0f4 RDI: 0000000000000246
> > RBP: ffff88017d3a5d18 R08: ffff8801d7e02480 R09: fffffbfff13da030
> > R10: fffffbfff13da030 R11: 0000000000000003 R12: 1ffff1002fa74b96
> > R13: 00000000ffffffff R14: 0000000000000200 R15: 0000000000000000
> >   dump_header+0x27b/0xf72 mm/oom_kill.c:441
> >   out_of_memory.cold.30+0xf/0x184 mm/oom_kill.c:1109
> >   mem_cgroup_out_of_memory+0x15e/0x210 mm/memcontrol.c:1386
> >   mem_cgroup_oom mm/memcontrol.c:1701 [inline]
> >   try_charge+0xb7c/0x1710 mm/memcontrol.c:2260
> >   mem_cgroup_try_charge+0x627/0xe20 mm/memcontrol.c:5892
> >   mem_cgroup_try_charge_delay+0x1d/0xa0 mm/memcontrol.c:5907
> >   shmem_getpage_gfp+0x186b/0x4840 mm/shmem.c:1784
> >   shmem_fault+0x25f/0x960 mm/shmem.c:1982
> >   __do_fault+0x100/0x6b0 mm/memory.c:2996
> >   do_read_fault mm/memory.c:3408 [inline]
> >   do_fault mm/memory.c:3531 [inline]
> > 
> > The primary reason of the stall lies in an expensive printk handling
> > of oom report flood because a misconfiguration on the syzbot side
> > caused that there is simply no eligible task because they have
> > OOM_SCORE_ADJ_MIN set. This generates the oom report for each allocation
> > from the memcg context.
> > 
> > While normal workloads should be much more careful about potential heavy
> > memory consumers that are OOM disabled it makes some sense to rate limit
> > a potentially expensive oom reports for cases when there is no eligible
> > victim found. Do that by moving the rate limit logic inside dump_header.
> > We no longer rely on the caller to do that. It was only oom_kill_process
> > which has been throttling. Other two call sites simply didn't have to
> > care because one just paniced on the OOM when configured that way and
> > no eligible task would panic for the global case as well. Memcg changed
> > the picture because we do not panic and we might have multiple sources
> > of the same event.
> > 
> > Once we are here, make sure that the reason to trigger the OOM is
> > printed without ratelimiting because this is really valuable to
> > debug what happened.
> > 
> > Reported-by: syzbot+77e6b28a7a7106ad0def@syzkaller.appspotmail.com
> > Cc: guro@fb.com
> > Cc: hannes@cmpxchg.org
> > Cc: kirill.shutemov@linux.intel.com
> > Cc: linux-kernel@vger.kernel.org
> > Cc: penguin-kernel@i-love.sakura.ne.jp
> > Cc: rientjes@google.com
> > Cc: yang.s@alibaba-inc.com
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> So not more than 10 dumps in each 5s interval. That looks reasonable
> to me. By the time it starts dropping data you have more than enough
> information to go on already.

Yeah. Unless we have a storm coming from many different cgroups in
parallel. But even then we have the allocation context for each OOM so
we are not losing everything. Should we ever tune this, it can be done
later with some explicit examples.

> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks! I will post the patch to Andrew early next week.
Tetsuo Handa Oct. 12, 2018, 12:10 p.m. UTC | #5
On 2018/10/12 21:08, Michal Hocko wrote:
>> So not more than 10 dumps in each 5s interval. That looks reasonable
>> to me. By the time it starts dropping data you have more than enough
>> information to go on already.
> 
> Yeah. Unless we have a storm coming from many different cgroups in
> parallel. But even then we have the allocation context for each OOM so
> we are not losing everything. Should we ever tune this, it can be done
> later with some explicit examples.
> 
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Thanks! I will post the patch to Andrew early next week.
> 

How do you handle environments where one dump takes e.g. 3 seconds?
Counting delay between first message in previous dump and first message
in next dump is not safe. Unless we count delay between last message
in previous dump and first message in next dump, we cannot guarantee
that the system won't lockup due to printk() flooding.
Johannes Weiner Oct. 12, 2018, 12:41 p.m. UTC | #6
On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote:
> On 2018/10/12 21:08, Michal Hocko wrote:
> >> So not more than 10 dumps in each 5s interval. That looks reasonable
> >> to me. By the time it starts dropping data you have more than enough
> >> information to go on already.
> > 
> > Yeah. Unless we have a storm coming from many different cgroups in
> > parallel. But even then we have the allocation context for each OOM so
> > we are not losing everything. Should we ever tune this, it can be done
> > later with some explicit examples.
> > 
> >> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > Thanks! I will post the patch to Andrew early next week.
> > 
> 
> How do you handle environments where one dump takes e.g. 3 seconds?
> Counting delay between first message in previous dump and first message
> in next dump is not safe. Unless we count delay between last message
> in previous dump and first message in next dump, we cannot guarantee
> that the system won't lockup due to printk() flooding.

How is that different from any other printk ratelimiting? If a dump
takes 3 seconds you need to fix your console. It doesn't make sense to
design KERN_INFO messages for the slowest serial consoles out there.

That's what we did, btw. We used to patch out the OOM header because
our serial console was so bad, but obviously that's not a generic
upstream solution. We've since changed the loglevel on the serial and
use netconsole[1] for the chattier loglevels.

[1] https://github.com/facebook/fbkutils/tree/master/netconsd
Tetsuo Handa Oct. 12, 2018, 12:58 p.m. UTC | #7
Calling printk() people. ;-)

On 2018/10/12 21:41, Johannes Weiner wrote:
> On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote:
>> On 2018/10/12 21:08, Michal Hocko wrote:
>>>> So not more than 10 dumps in each 5s interval. That looks reasonable
>>>> to me. By the time it starts dropping data you have more than enough
>>>> information to go on already.
>>>
>>> Yeah. Unless we have a storm coming from many different cgroups in
>>> parallel. But even then we have the allocation context for each OOM so
>>> we are not losing everything. Should we ever tune this, it can be done
>>> later with some explicit examples.
>>>
>>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>>>
>>> Thanks! I will post the patch to Andrew early next week.
>>>
>>
>> How do you handle environments where one dump takes e.g. 3 seconds?
>> Counting delay between first message in previous dump and first message
>> in next dump is not safe. Unless we count delay between last message
>> in previous dump and first message in next dump, we cannot guarantee
>> that the system won't lockup due to printk() flooding.
> 
> How is that different from any other printk ratelimiting? If a dump
> takes 3 seconds you need to fix your console. It doesn't make sense to
> design KERN_INFO messages for the slowest serial consoles out there.

You can't fix the console. It is a hardware limitation.

> 
> That's what we did, btw. We used to patch out the OOM header because
> our serial console was so bad, but obviously that's not a generic
> upstream solution. We've since changed the loglevel on the serial and
> use netconsole[1] for the chattier loglevels.
> 
> [1] https://github.com/facebook/fbkutils/tree/master/netconsd
>
Tetsuo Handa Oct. 13, 2018, 11:09 a.m. UTC | #8
On 2018/10/12 21:58, Tetsuo Handa wrote:
> On 2018/10/12 21:41, Johannes Weiner wrote:
>> On Fri, Oct 12, 2018 at 09:10:40PM +0900, Tetsuo Handa wrote:
>>> On 2018/10/12 21:08, Michal Hocko wrote:
>>>>> So not more than 10 dumps in each 5s interval. That looks reasonable
>>>>> to me. By the time it starts dropping data you have more than enough
>>>>> information to go on already.

Not reasonable at all.

>>>>
>>>> Yeah. Unless we have a storm coming from many different cgroups in
>>>> parallel. But even then we have the allocation context for each OOM so
>>>> we are not losing everything. Should we ever tune this, it can be done
>>>> later with some explicit examples.
>>>>
>>>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>>>>
>>>> Thanks! I will post the patch to Andrew early next week.
>>>>

One thread from one cgroup is sufficient. I don't think that Michal's patch
is an appropriate mitigation. It still needlessly floods kernel log buffer
and significantly defers recovery operation.

Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

---------- Testcase ----------

#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
	FILE *fp;
	const unsigned long size = 1048576 * 200;
	char *buf = malloc(size);
	mkdir("/sys/fs/cgroup/memory/test1", 0755);
	fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
	fprintf(fp, "%lu\n", size / 2);
	fclose(fp);
	fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
	fprintf(fp, "%u\n", getpid());
	fclose(fp);
	fp = fopen("/proc/self/oom_score_adj", "w");
	fprintf(fp, "-1000\n");
	fclose(fp);
	fp = fopen("/dev/zero", "r");
	fread(buf, 1, size, fp);
	fclose(fp);
	return 0;
}

---------- Michal's patch ----------

73133 lines (5.79MB) of kernel messages per one run

[root@ccsecurity ~]# time ./a.out

real    3m44.389s
user    0m0.000s
sys     3m42.334s

[root@ccsecurity ~]# time ./a.out

real    3m41.767s
user    0m0.004s
sys     3m39.779s

---------- My v2 patch ----------

50 lines (3.40 KB) of kernel messages per one run

[root@ccsecurity ~]# time ./a.out

real    0m5.227s
user    0m0.000s
sys     0m4.950s

[root@ccsecurity ~]# time ./a.out

real    0m5.249s
user    0m0.000s
sys     0m4.956s
Johannes Weiner Oct. 13, 2018, 11:22 a.m. UTC | #9
On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote:
> ---------- Michal's patch ----------
> 
> 73133 lines (5.79MB) of kernel messages per one run
> 
> [root@ccsecurity ~]# time ./a.out
> 
> real    3m44.389s
> user    0m0.000s
> sys     3m42.334s
> 
> [root@ccsecurity ~]# time ./a.out
> 
> real    3m41.767s
> user    0m0.004s
> sys     3m39.779s
> 
> ---------- My v2 patch ----------
> 
> 50 lines (3.40 KB) of kernel messages per one run
> 
> [root@ccsecurity ~]# time ./a.out
> 
> real    0m5.227s
> user    0m0.000s
> sys     0m4.950s
> 
> [root@ccsecurity ~]# time ./a.out
> 
> real    0m5.249s
> user    0m0.000s
> sys     0m4.956s

Your patch is suppressing information that I want to have and my
console can handle, just because your console is slow, even though
there is no need to use that console at that log level.

NAK to your patch. I think you're looking at this from the wrong
angle. A console that takes almost 4 minutes to print 70k lines
shouldn't be the baseline for how verbose KERN_INFO is.
Tetsuo Handa Oct. 13, 2018, 11:28 a.m. UTC | #10
On 2018/10/13 20:22, Johannes Weiner wrote:
> On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote:
>> ---------- Michal's patch ----------
>>
>> 73133 lines (5.79MB) of kernel messages per one run
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real    3m44.389s
>> user    0m0.000s
>> sys     3m42.334s
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real    3m41.767s
>> user    0m0.004s
>> sys     3m39.779s
>>
>> ---------- My v2 patch ----------
>>
>> 50 lines (3.40 KB) of kernel messages per one run
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real    0m5.227s
>> user    0m0.000s
>> sys     0m4.950s
>>
>> [root@ccsecurity ~]# time ./a.out
>>
>> real    0m5.249s
>> user    0m0.000s
>> sys     0m4.956s
> 
> Your patch is suppressing information that I want to have and my
> console can handle, just because your console is slow, even though
> there is no need to use that console at that log level.

My patch is not suppressing information you want to have.
My patch is mainly suppressing

[   52.393146] Out of memory and no killable processes...
[   52.395195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   52.398623] Out of memory and no killable processes...
[   52.401195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   52.404356] Out of memory and no killable processes...
[   52.406492] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   52.409595] Out of memory and no killable processes...
[   52.411745] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   52.415588] Out of memory and no killable processes...
[   52.418484] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   52.421904] Out of memory and no killable processes...
[   52.424273] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000

lines which Michal's patch cannot suppress.

Also, my console is console=ttyS0,115200n8 . Not slow at all.

> 
> NAK to your patch. I think you're looking at this from the wrong
> angle. A console that takes almost 4 minutes to print 70k lines
> shouldn't be the baseline for how verbose KERN_INFO is.
> 

Run the testcase in your environment.
Michal Hocko Oct. 15, 2018, 8:19 a.m. UTC | #11
On Sat 13-10-18 20:28:38, Tetsuo Handa wrote:
> On 2018/10/13 20:22, Johannes Weiner wrote:
> > On Sat, Oct 13, 2018 at 08:09:30PM +0900, Tetsuo Handa wrote:
> >> ---------- Michal's patch ----------
> >>
> >> 73133 lines (5.79MB) of kernel messages per one run
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real    3m44.389s
> >> user    0m0.000s
> >> sys     3m42.334s
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real    3m41.767s
> >> user    0m0.004s
> >> sys     3m39.779s
> >>
> >> ---------- My v2 patch ----------
> >>
> >> 50 lines (3.40 KB) of kernel messages per one run
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real    0m5.227s
> >> user    0m0.000s
> >> sys     0m4.950s
> >>
> >> [root@ccsecurity ~]# time ./a.out
> >>
> >> real    0m5.249s
> >> user    0m0.000s
> >> sys     0m4.956s
> > 
> > Your patch is suppressing information that I want to have and my
> > console can handle, just because your console is slow, even though
> > there is no need to use that console at that log level.
> 
> My patch is not suppressing information you want to have.
> My patch is mainly suppressing
> 
> [   52.393146] Out of memory and no killable processes...
> [   52.395195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
> [   52.398623] Out of memory and no killable processes...
> [   52.401195] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
> [   52.404356] Out of memory and no killable processes...
> [   52.406492] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
> [   52.409595] Out of memory and no killable processes...
> [   52.411745] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
> [   52.415588] Out of memory and no killable processes...
> [   52.418484] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
> [   52.421904] Out of memory and no killable processes...
> [   52.424273] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
> 
> lines which Michal's patch cannot suppress.

This was a deliberate decision because the allocation failure context is
usually a useful information to get. If this is killing a reasonably
configured machine then we can move the ratelimit up and suppress that
information. This will always be cost vs. benefit decision. And as such
it should be argued in the changelog.

As so many dozens of times before, I will point you to an incremental
nature of changes we really prefer in the mm land. We are also after a
simplicity which your proposal lacks in many aspects. You seem to ignore
that general approach and I have hard time to consider your NAK as a
relevant feedback. Going to an extreme and basing a complex solution on
it is not going to fly. No killable process should be a rare event which
requires a seriously misconfigured memcg to happen so wildly. If you can
trigger it with a normal user privileges then it would be a clear bug to
address rather than work around with printk throttling.
Tetsuo Handa Oct. 15, 2018, 10:57 a.m. UTC | #12
On 2018/10/15 17:19, Michal Hocko wrote:
> As so many dozens of times before, I will point you to an incremental
> nature of changes we really prefer in the mm land. We are also after a
> simplicity which your proposal lacks in many aspects. You seem to ignore
> that general approach and I have hard time to consider your NAK as a
> relevant feedback. Going to an extreme and basing a complex solution on
> it is not going to fly. No killable process should be a rare event which
> requires a seriously misconfigured memcg to happen so wildly. If you can
> trigger it with a normal user privileges then it would be a clear bug to
> address rather than work around with printk throttling.
> 

I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
with a normal user privileges. This is a lot of needless noise/delay.
No killable process is not a rare event, even without root privileges.

[root@ccsecurity kumaneko]# time ./a.out
Killed

real    0m2.396s
user    0m0.000s
sys     0m2.970s
[root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
202
[root@ccsecurity ~]# dmesg | wc
    942    7335   70716
[root@ccsecurity ~]#

----------------------------------------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <sys/mman.h>

#define NUMTHREADS 256
#define MMAPSIZE 4 * 10485760
#define STACKSIZE 4096
static int pipe_fd[2] = { EOF, EOF };
static int memory_eater(void *unused)
{
        int fd = open("/dev/zero", O_RDONLY);
        char *buf = mmap(NULL, MMAPSIZE, PROT_WRITE | PROT_READ,
                         MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
        read(pipe_fd[0], buf, 1);
        read(fd, buf, MMAPSIZE);
        pause();
        return 0;
}
int main(int argc, char *argv[])
{
        int i;
        char *stack;
        FILE *fp;
        const unsigned long size = 1048576 * 200;
        mkdir("/sys/fs/cgroup/memory/test1", 0755);
        fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
        fprintf(fp, "%lu\n", size);
        fclose(fp);
        fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
        fprintf(fp, "%u\n", getpid());
        fclose(fp);
        if (setgid(-2) || setuid(-2))
                return 1;
        stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ,
                     MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
        for (i = 0; i < NUMTHREADS; i++)
                if (clone(memory_eater, stack + (i + 1) * STACKSIZE,
                          CLONE_SIGHAND | CLONE_THREAD | CLONE_VM | CLONE_FS | CLONE_FILES, NULL) == -1)
                        break;
        sleep(1);
        close(pipe_fd[1]);
        pause();
        return 0;
}
----------------------------------------
Michal Hocko Oct. 15, 2018, 11:24 a.m. UTC | #13
On Mon 15-10-18 19:57:35, Tetsuo Handa wrote:
> On 2018/10/15 17:19, Michal Hocko wrote:
> > As so many dozens of times before, I will point you to an incremental
> > nature of changes we really prefer in the mm land. We are also after a
> > simplicity which your proposal lacks in many aspects. You seem to ignore
> > that general approach and I have hard time to consider your NAK as a
> > relevant feedback. Going to an extreme and basing a complex solution on
> > it is not going to fly. No killable process should be a rare event which
> > requires a seriously misconfigured memcg to happen so wildly. If you can
> > trigger it with a normal user privileges then it would be a clear bug to
> > address rather than work around with printk throttling.
> > 
> 
> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
> with a normal user privileges. This is a lot of needless noise/delay.

I am pretty sure you have understood the part of my message you have
chosen to not quote where I have said that the specific rate limitting
decisions can be changed based on reasonable configurations. There is
absolutely zero reason to NAK a natural decision to unify the throttling
and cook a per-memcg way for a very specific path instead.

> No killable process is not a rare event, even without root privileges.
>
> [root@ccsecurity kumaneko]# time ./a.out
> Killed
> 
> real    0m2.396s
> user    0m0.000s
> sys     0m2.970s
> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
> 202
> [root@ccsecurity ~]# dmesg | wc
>     942    7335   70716

OK, so this is 70kB worth of data pushed throug the console. Is this
really killing any machine?
Tetsuo Handa Oct. 15, 2018, 12:47 p.m. UTC | #14
On 2018/10/15 20:24, Michal Hocko wrote:
> On Mon 15-10-18 19:57:35, Tetsuo Handa wrote:
>> On 2018/10/15 17:19, Michal Hocko wrote:
>>> As so many dozens of times before, I will point you to an incremental
>>> nature of changes we really prefer in the mm land. We are also after a
>>> simplicity which your proposal lacks in many aspects. You seem to ignore
>>> that general approach and I have hard time to consider your NAK as a
>>> relevant feedback. Going to an extreme and basing a complex solution on
>>> it is not going to fly. No killable process should be a rare event which
>>> requires a seriously misconfigured memcg to happen so wildly. If you can
>>> trigger it with a normal user privileges then it would be a clear bug to
>>> address rather than work around with printk throttling.
>>>
>>
>> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
>> with a normal user privileges. This is a lot of needless noise/delay.
> 
> I am pretty sure you have understood the part of my message you have
> chosen to not quote where I have said that the specific rate limitting
> decisions can be changed based on reasonable configurations. There is
> absolutely zero reason to NAK a natural decision to unify the throttling
> and cook a per-memcg way for a very specific path instead.
> 
>> No killable process is not a rare event, even without root privileges.
>>
>> [root@ccsecurity kumaneko]# time ./a.out
>> Killed
>>
>> real    0m2.396s
>> user    0m0.000s
>> sys     0m2.970s
>> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
>> 202
>> [root@ccsecurity ~]# dmesg | wc
>>     942    7335   70716
> 
> OK, so this is 70kB worth of data pushed throug the console. Is this
> really killing any machine?
> 

Nobody can prove that it never kills some machine. This is just one example result of
one example stress tried in my environment. Since I am secure programming man from security
subsystem, I really hate your "Can you trigger it?" resistance. Since this is OOM path
where nobody tests, starting from being prepared for the worst case keeps things simple.
Michal Hocko Oct. 15, 2018, 1:35 p.m. UTC | #15
On Mon 15-10-18 21:47:08, Tetsuo Handa wrote:
> On 2018/10/15 20:24, Michal Hocko wrote:
> > On Mon 15-10-18 19:57:35, Tetsuo Handa wrote:
> >> On 2018/10/15 17:19, Michal Hocko wrote:
> >>> As so many dozens of times before, I will point you to an incremental
> >>> nature of changes we really prefer in the mm land. We are also after a
> >>> simplicity which your proposal lacks in many aspects. You seem to ignore
> >>> that general approach and I have hard time to consider your NAK as a
> >>> relevant feedback. Going to an extreme and basing a complex solution on
> >>> it is not going to fly. No killable process should be a rare event which
> >>> requires a seriously misconfigured memcg to happen so wildly. If you can
> >>> trigger it with a normal user privileges then it would be a clear bug to
> >>> address rather than work around with printk throttling.
> >>>
> >>
> >> I can trigger 200+ times / 900+ lines / 69KB+ of needless OOM messages
> >> with a normal user privileges. This is a lot of needless noise/delay.
> > 
> > I am pretty sure you have understood the part of my message you have
> > chosen to not quote where I have said that the specific rate limitting
> > decisions can be changed based on reasonable configurations. There is
> > absolutely zero reason to NAK a natural decision to unify the throttling
> > and cook a per-memcg way for a very specific path instead.
> > 
> >> No killable process is not a rare event, even without root privileges.
> >>
> >> [root@ccsecurity kumaneko]# time ./a.out
> >> Killed
> >>
> >> real    0m2.396s
> >> user    0m0.000s
> >> sys     0m2.970s
> >> [root@ccsecurity ~]# dmesg | grep 'no killable' | wc -l
> >> 202
> >> [root@ccsecurity ~]# dmesg | wc
> >>     942    7335   70716
> > 
> > OK, so this is 70kB worth of data pushed throug the console. Is this
> > really killing any machine?
> > 
> 
> Nobody can prove that it never kills some machine. This is just one example result of
> one example stress tried in my environment. Since I am secure programming man from security
> subsystem, I really hate your "Can you trigger it?" resistance. Since this is OOM path
> where nobody tests, starting from being prepared for the worst case keeps things simple.

There is simply no way to be generally safe this kind of situation. As
soon as your console is so slow that you cannot push the oom report
through there is only one single option left and that is to disable the
oom report altogether. And that might be a viable option. But fiddling
with per memcg limit is not going to fly. Just realize what will happen
if you have hundreds of different memcgs triggering this path around the
same time.

So can you start being reasonable and try to look at a wider picture
finally please?
Tetsuo Handa Oct. 16, 2018, 12:55 a.m. UTC | #16
On 2018/10/15 22:35, Michal Hocko wrote:
>> Nobody can prove that it never kills some machine. This is just one example result of
>> one example stress tried in my environment. Since I am secure programming man from security
>> subsystem, I really hate your "Can you trigger it?" resistance. Since this is OOM path
>> where nobody tests, starting from being prepared for the worst case keeps things simple.
> 
> There is simply no way to be generally safe this kind of situation. As
> soon as your console is so slow that you cannot push the oom report
> through there is only one single option left and that is to disable the
> oom report altogether. And that might be a viable option.

There is a way to be safe this kind of situation. The way is to make sure that printk()
is called with enough interval. That is, count the interval between the end of previous
printk() messages and the beginning of next printk() messages.

And you are misunderstanding my patch. Although my patch does not ratelimit the first
occurrence of memcg OOM in each memcg domain (because the first

 		dump_header(oc, NULL);
 		pr_warn("Out of memory and no killable processes...\n");

output is usually a useful information to get) which is serialized by oom_lock mutex,
my patch cannot cause lockup because my patch ratelimits subsequent outputs in any
memcg domain. That is, my patch might cause

  "** %u printk messages dropped **\n"

when we have hundreds of different memcgs triggering this path around the same time,
my patch refrains from "keep disturbing administrator's manual recovery operation from
console by printing

  "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
  "Out of memory and no killable processes...\n"

on each page fault event from hundreds of different memcgs triggering this path".

There is no need to print

  "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
  "Out of memory and no killable processes...\n"

lines on evey page fault event. A kernel which consumes multiple milliseconds on each page
fault event (due to printk() messages from the defunctional OOM killer) is stupid.

>                                                           But fiddling
> with per memcg limit is not going to fly. Just realize what will happen
> if you have hundreds of different memcgs triggering this path around the
> same time.

You have just said that "No killable process should be a rare event which
requires a seriously misconfigured memcg to happen so wildly." and now you
refer to a very bad case "Just realize what will happen if you have hundreds
of different memcgs triggering this path around the same time." which makes
your former comment suspicious.

> 
> So can you start being reasonable and try to look at a wider picture
> finally please?
> 

Honestly, I can't look at a wider picture because I have never been shown a picture from you.
What we are doing is endless loop of "let's do ... because ..." and "hmm, our assumption
was wrong because ..."; that is, making changes without firstly considering the worst case.
For example, OOM victims which David Rientjes is complaining is that our assumption that
"__oom_reap_task_mm() can reclaim majority of memory" was wrong. (And your proposal to
hand over is getting no response.) For another example, __set_oom_adj() which Yong-Taek Lee
is trying to optimize is that our assumption that "we already succeeded enforcing same
oom_score_adj among multiple thread groups" was wrong. For yet another example,
CVE-2018-1000200 and CVE-2016-10723 are caused by ignoring my concern. And funny thing
is that you negated the rationale of "mm, oom: remove sleep from under oom_lock" by
"mm, oom: remove oom_lock from oom_reaper" after only 4 days...

Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with simplified
one shown below (because you don't like per memcg limit).

---
 mm/oom_kill.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..9056f9b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1106,6 +1106,11 @@ bool out_of_memory(struct oom_control *oc)
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
+		static unsigned long last_warned;
+
+		if ((is_sysrq_oom(oc) || is_memcg_oom(oc)) &&
+		    time_in_range(jiffies, last_warned, last_warned + 60 * HZ))
+			return false;
 		dump_header(oc, NULL);
 		pr_warn("Out of memory and no killable processes...\n");
 		/*
@@ -1115,6 +1120,7 @@ bool out_of_memory(struct oom_control *oc)
 		 */
 		if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
 			panic("System is deadlocked on memory\n");
+		last_warned = jiffies;
 	}
 	if (oc->chosen && oc->chosen != (void *)-1UL)
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
Michal Hocko Oct. 16, 2018, 9:20 a.m. UTC | #17
On Tue 16-10-18 09:55:06, Tetsuo Handa wrote:
> On 2018/10/15 22:35, Michal Hocko wrote:
> >> Nobody can prove that it never kills some machine. This is just one example result of
> >> one example stress tried in my environment. Since I am secure programming man from security
> >> subsystem, I really hate your "Can you trigger it?" resistance. Since this is OOM path
> >> where nobody tests, starting from being prepared for the worst case keeps things simple.
> > 
> > There is simply no way to be generally safe this kind of situation. As
> > soon as your console is so slow that you cannot push the oom report
> > through there is only one single option left and that is to disable the
> > oom report altogether. And that might be a viable option.
> 
> There is a way to be safe this kind of situation. The way is to make sure that printk()
> is called with enough interval. That is, count the interval between the end of previous
> printk() messages and the beginning of next printk() messages.

You are simply wrong. Because any interval is meaningless without
knowing the printk throughput.

[...]

> lines on evey page fault event. A kernel which consumes multiple milliseconds on each page
> fault event (due to printk() messages from the defunctional OOM killer) is stupid.

Not if it represent an unusual situation where there is no eligible
task available. Because this is an exceptional case where the cost of
the printk is simply not relevant.

[...]

I am sorry to skip large part of your message but this discussion, like
many others, doesn't lead anywhere. You simply refuse to understand
some of the core assumptions in this area.

> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with simplified
> one shown below (because you don't like per memcg limit).

My patch is adding a rate-limit! I really fail to see why we need yet
another one on top of it. This is just ridiculous. I can see reasons to
tune that rate limit but adding 2 different mechanisms is just wrong.

If your NAK to unify the ratelimit for dump_header for all paths
still holds then I do not care too much to push it forward. But I find
thiis way of the review feedback counter productive.
Tetsuo Handa Oct. 16, 2018, 11:05 a.m. UTC | #18
On 2018/10/16 18:20, Michal Hocko wrote:
>> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with simplified
>> one shown below (because you don't like per memcg limit).
> 
> My patch is adding a rate-limit! I really fail to see why we need yet
> another one on top of it. This is just ridiculous. I can see reasons to
> tune that rate limit but adding 2 different mechanisms is just wrong.
> 
> If your NAK to unify the ratelimit for dump_header for all paths
> still holds then I do not care too much to push it forward. But I find
> thiis way of the review feedback counter productive.
> 

Your patch is _NOT_ adding a rate-limit for

  "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
  "Out of memory and no killable processes...\n"

lines!

[   97.519229] Out of memory and no killable processes...
[   97.522060] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.525507] Out of memory and no killable processes...
[   97.528817] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.532345] Out of memory and no killable processes...
[   97.534813] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.538270] Out of memory and no killable processes...
[   97.541449] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.546268] Out of memory and no killable processes...
[   97.548823] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.552399] Out of memory and no killable processes...
[   97.554939] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.558616] Out of memory and no killable processes...
[   97.562257] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.565998] Out of memory and no killable processes...
[   97.568642] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.572169] Out of memory and no killable processes...
[   97.575200] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.579357] Out of memory and no killable processes...
[   97.581912] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.585414] Out of memory and no killable processes...
[   97.589191] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.593586] Out of memory and no killable processes...
[   97.596527] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.600118] Out of memory and no killable processes...
[   97.603237] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.606837] Out of memory and no killable processes...
[   97.611550] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.615244] Out of memory and no killable processes...
[   97.617859] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.621634] Out of memory and no killable processes...
[   97.624884] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.629256] Out of memory and no killable processes...
[   97.631885] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.635367] Out of memory and no killable processes...
[   97.638033] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.641827] Out of memory and no killable processes...
[   97.641993] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.648453] Out of memory and no killable processes...
[   97.651481] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.655082] Out of memory and no killable processes...
[   97.657941] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.663036] Out of memory and no killable processes...
[   97.665890] a.out invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
[   97.669473] Out of memory and no killable processes...

We don't need to print these lines every few milliseconds. Even if an exceptional case,
this is a DoS for console users. Printing once (or a few times) per a minute will be
enough. Otherwise, users cannot see what they are typing and what are printed.
Michal Hocko Oct. 16, 2018, 11:17 a.m. UTC | #19
On Tue 16-10-18 20:05:47, Tetsuo Handa wrote:
> On 2018/10/16 18:20, Michal Hocko wrote:
> >> Anyway, I'm OK if we apply _BOTH_ your patch and my patch. Or I'm OK with simplified
> >> one shown below (because you don't like per memcg limit).
> > 
> > My patch is adding a rate-limit! I really fail to see why we need yet
> > another one on top of it. This is just ridiculous. I can see reasons to
> > tune that rate limit but adding 2 different mechanisms is just wrong.
> > 
> > If your NAK to unify the ratelimit for dump_header for all paths
> > still holds then I do not care too much to push it forward. But I find
> > thiis way of the review feedback counter productive.
> > 
> 
> Your patch is _NOT_ adding a rate-limit for
> 
>   "%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n"
>   "Out of memory and no killable processes...\n"
> 
> lines!

And I've said I do not have objections to have an _incremental_ patch to
move the ratelimit up with a clear cost/benefit evaluation.
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa5360616..4ee393c85e27 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -430,6 +430,9 @@  static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
 
 static void dump_header(struct oom_control *oc, struct task_struct *p)
 {
+	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
+					      DEFAULT_RATELIMIT_BURST);
+
 	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n",
 		current->comm, oc->gfp_mask, &oc->gfp_mask,
 		nodemask_pr_args(oc->nodemask), oc->order,
@@ -437,6 +440,9 @@  static void dump_header(struct oom_control *oc, struct task_struct *p)
 	if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
 		pr_warn("COMPACTION is disabled!!!\n");
 
+	if (!__ratelimit(&oom_rs))
+		return;
+
 	cpuset_print_current_mems_allowed();
 	dump_stack();
 	if (is_memcg_oom(oc))
@@ -931,8 +937,6 @@  static void oom_kill_process(struct oom_control *oc, const char *message)
 	struct task_struct *t;
 	struct mem_cgroup *oom_group;
 	unsigned int victim_points = 0;
-	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
-					      DEFAULT_RATELIMIT_BURST);
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -949,8 +953,7 @@  static void oom_kill_process(struct oom_control *oc, const char *message)
 	}
 	task_unlock(p);
 
-	if (__ratelimit(&oom_rs))
-		dump_header(oc, p);
+	dump_header(oc, p);
 
 	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);