Add the memcg print oom info for system oom

Message ID	1526540428-12178-1-git-send-email-ufo19890607@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of ufo19890607@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; From: ufo19890607 <ufo19890607@gmail.com> To: akpm@linux-foundation.org, mhocko@suse.com, rientjes@google.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, penguin-kernel@I-love.SAKURA.ne.jp, guro@fb.com, yang.s@alibaba-inc.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, yuzhoujian <yuzhoujian@didichuxing.com> Subject: [PATCH] Add the memcg print oom info for system oom Date: Thu, 17 May 2018 08:00:28 +0100 Message-Id: <1526540428-12178-1-git-send-email-ufo19890607@gmail.com> Sender: owner-linux-mm@kvack.org Precedence: bulk

禹舟键 May 17, 2018, 7 a.m. UTC

From: yuzhoujian <yuzhoujian@didichuxing.com>

The dump_header does not print the memcg's name when the system
oom happened. Some users want to locate the certain container
which contains the task that has been killed by the oom killer.
So I add the mem_cgroup_print_oom_info when system oom events
happened.

Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com>
---
 mm/oom_kill.c | 1 +
 1 file changed, 1 insertion(+)

Michal Hocko May 17, 2018, 7:11 a.m. UTC | #1

On Thu 17-05-18 08:00:28, ufo19890607 wrote:
> From: yuzhoujian <yuzhoujian@didichuxing.com>
> 
> The dump_header does not print the memcg's name when the system
> oom happened. Some users want to locate the certain container
> which contains the task that has been killed by the oom killer.
> So I add the mem_cgroup_print_oom_info when system oom events
> happened.

The oom report is quite heavy today. Do we really need the full memcg
oom report here. Wouldn't it be sufficient to print the memcg the task
belongs to?

> Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com>
> ---
>  mm/oom_kill.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8ba6cb88cf58..244416c9834a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -433,6 +433,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
>  	if (is_memcg_oom(oc))
>  		mem_cgroup_print_oom_info(oc->memcg, p);
>  	else {
> +		mem_cgroup_print_oom_info(mem_cgroup_from_task(p), p);
>  		show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask);
>  		if (is_dump_unreclaim_slabs())
>  			dump_unreclaimable_slab();
> -- 
> 2.14.1
>

禹舟键 May 17, 2018, 9:44 a.m. UTC | #2

Hi Michal
I think the current OOM report is imcomplete. I can get the task which
invoked the oom-killer and the task which has been killed by the
oom-killer, and memory info when the oom happened. But I cannot infer the
certain memcg to which the task killed by oom-killer belongs, because that
task has been killed, and the dump_task will print all of the tasks in the
system.

mem_cgroup_print_oom_info will print five lines of content including
memcg's name , usage, limit. I don't think five lines of content will cause
a big problem. Or it at least prints the memcg's name.

Thanks
Wind

Michal Hocko <mhocko@kernel.org> 于2018年5月17日周四 下午3:11写道：

> On Thu 17-05-18 08:00:28, ufo19890607 wrote:
> > From: yuzhoujian <yuzhoujian@didichuxing.com>
> >
> > The dump_header does not print the memcg's name when the system
> > oom happened. Some users want to locate the certain container
> > which contains the task that has been killed by the oom killer.
> > So I add the mem_cgroup_print_oom_info when system oom events
> > happened.
>
> The oom report is quite heavy today. Do we really need the full memcg
> oom report here. Wouldn't it be sufficient to print the memcg the task
> belongs to?
>
> > Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com>
> > ---
> >  mm/oom_kill.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 8ba6cb88cf58..244416c9834a 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -433,6 +433,7 @@ static void dump_header(struct oom_control *oc,
> struct task_struct *p)
> >       if (is_memcg_oom(oc))
> >               mem_cgroup_print_oom_info(oc->memcg, p);
> >       else {
> > +             mem_cgroup_print_oom_info(mem_cgroup_from_task(p), p);
> >               show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask);
> >               if (is_dump_unreclaim_slabs())
> >                       dump_unreclaimable_slab();
> > --
> > 2.14.1
> >
>
> --
> Michal Hocko
> SUSE Labs
>

Michal Hocko May 17, 2018, 10:23 a.m. UTC | #3

On Thu 17-05-18 17:44:43, 禹舟键 wrote:
> Hi Michal
> I think the current OOM report is imcomplete. I can get the task which
> invoked the oom-killer and the task which has been killed by the
> oom-killer, and memory info when the oom happened. But I cannot infer the
> certain memcg to which the task killed by oom-killer belongs, because that
> task has been killed, and the dump_task will print all of the tasks in the
> system.

I can see how the origin memcg might be useful, but ...
> 
> mem_cgroup_print_oom_info will print five lines of content including
> memcg's name , usage, limit. I don't think five lines of content will cause
> a big problem. Or it at least prints the memcg's name.

this is not 5 lines at all. We dump memcg stats for the whole oom memcg
subtree. For your patch it would be the whole subtree of the memcg of
the oom victim. With cgroup v1 this can be quite deep as tasks can
belong to inter-nodes as well. Would be

		pr_info("Task in ");
		pr_cont_cgroup_path(task_cgroup(p, memory_cgrp_id));
		pr_cont(" killed as a result of limit of ");

part of that output sufficient for your usecase? You will not get memory
consumption of the group but is that really so relevant when we are
killing individual tasks? Please note that there are proposals to make
the global oom killer memcg aware and select by the memcg size rather
than pick on random tasks
(http://lkml.kernel.org/r/20171130152824.1591-1-guro@fb.com). Maybe that
will be more interesting for your container usecase.

Roman Gushchin May 17, 2018, 10:42 a.m. UTC | #4

On Thu, May 17, 2018 at 12:23:30PM +0200, Michal Hocko wrote:
> On Thu 17-05-18 17:44:43, 禹舟键 wrote:
> > Hi Michal
> > I think the current OOM report is imcomplete. I can get the task which
> > invoked the oom-killer and the task which has been killed by the
> > oom-killer, and memory info when the oom happened. But I cannot infer the
> > certain memcg to which the task killed by oom-killer belongs, because that
> > task has been killed, and the dump_task will print all of the tasks in the
> > system.
> 
> I can see how the origin memcg might be useful, but ...
> > 
> > mem_cgroup_print_oom_info will print five lines of content including
> > memcg's name , usage, limit. I don't think five lines of content will cause
> > a big problem. Or it at least prints the memcg's name.

I want only add here that if system-wide OOM is a rare event, you can look
at per-cgroup oom counters to find the cgroup, which contained the killed
task. Not super handy, but might work for debug purposes.

> this is not 5 lines at all. We dump memcg stats for the whole oom memcg
> subtree. For your patch it would be the whole subtree of the memcg of
> the oom victim. With cgroup v1 this can be quite deep as tasks can
> belong to inter-nodes as well. Would be
> 
> 		pr_info("Task in ");
> 		pr_cont_cgroup_path(task_cgroup(p, memory_cgrp_id));
> 		pr_cont(" killed as a result of limit of ");
> 
> part of that output sufficient for your usecase? You will not get memory
> consumption of the group but is that really so relevant when we are
> killing individual tasks? Please note that there are proposals to make
> the global oom killer memcg aware and select by the memcg size rather
> than pick on random tasks
> (http://lkml.kernel.org/r/20171130152824.1591-1-guro@fb.com). Maybe that
> will be more interesting for your container usecase.

Speaking about memcg OOM reports more broadly, IMO
rather than spam with memcg-local OOM dumps to dmesg,
it's better to add a new interface to read memcg-specific OOM reports.

The current dmesg OOM report contains a lot of low-level stuff,
which is handy for debugging system-wide OOM issues,
and memcg-aware stuff too; that makes it bulky.

Anyway, Michal's 1-line proposal looks quite acceptable to me.

Thanks!

kernel test robot May 19, 2018, 9:18 p.m. UTC | #5

Hi yuzhoujian,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.17-rc5 next-20180517]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/ufo19890607/Add-the-memcg-print-oom-info-for-system-oom/20180520-041924
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-x003-201820 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

   mm/oom_kill.c: In function 'dump_header':
>> mm/oom_kill.c:436:29: error: implicit declaration of function 'mem_cgroup_from_task'; did you mean 'mem_cgroup_from_id'? [-Werror=implicit-function-declaration]
      mem_cgroup_print_oom_info(mem_cgroup_from_task(p), p);
                                ^~~~~~~~~~~~~~~~~~~~
                                mem_cgroup_from_id
>> mm/oom_kill.c:436:29: warning: passing argument 1 of 'mem_cgroup_print_oom_info' makes pointer from integer without a cast [-Wint-conversion]
   In file included from include/linux/swap.h:9:0,
                    from mm/oom_kill.c:28:
   include/linux/memcontrol.h:932:1: note: expected 'struct mem_cgroup *' but argument is of type 'int'
    mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
    ^~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +436 mm/oom_kill.c

   421	
   422	static void dump_header(struct oom_control *oc, struct task_struct *p)
   423	{
   424		pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), nodemask=%*pbl, order=%d, oom_score_adj=%hd\n",
   425			current->comm, oc->gfp_mask, &oc->gfp_mask,
   426			nodemask_pr_args(oc->nodemask), oc->order,
   427				current->signal->oom_score_adj);
   428		if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
   429			pr_warn("COMPACTION is disabled!!!\n");
   430	
   431		cpuset_print_current_mems_allowed();
   432		dump_stack();
   433		if (is_memcg_oom(oc))
   434			mem_cgroup_print_oom_info(oc->memcg, p);
   435		else {
 > 436			mem_cgroup_print_oom_info(mem_cgroup_from_task(p), p);
   437			show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask);
   438			if (is_dump_unreclaim_slabs())
   439				dump_unreclaimable_slab();
   440		}
   441		if (sysctl_oom_dump_tasks)
   442			dump_tasks(oc->memcg, oc->nodemask);
   443	}
   444	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

David Rientjes May 21, 2018, 9:11 p.m. UTC | #6

On Thu, 17 May 2018, Michal Hocko wrote:

> this is not 5 lines at all. We dump memcg stats for the whole oom memcg
> subtree. For your patch it would be the whole subtree of the memcg of
> the oom victim. With cgroup v1 this can be quite deep as tasks can
> belong to inter-nodes as well. Would be
> 
> 		pr_info("Task in ");
> 		pr_cont_cgroup_path(task_cgroup(p, memory_cgrp_id));
> 		pr_cont(" killed as a result of limit of ");
> 
> part of that output sufficient for your usecase?

There's no memcg to print as the limit in the above, but it does seem like 
the single line output is all that is needed in this case.

It might be useful to discuss a single line output that specifies relevant 
information about the context of the oom kill, the killed thread, and the 
memcg of that thread, in a way that will be backwards compatible.  The 
messages in the oom killer have been restructured over time, I don't 
believe there is a backwards compatible way to search for an oom event in 
the kernel log.

I've had success with defining a single line output the includes the 
CONSTRAINT_* of the oom kill, the origin and kill memcgs, the thread name, 
pid, and uid.  On system oom kills, origin and kill memcgs are left empty.

oom-kill constraint=CONSTRAINT_* origin_memcg=<memcg> kill_memcg=<memcg> task=<comm> pid=<pid> uid=<uid>

Perhaps we should introduce a single line output that will be backwards 
compatible that includes this information?

Michal Hocko May 22, 2018, 6:37 a.m. UTC | #7

On Mon 21-05-18 14:11:21, David Rientjes wrote:
> On Thu, 17 May 2018, Michal Hocko wrote:
> 
> > this is not 5 lines at all. We dump memcg stats for the whole oom memcg
> > subtree. For your patch it would be the whole subtree of the memcg of
> > the oom victim. With cgroup v1 this can be quite deep as tasks can
> > belong to inter-nodes as well. Would be
> > 
> > 		pr_info("Task in ");
> > 		pr_cont_cgroup_path(task_cgroup(p, memory_cgrp_id));
> > 		pr_cont(" killed as a result of limit of ");
> > 
> > part of that output sufficient for your usecase?
> 
> There's no memcg to print as the limit in the above, but it does seem like 
> the single line output is all that is needed in this case.

Yeah, that is exactly what I was proposing. I just copy&pasted the whole
part to make it clear which part of mem_cgroup_print_oom_info I meant.
Referring to "killed as a reslt of limit of" was misleading. Sorry about
that.

> It might be useful to discuss a single line output that specifies relevant 
> information about the context of the oom kill, the killed thread, and the 
> memcg of that thread, in a way that will be backwards compatible.  The 
> messages in the oom killer have been restructured over time, I don't 
> believe there is a backwards compatible way to search for an oom event in 
> the kernel log.

Agreed
 
> I've had success with defining a single line output the includes the 
> CONSTRAINT_* of the oom kill, the origin and kill memcgs, the thread name, 
> pid, and uid.  On system oom kills, origin and kill memcgs are left empty.
> 
> oom-kill constraint=CONSTRAINT_* origin_memcg=<memcg> kill_memcg=<memcg> task=<comm> pid=<pid> uid=<uid>
> 
> Perhaps we should introduce a single line output that will be backwards 
> compatible that includes this information?

I do not have a strong preference here. We already print cpuset on its
own line and we can do the same for the memcg.

David Rientjes May 22, 2018, 10:54 p.m. UTC | #8

On Tue, 22 May 2018, Michal Hocko wrote:

> > I've had success with defining a single line output the includes the 
> > CONSTRAINT_* of the oom kill, the origin and kill memcgs, the thread name, 
> > pid, and uid.  On system oom kills, origin and kill memcgs are left empty.
> > 
> > oom-kill constraint=CONSTRAINT_* origin_memcg=<memcg> kill_memcg=<memcg> task=<comm> pid=<pid> uid=<uid>
> > 
> > Perhaps we should introduce a single line output that will be backwards 
> > compatible that includes this information?
> 
> I do not have a strong preference here. We already print cpuset on its
> own line and we can do the same for the memcg.
> 

Yes, for both the memcg that has reached its limit (origin_memcg) and the 
memcg the killed process is attached to (kill_memcg).

It's beneficial to have a single-line output to avoid any printk 
interleaving or ratelimiting that includes the constraint, comm, and at 
least pid.  (We include uid simply to find oom kills of root processes.)

We already have all this information, including cpuset, cpuset nodemask, 
and allocation nodemask for mempolicy ooms.  The only exception appears to 
be the kill_memcg for CONSTRAINT_NONE and for it to be emitted in a way 
that can't be interleaved or suppressed.

Perhaps we can have this?

oom-kill constraint=CONSTRAINT_* nodemask=<cpuset/mempolicy nodemask> origin_memcg=<memcg> kill_memcg=<memcg> task=<comm> pid=<pid> uid=<uid>

For CONSTRAINT_NONE, nodemask and origin_memcg are empty.  For 
CONSTRAINT_CPUSET and CONSTRAINT_MEMORY_POLICY, origin_memcg is empty.

Add the memcg print oom info for system oom

Commit Message

Comments

Patch