memcg: oom: ignore oom warnings from memory.max

Message ID	20200430182712.237526-1-shakeelb@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=MMcj=6O=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BD4DC20873 Date: Thu, 30 Apr 2020 11:27:12 -0700 Message-Id: <20200430182712.237526-1-shakeelb@google.com> Mime-Version: 1.0 Subject: [PATCH] memcg: oom: ignore oom warnings from memory.max From: Shakeel Butt <shakeelb@google.com> To: Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <guro@fb.com>, Michal Hocko <mhocko@kernel.org> Cc: Greg Thelen <gthelen@google.com>, Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Shakeel Butt <shakeelb@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	memcg: oom: ignore oom warnings from memory.max \| expand memcg: oom: ignore oom warnings from memory.max

Shakeel Butt April 30, 2020, 6:27 p.m. UTC

Lowering memory.max can trigger an oom-kill if the reclaim does not
succeed. However if oom-killer does not find a process for killing, it
dumps a lot of warnings.

Deleting a memcg does not reclaim memory from it and the memory can
linger till there is a memory pressure. One normal way to proactively
reclaim such memory is to set memory.max to 0 just before deleting the
memcg. However if some of the memcg's memory is pinned by others, this
operation can trigger an oom-kill without any process and thus can log a
lot un-needed warnings. So, ignore all such warnings from memory.max.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 include/linux/oom.h | 3 +++
 mm/memcontrol.c     | 9 +++++----
 mm/oom_kill.c       | 2 +-
 3 files changed, 9 insertions(+), 5 deletions(-)

Roman Gushchin April 30, 2020, 7:06 p.m. UTC | #1

Hello, Shakeel!

On Thu, Apr 30, 2020 at 11:27:12AM -0700, Shakeel Butt wrote:
> Lowering memory.max can trigger an oom-kill if the reclaim does not
> succeed. However if oom-killer does not find a process for killing, it
> dumps a lot of warnings.

Makes total sense to me.

> 
> Deleting a memcg does not reclaim memory from it and the memory can
> linger till there is a memory pressure. One normal way to proactively
> reclaim such memory is to set memory.max to 0 just before deleting the
> memcg. However if some of the memcg's memory is pinned by others, this
> operation can trigger an oom-kill without any process and thus can log a
> lot un-needed warnings. So, ignore all such warnings from memory.max.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  include/linux/oom.h | 3 +++
>  mm/memcontrol.c     | 9 +++++----
>  mm/oom_kill.c       | 2 +-
>  3 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index c696c265f019..6345dc55df64 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -52,6 +52,9 @@ struct oom_control {
>  
>  	/* Used to print the constraint info. */
>  	enum oom_constraint constraint;
> +
> +	/* Do not warn even if there is no process to be killed. */
> +	bool no_warn;

I'd invert it to warn. Or maybe even warn_on_no_proc?

>  };
>  
>  extern struct mutex oom_lock;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 317dbbaac603..a1f00d9b9bb0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1571,7 +1571,7 @@ unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
>  }
>  
>  static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> -				     int order)
> +				     int order, bool no_warn)
>  {
>  	struct oom_control oc = {
>  		.zonelist = NULL,
> @@ -1579,6 +1579,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		.memcg = memcg,
>  		.gfp_mask = gfp_mask,
>  		.order = order,
> +		.no_warn = no_warn,
>  	};
>  	bool ret;
>  
> @@ -1821,7 +1822,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  		mem_cgroup_oom_notify(memcg);
>  
>  	mem_cgroup_unmark_under_oom(memcg);
> -	if (mem_cgroup_out_of_memory(memcg, mask, order))
> +	if (mem_cgroup_out_of_memory(memcg, mask, order, false))
>  		ret = OOM_SUCCESS;
>  	else
>  		ret = OOM_FAILED;
> @@ -1880,7 +1881,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  		mem_cgroup_unmark_under_oom(memcg);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> -					 current->memcg_oom_order);
> +					 current->memcg_oom_order, false);
>  	} else {
>  		schedule();
>  		mem_cgroup_unmark_under_oom(memcg);
> @@ -6106,7 +6107,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>  		}
>  
>  		memcg_memory_event(memcg, MEMCG_OOM);
> -		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> +		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0, true))

I wonder if we can handle it automatically from the oom_killer side?
We can suppress warnings if oc->memcg is set and the cgroup scanning
showed that there are no belonging processes?

Thanks!

Johannes Weiner April 30, 2020, 7:29 p.m. UTC | #2

On Thu, Apr 30, 2020 at 11:27:12AM -0700, Shakeel Butt wrote:
> Lowering memory.max can trigger an oom-kill if the reclaim does not
> succeed. However if oom-killer does not find a process for killing, it
> dumps a lot of warnings.
> 
> Deleting a memcg does not reclaim memory from it and the memory can
> linger till there is a memory pressure. One normal way to proactively
> reclaim such memory is to set memory.max to 0 just before deleting the
> memcg. However if some of the memcg's memory is pinned by others, this
> operation can trigger an oom-kill without any process and thus can log a
> lot un-needed warnings. So, ignore all such warnings from memory.max.

Can't you set memory.high=0 instead? It does the reclaim portion of
memory.max, without the actual OOM killing that causes you problems.

Johannes Weiner April 30, 2020, 7:30 p.m. UTC | #3

On Thu, Apr 30, 2020 at 12:06:10PM -0700, Roman Gushchin wrote:
> On Thu, Apr 30, 2020 at 11:27:12AM -0700, Shakeel Butt wrote:
> > @@ -6106,7 +6107,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> >  		}
> >  
> >  		memcg_memory_event(memcg, MEMCG_OOM);
> > -		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> > +		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0, true))
> 
> I wonder if we can handle it automatically from the oom_killer side?
> We can suppress warnings if oc->memcg is set and the cgroup scanning
> showed that there are no belonging processes?

Note that we do remote charging for certain consumers, where memory
gets charged from the outside of the cgroup.

We would want to know if these cause OOMs on an empty cgroup, rather
than force-charge the allocations quietly.

Shakeel Butt April 30, 2020, 7:31 p.m. UTC | #4

On Thu, Apr 30, 2020 at 12:06 PM Roman Gushchin <guro@fb.com> wrote:
>
> Hello, Shakeel!
>
> On Thu, Apr 30, 2020 at 11:27:12AM -0700, Shakeel Butt wrote:
> > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > succeed. However if oom-killer does not find a process for killing, it
> > dumps a lot of warnings.
>
> Makes total sense to me.
>
> >
> > Deleting a memcg does not reclaim memory from it and the memory can
> > linger till there is a memory pressure. One normal way to proactively
> > reclaim such memory is to set memory.max to 0 just before deleting the
> > memcg. However if some of the memcg's memory is pinned by others, this
> > operation can trigger an oom-kill without any process and thus can log a
> > lot un-needed warnings. So, ignore all such warnings from memory.max.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > ---
> >  include/linux/oom.h | 3 +++
> >  mm/memcontrol.c     | 9 +++++----
> >  mm/oom_kill.c       | 2 +-
> >  3 files changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > index c696c265f019..6345dc55df64 100644
> > --- a/include/linux/oom.h
> > +++ b/include/linux/oom.h
> > @@ -52,6 +52,9 @@ struct oom_control {
> >
> >       /* Used to print the constraint info. */
> >       enum oom_constraint constraint;
> > +
> > +     /* Do not warn even if there is no process to be killed. */
> > +     bool no_warn;
>
> I'd invert it to warn. Or maybe even warn_on_no_proc?
>

Sure.

> >  };
> >
> >  extern struct mutex oom_lock;
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 317dbbaac603..a1f00d9b9bb0 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1571,7 +1571,7 @@ unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
> >  }
> >
> >  static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > -                                  int order)
> > +                                  int order, bool no_warn)
> >  {
> >       struct oom_control oc = {
> >               .zonelist = NULL,
> > @@ -1579,6 +1579,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >               .memcg = memcg,
> >               .gfp_mask = gfp_mask,
> >               .order = order,
> > +             .no_warn = no_warn,
> >       };
> >       bool ret;
> >
> > @@ -1821,7 +1822,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
> >               mem_cgroup_oom_notify(memcg);
> >
> >       mem_cgroup_unmark_under_oom(memcg);
> > -     if (mem_cgroup_out_of_memory(memcg, mask, order))
> > +     if (mem_cgroup_out_of_memory(memcg, mask, order, false))
> >               ret = OOM_SUCCESS;
> >       else
> >               ret = OOM_FAILED;
> > @@ -1880,7 +1881,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
> >               mem_cgroup_unmark_under_oom(memcg);
> >               finish_wait(&memcg_oom_waitq, &owait.wait);
> >               mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> > -                                      current->memcg_oom_order);
> > +                                      current->memcg_oom_order, false);
> >       } else {
> >               schedule();
> >               mem_cgroup_unmark_under_oom(memcg);
> > @@ -6106,7 +6107,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> >               }
> >
> >               memcg_memory_event(memcg, MEMCG_OOM);
> > -             if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> > +             if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0, true))
>
> I wonder if we can handle it automatically from the oom_killer side?
> We can suppress warnings if oc->memcg is set and the cgroup scanning
> showed that there are no belonging processes?
>

What about the charging path? Do we not want such warnings from
charging paths? It might be due to some misconfiguration.

Shakeel Butt April 30, 2020, 8:20 p.m. UTC | #5

On Thu, Apr 30, 2020 at 12:29 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Apr 30, 2020 at 11:27:12AM -0700, Shakeel Butt wrote:
> > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > succeed. However if oom-killer does not find a process for killing, it
> > dumps a lot of warnings.
> >
> > Deleting a memcg does not reclaim memory from it and the memory can
> > linger till there is a memory pressure. One normal way to proactively
> > reclaim such memory is to set memory.max to 0 just before deleting the
> > memcg. However if some of the memcg's memory is pinned by others, this
> > operation can trigger an oom-kill without any process and thus can log a
> > lot un-needed warnings. So, ignore all such warnings from memory.max.
>
> Can't you set memory.high=0 instead? It does the reclaim portion of
> memory.max, without the actual OOM killing that causes you problems.

Yes that would work but remote charging concerns me. Remote charging
can still happen after the memcg is offlined and at the moment, high
reclaim does not work for remote memcg and the usage can go till max
or global pressure. This is most probably a misconfiguration and we
might not receive the warnings in the log ever. Setting memory.max to
0 will definitely give such warnings.

Roman Gushchin April 30, 2020, 8:23 p.m. UTC | #6

On Thu, Apr 30, 2020 at 03:30:49PM -0400, Johannes Weiner wrote:
> On Thu, Apr 30, 2020 at 12:06:10PM -0700, Roman Gushchin wrote:
> > On Thu, Apr 30, 2020 at 11:27:12AM -0700, Shakeel Butt wrote:
> > > @@ -6106,7 +6107,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> > >  		}
> > >  
> > >  		memcg_memory_event(memcg, MEMCG_OOM);
> > > -		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> > > +		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0, true))
> > 
> > I wonder if we can handle it automatically from the oom_killer side?
> > We can suppress warnings if oc->memcg is set and the cgroup scanning
> > showed that there are no belonging processes?
> 
> Note that we do remote charging for certain consumers, where memory
> gets charged from the outside of the cgroup.
> 
> We would want to know if these cause OOMs on an empty cgroup, rather
> than force-charge the allocations quietly.
> 

Yeah, good point.

Yafang Shao May 1, 2020, 1:39 a.m. UTC | #7

On Fri, May 1, 2020 at 2:27 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Lowering memory.max can trigger an oom-kill if the reclaim does not
> succeed. However if oom-killer does not find a process for killing, it
> dumps a lot of warnings.
>

I have been confused by this behavior for several months and I think
it will confuse more memcg users.
We should keep the memcg oom behavior consistent with system oom - no
oom kill if no process.

What about bellow change ?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e28098e13f1c..25fbc37a747f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6086,6 +6086,9 @@ static ssize_t memory_max_write(struct
kernfs_open_file *of,
                        continue;
                }

+               if (!cgroup_is_populated(memcg->css.cgroup))
+                       break;
+
                memcg_memory_event(memcg, MEMCG_OOM);
                if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
                        break;

> Deleting a memcg does not reclaim memory from it and the memory can
> linger till there is a memory pressure. One normal way to proactively
> reclaim such memory is to set memory.max to 0 just before deleting the
> memcg. However if some of the memcg's memory is pinned by others, this
> operation can trigger an oom-kill without any process and thus can log a
> lot un-needed warnings. So, ignore all such warnings from memory.max.
>
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  include/linux/oom.h | 3 +++
>  mm/memcontrol.c     | 9 +++++----
>  mm/oom_kill.c       | 2 +-
>  3 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index c696c265f019..6345dc55df64 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -52,6 +52,9 @@ struct oom_control {
>
>         /* Used to print the constraint info. */
>         enum oom_constraint constraint;
> +
> +       /* Do not warn even if there is no process to be killed. */
> +       bool no_warn;
>  };
>
>  extern struct mutex oom_lock;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 317dbbaac603..a1f00d9b9bb0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1571,7 +1571,7 @@ unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
>  }
>
>  static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> -                                    int order)
> +                                    int order, bool no_warn)
>  {
>         struct oom_control oc = {
>                 .zonelist = NULL,
> @@ -1579,6 +1579,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>                 .memcg = memcg,
>                 .gfp_mask = gfp_mask,
>                 .order = order,
> +               .no_warn = no_warn,
>         };
>         bool ret;
>
> @@ -1821,7 +1822,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>                 mem_cgroup_oom_notify(memcg);
>
>         mem_cgroup_unmark_under_oom(memcg);
> -       if (mem_cgroup_out_of_memory(memcg, mask, order))
> +       if (mem_cgroup_out_of_memory(memcg, mask, order, false))
>                 ret = OOM_SUCCESS;
>         else
>                 ret = OOM_FAILED;
> @@ -1880,7 +1881,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
>                 mem_cgroup_unmark_under_oom(memcg);
>                 finish_wait(&memcg_oom_waitq, &owait.wait);
>                 mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> -                                        current->memcg_oom_order);
> +                                        current->memcg_oom_order, false);
>         } else {
>                 schedule();
>                 mem_cgroup_unmark_under_oom(memcg);
> @@ -6106,7 +6107,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>                 }
>
>                 memcg_memory_event(memcg, MEMCG_OOM);
> -               if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> +               if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0, true))
>                         break;
>         }
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 463b3d74a64a..5ace39f6fe1e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1098,7 +1098,7 @@ bool out_of_memory(struct oom_control *oc)
>
>         select_bad_process(oc);
>         /* Found nothing?!?! */
> -       if (!oc->chosen) {
> +       if (!oc->chosen && !oc->no_warn) {
>                 dump_header(oc, NULL);
>                 pr_warn("Out of memory and no killable processes...\n");
>                 /*
> --
> 2.26.2.526.g744177e7f7-goog
>
>

Shakeel Butt May 1, 2020, 2:04 a.m. UTC | #8

On Thu, Apr 30, 2020 at 6:39 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Fri, May 1, 2020 at 2:27 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > succeed. However if oom-killer does not find a process for killing, it
> > dumps a lot of warnings.
> >
>
> I have been confused by this behavior for several months and I think
> it will confuse more memcg users.
> We should keep the memcg oom behavior consistent with system oom - no
> oom kill if no process.
>
> What about bellow change ?
>

Seems fine to me. If others are ok with this, please do send a signed-off patch.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e28098e13f1c..25fbc37a747f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6086,6 +6086,9 @@ static ssize_t memory_max_write(struct
> kernfs_open_file *of,
>                         continue;
>                 }
>
> +               if (!cgroup_is_populated(memcg->css.cgroup))
> +                       break;
> +
>                 memcg_memory_event(memcg, MEMCG_OOM);
>                 if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
>                         break;
>
> > Deleting a memcg does not reclaim memory from it and the memory can
> > linger till there is a memory pressure. One normal way to proactively
> > reclaim such memory is to set memory.max to 0 just before deleting the
> > memcg. However if some of the memcg's memory is pinned by others, this
> > operation can trigger an oom-kill without any process and thus can log a
> > lot un-needed warnings. So, ignore all such warnings from memory.max.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > ---
> >  include/linux/oom.h | 3 +++
> >  mm/memcontrol.c     | 9 +++++----
> >  mm/oom_kill.c       | 2 +-
> >  3 files changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > index c696c265f019..6345dc55df64 100644
> > --- a/include/linux/oom.h
> > +++ b/include/linux/oom.h
> > @@ -52,6 +52,9 @@ struct oom_control {
> >
> >         /* Used to print the constraint info. */
> >         enum oom_constraint constraint;
> > +
> > +       /* Do not warn even if there is no process to be killed. */
> > +       bool no_warn;
> >  };
> >
> >  extern struct mutex oom_lock;
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 317dbbaac603..a1f00d9b9bb0 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1571,7 +1571,7 @@ unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
> >  }
> >
> >  static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > -                                    int order)
> > +                                    int order, bool no_warn)
> >  {
> >         struct oom_control oc = {
> >                 .zonelist = NULL,
> > @@ -1579,6 +1579,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >                 .memcg = memcg,
> >                 .gfp_mask = gfp_mask,
> >                 .order = order,
> > +               .no_warn = no_warn,
> >         };
> >         bool ret;
> >
> > @@ -1821,7 +1822,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
> >                 mem_cgroup_oom_notify(memcg);
> >
> >         mem_cgroup_unmark_under_oom(memcg);
> > -       if (mem_cgroup_out_of_memory(memcg, mask, order))
> > +       if (mem_cgroup_out_of_memory(memcg, mask, order, false))
> >                 ret = OOM_SUCCESS;
> >         else
> >                 ret = OOM_FAILED;
> > @@ -1880,7 +1881,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
> >                 mem_cgroup_unmark_under_oom(memcg);
> >                 finish_wait(&memcg_oom_waitq, &owait.wait);
> >                 mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> > -                                        current->memcg_oom_order);
> > +                                        current->memcg_oom_order, false);
> >         } else {
> >                 schedule();
> >                 mem_cgroup_unmark_under_oom(memcg);
> > @@ -6106,7 +6107,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> >                 }
> >
> >                 memcg_memory_event(memcg, MEMCG_OOM);
> > -               if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> > +               if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0, true))
> >                         break;
> >         }
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 463b3d74a64a..5ace39f6fe1e 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1098,7 +1098,7 @@ bool out_of_memory(struct oom_control *oc)
> >
> >         select_bad_process(oc);
> >         /* Found nothing?!?! */
> > -       if (!oc->chosen) {
> > +       if (!oc->chosen && !oc->no_warn) {
> >                 dump_header(oc, NULL);
> >                 pr_warn("Out of memory and no killable processes...\n");
> >                 /*
> > --
> > 2.26.2.526.g744177e7f7-goog
> >
> >
>
>
> --
> Thanks
> Yafang

Yafang Shao May 1, 2020, 2:12 a.m. UTC | #9

On Fri, May 1, 2020 at 10:04 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Apr 30, 2020 at 6:39 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Fri, May 1, 2020 at 2:27 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > succeed. However if oom-killer does not find a process for killing, it
> > > dumps a lot of warnings.
> > >
> >
> > I have been confused by this behavior for several months and I think
> > it will confuse more memcg users.
> > We should keep the memcg oom behavior consistent with system oom - no
> > oom kill if no process.
> >
> > What about bellow change ?
> >
>
> Seems fine to me. If others are ok with this, please do send a signed-off patch.
>

Sure, I will.

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e28098e13f1c..25fbc37a747f 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6086,6 +6086,9 @@ static ssize_t memory_max_write(struct
> > kernfs_open_file *of,
> >                         continue;
> >                 }
> >
> > +               if (!cgroup_is_populated(memcg->css.cgroup))
> > +                       break;
> > +
> >                 memcg_memory_event(memcg, MEMCG_OOM);
> >                 if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> >                         break;
> >
> > > Deleting a memcg does not reclaim memory from it and the memory can
> > > linger till there is a memory pressure. One normal way to proactively
> > > reclaim such memory is to set memory.max to 0 just before deleting the
> > > memcg. However if some of the memcg's memory is pinned by others, this
> > > operation can trigger an oom-kill without any process and thus can log a
> > > lot un-needed warnings. So, ignore all such warnings from memory.max.
> > >
> > > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > ---
> > >  include/linux/oom.h | 3 +++
> > >  mm/memcontrol.c     | 9 +++++----
> > >  mm/oom_kill.c       | 2 +-
> > >  3 files changed, 9 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > > index c696c265f019..6345dc55df64 100644
> > > --- a/include/linux/oom.h
> > > +++ b/include/linux/oom.h
> > > @@ -52,6 +52,9 @@ struct oom_control {
> > >
> > >         /* Used to print the constraint info. */
> > >         enum oom_constraint constraint;
> > > +
> > > +       /* Do not warn even if there is no process to be killed. */
> > > +       bool no_warn;
> > >  };
> > >
> > >  extern struct mutex oom_lock;
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 317dbbaac603..a1f00d9b9bb0 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1571,7 +1571,7 @@ unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
> > >  }
> > >
> > >  static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > -                                    int order)
> > > +                                    int order, bool no_warn)
> > >  {
> > >         struct oom_control oc = {
> > >                 .zonelist = NULL,
> > > @@ -1579,6 +1579,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >                 .memcg = memcg,
> > >                 .gfp_mask = gfp_mask,
> > >                 .order = order,
> > > +               .no_warn = no_warn,
> > >         };
> > >         bool ret;
> > >
> > > @@ -1821,7 +1822,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
> > >                 mem_cgroup_oom_notify(memcg);
> > >
> > >         mem_cgroup_unmark_under_oom(memcg);
> > > -       if (mem_cgroup_out_of_memory(memcg, mask, order))
> > > +       if (mem_cgroup_out_of_memory(memcg, mask, order, false))
> > >                 ret = OOM_SUCCESS;
> > >         else
> > >                 ret = OOM_FAILED;
> > > @@ -1880,7 +1881,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
> > >                 mem_cgroup_unmark_under_oom(memcg);
> > >                 finish_wait(&memcg_oom_waitq, &owait.wait);
> > >                 mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> > > -                                        current->memcg_oom_order);
> > > +                                        current->memcg_oom_order, false);
> > >         } else {
> > >                 schedule();
> > >                 mem_cgroup_unmark_under_oom(memcg);
> > > @@ -6106,7 +6107,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> > >                 }
> > >
> > >                 memcg_memory_event(memcg, MEMCG_OOM);
> > > -               if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> > > +               if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0, true))
> > >                         break;
> > >         }
> > >
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 463b3d74a64a..5ace39f6fe1e 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -1098,7 +1098,7 @@ bool out_of_memory(struct oom_control *oc)
> > >
> > >         select_bad_process(oc);
> > >         /* Found nothing?!?! */
> > > -       if (!oc->chosen) {
> > > +       if (!oc->chosen && !oc->no_warn) {
> > >                 dump_header(oc, NULL);
> > >                 pr_warn("Out of memory and no killable processes...\n");
> > >                 /*
> > > --
> > > 2.26.2.526.g744177e7f7-goog
> > >
> > >
> >
> >
> > --
> > Thanks
> > Yafang

Michal Hocko May 4, 2020, 6:56 a.m. UTC | #10

On Thu 30-04-20 11:27:12, Shakeel Butt wrote:
> Lowering memory.max can trigger an oom-kill if the reclaim does not
> succeed. However if oom-killer does not find a process for killing, it
> dumps a lot of warnings.

It shouldn't dump much more than the regular OOM report AFAICS. Sure
there is "Out of memory and no killable processes..." message printed as
well but is that a real problem?

> Deleting a memcg does not reclaim memory from it and the memory can
> linger till there is a memory pressure. One normal way to proactively
> reclaim such memory is to set memory.max to 0 just before deleting the
> memcg. However if some of the memcg's memory is pinned by others, this
> operation can trigger an oom-kill without any process and thus can log a
> lot un-needed warnings. So, ignore all such warnings from memory.max.

OK, I can see why you might want to use memory.max for that purpose but
I do not really understand why the oom report is a problem here.
memory.max can trigger the oom kill and user should be expecting the oom
report under that condition. Why is "no eligible task" so special? Is it
because you know that there won't be any tasks for your particular case?
What about other use cases where memory.max is not used as a "sweep
before tear down"?

> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  include/linux/oom.h | 3 +++
>  mm/memcontrol.c     | 9 +++++----
>  mm/oom_kill.c       | 2 +-
>  3 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index c696c265f019..6345dc55df64 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -52,6 +52,9 @@ struct oom_control {
>  
>  	/* Used to print the constraint info. */
>  	enum oom_constraint constraint;
> +
> +	/* Do not warn even if there is no process to be killed. */
> +	bool no_warn;
>  };
>  
>  extern struct mutex oom_lock;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 317dbbaac603..a1f00d9b9bb0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1571,7 +1571,7 @@ unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
>  }
>  
>  static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> -				     int order)
> +				     int order, bool no_warn)
>  {
>  	struct oom_control oc = {
>  		.zonelist = NULL,
> @@ -1579,6 +1579,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		.memcg = memcg,
>  		.gfp_mask = gfp_mask,
>  		.order = order,
> +		.no_warn = no_warn,
>  	};
>  	bool ret;
>  
> @@ -1821,7 +1822,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  		mem_cgroup_oom_notify(memcg);
>  
>  	mem_cgroup_unmark_under_oom(memcg);
> -	if (mem_cgroup_out_of_memory(memcg, mask, order))
> +	if (mem_cgroup_out_of_memory(memcg, mask, order, false))
>  		ret = OOM_SUCCESS;
>  	else
>  		ret = OOM_FAILED;
> @@ -1880,7 +1881,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  		mem_cgroup_unmark_under_oom(memcg);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
>  		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> -					 current->memcg_oom_order);
> +					 current->memcg_oom_order, false);
>  	} else {
>  		schedule();
>  		mem_cgroup_unmark_under_oom(memcg);
> @@ -6106,7 +6107,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>  		}
>  
>  		memcg_memory_event(memcg, MEMCG_OOM);
> -		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> +		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0, true))
>  			break;
>  	}
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 463b3d74a64a..5ace39f6fe1e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1098,7 +1098,7 @@ bool out_of_memory(struct oom_control *oc)
>  
>  	select_bad_process(oc);
>  	/* Found nothing?!?! */
> -	if (!oc->chosen) {
> +	if (!oc->chosen && !oc->no_warn) {
>  		dump_header(oc, NULL);
>  		pr_warn("Out of memory and no killable processes...\n");
>  		/*
> -- 
> 2.26.2.526.g744177e7f7-goog

Michal Hocko May 4, 2020, 6:57 a.m. UTC | #11

On Thu 30-04-20 13:20:10, Shakeel Butt wrote:
> On Thu, Apr 30, 2020 at 12:29 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Apr 30, 2020 at 11:27:12AM -0700, Shakeel Butt wrote:
> > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > succeed. However if oom-killer does not find a process for killing, it
> > > dumps a lot of warnings.
> > >
> > > Deleting a memcg does not reclaim memory from it and the memory can
> > > linger till there is a memory pressure. One normal way to proactively
> > > reclaim such memory is to set memory.max to 0 just before deleting the
> > > memcg. However if some of the memcg's memory is pinned by others, this
> > > operation can trigger an oom-kill without any process and thus can log a
> > > lot un-needed warnings. So, ignore all such warnings from memory.max.
> >
> > Can't you set memory.high=0 instead? It does the reclaim portion of
> > memory.max, without the actual OOM killing that causes you problems.
> 
> Yes that would work but remote charging concerns me. Remote charging
> can still happen after the memcg is offlined and at the moment, high
> reclaim does not work for remote memcg and the usage can go till max
> or global pressure. This is most probably a misconfiguration and we
> might not receive the warnings in the log ever. Setting memory.max to
> 0 will definitely give such warnings.

Can we add a warning for the remote charging on dead memcgs?

Michal Hocko May 4, 2020, 7:03 a.m. UTC | #12

On Fri 01-05-20 09:39:24, Yafang Shao wrote:
> On Fri, May 1, 2020 at 2:27 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > succeed. However if oom-killer does not find a process for killing, it
> > dumps a lot of warnings.
> >
> 
> I have been confused by this behavior for several months and I think
> it will confuse more memcg users.

Could you be more specific what has caused the confusion?

> We should keep the memcg oom behavior consistent with system oom - no
> oom kill if no process.

This is not the global mmemcg behavior. We do complain loud on no
eligible tasks and actually panic the system. Memcg cannot simply
do the same by default for obvious reasons.

> What about bellow change ?
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e28098e13f1c..25fbc37a747f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6086,6 +6086,9 @@ static ssize_t memory_max_write(struct
> kernfs_open_file *of,
>                         continue;
>                 }
> 
> +               if (!cgroup_is_populated(memcg->css.cgroup))
> +                       break;
> +
>                 memcg_memory_event(memcg, MEMCG_OOM);
>                 if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
>                         break;

I am not a great fan to be honest. The warning might be useful for other
usecases when it is not clear that the memcg is empty.

Yafang Shao May 4, 2020, 7:26 a.m. UTC | #13

On Mon, May 4, 2020 at 3:03 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 01-05-20 09:39:24, Yafang Shao wrote:
> > On Fri, May 1, 2020 at 2:27 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > succeed. However if oom-killer does not find a process for killing, it
> > > dumps a lot of warnings.
> > >
> >
> > I have been confused by this behavior for several months and I think
> > it will confuse more memcg users.
>
> Could you be more specific what has caused the confusion?
>

No task is different from no eligible task.
No eligible task means there are some candidates but no one is eligible.
Whille no task means there is no candidate.

> > We should keep the memcg oom behavior consistent with system oom - no
> > oom kill if no process.
>
> This is not the global mmemcg behavior. We do complain loud on no
> eligible tasks and actually panic the system. Memcg cannot simply
> do the same by default for obvious reasons.
>

As explianed above, no eligible task is different from no task.
If there are some candidates but no one is eligible, the system will panic.
While if there's no task, it is definitely no OOM, because that's an
improssible thing for the system.

> > What about bellow change ?
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e28098e13f1c..25fbc37a747f 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6086,6 +6086,9 @@ static ssize_t memory_max_write(struct
> > kernfs_open_file *of,
> >                         continue;
> >                 }
> >
> > +               if (!cgroup_is_populated(memcg->css.cgroup))
> > +                       break;
> > +
> >                 memcg_memory_event(memcg, MEMCG_OOM);
> >                 if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> >                         break;
>
> I am not a great fan to be honest. The warning might be useful for other
> usecases when it is not clear that the memcg is empty.
>

The other usecase can still get the oom status fomr the MEMCG_OOM
event, see bellow,

                 memcg_memory_event(memcg, MEMCG_OOM);
+               if (!cgroup_is_populated(memcg->css.cgroup))
+                       break;

See also https://lore.kernel.org/linux-mm/20200504042621.10334-3-laoar.shao@gmail.com/T/#u

Michal Hocko May 4, 2020, 7:35 a.m. UTC | #14

On Mon 04-05-20 15:26:52, Yafang Shao wrote:
> On Mon, May 4, 2020 at 3:03 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 01-05-20 09:39:24, Yafang Shao wrote:
> > > On Fri, May 1, 2020 at 2:27 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > > succeed. However if oom-killer does not find a process for killing, it
> > > > dumps a lot of warnings.
> > > >
> > >
> > > I have been confused by this behavior for several months and I think
> > > it will confuse more memcg users.
> >
> > Could you be more specific what has caused the confusion?
> >
> 
> No task is different from no eligible task.
> No eligible task means there are some candidates but no one is eligible.
> Whille no task means there is no candidate.

I really fail to see a difference. It is clear the one is subset of the
other but in practical life tasks might come and go at any time and if
you try to reduce the hard limit and there are no tasks that could be
reclaimed then I believe we should complain whether it is only oom
disabled tasks or no tasks at all. It is certainly unexpected situation
in some cases because there are resources which are bound to the memcg
without any task we can act on.

> > > We should keep the memcg oom behavior consistent with system oom - no
> > > oom kill if no process.
> >
> > This is not the global mmemcg behavior. We do complain loud on no
> > eligible tasks and actually panic the system. Memcg cannot simply
> > do the same by default for obvious reasons.
> >
> 
> As explianed above, no eligible task is different from no task.
> If there are some candidates but no one is eligible, the system will panic.
> While if there's no task, it is definitely no OOM, because that's an
> improssible thing for the system.

This is very much possible situation when all eligible tasks have been
already killed but they didn't really help to resolve the oom situation
- e.g. in kernel memory leak or unbounded shmem consumption etc...

Yafang Shao May 4, 2020, 7:40 a.m. UTC | #15

On Mon, May 4, 2020 at 3:35 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 04-05-20 15:26:52, Yafang Shao wrote:
> > On Mon, May 4, 2020 at 3:03 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Fri 01-05-20 09:39:24, Yafang Shao wrote:
> > > > On Fri, May 1, 2020 at 2:27 AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > > > succeed. However if oom-killer does not find a process for killing, it
> > > > > dumps a lot of warnings.
> > > > >
> > > >
> > > > I have been confused by this behavior for several months and I think
> > > > it will confuse more memcg users.
> > >
> > > Could you be more specific what has caused the confusion?
> > >
> >
> > No task is different from no eligible task.
> > No eligible task means there are some candidates but no one is eligible.
> > Whille no task means there is no candidate.
>
> I really fail to see a difference. It is clear the one is subset of the
> other but in practical life tasks might come and go at any time and if
> you try to reduce the hard limit and there are no tasks that could be
> reclaimed then I believe we should complain whether it is only oom
> disabled tasks or no tasks at all. It is certainly unexpected situation
> in some cases because there are resources which are bound to the memcg
> without any task we can act on.
>
> > > > We should keep the memcg oom behavior consistent with system oom - no
> > > > oom kill if no process.
> > >
> > > This is not the global mmemcg behavior. We do complain loud on no
> > > eligible tasks and actually panic the system. Memcg cannot simply
> > > do the same by default for obvious reasons.
> > >
> >
> > As explianed above, no eligible task is different from no task.
> > If there are some candidates but no one is eligible, the system will panic.
> > While if there's no task, it is definitely no OOM, because that's an
> > improssible thing for the system.
>
> This is very much possible situation when all eligible tasks have been
> already killed but they didn't really help to resolve the oom situation
> - e.g. in kernel memory leak or unbounded shmem consumption etc...
>

That's still an impossible thing, because many tasks are invisible to
the oom killer.
See oom_unkillable_task().

Michal Hocko May 4, 2020, 8:03 a.m. UTC | #16

On Mon 04-05-20 15:40:18, Yafang Shao wrote:
> On Mon, May 4, 2020 at 3:35 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 04-05-20 15:26:52, Yafang Shao wrote:
[...]
> > > As explianed above, no eligible task is different from no task.
> > > If there are some candidates but no one is eligible, the system will panic.
> > > While if there's no task, it is definitely no OOM, because that's an
> > > improssible thing for the system.
> >
> > This is very much possible situation when all eligible tasks have been
> > already killed but they didn't really help to resolve the oom situation
> > - e.g. in kernel memory leak or unbounded shmem consumption etc...
> >
> 
> That's still an impossible thing, because many tasks are invisible to
> the oom killer.
> See oom_unkillable_task().

I do not follow, really. oom_unkillable_task only says that global init
cannot be killed and that it doesn't make any sense to kill kernel
threads as they do not own any mm normally.

Shakeel Butt May 4, 2020, 1:54 p.m. UTC | #17

On Sun, May 3, 2020 at 11:56 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 30-04-20 11:27:12, Shakeel Butt wrote:
> > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > succeed. However if oom-killer does not find a process for killing, it
> > dumps a lot of warnings.
>
> It shouldn't dump much more than the regular OOM report AFAICS. Sure
> there is "Out of memory and no killable processes..." message printed as
> well but is that a real problem?
>
> > Deleting a memcg does not reclaim memory from it and the memory can
> > linger till there is a memory pressure. One normal way to proactively
> > reclaim such memory is to set memory.max to 0 just before deleting the
> > memcg. However if some of the memcg's memory is pinned by others, this
> > operation can trigger an oom-kill without any process and thus can log a
> > lot un-needed warnings. So, ignore all such warnings from memory.max.
>
> OK, I can see why you might want to use memory.max for that purpose but
> I do not really understand why the oom report is a problem here.

It may not be a problem for an individual or small scale deployment
but when "sweep before tear down" is the part of the workflow for
thousands of machines cycling through hundreds of thousands of cgroups
then we can potentially flood the logs with not useful dumps and may
hide (or overflow) any useful information in the logs.

> memory.max can trigger the oom kill and user should be expecting the oom
> report under that condition. Why is "no eligible task" so special? Is it
> because you know that there won't be any tasks for your particular case?
> What about other use cases where memory.max is not used as a "sweep
> before tear down"?

What other such use-cases would be? The only use-case I can envision
of adjusting limits dynamically of a live cgroup are resource
managers. However for cgroup v2, memory.high is the recommended way to
limit the usage, so, why would resource managers be changing
memory.max instead of memory.high? I am not sure. What do you think?
FB is moving away from limits setting, so, not sure if they have
thought of these cases.

BTW for such use-cases, shouldn't we be taking the memcg's oom_lock?

>
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > ---
> >  include/linux/oom.h | 3 +++
> >  mm/memcontrol.c     | 9 +++++----
> >  mm/oom_kill.c       | 2 +-
> >  3 files changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/oom.h b/include/linux/oom.h
> > index c696c265f019..6345dc55df64 100644
> > --- a/include/linux/oom.h
> > +++ b/include/linux/oom.h
> > @@ -52,6 +52,9 @@ struct oom_control {
> >
> >       /* Used to print the constraint info. */
> >       enum oom_constraint constraint;
> > +
> > +     /* Do not warn even if there is no process to be killed. */
> > +     bool no_warn;
> >  };
> >
> >  extern struct mutex oom_lock;
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 317dbbaac603..a1f00d9b9bb0 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1571,7 +1571,7 @@ unsigned long mem_cgroup_size(struct mem_cgroup *memcg)
> >  }
> >
> >  static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > -                                  int order)
> > +                                  int order, bool no_warn)
> >  {
> >       struct oom_control oc = {
> >               .zonelist = NULL,
> > @@ -1579,6 +1579,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >               .memcg = memcg,
> >               .gfp_mask = gfp_mask,
> >               .order = order,
> > +             .no_warn = no_warn,
> >       };
> >       bool ret;
> >
> > @@ -1821,7 +1822,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
> >               mem_cgroup_oom_notify(memcg);
> >
> >       mem_cgroup_unmark_under_oom(memcg);
> > -     if (mem_cgroup_out_of_memory(memcg, mask, order))
> > +     if (mem_cgroup_out_of_memory(memcg, mask, order, false))
> >               ret = OOM_SUCCESS;
> >       else
> >               ret = OOM_FAILED;
> > @@ -1880,7 +1881,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
> >               mem_cgroup_unmark_under_oom(memcg);
> >               finish_wait(&memcg_oom_waitq, &owait.wait);
> >               mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> > -                                      current->memcg_oom_order);
> > +                                      current->memcg_oom_order, false);
> >       } else {
> >               schedule();
> >               mem_cgroup_unmark_under_oom(memcg);
> > @@ -6106,7 +6107,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> >               }
> >
> >               memcg_memory_event(memcg, MEMCG_OOM);
> > -             if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> > +             if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0, true))
> >                       break;
> >       }
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 463b3d74a64a..5ace39f6fe1e 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1098,7 +1098,7 @@ bool out_of_memory(struct oom_control *oc)
> >
> >       select_bad_process(oc);
> >       /* Found nothing?!?! */
> > -     if (!oc->chosen) {
> > +     if (!oc->chosen && !oc->no_warn) {
> >               dump_header(oc, NULL);
> >               pr_warn("Out of memory and no killable processes...\n");
> >               /*
> > --
> > 2.26.2.526.g744177e7f7-goog
>
> --
> Michal Hocko
> SUSE Labs

Shakeel Butt May 4, 2020, 1:54 p.m. UTC | #18

On Sun, May 3, 2020 at 11:57 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 30-04-20 13:20:10, Shakeel Butt wrote:
> > On Thu, Apr 30, 2020 at 12:29 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Thu, Apr 30, 2020 at 11:27:12AM -0700, Shakeel Butt wrote:
> > > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > > succeed. However if oom-killer does not find a process for killing, it
> > > > dumps a lot of warnings.
> > > >
> > > > Deleting a memcg does not reclaim memory from it and the memory can
> > > > linger till there is a memory pressure. One normal way to proactively
> > > > reclaim such memory is to set memory.max to 0 just before deleting the
> > > > memcg. However if some of the memcg's memory is pinned by others, this
> > > > operation can trigger an oom-kill without any process and thus can log a
> > > > lot un-needed warnings. So, ignore all such warnings from memory.max.
> > >
> > > Can't you set memory.high=0 instead? It does the reclaim portion of
> > > memory.max, without the actual OOM killing that causes you problems.
> >
> > Yes that would work but remote charging concerns me. Remote charging
> > can still happen after the memcg is offlined and at the moment, high
> > reclaim does not work for remote memcg and the usage can go till max
> > or global pressure. This is most probably a misconfiguration and we
> > might not receive the warnings in the log ever. Setting memory.max to
> > 0 will definitely give such warnings.
>
> Can we add a warning for the remote charging on dead memcgs?
>

I don't think we should warn for all remote charging on dead memcgs.
One particular example is the buffer_head which can be allocated
within reclaim context and most probably pages which they are attached
to will be freed soon.

Shakeel

Michal Hocko May 4, 2020, 2:11 p.m. UTC | #19

On Mon 04-05-20 06:54:40, Shakeel Butt wrote:
> On Sun, May 3, 2020 at 11:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 30-04-20 11:27:12, Shakeel Butt wrote:
> > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > succeed. However if oom-killer does not find a process for killing, it
> > > dumps a lot of warnings.
> >
> > It shouldn't dump much more than the regular OOM report AFAICS. Sure
> > there is "Out of memory and no killable processes..." message printed as
> > well but is that a real problem?
> >
> > > Deleting a memcg does not reclaim memory from it and the memory can
> > > linger till there is a memory pressure. One normal way to proactively
> > > reclaim such memory is to set memory.max to 0 just before deleting the
> > > memcg. However if some of the memcg's memory is pinned by others, this
> > > operation can trigger an oom-kill without any process and thus can log a
> > > lot un-needed warnings. So, ignore all such warnings from memory.max.
> >
> > OK, I can see why you might want to use memory.max for that purpose but
> > I do not really understand why the oom report is a problem here.
> 
> It may not be a problem for an individual or small scale deployment
> but when "sweep before tear down" is the part of the workflow for
> thousands of machines cycling through hundreds of thousands of cgroups
> then we can potentially flood the logs with not useful dumps and may
> hide (or overflow) any useful information in the logs.

If you are doing this in a large scale and the oom report is really a
problem then you shouldn't be resetting hard limit to 0 in the first
place.

> > memory.max can trigger the oom kill and user should be expecting the oom
> > report under that condition. Why is "no eligible task" so special? Is it
> > because you know that there won't be any tasks for your particular case?
> > What about other use cases where memory.max is not used as a "sweep
> > before tear down"?
> 
> What other such use-cases would be? The only use-case I can envision
> of adjusting limits dynamically of a live cgroup are resource
> managers. However for cgroup v2, memory.high is the recommended way to
> limit the usage, so, why would resource managers be changing
> memory.max instead of memory.high? I am not sure. What do you think?

There are different reasons to use the hard limit. Mostly to contain
potential runaways. While high limit might be a sufficient measure to
achieve that as well the hard limit is the last resort. And it clearly
has the oom killer semantic so I am not really sure why you are
comparing the two.

> FB is moving away from limits setting, so, not sure if they have
> thought of these cases.
> 
> BTW for such use-cases, shouldn't we be taking the memcg's oom_lock?

This is a good question. I would have to go and double check the code
but I suspect that this is an omission.

Tetsuo Handa May 4, 2020, 2:20 p.m. UTC | #20

On 2020/05/04 22:54, Shakeel Butt wrote:
> It may not be a problem for an individual or small scale deployment
> but when "sweep before tear down" is the part of the workflow for
> thousands of machines cycling through hundreds of thousands of cgroups
> then we can potentially flood the logs with not useful dumps and may
> hide (or overflow) any useful information in the logs.

I'm proposing a patch which allows configuring which OOM-related messages
should be sent to consoles at
https://lkml.kernel.org/r/20200424024239.63607-1-penguin-kernel@I-love.SAKURA.ne.jp .
Will that approach help you?

Shakeel Butt May 4, 2020, 2:53 p.m. UTC | #21

On Mon, May 4, 2020 at 7:11 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 04-05-20 06:54:40, Shakeel Butt wrote:
> > On Sun, May 3, 2020 at 11:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Thu 30-04-20 11:27:12, Shakeel Butt wrote:
> > > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > > succeed. However if oom-killer does not find a process for killing, it
> > > > dumps a lot of warnings.
> > >
> > > It shouldn't dump much more than the regular OOM report AFAICS. Sure
> > > there is "Out of memory and no killable processes..." message printed as
> > > well but is that a real problem?
> > >
> > > > Deleting a memcg does not reclaim memory from it and the memory can
> > > > linger till there is a memory pressure. One normal way to proactively
> > > > reclaim such memory is to set memory.max to 0 just before deleting the
> > > > memcg. However if some of the memcg's memory is pinned by others, this
> > > > operation can trigger an oom-kill without any process and thus can log a
> > > > lot un-needed warnings. So, ignore all such warnings from memory.max.
> > >
> > > OK, I can see why you might want to use memory.max for that purpose but
> > > I do not really understand why the oom report is a problem here.
> >
> > It may not be a problem for an individual or small scale deployment
> > but when "sweep before tear down" is the part of the workflow for
> > thousands of machines cycling through hundreds of thousands of cgroups
> > then we can potentially flood the logs with not useful dumps and may
> > hide (or overflow) any useful information in the logs.
>
> If you are doing this in a large scale and the oom report is really a
> problem then you shouldn't be resetting hard limit to 0 in the first
> place.
>

I think I have pretty clearly described why we want to reset the hard
limit to 0, so, unless there is an alternative I don't see why we
should not be doing this.

> > > memory.max can trigger the oom kill and user should be expecting the oom
> > > report under that condition. Why is "no eligible task" so special? Is it
> > > because you know that there won't be any tasks for your particular case?
> > > What about other use cases where memory.max is not used as a "sweep
> > > before tear down"?
> >
> > What other such use-cases would be? The only use-case I can envision
> > of adjusting limits dynamically of a live cgroup are resource
> > managers. However for cgroup v2, memory.high is the recommended way to
> > limit the usage, so, why would resource managers be changing
> > memory.max instead of memory.high? I am not sure. What do you think?
>
> There are different reasons to use the hard limit. Mostly to contain
> potential runaways. While high limit might be a sufficient measure to
> achieve that as well the hard limit is the last resort. And it clearly
> has the oom killer semantic so I am not really sure why you are
> comparing the two.
>

I am trying to see if "no eligible task" is really an issue and should
be warned for the "other use cases". The only real use-case I can
think of are resource managers adjusting the limit dynamically. I
don't see "no eligible task" a concerning reason for such use-case. If
you have some other use-case please do tell.

Shakeel

Shakeel Butt May 4, 2020, 2:57 p.m. UTC | #22

On Mon, May 4, 2020 at 7:20 AM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2020/05/04 22:54, Shakeel Butt wrote:
> > It may not be a problem for an individual or small scale deployment
> > but when "sweep before tear down" is the part of the workflow for
> > thousands of machines cycling through hundreds of thousands of cgroups
> > then we can potentially flood the logs with not useful dumps and may
> > hide (or overflow) any useful information in the logs.
>
> I'm proposing a patch which allows configuring which OOM-related messages
> should be sent to consoles at
> https://lkml.kernel.org/r/20200424024239.63607-1-penguin-kernel@I-love.SAKURA.ne.jp .
> Will that approach help you?

From what I understand, that patch is specifically for controlling
messages to consoles. The messages will still be in logs, right?

Michal Hocko May 4, 2020, 3 p.m. UTC | #23

On Mon 04-05-20 07:53:01, Shakeel Butt wrote:
> On Mon, May 4, 2020 at 7:11 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 04-05-20 06:54:40, Shakeel Butt wrote:
> > > On Sun, May 3, 2020 at 11:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Thu 30-04-20 11:27:12, Shakeel Butt wrote:
> > > > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > > > succeed. However if oom-killer does not find a process for killing, it
> > > > > dumps a lot of warnings.
> > > >
> > > > It shouldn't dump much more than the regular OOM report AFAICS. Sure
> > > > there is "Out of memory and no killable processes..." message printed as
> > > > well but is that a real problem?
> > > >
> > > > > Deleting a memcg does not reclaim memory from it and the memory can
> > > > > linger till there is a memory pressure. One normal way to proactively
> > > > > reclaim such memory is to set memory.max to 0 just before deleting the
> > > > > memcg. However if some of the memcg's memory is pinned by others, this
> > > > > operation can trigger an oom-kill without any process and thus can log a
> > > > > lot un-needed warnings. So, ignore all such warnings from memory.max.
> > > >
> > > > OK, I can see why you might want to use memory.max for that purpose but
> > > > I do not really understand why the oom report is a problem here.
> > >
> > > It may not be a problem for an individual or small scale deployment
> > > but when "sweep before tear down" is the part of the workflow for
> > > thousands of machines cycling through hundreds of thousands of cgroups
> > > then we can potentially flood the logs with not useful dumps and may
> > > hide (or overflow) any useful information in the logs.
> >
> > If you are doing this in a large scale and the oom report is really a
> > problem then you shouldn't be resetting hard limit to 0 in the first
> > place.
> >
> 
> I think I have pretty clearly described why we want to reset the hard
> limit to 0, so, unless there is an alternative I don't see why we
> should not be doing this.

I am not saying you shouldn't be doing that. I am just saying that if
you do then you have to live with oom reports.

> > > > memory.max can trigger the oom kill and user should be expecting the oom
> > > > report under that condition. Why is "no eligible task" so special? Is it
> > > > because you know that there won't be any tasks for your particular case?
> > > > What about other use cases where memory.max is not used as a "sweep
> > > > before tear down"?
> > >
> > > What other such use-cases would be? The only use-case I can envision
> > > of adjusting limits dynamically of a live cgroup are resource
> > > managers. However for cgroup v2, memory.high is the recommended way to
> > > limit the usage, so, why would resource managers be changing
> > > memory.max instead of memory.high? I am not sure. What do you think?
> >
> > There are different reasons to use the hard limit. Mostly to contain
> > potential runaways. While high limit might be a sufficient measure to
> > achieve that as well the hard limit is the last resort. And it clearly
> > has the oom killer semantic so I am not really sure why you are
> > comparing the two.
> >
> 
> I am trying to see if "no eligible task" is really an issue and should
> be warned for the "other use cases". The only real use-case I can
> think of are resource managers adjusting the limit dynamically. I
> don't see "no eligible task" a concerning reason for such use-case.

It is very much a concerning reason to notify about like any other OOM
situation due to hard limit breach. In this case it is worse in some
sense because the limit cannot be trimmed down because there is no
directly reclaimable memory at all. Such an oom situation is
effectivelly conserved.

Shakeel Butt May 4, 2020, 3:35 p.m. UTC | #24

On Mon, May 4, 2020 at 8:00 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 04-05-20 07:53:01, Shakeel Butt wrote:
> > On Mon, May 4, 2020 at 7:11 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 04-05-20 06:54:40, Shakeel Butt wrote:
> > > > On Sun, May 3, 2020 at 11:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Thu 30-04-20 11:27:12, Shakeel Butt wrote:
> > > > > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > > > > succeed. However if oom-killer does not find a process for killing, it
> > > > > > dumps a lot of warnings.
> > > > >
> > > > > It shouldn't dump much more than the regular OOM report AFAICS. Sure
> > > > > there is "Out of memory and no killable processes..." message printed as
> > > > > well but is that a real problem?
> > > > >
> > > > > > Deleting a memcg does not reclaim memory from it and the memory can
> > > > > > linger till there is a memory pressure. One normal way to proactively
> > > > > > reclaim such memory is to set memory.max to 0 just before deleting the
> > > > > > memcg. However if some of the memcg's memory is pinned by others, this
> > > > > > operation can trigger an oom-kill without any process and thus can log a
> > > > > > lot un-needed warnings. So, ignore all such warnings from memory.max.
> > > > >
> > > > > OK, I can see why you might want to use memory.max for that purpose but
> > > > > I do not really understand why the oom report is a problem here.
> > > >
> > > > It may not be a problem for an individual or small scale deployment
> > > > but when "sweep before tear down" is the part of the workflow for
> > > > thousands of machines cycling through hundreds of thousands of cgroups
> > > > then we can potentially flood the logs with not useful dumps and may
> > > > hide (or overflow) any useful information in the logs.
> > >
> > > If you are doing this in a large scale and the oom report is really a
> > > problem then you shouldn't be resetting hard limit to 0 in the first
> > > place.
> > >
> >
> > I think I have pretty clearly described why we want to reset the hard
> > limit to 0, so, unless there is an alternative I don't see why we
> > should not be doing this.
>
> I am not saying you shouldn't be doing that. I am just saying that if
> you do then you have to live with oom reports.
>
> > > > > memory.max can trigger the oom kill and user should be expecting the oom
> > > > > report under that condition. Why is "no eligible task" so special? Is it
> > > > > because you know that there won't be any tasks for your particular case?
> > > > > What about other use cases where memory.max is not used as a "sweep
> > > > > before tear down"?
> > > >
> > > > What other such use-cases would be? The only use-case I can envision
> > > > of adjusting limits dynamically of a live cgroup are resource
> > > > managers. However for cgroup v2, memory.high is the recommended way to
> > > > limit the usage, so, why would resource managers be changing
> > > > memory.max instead of memory.high? I am not sure. What do you think?
> > >
> > > There are different reasons to use the hard limit. Mostly to contain
> > > potential runaways. While high limit might be a sufficient measure to
> > > achieve that as well the hard limit is the last resort. And it clearly
> > > has the oom killer semantic so I am not really sure why you are
> > > comparing the two.
> > >
> >
> > I am trying to see if "no eligible task" is really an issue and should
> > be warned for the "other use cases". The only real use-case I can
> > think of are resource managers adjusting the limit dynamically. I
> > don't see "no eligible task" a concerning reason for such use-case.
>
> It is very much a concerning reason to notify about like any other OOM
> situation due to hard limit breach. In this case it is worse in some
> sense because the limit cannot be trimmed down because there is no
> directly reclaimable memory at all. Such an oom situation is
> effectivelly conserved.
> --

Let me make a more precise statement and tell me if you agree. The "no
eligible task" is concerning for the charging path but not for the
writer of memory.max. The writer can read the usage and
cgroup.[procs|events] to figure out the situation if needed. Usually
such writers (i.e. resource managers) use memory.high in addition to
memory.max. First set memory.high and once the usage is below the high
then set max to not induce the oom-kills.

Shakeel

Yafang Shao May 4, 2020, 3:39 p.m. UTC | #25

On Mon, May 4, 2020 at 11:36 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, May 4, 2020 at 8:00 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 04-05-20 07:53:01, Shakeel Butt wrote:
> > > On Mon, May 4, 2020 at 7:11 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Mon 04-05-20 06:54:40, Shakeel Butt wrote:
> > > > > On Sun, May 3, 2020 at 11:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > >
> > > > > > On Thu 30-04-20 11:27:12, Shakeel Butt wrote:
> > > > > > > Lowering memory.max can trigger an oom-kill if the reclaim does not
> > > > > > > succeed. However if oom-killer does not find a process for killing, it
> > > > > > > dumps a lot of warnings.
> > > > > >
> > > > > > It shouldn't dump much more than the regular OOM report AFAICS. Sure
> > > > > > there is "Out of memory and no killable processes..." message printed as
> > > > > > well but is that a real problem?
> > > > > >
> > > > > > > Deleting a memcg does not reclaim memory from it and the memory can
> > > > > > > linger till there is a memory pressure. One normal way to proactively
> > > > > > > reclaim such memory is to set memory.max to 0 just before deleting the
> > > > > > > memcg. However if some of the memcg's memory is pinned by others, this
> > > > > > > operation can trigger an oom-kill without any process and thus can log a
> > > > > > > lot un-needed warnings. So, ignore all such warnings from memory.max.
> > > > > >
> > > > > > OK, I can see why you might want to use memory.max for that purpose but
> > > > > > I do not really understand why the oom report is a problem here.
> > > > >
> > > > > It may not be a problem for an individual or small scale deployment
> > > > > but when "sweep before tear down" is the part of the workflow for
> > > > > thousands of machines cycling through hundreds of thousands of cgroups
> > > > > then we can potentially flood the logs with not useful dumps and may
> > > > > hide (or overflow) any useful information in the logs.
> > > >
> > > > If you are doing this in a large scale and the oom report is really a
> > > > problem then you shouldn't be resetting hard limit to 0 in the first
> > > > place.
> > > >
> > >
> > > I think I have pretty clearly described why we want to reset the hard
> > > limit to 0, so, unless there is an alternative I don't see why we
> > > should not be doing this.
> >
> > I am not saying you shouldn't be doing that. I am just saying that if
> > you do then you have to live with oom reports.
> >
> > > > > > memory.max can trigger the oom kill and user should be expecting the oom
> > > > > > report under that condition. Why is "no eligible task" so special? Is it
> > > > > > because you know that there won't be any tasks for your particular case?
> > > > > > What about other use cases where memory.max is not used as a "sweep
> > > > > > before tear down"?
> > > > >
> > > > > What other such use-cases would be? The only use-case I can envision
> > > > > of adjusting limits dynamically of a live cgroup are resource
> > > > > managers. However for cgroup v2, memory.high is the recommended way to
> > > > > limit the usage, so, why would resource managers be changing
> > > > > memory.max instead of memory.high? I am not sure. What do you think?
> > > >
> > > > There are different reasons to use the hard limit. Mostly to contain
> > > > potential runaways. While high limit might be a sufficient measure to
> > > > achieve that as well the hard limit is the last resort. And it clearly
> > > > has the oom killer semantic so I am not really sure why you are
> > > > comparing the two.
> > > >
> > >
> > > I am trying to see if "no eligible task" is really an issue and should
> > > be warned for the "other use cases". The only real use-case I can
> > > think of are resource managers adjusting the limit dynamically. I
> > > don't see "no eligible task" a concerning reason for such use-case.
> >
> > It is very much a concerning reason to notify about like any other OOM
> > situation due to hard limit breach. In this case it is worse in some
> > sense because the limit cannot be trimmed down because there is no
> > directly reclaimable memory at all. Such an oom situation is
> > effectivelly conserved.
> > --
>
> Let me make a more precise statement and tell me if you agree. The "no
> eligible task" is concerning for the charging path but not for the
> writer of memory.max. The writer can read the usage and
> cgroup.[procs|events] to figure out the situation if needed.

Agreed.
cgroup.[procs|events] can give all the admin want in this situation.
The oom report is a redundant infomation, really.

> Usually
> such writers (i.e. resource managers) use memory.high in addition to
> memory.max. First set memory.high and once the usage is below the high
> then set max to not induce the oom-kills.
>

Tetsuo Handa May 4, 2020, 3:44 p.m. UTC | #26

On 2020/05/04 23:57, Shakeel Butt wrote:
> On Mon, May 4, 2020 at 7:20 AM Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>
>> On 2020/05/04 22:54, Shakeel Butt wrote:
>>> It may not be a problem for an individual or small scale deployment
>>> but when "sweep before tear down" is the part of the workflow for
>>> thousands of machines cycling through hundreds of thousands of cgroups
>>> then we can potentially flood the logs with not useful dumps and may
>>> hide (or overflow) any useful information in the logs.
>>
>> I'm proposing a patch which allows configuring which OOM-related messages
>> should be sent to consoles at
>> https://lkml.kernel.org/r/20200424024239.63607-1-penguin-kernel@I-love.SAKURA.ne.jp .
>> Will that approach help you?
> 
>>From what I understand, that patch is specifically for controlling
> messages to consoles. The messages will still be in logs, right?
> 

Right.

If you want to control which OOM-related messages should be sent to syslog,
we could use similar approach.

Michal Hocko May 4, 2020, 4:06 p.m. UTC | #27

On Mon 04-05-20 08:35:57, Shakeel Butt wrote:
> On Mon, May 4, 2020 at 8:00 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Mon 04-05-20 07:53:01, Shakeel Butt wrote:
[...]
> > > I am trying to see if "no eligible task" is really an issue and should
> > > be warned for the "other use cases". The only real use-case I can
> > > think of are resource managers adjusting the limit dynamically. I
> > > don't see "no eligible task" a concerning reason for such use-case.
> >
> > It is very much a concerning reason to notify about like any other OOM
> > situation due to hard limit breach. In this case it is worse in some
> > sense because the limit cannot be trimmed down because there is no
> > directly reclaimable memory at all. Such an oom situation is
> > effectivelly conserved.
> > --
> 
> Let me make a more precise statement and tell me if you agree. The "no
> eligible task" is concerning for the charging path but not for the
> writer of memory.max. The writer can read the usage and
> cgroup.[procs|events] to figure out the situation if needed.

I really hate to repeat myself but this is no different from a regular
oom situation. Admin sets the hard limit and the kernel tries to act
upon that.

You cannot make any assumption about what admin wanted or didn't want
to see. We simply trigger the oom killer on memory.max and this is a
documented behavior. No eligible task or no task at all is a simply a
corner case when the kernel cannot act and mentions that along with the
oom report so that whoever consumes that information can debug or act on
that fact.

Silencing the oom report is simply removing a potentially useful
aid to debug further a potential problem. But let me repeat this is not
reallly any different from a regular oom situation when the oom killer
is able to act.

Shakeel Butt May 4, 2020, 7:23 p.m. UTC | #28

On Mon, May 4, 2020 at 9:06 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 04-05-20 08:35:57, Shakeel Butt wrote:
> > On Mon, May 4, 2020 at 8:00 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Mon 04-05-20 07:53:01, Shakeel Butt wrote:
> [...]
> > > > I am trying to see if "no eligible task" is really an issue and should
> > > > be warned for the "other use cases". The only real use-case I can
> > > > think of are resource managers adjusting the limit dynamically. I
> > > > don't see "no eligible task" a concerning reason for such use-case.
> > >
> > > It is very much a concerning reason to notify about like any other OOM
> > > situation due to hard limit breach. In this case it is worse in some
> > > sense because the limit cannot be trimmed down because there is no
> > > directly reclaimable memory at all. Such an oom situation is
> > > effectivelly conserved.
> > > --
> >
> > Let me make a more precise statement and tell me if you agree. The "no
> > eligible task" is concerning for the charging path but not for the
> > writer of memory.max. The writer can read the usage and
> > cgroup.[procs|events] to figure out the situation if needed.
>
> I really hate to repeat myself but this is no different from a regular
> oom situation.

Conceptually yes there is no difference but there is no *divine
restriction* to not make a difference if there is a real world
use-case which would benefit from it.

> Admin sets the hard limit and the kernel tries to act
> upon that.
>
> You cannot make any assumption about what admin wanted or didn't want
> to see.

Actually we always make assumptions on how the feature we implement
will be used and as new use-cases come the assumptions evolve.

> We simply trigger the oom killer on memory.max and this is a
> documented behavior. No eligible task or no task at all is a simply a
> corner case

For "sweep before tear down" use-case this is not a corner case.

> when the kernel cannot act and mentions that along with the
> oom report so that whoever consumes that information can debug or act on
> that fact.
>
> Silencing the oom report is simply removing a potentially useful
> aid to debug further a potential problem.

*Potentially* useful for debugging versus actually beneficial for
"sweep before tear down" use-case. Also I am not saying to make "no
dumps for memory.max when no eligible tasks" a set in stone rule. We
can always reevaluate when such information will actually be useful.

Johannes/Andrew, what's your opinion?

Michal Hocko May 5, 2020, 7:13 a.m. UTC | #29

On Mon 04-05-20 12:23:51, Shakeel Butt wrote:
[...]
> *Potentially* useful for debugging versus actually beneficial for
> "sweep before tear down" use-case.

I definitely do not want to prevent you from achieving what you
want/need. Let's get back to your argument on why you cannot use
memory.high for this purpose and what is the actual difference from
memory.max on the "sweep before removal". You've said

: Yes that would work but remote charging concerns me. Remote charging
: can still happen after the memcg is offlined and at the moment, high
: reclaim does not work for remote memcg and the usage can go till max
: or global pressure. This is most probably a misconfiguration and we
: might not receive the warnings in the log ever. Setting memory.max to
: 0 will definitely give such warnings.

So essentially the only reason you are not using memory.high which would
effectively achieve the same level of reclaim for your usecase is that
potential future remote charges could get unnoticed. I have proposed to
warn when charging to an offline memcg because that looks like a sign of
bug to me. Having the hard limit clamped to 0 (or some other small
value) would complain loud by the oom report and no eligible tasks
message but it will unlikely help to stop such a usage because, well,
there is nothing reclaimable and we force the charge in that case. So
you are effectively in the memory.high like situation.

So instead of potentially removing a useful information can we focus on
the remote charging side of the problem and deal with it in a sensible
way? That would make memory.high usable for your usecase and I still
believe that this is what you should be using in the first place.

Shakeel Butt May 5, 2020, 3:03 p.m. UTC | #30

On Tue, May 5, 2020 at 12:13 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 04-05-20 12:23:51, Shakeel Butt wrote:
> [...]
> > *Potentially* useful for debugging versus actually beneficial for
> > "sweep before tear down" use-case.
>
> I definitely do not want to prevent you from achieving what you
> want/need. Let's get back to your argument on why you cannot use
> memory.high for this purpose and what is the actual difference from
> memory.max on the "sweep before removal". You've said
>
> : Yes that would work but remote charging concerns me. Remote charging
> : can still happen after the memcg is offlined and at the moment, high
> : reclaim does not work for remote memcg and the usage can go till max
> : or global pressure. This is most probably a misconfiguration and we
> : might not receive the warnings in the log ever. Setting memory.max to
> : 0 will definitely give such warnings.
>
> So essentially the only reason you are not using memory.high which would
> effectively achieve the same level of reclaim for your usecase is that
> potential future remote charges could get unnoticed.

Yes.

> I have proposed to
> warn when charging to an offline memcg because that looks like a sign of
> bug to me.

Instead of a bug, I would say misconfiguration but there is at least a
genuine user i.e. buffer_head. It can be allocated in reclaim and
trigger remote charging but it should be fine as the page it is
attached to will possibly get freed soon. So, I don't think we want to
warn for all remote charges to an offlined memcg.

> Having the hard limit clamped to 0 (or some other small
> value) would complain loud by the oom report and no eligible tasks
> message but it will unlikely help to stop such a usage because, well,
> there is nothing reclaimable and we force the charge in that case. So
> you are effectively in the memory.high like situation.

Yes, effectively it will be similar to memory.high but at least we
will get early warnings.

Now rethinking about the remote charging of buffer_head to an offlined
memcg with memory.max=0. It seems like it is special in the sense that
it is using __GFP_NOFAIL and will skip the oom-killer and thus
warnings. Maybe the right approach is, as you suggested, always warn
for charging an offline memcg unless
(__GFP_NOFAIL|__GFP_RETRY_MAYFAIL). Though I am not sure if this is
doable without code duplication.

>
> So instead of potentially removing a useful information can we focus on
> the remote charging side of the problem and deal with it in a sensible
> way? That would make memory.high usable for your usecase and I still
> believe that this is what you should be using in the first place.

We talked about this at LSFMM'19 and I think the decision was to not
fix high reclaim for remote memcg until it will be an actual issue. I
suppose now we can treat it as an actual issue.

There are a couple of open questions:
1) Should the remote chargers be throttled and do the high reclaim?
2) There can be multiple remote charges to multiple memcgs in a single
kernel entry. Should we handle such scenarios?

Shakeel

Johannes Weiner May 5, 2020, 3:27 p.m. UTC | #31

On Mon, May 04, 2020 at 12:23:51PM -0700, Shakeel Butt wrote:
> On Mon, May 4, 2020 at 9:06 AM Michal Hocko <mhocko@kernel.org> wrote:
> > I really hate to repeat myself but this is no different from a regular
> > oom situation.
> 
> Conceptually yes there is no difference but there is no *divine
> restriction* to not make a difference if there is a real world
> use-case which would benefit from it.

I would wholeheartedly agree with this in general.

However, we're talking about the very semantics that set memory.max
apart from memory.high: triggering OOM kills to enforce the limit.

> > when the kernel cannot act and mentions that along with the
> > oom report so that whoever consumes that information can debug or act on
> > that fact.
> >
> > Silencing the oom report is simply removing a potentially useful
> > aid to debug further a potential problem.
> 
> *Potentially* useful for debugging versus actually beneficial for
> "sweep before tear down" use-case. Also I am not saying to make "no
> dumps for memory.max when no eligible tasks" a set in stone rule. We
> can always reevaluate when such information will actually be useful.
> 
> Johannes/Andrew, what's your opinion?

I still think that if you want to sweep without triggering OOMs,
memory.high has the matching semantics.

As you pointed out, it doesn't work well for foreign charges, but that
is more of a limitation in the implementation than in the semantics:

	/*
	 * If the hierarchy is above the normal consumption range, schedule
	 * reclaim on returning to userland.  We can perform reclaim here
	 * if __GFP_RECLAIM but let's always punt for simplicity and so that
	 * GFP_KERNEL can consistently be used during reclaim.  @memcg is
	 * not recorded as it most likely matches current's and won't
	 * change in the meantime.  As high limit is checked again before
	 * reclaim, the cost of mismatch is negligible.
	 */

Wouldn't it be more useful to fix that instead? It shouldn't be much
of a code change to do sync reclaim in try_charge().

Then you could express all things that you asked for without changing
any user-visible semantics: sweep an empty cgroup as well as possible,
do not oom on remaining charges that continue to be used by processes
outside the cgroup, do trigger oom on new foreign charges appearing
due to a misconfiguration.

	echo 0 > memory.high
	cat memory.current > memory.max

Would this work for you?

Shakeel Butt May 5, 2020, 3:35 p.m. UTC | #32

On Tue, May 5, 2020 at 8:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, May 04, 2020 at 12:23:51PM -0700, Shakeel Butt wrote:
> > On Mon, May 4, 2020 at 9:06 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > I really hate to repeat myself but this is no different from a regular
> > > oom situation.
> >
> > Conceptually yes there is no difference but there is no *divine
> > restriction* to not make a difference if there is a real world
> > use-case which would benefit from it.
>
> I would wholeheartedly agree with this in general.
>
> However, we're talking about the very semantics that set memory.max
> apart from memory.high: triggering OOM kills to enforce the limit.
>
> > > when the kernel cannot act and mentions that along with the
> > > oom report so that whoever consumes that information can debug or act on
> > > that fact.
> > >
> > > Silencing the oom report is simply removing a potentially useful
> > > aid to debug further a potential problem.
> >
> > *Potentially* useful for debugging versus actually beneficial for
> > "sweep before tear down" use-case. Also I am not saying to make "no
> > dumps for memory.max when no eligible tasks" a set in stone rule. We
> > can always reevaluate when such information will actually be useful.
> >
> > Johannes/Andrew, what's your opinion?
>
> I still think that if you want to sweep without triggering OOMs,
> memory.high has the matching semantics.
>
> As you pointed out, it doesn't work well for foreign charges, but that
> is more of a limitation in the implementation than in the semantics:
>
>         /*
>          * If the hierarchy is above the normal consumption range, schedule
>          * reclaim on returning to userland.  We can perform reclaim here
>          * if __GFP_RECLAIM but let's always punt for simplicity and so that
>          * GFP_KERNEL can consistently be used during reclaim.  @memcg is
>          * not recorded as it most likely matches current's and won't
>          * change in the meantime.  As high limit is checked again before
>          * reclaim, the cost of mismatch is negligible.
>          */
>
> Wouldn't it be more useful to fix that instead? It shouldn't be much
> of a code change to do sync reclaim in try_charge().

Sync reclaim would really simplify the remote charging case. Though
should sync reclaim only be done for remote charging or for all?

>
> Then you could express all things that you asked for without changing
> any user-visible semantics: sweep an empty cgroup as well as possible,
> do not oom on remaining charges that continue to be used by processes
> outside the cgroup, do trigger oom on new foreign charges appearing
> due to a misconfiguration.
>
>         echo 0 > memory.high
>         cat memory.current > memory.max
>
> Would this work for you?

Yes that would work. I will work on a patch.

Michal Hocko May 5, 2020, 3:49 p.m. UTC | #33

On Tue 05-05-20 08:35:45, Shakeel Butt wrote:
> On Tue, May 5, 2020 at 8:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Mon, May 04, 2020 at 12:23:51PM -0700, Shakeel Butt wrote:
> > > On Mon, May 4, 2020 at 9:06 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > I really hate to repeat myself but this is no different from a regular
> > > > oom situation.
> > >
> > > Conceptually yes there is no difference but there is no *divine
> > > restriction* to not make a difference if there is a real world
> > > use-case which would benefit from it.
> >
> > I would wholeheartedly agree with this in general.
> >
> > However, we're talking about the very semantics that set memory.max
> > apart from memory.high: triggering OOM kills to enforce the limit.
> >
> > > > when the kernel cannot act and mentions that along with the
> > > > oom report so that whoever consumes that information can debug or act on
> > > > that fact.
> > > >
> > > > Silencing the oom report is simply removing a potentially useful
> > > > aid to debug further a potential problem.
> > >
> > > *Potentially* useful for debugging versus actually beneficial for
> > > "sweep before tear down" use-case. Also I am not saying to make "no
> > > dumps for memory.max when no eligible tasks" a set in stone rule. We
> > > can always reevaluate when such information will actually be useful.
> > >
> > > Johannes/Andrew, what's your opinion?
> >
> > I still think that if you want to sweep without triggering OOMs,
> > memory.high has the matching semantics.
> >
> > As you pointed out, it doesn't work well for foreign charges, but that
> > is more of a limitation in the implementation than in the semantics:
> >
> >         /*
> >          * If the hierarchy is above the normal consumption range, schedule
> >          * reclaim on returning to userland.  We can perform reclaim here
> >          * if __GFP_RECLAIM but let's always punt for simplicity and so that
> >          * GFP_KERNEL can consistently be used during reclaim.  @memcg is
> >          * not recorded as it most likely matches current's and won't
> >          * change in the meantime.  As high limit is checked again before
> >          * reclaim, the cost of mismatch is negligible.
> >          */
> >
> > Wouldn't it be more useful to fix that instead? It shouldn't be much
> > of a code change to do sync reclaim in try_charge().
> 
> Sync reclaim would really simplify the remote charging case. Though
> should sync reclaim only be done for remote charging or for all?

The simplest way around that would be to always do sync reclaim for
unpopulated memcgs because those do not have any user context to reclaim
from and for restricted allocation contexts like atomic allocations
maybe also for GFP_NO{IO,FS}.

Johannes Weiner May 5, 2020, 4:40 p.m. UTC | #34

On Tue, May 05, 2020 at 08:35:45AM -0700, Shakeel Butt wrote:
> On Tue, May 5, 2020 at 8:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Mon, May 04, 2020 at 12:23:51PM -0700, Shakeel Butt wrote:
> > > On Mon, May 4, 2020 at 9:06 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > I really hate to repeat myself but this is no different from a regular
> > > > oom situation.
> > >
> > > Conceptually yes there is no difference but there is no *divine
> > > restriction* to not make a difference if there is a real world
> > > use-case which would benefit from it.
> >
> > I would wholeheartedly agree with this in general.
> >
> > However, we're talking about the very semantics that set memory.max
> > apart from memory.high: triggering OOM kills to enforce the limit.
> >
> > > > when the kernel cannot act and mentions that along with the
> > > > oom report so that whoever consumes that information can debug or act on
> > > > that fact.
> > > >
> > > > Silencing the oom report is simply removing a potentially useful
> > > > aid to debug further a potential problem.
> > >
> > > *Potentially* useful for debugging versus actually beneficial for
> > > "sweep before tear down" use-case. Also I am not saying to make "no
> > > dumps for memory.max when no eligible tasks" a set in stone rule. We
> > > can always reevaluate when such information will actually be useful.
> > >
> > > Johannes/Andrew, what's your opinion?
> >
> > I still think that if you want to sweep without triggering OOMs,
> > memory.high has the matching semantics.
> >
> > As you pointed out, it doesn't work well for foreign charges, but that
> > is more of a limitation in the implementation than in the semantics:
> >
> >         /*
> >          * If the hierarchy is above the normal consumption range, schedule
> >          * reclaim on returning to userland.  We can perform reclaim here
> >          * if __GFP_RECLAIM but let's always punt for simplicity and so that
> >          * GFP_KERNEL can consistently be used during reclaim.  @memcg is
> >          * not recorded as it most likely matches current's and won't
> >          * change in the meantime.  As high limit is checked again before
> >          * reclaim, the cost of mismatch is negligible.
> >          */
> >
> > Wouldn't it be more useful to fix that instead? It shouldn't be much
> > of a code change to do sync reclaim in try_charge().
> 
> Sync reclaim would really simplify the remote charging case. Though
> should sync reclaim only be done for remote charging or for all?

I would do it for all __GFP_RECLAIM callers, no need to special case
remote charging unnecessarily IMO.

We can do both the reclaim as well as the penalty throttling
synchronously, i.e. all of mem_cgroup_handle_over_high(). And then
punt to the userspace resume when we either didn't reclaim or are
still over the threshold after reclaim.

Btw we should probably kick off high_work rather than set userspace
resume on foreign charges, right? Otherwise we may not reclaim at all
when we push a group over its limit from outside (and in a
!__GFP_RECLAIM context).

Johannes Weiner May 5, 2020, 4:57 p.m. UTC | #35

On Tue, May 05, 2020 at 08:03:17AM -0700, Shakeel Butt wrote:
> On Tue, May 5, 2020 at 12:13 AM Michal Hocko <mhocko@kernel.org> wrote:
> > So instead of potentially removing a useful information can we focus on
> > the remote charging side of the problem and deal with it in a sensible
> > way? That would make memory.high usable for your usecase and I still
> > believe that this is what you should be using in the first place.
> 
> We talked about this at LSFMM'19 and I think the decision was to not
> fix high reclaim for remote memcg until it will be an actual issue. I
> suppose now we can treat it as an actual issue.
> 
> There are a couple of open questions:
> 1) Should the remote chargers be throttled and do the high reclaim?

I would say it depends on the caller. We may need both: a centralized
charger like a kthread that does work on behalf of many cgroups should
not be throttled, or we risk priority inversions - maybe not even held
up for reclaim, although that would need some thinking about
memory.max as well - whether remote charges should be allowed to
overrun memory.max temporarily, and we punt all reclaim from such
contexts to a worker. (Preferrably, in the future, with all the cpu
cycles of that worker accounted to the cgroup. We want that for
kswapd-style background reclaim as well).

We do similar things with IO for example, when offloading IO to a
central kthread. The kthread is allowed to proceed past a cgroup's
allowance in order to make forward progress, and the cgroup will be
forced to pay up retroactively when it comes back for more of the
resource.

memcg: oom: ignore oom warnings from memory.max

Commit Message

Comments

Patch