diff mbox series

mm, oom: make a last minute check to prevent unnecessary memcg oom kills

Message ID alpine.DEB.2.21.2003101454580.142656@chino.kir.corp.google.com (mailing list archive)
State New, archived
Headers show
Series mm, oom: make a last minute check to prevent unnecessary memcg oom kills | expand

Commit Message

David Rientjes March 10, 2020, 9:55 p.m. UTC
Killing a user process as a result of hitting memcg limits is a serious
decision that is unfortunately needed only when no forward progress in
reclaiming memory can be made.

Deciding the appropriate oom victim can take a sufficient amount of time
that allows another process that is exiting to actually uncharge to the
same memcg hierarchy and prevent unnecessarily killing user processes.

An example is to prevent *multiple* unnecessary oom kills on a system
with two cores where the oom kill occurs when there is an abundance of
free memory available:

Memory cgroup out of memory: Killed process 628 (repro) total-vm:41944kB, anon-rss:40888kB, file-rss:496kB, shmem-rss:0kB, UID:0 pgtables:116kB oom_score_adj:0
<immediately after>
repro invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
CPU: 1 PID: 629 Comm: repro Not tainted 5.6.0-rc5+ #130
Call Trace:
 dump_stack+0x78/0xb6
 dump_header+0x55/0x240
 oom_kill_process+0xc5/0x170
 out_of_memory+0x305/0x4a0
 try_charge+0x77b/0xac0
 mem_cgroup_try_charge+0x10a/0x220
 mem_cgroup_try_charge_delay+0x1e/0x40
 handle_mm_fault+0xdf2/0x15f0
 do_user_addr_fault+0x21f/0x420
 async_page_fault+0x2f/0x40
memory: usage 61336kB, limit 102400kB, failcnt 74

Notice the second memcg oom kill shows usage is >40MB below its limit of
100MB but a process is still unnecessarily killed because the decision has
already been made to oom kill by calling out_of_memory() before the
initial victim had a chance to uncharge its memory.

Make a last minute check to determine if an oom kill is really needed to
prevent unnecessary oom killing.

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/memcontrol.h |  7 +++++++
 mm/memcontrol.c            |  2 +-
 mm/oom_kill.c              | 16 +++++++++++++---
 3 files changed, 21 insertions(+), 4 deletions(-)

Comments

Michal Hocko March 10, 2020, 10:19 p.m. UTC | #1
On Tue 10-03-20 14:55:50, David Rientjes wrote:
> Killing a user process as a result of hitting memcg limits is a serious
> decision that is unfortunately needed only when no forward progress in
> reclaiming memory can be made.
> 
> Deciding the appropriate oom victim can take a sufficient amount of time
> that allows another process that is exiting to actually uncharge to the
> same memcg hierarchy and prevent unnecessarily killing user processes.
> 
> An example is to prevent *multiple* unnecessary oom kills on a system
> with two cores where the oom kill occurs when there is an abundance of
> free memory available:
> 
> Memory cgroup out of memory: Killed process 628 (repro) total-vm:41944kB, anon-rss:40888kB, file-rss:496kB, shmem-rss:0kB, UID:0 pgtables:116kB oom_score_adj:0
> <immediately after>
> repro invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> CPU: 1 PID: 629 Comm: repro Not tainted 5.6.0-rc5+ #130
> Call Trace:
>  dump_stack+0x78/0xb6
>  dump_header+0x55/0x240
>  oom_kill_process+0xc5/0x170
>  out_of_memory+0x305/0x4a0
>  try_charge+0x77b/0xac0
>  mem_cgroup_try_charge+0x10a/0x220
>  mem_cgroup_try_charge_delay+0x1e/0x40
>  handle_mm_fault+0xdf2/0x15f0
>  do_user_addr_fault+0x21f/0x420
>  async_page_fault+0x2f/0x40
> memory: usage 61336kB, limit 102400kB, failcnt 74
> 
> Notice the second memcg oom kill shows usage is >40MB below its limit of
> 100MB but a process is still unnecessarily killed because the decision has
> already been made to oom kill by calling out_of_memory() before the
> initial victim had a chance to uncharge its memory.

Could you be more specific about the specific workload please?

> Make a last minute check to determine if an oom kill is really needed to
> prevent unnecessary oom killing.

I really see no reason why the memcg oom should behave differently from
the global case. In both cases there will be a point of no return.
Where-ever it is done it will be racy and the oom victim selection will
play the race window role. There is simply no way around that without
making the whole thing completely synchronous. This all looks like a
micro optimization and I would really like to see a relevant real world
usecase presented before new special casing is added.

> 
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: stable@vger.kernel.org
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/linux/memcontrol.h |  7 +++++++
>  mm/memcontrol.c            |  2 +-
>  mm/oom_kill.c              | 16 +++++++++++++---
>  3 files changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -445,6 +445,8 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
>  int mem_cgroup_scan_tasks(struct mem_cgroup *,
>  			  int (*)(struct task_struct *, void *), void *);
>  
> +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg);
> +
>  static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
>  {
>  	if (mem_cgroup_disabled())
> @@ -945,6 +947,11 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  	return 0;
>  }
>  
> +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> +{
> +	return 0;
> +}
> +
>  static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
>  {
>  	return 0;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1286,7 +1286,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
>   * Returns the maximum amount of memory @mem can be charged with, in
>   * pages.
>   */
> -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
>  {
>  	unsigned long margin = 0;
>  	unsigned long count;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -972,9 +972,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	}
>  	task_unlock(victim);
>  
> -	if (__ratelimit(&oom_rs))
> -		dump_header(oc, victim);
> -
>  	/*
>  	 * Do we need to kill the entire memory cgroup?
>  	 * Or even one of the ancestor memory cgroups?
> @@ -982,6 +979,19 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	 */
>  	oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
>  
> +	if (is_memcg_oom(oc)) {
> +		cond_resched();
> +
> +		/* One last check: do we *really* need to kill? */
> +		if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order)) {
> +			put_task_struct(victim);
> +			return;
> +		}
> +	}
> +
> +	if (__ratelimit(&oom_rs))
> +		dump_header(oc, victim);
> +
>  	__oom_kill_process(victim, message);
>  
>  	/*
David Rientjes March 10, 2020, 10:54 p.m. UTC | #2
On Tue, 10 Mar 2020, Michal Hocko wrote:

> On Tue 10-03-20 14:55:50, David Rientjes wrote:
> > Killing a user process as a result of hitting memcg limits is a serious
> > decision that is unfortunately needed only when no forward progress in
> > reclaiming memory can be made.
> > 
> > Deciding the appropriate oom victim can take a sufficient amount of time
> > that allows another process that is exiting to actually uncharge to the
> > same memcg hierarchy and prevent unnecessarily killing user processes.
> > 
> > An example is to prevent *multiple* unnecessary oom kills on a system
> > with two cores where the oom kill occurs when there is an abundance of
> > free memory available:
> > 
> > Memory cgroup out of memory: Killed process 628 (repro) total-vm:41944kB, anon-rss:40888kB, file-rss:496kB, shmem-rss:0kB, UID:0 pgtables:116kB oom_score_adj:0
> > <immediately after>
> > repro invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> > CPU: 1 PID: 629 Comm: repro Not tainted 5.6.0-rc5+ #130
> > Call Trace:
> >  dump_stack+0x78/0xb6
> >  dump_header+0x55/0x240
> >  oom_kill_process+0xc5/0x170
> >  out_of_memory+0x305/0x4a0
> >  try_charge+0x77b/0xac0
> >  mem_cgroup_try_charge+0x10a/0x220
> >  mem_cgroup_try_charge_delay+0x1e/0x40
> >  handle_mm_fault+0xdf2/0x15f0
> >  do_user_addr_fault+0x21f/0x420
> >  async_page_fault+0x2f/0x40
> > memory: usage 61336kB, limit 102400kB, failcnt 74
> > 
> > Notice the second memcg oom kill shows usage is >40MB below its limit of
> > 100MB but a process is still unnecessarily killed because the decision has
> > already been made to oom kill by calling out_of_memory() before the
> > initial victim had a chance to uncharge its memory.
> 
> Could you be more specific about the specific workload please?
> 

Robert, could you elaborate on the user-visible effects of this issue that 
caused it to initially get reported?

> > Make a last minute check to determine if an oom kill is really needed to
> > prevent unnecessary oom killing.
> 
> I really see no reason why the memcg oom should behave differently from
> the global case. In both cases there will be a point of no return.
> Where-ever it is done it will be racy and the oom victim selection will
> play the race window role. There is simply no way around that without
> making the whole thing completely synchronous. This all looks like a
> micro optimization and I would really like to see a relevant real world
> usecase presented before new special casing is added.
> 

The patch certainly prevents unnecessary oom kills when there is a pending 
victim that uncharges its memory between invoking the oom killer and 
finding MMF_OOM_SKIP in the list of eligible tasks and its much more 
common on systems with limited cpu cores.

Adding support for the global case is more difficult because we rely on 
multiple heuristics, not only watermarks, to determine if we can allocate.  
It would likely require using a lot of the logic from the page allocator 
(alloc flags, watermark check, mempolicy awareness, cpusets) to make it 
work reliably and not miss a corner-case where we actually don't end up on 
oom killing anything at all.

Memcg charging, on the other hand, is much simpler as exhibited by this 
patch since it's only about the number of pages to charge and avoiding 
unnecessarily oom killing a user process is of paramount importance to 
that user.

Basically: it becomes very difficult for a cloud provider to say "look, 
your process was oom killed and it shows usage is 60MB in a memcg limited 
to 100MB" :)

> > 
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  include/linux/memcontrol.h |  7 +++++++
> >  mm/memcontrol.c            |  2 +-
> >  mm/oom_kill.c              | 16 +++++++++++++---
> >  3 files changed, 21 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -445,6 +445,8 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
> >  int mem_cgroup_scan_tasks(struct mem_cgroup *,
> >  			  int (*)(struct task_struct *, void *), void *);
> >  
> > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg);
> > +
> >  static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
> >  {
> >  	if (mem_cgroup_disabled())
> > @@ -945,6 +947,11 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
> >  	return 0;
> >  }
> >  
> > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > +{
> > +	return 0;
> > +}
> > +
> >  static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
> >  {
> >  	return 0;
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1286,7 +1286,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> >   * Returns the maximum amount of memory @mem can be charged with, in
> >   * pages.
> >   */
> > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> >  {
> >  	unsigned long margin = 0;
> >  	unsigned long count;
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -972,9 +972,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  	}
> >  	task_unlock(victim);
> >  
> > -	if (__ratelimit(&oom_rs))
> > -		dump_header(oc, victim);
> > -
> >  	/*
> >  	 * Do we need to kill the entire memory cgroup?
> >  	 * Or even one of the ancestor memory cgroups?
> > @@ -982,6 +979,19 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  	 */
> >  	oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
> >  
> > +	if (is_memcg_oom(oc)) {
> > +		cond_resched();
> > +
> > +		/* One last check: do we *really* need to kill? */
> > +		if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order)) {
> > +			put_task_struct(victim);
> > +			return;
> > +		}
> > +	}
> > +
> > +	if (__ratelimit(&oom_rs))
> > +		dump_header(oc, victim);
> > +
> >  	__oom_kill_process(victim, message);
> >  
> >  	/*
> 
> -- 
> Michal Hocko
> SUSE Labs
>
Michal Hocko March 11, 2020, 8:39 a.m. UTC | #3
On Tue 10-03-20 15:54:44, David Rientjes wrote:
> On Tue, 10 Mar 2020, Michal Hocko wrote:
> 
> > On Tue 10-03-20 14:55:50, David Rientjes wrote:
> > > Killing a user process as a result of hitting memcg limits is a serious
> > > decision that is unfortunately needed only when no forward progress in
> > > reclaiming memory can be made.
> > > 
> > > Deciding the appropriate oom victim can take a sufficient amount of time
> > > that allows another process that is exiting to actually uncharge to the
> > > same memcg hierarchy and prevent unnecessarily killing user processes.
> > > 
> > > An example is to prevent *multiple* unnecessary oom kills on a system
> > > with two cores where the oom kill occurs when there is an abundance of
> > > free memory available:
> > > 
> > > Memory cgroup out of memory: Killed process 628 (repro) total-vm:41944kB, anon-rss:40888kB, file-rss:496kB, shmem-rss:0kB, UID:0 pgtables:116kB oom_score_adj:0
> > > <immediately after>
> > > repro invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> > > CPU: 1 PID: 629 Comm: repro Not tainted 5.6.0-rc5+ #130
> > > Call Trace:
> > >  dump_stack+0x78/0xb6
> > >  dump_header+0x55/0x240
> > >  oom_kill_process+0xc5/0x170
> > >  out_of_memory+0x305/0x4a0
> > >  try_charge+0x77b/0xac0
> > >  mem_cgroup_try_charge+0x10a/0x220
> > >  mem_cgroup_try_charge_delay+0x1e/0x40
> > >  handle_mm_fault+0xdf2/0x15f0
> > >  do_user_addr_fault+0x21f/0x420
> > >  async_page_fault+0x2f/0x40
> > > memory: usage 61336kB, limit 102400kB, failcnt 74
> > > 
> > > Notice the second memcg oom kill shows usage is >40MB below its limit of
> > > 100MB but a process is still unnecessarily killed because the decision has
> > > already been made to oom kill by calling out_of_memory() before the
> > > initial victim had a chance to uncharge its memory.
> > 
> > Could you be more specific about the specific workload please?
> > 
> 
> Robert, could you elaborate on the user-visible effects of this issue that 
> caused it to initially get reported?

Yes please, real life usecases are important when adding hacks like this
one and we should have a clear data to support the check actually helps
(in how many instances etc...)
Tetsuo Handa March 11, 2020, 11:41 a.m. UTC | #4
On 2020/03/11 7:54, David Rientjes wrote:
> The patch certainly prevents unnecessary oom kills when there is a pending 
> victim that uncharges its memory between invoking the oom killer and 
> finding MMF_OOM_SKIP in the list of eligible tasks and its much more 
> common on systems with limited cpu cores.

I think that it is dump_header() which currently spends much time (due to
synchronous printing) enough to make "the second memcg oom kill shows usage
is >40MB below its limit of 100MB" happen. Shouldn't we call dump_header()
and then do the last check and end with "but did not kill anybody" message?
David Rientjes March 11, 2020, 7:51 p.m. UTC | #5
On Wed, 11 Mar 2020, Tetsuo Handa wrote:

> > The patch certainly prevents unnecessary oom kills when there is a pending 
> > victim that uncharges its memory between invoking the oom killer and 
> > finding MMF_OOM_SKIP in the list of eligible tasks and its much more 
> > common on systems with limited cpu cores.
> 
> I think that it is dump_header() which currently spends much time (due to
> synchronous printing) enough to make "the second memcg oom kill shows usage
> is >40MB below its limit of 100MB" happen. Shouldn't we call dump_header()
> and then do the last check and end with "but did not kill anybody" message?
> 

Lol, I actually did that for internal testing as well :)  I didn't like 
how it spammed the kernel log and then basically said "just kidding, 
nothing was oom killed."

But if you think this would helpful I can propose it as v2.
---
 include/linux/memcontrol.h |  7 +++++++
 mm/memcontrol.c            |  2 +-
 mm/oom_kill.c              | 14 +++++++++++++-
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -445,6 +445,8 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 int mem_cgroup_scan_tasks(struct mem_cgroup *,
 			  int (*)(struct task_struct *, void *), void *);
 
+unsigned long mem_cgroup_margin(struct mem_cgroup *memcg);
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	if (mem_cgroup_disabled())
@@ -945,6 +947,11 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return 0;
 }
 
+static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	return 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1286,7 +1286,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
  * Returns the maximum amount of memory @mem can be charged with, in
  * pages.
  */
-static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
+unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 {
 	unsigned long margin = 0;
 	unsigned long count;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -934,7 +934,6 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-#undef K
 
 /*
  * Kill provided task unless it's secured by setting
@@ -982,6 +981,18 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	 */
 	oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
 
+	/* One last check: do we *really* need to kill? */
+	if (is_memcg_oom(oc)) {
+		unsigned long margin = mem_cgroup_margin(oc->memcg);
+
+		if (unlikely(margin >= (1 << oc->order))) {
+			put_task_struct(victim);
+			pr_info("Suppressed oom kill, %lukB of memory can be charged\n",
+				K(margin));
+			return;
+		}
+	}
+
 	__oom_kill_process(victim, message);
 
 	/*
@@ -994,6 +1005,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 		mem_cgroup_put(oom_group);
 	}
 }
+#undef K
 
 /*
  * Determines whether the kernel must panic because of the panic_on_oom sysctl.
Michal Hocko March 17, 2020, 7:59 a.m. UTC | #6
On Wed 11-03-20 09:39:01, Michal Hocko wrote:
> On Tue 10-03-20 15:54:44, David Rientjes wrote:
> > On Tue, 10 Mar 2020, Michal Hocko wrote:
> > 
> > > On Tue 10-03-20 14:55:50, David Rientjes wrote:
> > > > Killing a user process as a result of hitting memcg limits is a serious
> > > > decision that is unfortunately needed only when no forward progress in
> > > > reclaiming memory can be made.
> > > > 
> > > > Deciding the appropriate oom victim can take a sufficient amount of time
> > > > that allows another process that is exiting to actually uncharge to the
> > > > same memcg hierarchy and prevent unnecessarily killing user processes.
> > > > 
> > > > An example is to prevent *multiple* unnecessary oom kills on a system
> > > > with two cores where the oom kill occurs when there is an abundance of
> > > > free memory available:
> > > > 
> > > > Memory cgroup out of memory: Killed process 628 (repro) total-vm:41944kB, anon-rss:40888kB, file-rss:496kB, shmem-rss:0kB, UID:0 pgtables:116kB oom_score_adj:0
> > > > <immediately after>
> > > > repro invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> > > > CPU: 1 PID: 629 Comm: repro Not tainted 5.6.0-rc5+ #130
> > > > Call Trace:
> > > >  dump_stack+0x78/0xb6
> > > >  dump_header+0x55/0x240
> > > >  oom_kill_process+0xc5/0x170
> > > >  out_of_memory+0x305/0x4a0
> > > >  try_charge+0x77b/0xac0
> > > >  mem_cgroup_try_charge+0x10a/0x220
> > > >  mem_cgroup_try_charge_delay+0x1e/0x40
> > > >  handle_mm_fault+0xdf2/0x15f0
> > > >  do_user_addr_fault+0x21f/0x420
> > > >  async_page_fault+0x2f/0x40
> > > > memory: usage 61336kB, limit 102400kB, failcnt 74
> > > > 
> > > > Notice the second memcg oom kill shows usage is >40MB below its limit of
> > > > 100MB but a process is still unnecessarily killed because the decision has
> > > > already been made to oom kill by calling out_of_memory() before the
> > > > initial victim had a chance to uncharge its memory.
> > > 
> > > Could you be more specific about the specific workload please?
> > > 
> > 
> > Robert, could you elaborate on the user-visible effects of this issue that 
> > caused it to initially get reported?
> 
> Yes please, real life usecases are important when adding hacks like this
> one and we should have a clear data to support the check actually helps
> (in how many instances etc...)

Friendly ping.
Robert Kolchmeyer March 17, 2020, 6:25 p.m. UTC | #7
On Tue, Mar 10, 2020 at 3:54 PM David Rientjes <rientjes@google.com> wrote:
>
> Robert, could you elaborate on the user-visible effects of this issue that
> caused it to initially get reported?
>

Ami (now cc'ed) knows more, but here is my understanding. The use case
involves a Docker container running multiple processes. The container
has a memory limit set. The container contains two long-lived,
important processes p1 and p2, and some arbitrary, dynamic number of
usually ephemeral processes p3,...,pn. These processes are structured
in a hierarchy that looks like p1->p2->[p3,...,pn]; p1 is a parent of
p2, and p2 is the parent for all of the ephemeral processes p3,...,pn.

Since p1 and p2 are long-lived and important, the user does not want
p1 and p2 to be oom-killed. However, p3,...,pn are expected to use a
lot of memory, and it's ok for those processes to be oom-killed.

If the user sets oom_score_adj on p1 and p2 to make them very unlikely
to be oom-killed, p3,...,pn will inherit the oom_score_adj value,
which is bad. Additionally, setting oom_score_adj on p3,...,pn is
tricky, since processes in the Docker container (specifically p1 and
p2) don't have permissions to set oom_score_adj on p3,...,pn. The
ephemeral nature of p3,...,pn also makes setting oom_score_adj on them
tricky after they launch.

So, the user hopes that when one of p3,...,pn triggers an oom
condition in the Docker container, the oom killer will almost always
kill processes from p3,...,pn (and not kill p1 or p2, which are both
important and unlikely to trigger an oom condition). The issue of more
processes being killed than are strictly necessary is resulting in p1
or p2 being killed much more frequently when one of p3,...,pn triggers
an oom condition, and p1 or p2 being killed is very disruptive for the
user (my understanding is that p1 or p2 going down with high frequency
results in significant unhealthiness in the user's service).

The change proposed here has not been run in a production system, and
so I don't think anyone has data that conclusively demonstrates that
this change will solve the user's problem. But, from observations made
in their production system, the user is confident that addressing this
aggressive oom killing will solve their problem, and we have data that
shows this change does considerably reduce the frequency of aggressive
oom killing (from 61/100 oom killing events down to 0/100 with this
change).

Hope this gives a bit more context.

Thanks,
-Robert
Ami Fischman March 17, 2020, 7 p.m. UTC | #8
On Tue, Mar 17, 2020 at 11:26 AM Robert Kolchmeyer
<rkolchmeyer@google.com> wrote:
>
> On Tue, Mar 10, 2020 at 3:54 PM David Rientjes <rientjes@google.com> wrote:
> >
> > Robert, could you elaborate on the user-visible effects of this issue that
> > caused it to initially get reported?
>
> Ami (now cc'ed) knows more, but here is my understanding.

Robert's description of the mechanics we observed is accurate.

We discovered this regression in the oom-killer's behavior when
attempting to upgrade our system. The fraction of the system that
went unhealthy due to this issue was approximately equal to the
_sum_ of all other causes of unhealth, which are many and varied,
but each of which contribute only a small amount of
unhealth. This issue forced a rollback to the previous kernel
where we ~never see this behavior, returning our unhealth levels
to the previous background levels.

Cheers,
-a
Michal Hocko March 18, 2020, 9:55 a.m. UTC | #9
On Tue 17-03-20 11:25:52, Robert Kolchmeyer wrote:
> On Tue, Mar 10, 2020 at 3:54 PM David Rientjes <rientjes@google.com> wrote:
> >
> > Robert, could you elaborate on the user-visible effects of this issue that
> > caused it to initially get reported?
> >
> 
> Ami (now cc'ed) knows more, but here is my understanding. The use case
> involves a Docker container running multiple processes. The container
> has a memory limit set. The container contains two long-lived,
> important processes p1 and p2, and some arbitrary, dynamic number of
> usually ephemeral processes p3,...,pn. These processes are structured
> in a hierarchy that looks like p1->p2->[p3,...,pn]; p1 is a parent of
> p2, and p2 is the parent for all of the ephemeral processes p3,...,pn.
> 
> Since p1 and p2 are long-lived and important, the user does not want
> p1 and p2 to be oom-killed. However, p3,...,pn are expected to use a
> lot of memory, and it's ok for those processes to be oom-killed.
> 
> If the user sets oom_score_adj on p1 and p2 to make them very unlikely
> to be oom-killed, p3,...,pn will inherit the oom_score_adj value,
> which is bad. Additionally, setting oom_score_adj on p3,...,pn is
> tricky, since processes in the Docker container (specifically p1 and
> p2) don't have permissions to set oom_score_adj on p3,...,pn. The
> ephemeral nature of p3,...,pn also makes setting oom_score_adj on them
> tricky after they launch.

Thanks for the clarification.

> So, the user hopes that when one of p3,...,pn triggers an oom
> condition in the Docker container, the oom killer will almost always
> kill processes from p3,...,pn (and not kill p1 or p2, which are both
> important and unlikely to trigger an oom condition). The issue of more
> processes being killed than are strictly necessary is resulting in p1
> or p2 being killed much more frequently when one of p3,...,pn triggers
> an oom condition, and p1 or p2 being killed is very disruptive for the
> user (my understanding is that p1 or p2 going down with high frequency
> results in significant unhealthiness in the user's service).

Do you have any logs showing this condition? I am interested because
from your description it seems like p1/p2 shouldn't be usually those
which trigger the oom, right? That suggests that it should be mostly p3,
... pn to be in the kernel triggering the oom and therefore they
shouldn't vanish.
Michal Hocko March 18, 2020, 9:57 a.m. UTC | #10
On Tue 17-03-20 12:00:45, Ami Fischman wrote:
> On Tue, Mar 17, 2020 at 11:26 AM Robert Kolchmeyer
> <rkolchmeyer@google.com> wrote:
> >
> > On Tue, Mar 10, 2020 at 3:54 PM David Rientjes <rientjes@google.com> wrote:
> > >
> > > Robert, could you elaborate on the user-visible effects of this issue that
> > > caused it to initially get reported?
> >
> > Ami (now cc'ed) knows more, but here is my understanding.
> 
> Robert's description of the mechanics we observed is accurate.
> 
> We discovered this regression in the oom-killer's behavior when
> attempting to upgrade our system. The fraction of the system that
> went unhealthy due to this issue was approximately equal to the
> _sum_ of all other causes of unhealth, which are many and varied,
> but each of which contribute only a small amount of
> unhealth. This issue forced a rollback to the previous kernel
> where we ~never see this behavior, returning our unhealth levels
> to the previous background levels.

Could you be more specific on the good vs. bad kernel versions? Because
I do not remember any oom changes that would affect the
time-to-check-time-to-kill race. The timing might be slightly different
in each kernel version of course.
Ami Fischman March 18, 2020, 3:20 p.m. UTC | #11
On Wed, Mar 18, 2020 at 2:57 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 17-03-20 12:00:45, Ami Fischman wrote:
> > On Tue, Mar 17, 2020 at 11:26 AM Robert Kolchmeyer
> > <rkolchmeyer@google.com> wrote:
> > >
> > > On Tue, Mar 10, 2020 at 3:54 PM David Rientjes <rientjes@google.com> wrote:
> > > >
> > > > Robert, could you elaborate on the user-visible effects of this issue that
> > > > caused it to initially get reported?
> > >
> > > Ami (now cc'ed) knows more, but here is my understanding.
> >
> > Robert's description of the mechanics we observed is accurate.
> >
> > We discovered this regression in the oom-killer's behavior when
> > attempting to upgrade our system. The fraction of the system that
> > went unhealthy due to this issue was approximately equal to the
> > _sum_ of all other causes of unhealth, which are many and varied,
> > but each of which contribute only a small amount of
> > unhealth. This issue forced a rollback to the previous kernel
> > where we ~never see this behavior, returning our unhealth levels
> > to the previous background levels.
>
> Could you be more specific on the good vs. bad kernel versions? Because
> I do not remember any oom changes that would affect the
> time-to-check-time-to-kill race. The timing might be slightly different
> in each kernel version of course.

The original upgrade attempt included a large window of kernel
versions: 4.14.137 to 4.19.91.  In attempting to narrow down the
failure we found that in tests of 10 runs we went from 0/10
failures to 1/10 failure with the update from
https://chromium.googlesource.com/chromiumos/third_party/kernel/+/74fab24be8994bb5bb8d1aa8828f50e16bb38346
(based on 4.19.60) to
https://chromium.googlesource.com/chromiumos/third_party/kernel/+/6e0fef1b46bb91c196be56365d9af72e52bb4675
(also based on 4.19.60)
and then we went from 1/10 failures to 9/10 failures with the
upgrade to
https://chromium.googlesource.com/chromiumos/third_party/kernel/+/a33dffa8e5c47b877e4daece938a81e3cc810b90
(which I believe is based on 4.19.72).

(this was all before we had the minimal repro yielding Robert's
61/100->0/100 stat in his previous email)
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -445,6 +445,8 @@  void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 int mem_cgroup_scan_tasks(struct mem_cgroup *,
 			  int (*)(struct task_struct *, void *), void *);
 
+unsigned long mem_cgroup_margin(struct mem_cgroup *memcg);
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	if (mem_cgroup_disabled())
@@ -945,6 +947,11 @@  static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return 0;
 }
 
+static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	return 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1286,7 +1286,7 @@  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
  * Returns the maximum amount of memory @mem can be charged with, in
  * pages.
  */
-static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
+unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 {
 	unsigned long margin = 0;
 	unsigned long count;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -972,9 +972,6 @@  static void oom_kill_process(struct oom_control *oc, const char *message)
 	}
 	task_unlock(victim);
 
-	if (__ratelimit(&oom_rs))
-		dump_header(oc, victim);
-
 	/*
 	 * Do we need to kill the entire memory cgroup?
 	 * Or even one of the ancestor memory cgroups?
@@ -982,6 +979,19 @@  static void oom_kill_process(struct oom_control *oc, const char *message)
 	 */
 	oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
 
+	if (is_memcg_oom(oc)) {
+		cond_resched();
+
+		/* One last check: do we *really* need to kill? */
+		if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order)) {
+			put_task_struct(victim);
+			return;
+		}
+	}
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, victim);
+
 	__oom_kill_process(victim, message);
 
 	/*