[3/3] mm, oom: introduce memory.oom.group

Message ID	20180730180100.25079-4-guro@fb.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of prvs=17496daedc=guro@fb.com designates 67.231.153.30 as permitted sender) client-ip=67.231.153.30; From: Roman Gushchin <guro@fb.com> To: <linux-mm@kvack.org> CC: Michal Hocko <mhocko@suse.com>, Johannes Weiner <hannes@cmpxchg.org>, David Rientjes <rientjes@google.com>, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>, Tejun Heo <tj@kernel.org>, <kernel-team@fb.com>, <linux-kernel@vger.kernel.org>, Roman Gushchin <guro@fb.com> Subject: [PATCH 3/3] mm, oom: introduce memory.oom.group Date: Mon, 30 Jul 2018 11:01:00 -0700 Message-ID: <20180730180100.25079-4-guro@fb.com> In-Reply-To: <20180730180100.25079-1-guro@fb.com> References: <20180730180100.25079-1-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Received-SPF: None (protection.outlook.com: fb.com does not designate permitted sender hosts) X-Microsoft-Exchange-Diagnostics: 1;SN1PR15MB0175;23:o4r5UjjHWzzC695YOvuCiEuK8UK7ZY942zzLMKiyu5bHuENhV/VzUrdiue4TBnFW+sPyDsjzo4BYMSVuPHZjJTa8w7+fzozgddgQB7UxEkHTaT2CIg4PAyjiIsM51AjovMwuceX5N5lJ7oX2HwCc5zVlUsXFfwjH3kP+yxmedyFgNMSpBB9c1o3kBqcEwug47vMz0+lpROh2ImeFCNtS8aoDikaldMguUsIuj1gH1IN7M4rLICqD2FWwq/RE5dBeleIw1hgReBH7Swc0veDo/Mc3zwwn0Krk2kihD/R7K51Q1nb3HP9jQzDqlTYEDtKgw3Ft2ZY27iOORGFmRuL78HaPnsO5ia/3seiEN/21xbNFuQgN9NOTuxHHw4l+LDovcTGoI5ecp6KKHeGGbEPVIDDQyZVJeM3bxnWAGyij0rghBDXY8vCuTkam8zd6iZfP4sSLkvxhyOdJRTP3kVqLK9Qttf9rv/OCuSnoIifqsWeSobPPT6O+haq65UptPwz1hsNFgTHQ1nJb4/K07XnkqxMA/LAtO7vyTYJiXUW7gZ+WB3v9ELQX6zCCaxfDfYI3nrlZHD3lL7qlItUgDWA7LpMwOrZOwQ5IB8Ga/ESm2NlSGYMFyCmwyB7eGif+L9kfEhtcsjavSkQJ+1QmZGG6EvgBt3Bf/XA5Wn8PDWO1o/A2AIJp2OXUaddpI9jHpKhpwk2Apd0iDw/jek5TyRtT3VOgZfA79LPysPKfv7dKOLUkDS2lO0+YgXdFKpAaxu6UTzHcspDJwmBvNteosOf5o0oLKHlI7VziWk8Dtei9QmQUq2ZZsdwDxXfFSX93DdV0BpJE/Uz1WHHl1awl3/p72RNMKd1rT4KGEEUrhikOb6yQ5Q6QqwmetbfkhJpTOl1oAOehotvVRtiL1DcD+iGP3DrbOtIb+1+UbkY8mDNxLZOMNApJPPvU4w2eFDiZzs53UOQwRSYfry2+opoXvQWaHJFxucbKOWQHcJBwBSagCSuW+Qh5EdIFJKozDytYo+XDukWxxauRrB/03PWZHBo7iNlWjwTIIubJvf18EbfN0WVoKYv5X2vuhzahc3tHaIGshLueTmARxR2/zRWaixvs+8spudEFJyLB/lFdzmMWI6pVFiCkRF9/lwwBXck5rcUUJEgIBbeOBEMuTi6ghi/riGep74clHZatZ8rMHiQiFjnvAwj9Hv7FK0S6QD6z/FiImIgAfgNwnVUuvTp61GdihNiZoBQefsmmL7evwNc1r5ULnyfhDusmGtb/A38Lu0L+q7zCGLfLG9qmS+miKrriE6fi4/w/E6J4y73fCUHKFOQ= X-Microsoft-Antispam-Message-Info: B9ZZjynnytnsdY7VwfTOysbM+FZkB2VwexVrtw+GiSW5iH3EpMvP2GG2uzyFiUNZlHCIdjbgH08WG1u1UnaCFZeIZqdHPBTzUp0+r+17C+PEnsYNlvkejWUHL78A+wx69udziLd+3nzMiiL9trXKupH+zwiaBxd2ErHoUlCv+GysJBQxKz/wW/rDWLcFskrlUKyJRn90/r5wJeTe6mZU2O4G1XOGyWh5yyAwOqJRZJgCHtoB1rU3nKSkS8DottqZtSWV2yzxUET+9WfWoYx8dQMvrqkken6YZEZR/QtkFRG0xuTLYbWpOP9WoIyX1JsinrQq8HWBx5vx1jVGL+be7YcSm8Ro3QDQ84TBjlBo0FA= X-Microsoft-Exchange-Diagnostics: 1;SN1PR15MB0175;6:ulQYMdMpJ9kroD9t1jj7z49To33EMupnzKLSQbs8YwR3k99bUeM3Sv8Gep1LWAJ3LXjdrjAH8n9KMmwYT0VlzWjRmflgQiUKfomA1KvEjz5aN+/AvjbpX/4ToSMuLkltjVFp9zXm3Ddehz5NoL1rFxiet2VvanjMky4UMaIXP6XNXk9Ts5f94qVAs7TiPI4Oixngf405LdVTYoZB2trqStmiAAFJcvc3bfOAnwhz/n/DrevzNQoyM2qWGQ53GYS0Q1AoyEXzNsQjPsCieCchNs7sHJY8Z/yUwVA1E/G1lzfOmkAsnOCVTpI0+EynJNHLURTJyoOy/M5A/kQTQnEtbU1yRAveormN7+yrZO5C5cLrBC8B0CjujKDHjk7rc/nT05o2Tg9uwTYPc9KfFsTt6SmL1/4k43B+/LX89+wAEeeJVtKrMxQc6AzZ14d6NBP4vcFEIQq+/FtWKdRHmnVY2g==;5:8GWXatIc8OUazawSDWjWp0Tm9YFEhwaweTYB0j8ES8IgPkId5Fp2o5QXH9lz7v2sh3IYzB8rumgYFsks6Fskej3q7zzrDq+0qbNcb9WJUjvm7L6OsXMU7b3bhCgp+L/XG+pyJ25jr78Dt64IX4/NfoATm+mmHqCDhdild1iBlFU=;7:ZWSqt2W1PfwlUJxwbbf/5tDRLLSiIZdmyKbABoZiYzuEd+rlcCf8FKgEJ9FtgRPug2u53wrDFZ41h38XQGxCS+c9VuPIfvlSuaFuqmMQS6pSAULRJCsP6W7eCGDpELNteJzlumhb0bT4TFRMx/RwxVNb/dkzzpl9BipWJ2n9/zxr1XJjyRl45thGCs7IzpfoJsbdSwk8DwunHU3bO7M2pdoukEFy0Xqei7G71Pg2IxDCyBQgsgbN/qvoLwWcTSLD SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;SN1PR15MB0175;20:13HiqwHM4zJxqVcU5T/02cMpCKodW5WqmYC0nghDV1p96OkU8tYWv0dDE9W4N3WdKXaRPZUEl7NnqHIcIkwHCa1Oxe704SBLVUoM59EtrHwfEjxwdCZUVXUSnBq0sxHKPlS4dmg8uRM2o0TXaii/TRSo1I2k1uPdwX+3Lzp3Hn4= X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Jul 2018 18:01:33.6401 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 547c7b7a-b56c-418b-65aa-08d5f6467d74 X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	introduce memory.oom.group \| expand [0/3] introduce memory.oom.group [1/3] mm: introduce mem_cgroup_put() helper [2/3] mm, oom: refactor oom_kill_process() [3/3] mm, oom: introduce memory.oom.group

Roman Gushchin July 30, 2018, 6:01 p.m. UTC

For some workloads an intervention from the OOM killer
can be painful. Killing a random task can bring
the workload into an inconsistent state.

Historically, there are two common solutions for this
problem:
1) enabling panic_on_oom,
2) using a userspace daemon to monitor OOMs and kill
   all outstanding processes.

Both approaches have their downsides:
rebooting on each OOM is an obvious waste of capacity,
and handling all in userspace is tricky and requires
a userspace agent, which will monitor all cgroups
for OOMs.

In most cases an in-kernel after-OOM cleaning-up
mechanism can eliminate the necessity of enabling
panic_on_oom. Also, it can simplify the cgroup
management for userspace applications.

This commit introduces a new knob for cgroup v2 memory
controller: memory.oom.group. The knob determines
whether the cgroup should be treated as a single
unit by the OOM killer. If set, the cgroup and its
descendants are killed together or not at all.

To determine which cgroup has to be killed, we do
traverse the cgroup hierarchy from the victim task's
cgroup up to the OOMing cgroup (or root) and looking
for the highest-level cgroup  with memory.oom.group set.

Tasks with the OOM protection (oom_score_adj set to -1000)
are treated as an exception and are never killed.

This patch doesn't change the OOM victim selection algorithm.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
---
 Documentation/admin-guide/cgroup-v2.rst | 16 +++++++
 include/linux/memcontrol.h              | 13 +++++
 mm/memcontrol.c                         | 84 +++++++++++++++++++++++++++++++++
 mm/oom_kill.c                           | 29 ++++++++++++
 4 files changed, 142 insertions(+)

Michal Hocko July 31, 2018, 9:07 a.m. UTC | #1

On Mon 30-07-18 11:01:00, Roman Gushchin wrote:
> For some workloads an intervention from the OOM killer
> can be painful. Killing a random task can bring
> the workload into an inconsistent state.
> 
> Historically, there are two common solutions for this
> problem:
> 1) enabling panic_on_oom,
> 2) using a userspace daemon to monitor OOMs and kill
>    all outstanding processes.
> 
> Both approaches have their downsides:
> rebooting on each OOM is an obvious waste of capacity,
> and handling all in userspace is tricky and requires
> a userspace agent, which will monitor all cgroups
> for OOMs.
> 
> In most cases an in-kernel after-OOM cleaning-up
> mechanism can eliminate the necessity of enabling
> panic_on_oom. Also, it can simplify the cgroup
> management for userspace applications.
> 
> This commit introduces a new knob for cgroup v2 memory
> controller: memory.oom.group. The knob determines
> whether the cgroup should be treated as a single
> unit by the OOM killer. If set, the cgroup and its
> descendants are killed together or not at all.

I do not want to nit pick on wording but unit is not really a good
description. I would expect that to mean that the oom killer will
consider the unit also when selecting the task and that is not the case.
I would be more explicit about this being a single killable entity
because it forms an indivisible workload.

You can reuse http://lkml.kernel.org/r/20180730080357.GA24267@dhcp22.suse.cz
if you want.

[...]
> +/**
> + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
> + * @victim: task to be killed by the OOM killer
> + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM
> + *
> + * Returns a pointer to a memory cgroup, which has to be cleaned up
> + * by killing all belonging OOM-killable tasks.

Caller has to call mem_cgroup_put on the returned non-null memcg.

> + */
> +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> +					    struct mem_cgroup *oom_domain)
> +{
> +	struct mem_cgroup *oom_group = NULL;
> +	struct mem_cgroup *memcg;
> +
> +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +		return NULL;
> +
> +	if (!oom_domain)
> +		oom_domain = root_mem_cgroup;
> +
> +	rcu_read_lock();
> +
> +	memcg = mem_cgroup_from_task(victim);
> +	if (!memcg || memcg == root_mem_cgroup)
> +		goto out;

When can we have memcg == NULL? victim should be always non-NULL.
Also why do you need to special case the root_mem_cgroup here. The loop
below should handle that just fine no?

> +
> +	/*
> +	 * Traverse the memory cgroup hierarchy from the victim task's
> +	 * cgroup up to the OOMing cgroup (or root) to find the
> +	 * highest-level memory cgroup with oom.group set.
> +	 */
> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> +		if (memcg->oom_group)
> +			oom_group = memcg;
> +
> +		if (memcg == oom_domain)
> +			break;
> +	}
> +
> +	if (oom_group)
> +		css_get(&oom_group->css);
> +out:
> +	rcu_read_unlock();
> +
> +	return oom_group;
> +}
> +
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8bded6b3205b..08f30ed5abed 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -914,6 +914,19 @@ static void __oom_kill_process(struct task_struct *victim)
>  }
>  #undef K
>  
> +/*
> + * Kill provided task unless it's secured by setting
> + * oom_score_adj to OOM_SCORE_ADJ_MIN.
> + */
> +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> +{
> +	if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> +		get_task_struct(task);
> +		__oom_kill_process(task);
> +	}
> +	return 0;
> +}
> +
>  static void oom_kill_process(struct oom_control *oc, const char *message)
>  {
>  	struct task_struct *p = oc->chosen;
> @@ -921,6 +934,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	struct task_struct *victim = p;
>  	struct task_struct *child;
>  	struct task_struct *t;
> +	struct mem_cgroup *oom_group;
>  	unsigned int victim_points = 0;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
> @@ -974,7 +988,22 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	}
>  	read_unlock(&tasklist_lock);
>  
> +	/*
> +	 * Do we need to kill the entire memory cgroup?
> +	 * Or even one of the ancestor memory cgroups?
> +	 * Check this out before killing the victim task.
> +	 */
> +	oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
> +
>  	__oom_kill_process(victim);
> +
> +	/*
> +	 * If necessary, kill all tasks in the selected memory cgroup.
> +	 */
> +	if (oom_group) {

we want a printk explaining that we are going to tear down the whole
oom_group here.

> +		mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL);
> +		mem_cgroup_put(oom_group);
> +	}
>  }

Other than that looks good to me. My concern that the previous
implementation was more consistent because we were comparing memcgs
still holds but if there is no way forward that direction this should be
acceptable as well.

After above small things are addressed you can add
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

Roman Gushchin Aug. 1, 2018, 1:14 a.m. UTC | #2

On Tue, Jul 31, 2018 at 11:07:00AM +0200, Michal Hocko wrote:
> On Mon 30-07-18 11:01:00, Roman Gushchin wrote:
> > For some workloads an intervention from the OOM killer
> > can be painful. Killing a random task can bring
> > the workload into an inconsistent state.
> > 
> > Historically, there are two common solutions for this
> > problem:
> > 1) enabling panic_on_oom,
> > 2) using a userspace daemon to monitor OOMs and kill
> >    all outstanding processes.
> > 
> > Both approaches have their downsides:
> > rebooting on each OOM is an obvious waste of capacity,
> > and handling all in userspace is tricky and requires
> > a userspace agent, which will monitor all cgroups
> > for OOMs.
> > 
> > In most cases an in-kernel after-OOM cleaning-up
> > mechanism can eliminate the necessity of enabling
> > panic_on_oom. Also, it can simplify the cgroup
> > management for userspace applications.
> > 
> > This commit introduces a new knob for cgroup v2 memory
> > controller: memory.oom.group. The knob determines
> > whether the cgroup should be treated as a single
> > unit by the OOM killer. If set, the cgroup and its
> > descendants are killed together or not at all.
> 
> I do not want to nit pick on wording but unit is not really a good
> description. I would expect that to mean that the oom killer will
> consider the unit also when selecting the task and that is not the case.
> I would be more explicit about this being a single killable entity
> because it forms an indivisible workload.
> 
> You can reuse http://lkml.kernel.org/r/20180730080357.GA24267@dhcp22.suse.cz
> if you want.

Ok, I'll do my best to make it clearer.

> 
> [...]
> > +/**
> > + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
> > + * @victim: task to be killed by the OOM killer
> > + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM
> > + *
> > + * Returns a pointer to a memory cgroup, which has to be cleaned up
> > + * by killing all belonging OOM-killable tasks.
> 
> Caller has to call mem_cgroup_put on the returned non-null memcg.

Added.

> 
> > + */
> > +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> > +					    struct mem_cgroup *oom_domain)
> > +{
> > +	struct mem_cgroup *oom_group = NULL;
> > +	struct mem_cgroup *memcg;
> > +
> > +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > +		return NULL;
> > +
> > +	if (!oom_domain)
> > +		oom_domain = root_mem_cgroup;
> > +
> > +	rcu_read_lock();
> > +
> > +	memcg = mem_cgroup_from_task(victim);
> > +	if (!memcg || memcg == root_mem_cgroup)
> > +		goto out;
> 
> When can we have memcg == NULL? victim should be always non-NULL.
> Also why do you need to special case the root_mem_cgroup here. The loop
> below should handle that just fine no?

Idk, I prefer to keep an explicit root_mem_cgroup check,
rather than traversing the tree and relying on an inability
to set oom_group on the root.

!memcg check removed, you're right.

> 
> > +
> > +	/*
> > +	 * Traverse the memory cgroup hierarchy from the victim task's
> > +	 * cgroup up to the OOMing cgroup (or root) to find the
> > +	 * highest-level memory cgroup with oom.group set.
> > +	 */
> > +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> > +		if (memcg->oom_group)
> > +			oom_group = memcg;
> > +
> > +		if (memcg == oom_domain)
> > +			break;
> > +	}
> > +
> > +	if (oom_group)
> > +		css_get(&oom_group->css);
> > +out:
> > +	rcu_read_unlock();
> > +
> > +	return oom_group;
> > +}
> > +
> [...]
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 8bded6b3205b..08f30ed5abed 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -914,6 +914,19 @@ static void __oom_kill_process(struct task_struct *victim)
> >  }
> >  #undef K
> >  
> > +/*
> > + * Kill provided task unless it's secured by setting
> > + * oom_score_adj to OOM_SCORE_ADJ_MIN.
> > + */
> > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > +{
> > +	if (task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> > +		get_task_struct(task);
> > +		__oom_kill_process(task);
> > +	}
> > +	return 0;
> > +}
> > +
> >  static void oom_kill_process(struct oom_control *oc, const char *message)
> >  {
> >  	struct task_struct *p = oc->chosen;
> > @@ -921,6 +934,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  	struct task_struct *victim = p;
> >  	struct task_struct *child;
> >  	struct task_struct *t;
> > +	struct mem_cgroup *oom_group;
> >  	unsigned int victim_points = 0;
> >  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> >  					      DEFAULT_RATELIMIT_BURST);
> > @@ -974,7 +988,22 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  	}
> >  	read_unlock(&tasklist_lock);
> >  
> > +	/*
> > +	 * Do we need to kill the entire memory cgroup?
> > +	 * Or even one of the ancestor memory cgroups?
> > +	 * Check this out before killing the victim task.
> > +	 */
> > +	oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
> > +
> >  	__oom_kill_process(victim);
> > +
> > +	/*
> > +	 * If necessary, kill all tasks in the selected memory cgroup.
> > +	 */
> > +	if (oom_group) {
> 
> we want a printk explaining that we are going to tear down the whole
> oom_group here.

Does this looks good?
Or it's better to remove "memory." prefix?

[   52.835327] Out of memory: Kill process 1221 (allocate) score 241 or sacrifice child
[   52.836625] Killed process 1221 (allocate) total-vm:2257144kB, anon-rss:2009128kB, file-rss:4kB, shmem-rss:0kB
[   52.841431] Tasks in /A1 are going to be killed due to memory.oom.group set
[   52.869439] Killed process 1217 (allocate) total-vm:2052344kB, anon-rss:1704036kB, file-rss:0kB, shmem-rss:0kB
[   52.875601] Killed process 1218 (allocate) total-vm:106668kB, anon-rss:24668kB, file-rss:0kB, shmem-rss:0kB
[   52.882914] Killed process 1219 (allocate) total-vm:106668kB, anon-rss:21528kB, file-rss:0kB, shmem-rss:0kB
[   52.891806] Killed process 1220 (allocate) total-vm:2257144kB, anon-rss:1984120kB, file-rss:4kB, shmem-rss:0kB
[   52.903770] Killed process 1221 (allocate) total-vm:2257144kB, anon-rss:2009128kB, file-rss:4kB, shmem-rss:0kB
[   52.905574] Killed process 1222 (allocate) total-vm:2257144kB, anon-rss:2063640kB, file-rss:0kB, shmem-rss:0kB
[   53.202153] oom_reaper: reaped process 1222 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

> 
> > +		mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL);
> > +		mem_cgroup_put(oom_group);
> > +	}
> >  }
> 
> Other than that looks good to me. My concern that the previous
> implementation was more consistent because we were comparing memcgs
> still holds but if there is no way forward that direction this should be
> acceptable as well.
> 
> After above small things are addressed you can add
> Acked-by: Michal Hocko <mhocko@suse.com> 

Thank you!

Michal Hocko Aug. 1, 2018, 5:55 a.m. UTC | #3

On Tue 31-07-18 18:14:48, Roman Gushchin wrote:
> On Tue, Jul 31, 2018 at 11:07:00AM +0200, Michal Hocko wrote:
> > On Mon 30-07-18 11:01:00, Roman Gushchin wrote:
> > > For some workloads an intervention from the OOM killer
> > > can be painful. Killing a random task can bring
> > > the workload into an inconsistent state.
> > > 
> > > Historically, there are two common solutions for this
> > > problem:
> > > 1) enabling panic_on_oom,
> > > 2) using a userspace daemon to monitor OOMs and kill
> > >    all outstanding processes.
> > > 
> > > Both approaches have their downsides:
> > > rebooting on each OOM is an obvious waste of capacity,
> > > and handling all in userspace is tricky and requires
> > > a userspace agent, which will monitor all cgroups
> > > for OOMs.
> > > 
> > > In most cases an in-kernel after-OOM cleaning-up
> > > mechanism can eliminate the necessity of enabling
> > > panic_on_oom. Also, it can simplify the cgroup
> > > management for userspace applications.
> > > 
> > > This commit introduces a new knob for cgroup v2 memory
> > > controller: memory.oom.group. The knob determines
> > > whether the cgroup should be treated as a single
> > > unit by the OOM killer. If set, the cgroup and its
> > > descendants are killed together or not at all.
> > 
> > I do not want to nit pick on wording but unit is not really a good
> > description. I would expect that to mean that the oom killer will
> > consider the unit also when selecting the task and that is not the case.
> > I would be more explicit about this being a single killable entity
> > because it forms an indivisible workload.
> > 
> > You can reuse http://lkml.kernel.org/r/20180730080357.GA24267@dhcp22.suse.cz
> > if you want.
> 
> Ok, I'll do my best to make it clearer.
> 
> > 
> > [...]
> > > +/**
> > > + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
> > > + * @victim: task to be killed by the OOM killer
> > > + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM
> > > + *
> > > + * Returns a pointer to a memory cgroup, which has to be cleaned up
> > > + * by killing all belonging OOM-killable tasks.
> > 
> > Caller has to call mem_cgroup_put on the returned non-null memcg.
> 
> Added.
> 
> > 
> > > + */
> > > +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> > > +					    struct mem_cgroup *oom_domain)
> > > +{
> > > +	struct mem_cgroup *oom_group = NULL;
> > > +	struct mem_cgroup *memcg;
> > > +
> > > +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > > +		return NULL;
> > > +
> > > +	if (!oom_domain)
> > > +		oom_domain = root_mem_cgroup;
> > > +
> > > +	rcu_read_lock();
> > > +
> > > +	memcg = mem_cgroup_from_task(victim);
> > > +	if (!memcg || memcg == root_mem_cgroup)
> > > +		goto out;
> > 
> > When can we have memcg == NULL? victim should be always non-NULL.
> > Also why do you need to special case the root_mem_cgroup here. The loop
> > below should handle that just fine no?
> 
> Idk, I prefer to keep an explicit root_mem_cgroup check,
> rather than traversing the tree and relying on an inability
> to set oom_group on the root.

I will not insist but this just makes the code harder to read.

[...]
> > > +	if (oom_group) {
> > 
> > we want a printk explaining that we are going to tear down the whole
> > oom_group here.
> 
> Does this looks good?
> Or it's better to remove "memory." prefix?
> 
> [   52.835327] Out of memory: Kill process 1221 (allocate) score 241 or sacrifice child
> [   52.836625] Killed process 1221 (allocate) total-vm:2257144kB, anon-rss:2009128kB, file-rss:4kB, shmem-rss:0kB
> [   52.841431] Tasks in /A1 are going to be killed due to memory.oom.group set

Yes, looks good to me.

> [   52.869439] Killed process 1217 (allocate) total-vm:2052344kB, anon-rss:1704036kB, file-rss:0kB, shmem-rss:0kB
> [   52.875601] Killed process 1218 (allocate) total-vm:106668kB, anon-rss:24668kB, file-rss:0kB, shmem-rss:0kB
> [   52.882914] Killed process 1219 (allocate) total-vm:106668kB, anon-rss:21528kB, file-rss:0kB, shmem-rss:0kB
> [   52.891806] Killed process 1220 (allocate) total-vm:2257144kB, anon-rss:1984120kB, file-rss:4kB, shmem-rss:0kB
> [   52.903770] Killed process 1221 (allocate) total-vm:2257144kB, anon-rss:2009128kB, file-rss:4kB, shmem-rss:0kB
> [   52.905574] Killed process 1222 (allocate) total-vm:2257144kB, anon-rss:2063640kB, file-rss:0kB, shmem-rss:0kB
> [   53.202153] oom_reaper: reaped process 1222 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> > 
> > > +		mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL);
> > > +		mem_cgroup_put(oom_group);
> > > +	}
> > >  }
> > 
> > Other than that looks good to me. My concern that the previous
> > implementation was more consistent because we were comparing memcgs
> > still holds but if there is no way forward that direction this should be
> > acceptable as well.
> > 
> > After above small things are addressed you can add
> > Acked-by: Michal Hocko <mhocko@suse.com> 
> 
> Thank you!

Johannes Weiner Aug. 1, 2018, 5:48 p.m. UTC | #4

On Wed, Aug 01, 2018 at 07:55:03AM +0200, Michal Hocko wrote:
> On Tue 31-07-18 18:14:48, Roman Gushchin wrote:
> > On Tue, Jul 31, 2018 at 11:07:00AM +0200, Michal Hocko wrote:
> > > On Mon 30-07-18 11:01:00, Roman Gushchin wrote:
> > > > +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
> > > > +					    struct mem_cgroup *oom_domain)
> > > > +{
> > > > +	struct mem_cgroup *oom_group = NULL;
> > > > +	struct mem_cgroup *memcg;
> > > > +
> > > > +	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > > > +		return NULL;
> > > > +
> > > > +	if (!oom_domain)
> > > > +		oom_domain = root_mem_cgroup;
> > > > +
> > > > +	rcu_read_lock();
> > > > +
> > > > +	memcg = mem_cgroup_from_task(victim);
> > > > +	if (!memcg || memcg == root_mem_cgroup)
> > > > +		goto out;
> > > 
> > > When can we have memcg == NULL? victim should be always non-NULL.
> > > Also why do you need to special case the root_mem_cgroup here. The loop
> > > below should handle that just fine no?
> > 
> > Idk, I prefer to keep an explicit root_mem_cgroup check,
> > rather than traversing the tree and relying on an inability
> > to set oom_group on the root.
> 
> I will not insist but this just makes the code harder to read.

Just FYI, I'd prefer the explicit check. The loop would do the right
thing, but it's a little too implicit and subtle for my taste...

Johannes Weiner Aug. 1, 2018, 5:50 p.m. UTC | #5

On Mon, Jul 30, 2018 at 11:01:00AM -0700, Roman Gushchin wrote:
> For some workloads an intervention from the OOM killer
> can be painful. Killing a random task can bring
> the workload into an inconsistent state.
> 
> Historically, there are two common solutions for this
> problem:
> 1) enabling panic_on_oom,
> 2) using a userspace daemon to monitor OOMs and kill
>    all outstanding processes.
> 
> Both approaches have their downsides:
> rebooting on each OOM is an obvious waste of capacity,
> and handling all in userspace is tricky and requires
> a userspace agent, which will monitor all cgroups
> for OOMs.
> 
> In most cases an in-kernel after-OOM cleaning-up
> mechanism can eliminate the necessity of enabling
> panic_on_oom. Also, it can simplify the cgroup
> management for userspace applications.
> 
> This commit introduces a new knob for cgroup v2 memory
> controller: memory.oom.group. The knob determines
> whether the cgroup should be treated as a single
> unit by the OOM killer. If set, the cgroup and its
> descendants are killed together or not at all.
> 
> To determine which cgroup has to be killed, we do
> traverse the cgroup hierarchy from the victim task's
> cgroup up to the OOMing cgroup (or root) and looking
> for the highest-level cgroup  with memory.oom.group set.
> 
> Tasks with the OOM protection (oom_score_adj set to -1000)
> are treated as an exception and are never killed.
> 
> This patch doesn't change the OOM victim selection algorithm.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Tejun Heo <tj@kernel.org>

The semantics make sense to me and the code is straight-forward. With
Michal's other feedback incorporated, please feel free to add:

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

[3/3] mm, oom: introduce memory.oom.group

Commit Message

Comments

Patch