diff mbox series

[v3] mm: oom: introduce cpuset oom

Message ID 20230410025056.22103-1-ligang.bdlg@bytedance.com (mailing list archive)
State New
Headers show
Series [v3] mm: oom: introduce cpuset oom | expand

Commit Message

Gang Li April 10, 2023, 2:50 a.m. UTC
Cpusets constrain the CPU and Memory placement of tasks.
`CONSTRAINT_CPUSET` type in oom  has existed for a long time, but
has never been utilized.

When a process in cpuset which constrain memory placement triggers
oom, it may kill a completely irrelevant process on other numa nodes,
which will not release any memory for this cpuset.

We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and
selecting victim from all cpusets with the same mems_allowed as the
current cpuset.

Example:

Create two processes named mem_on_node0 and mem_on_node1 constrained
by cpusets respectively. These two processes alloc memory on their
own node. Now node0 has run out of memory, OOM will be invokled by
mem_on_node0.

Before this patch:

Since `CONSTRAINT_CPUSET` do nothing, the victim will be selected from
the entire system. Therefore, the OOM is highly likely to kill
mem_on_node1, which will not free any memory for mem_on_node0. This
is a useless kill.

```
[ 2786.519080] mem_on_node0 invoked oom-killer
[ 2786.885738] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 2787.181724] [  13432]     0 13432   787016   786745  6344704        0             0 mem_on_node1
[ 2787.189115] [  13457]     0 13457   787002   785504  6340608        0             0 mem_on_node0
[ 2787.216534] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0
[ 2787.229991] Out of memory: Killed process 13432 (mem_on_node1)
```

After this patch:

The victim will be selected only in all cpusets that have the same
mems_allowed as the cpuset that invoked oom. This will prevent
useless kill and protect innocent victims.

```
[  395.922444] mem_on_node0 invoked oom-killer
[  396.239777] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[  396.246128] [   2614]     0  2614  1311294  1144192  9224192        0             0 mem_on_node0
[  396.252655] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0
[  396.264068] Out of memory: Killed process 2614 (mem_on_node0)
```

Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: <cgroups@vger.kernel.org>
Cc: <linux-mm@kvack.org>
Cc: <rientjes@google.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
---
Changes in v3:
- Provide more details about the use case, testing, implementation.
- Document the userspace visible change in Documentation.
- Rename cpuset_cgroup_scan_tasks() to cpuset_scan_tasks() and add
  a doctext comment about its purpose and how it should be used.
- Take cpuset_rwsem to ensure that cpusets are stable.

Changes in v2:
- https://lore.kernel.org/all/20230404115509.14299-1-ligang.bdlg@bytedance.com/
- Select victim from all cpusets with the same mems_allowed as the current cpuset.
  (David Rientjes <rientjes@google.com>)

v1:
- https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/
- Introduce cpuset oom.
---
 .../admin-guide/cgroup-v1/cpusets.rst         | 14 +++++-
 include/linux/cpuset.h                        |  6 +++
 kernel/cgroup/cpuset.c                        | 44 +++++++++++++++++++
 mm/oom_kill.c                                 |  4 ++
 4 files changed, 66 insertions(+), 2 deletions(-)

Comments

Waiman Long April 10, 2023, 4:26 p.m. UTC | #1
On 4/9/23 22:50, Gang Li wrote:
> Cpusets constrain the CPU and Memory placement of tasks.
> `CONSTRAINT_CPUSET` type in oom  has existed for a long time, but
> has never been utilized.
>
> When a process in cpuset which constrain memory placement triggers
> oom, it may kill a completely irrelevant process on other numa nodes,
> which will not release any memory for this cpuset.
>
> We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and
> selecting victim from all cpusets with the same mems_allowed as the
> current cpuset.
>
> Example:
>
> Create two processes named mem_on_node0 and mem_on_node1 constrained
> by cpusets respectively. These two processes alloc memory on their
> own node. Now node0 has run out of memory, OOM will be invokled by
> mem_on_node0.
>
> Before this patch:
>
> Since `CONSTRAINT_CPUSET` do nothing, the victim will be selected from
> the entire system. Therefore, the OOM is highly likely to kill
> mem_on_node1, which will not free any memory for mem_on_node0. This
> is a useless kill.
>
> ```
> [ 2786.519080] mem_on_node0 invoked oom-killer
> [ 2786.885738] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> [ 2787.181724] [  13432]     0 13432   787016   786745  6344704        0             0 mem_on_node1
> [ 2787.189115] [  13457]     0 13457   787002   785504  6340608        0             0 mem_on_node0
> [ 2787.216534] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0
> [ 2787.229991] Out of memory: Killed process 13432 (mem_on_node1)
> ```
>
> After this patch:
>
> The victim will be selected only in all cpusets that have the same
> mems_allowed as the cpuset that invoked oom. This will prevent
> useless kill and protect innocent victims.
>
> ```
> [  395.922444] mem_on_node0 invoked oom-killer
> [  396.239777] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> [  396.246128] [   2614]     0  2614  1311294  1144192  9224192        0             0 mem_on_node0
> [  396.252655] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0
> [  396.264068] Out of memory: Killed process 2614 (mem_on_node0)
> ```
>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Cc: <cgroups@vger.kernel.org>
> Cc: <linux-mm@kvack.org>
> Cc: <rientjes@google.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Zefan Li <lizefan.x@bytedance.com>
> Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Thanks for the update.
> ---
> Changes in v3:
> - Provide more details about the use case, testing, implementation.
> - Document the userspace visible change in Documentation.
> - Rename cpuset_cgroup_scan_tasks() to cpuset_scan_tasks() and add
>    a doctext comment about its purpose and how it should be used.
> - Take cpuset_rwsem to ensure that cpusets are stable.
>
> Changes in v2:
> - https://lore.kernel.org/all/20230404115509.14299-1-ligang.bdlg@bytedance.com/
> - Select victim from all cpusets with the same mems_allowed as the current cpuset.
>    (David Rientjes <rientjes@google.com>)
>
> v1:
> - https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/
> - Introduce cpuset oom.
> ---
>   .../admin-guide/cgroup-v1/cpusets.rst         | 14 +++++-
>   include/linux/cpuset.h                        |  6 +++
>   kernel/cgroup/cpuset.c                        | 44 +++++++++++++++++++
>   mm/oom_kill.c                                 |  4 ++
>   4 files changed, 66 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst
> index 5d844ed4df69..d686cd47e91d 100644
> --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
> +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
> @@ -25,7 +25,8 @@ Written by Simon.Derr@bull.net
>        1.6 What is memory spread ?
>        1.7 What is sched_load_balance ?
>        1.8 What is sched_relax_domain_level ?
> -     1.9 How do I use cpusets ?
> +     1.9 What is cpuset oom ?
> +     1.10 How do I use cpusets ?
>      2. Usage Examples and Syntax
>        2.1 Basic Usage
>        2.2 Adding/removing cpus
> @@ -607,8 +608,17 @@ If your situation is:
>    - The latency is required even it sacrifices cache hit rate etc.
>      then increasing 'sched_relax_domain_level' would benefit you.
>   
> +1.9 What is cpuset oom ?
> +--------------------------
> +If there is no available memory to allocate on the nodes specified by
> +cpuset.mems, then an OOM (Out-Of-Memory) will be invoked.
> +
> +Since the victim selection is a heuristic algorithm, we cannot select
> +the "perfect" victim. Therefore, currently, the victim will be selected
> +from all the cpusets that have the same mems_allowed as the cpuset
> +which invoked OOM.
Nit: That feature is not specific to cgroup v1, as it applies to v2 as 
well. Maybe you can be more specific about that.
>   
> -1.9 How do I use cpusets ?
> +1.10 How do I use cpusets ?
>   --------------------------
>   
>   In order to minimize the impact of cpusets on critical kernel
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 980b76a1237e..75465bf58f74 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>   	task_unlock(current);
>   }
>   
> +int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg);
> +
>   #else /* !CONFIG_CPUSETS */
>   
>   static inline bool cpusets_enabled(void) { return false; }
> @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
>   	return false;
>   }
>   
> +static inline int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
> +{
> +	return 0;
> +}
>   #endif /* !CONFIG_CPUSETS */
>   
>   #endif /* _LINUX_CPUSET_H */
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index bc4dcfd7bee5..4c51225568aa 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -4013,6 +4013,50 @@ void cpuset_print_current_mems_allowed(void)
>   	rcu_read_unlock();
>   }
>   
> +/**
> + * cpuset_scan_tasks - specify the oom scan range
> + * @fn: callback function to select oom victim
> + * @arg: argument for callback function, usually a pointer to struct oom_control
> + *
> + * Description: This function is used to specify the oom scan range. Return 0 if
> + * no task is selected, otherwise return 1. The selected task will be stored in
> + * arg->chosen. Thins function can only be called in select_bad_process()
> + * while oc->onstraint == CONSTRAINT_CPUSET.

Nit: That is not strictly correct as dump_tasks() will call this as well.

Cheers,
Longman
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst
index 5d844ed4df69..d686cd47e91d 100644
--- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@@ -25,7 +25,8 @@  Written by Simon.Derr@bull.net
      1.6 What is memory spread ?
      1.7 What is sched_load_balance ?
      1.8 What is sched_relax_domain_level ?
-     1.9 How do I use cpusets ?
+     1.9 What is cpuset oom ?
+     1.10 How do I use cpusets ?
    2. Usage Examples and Syntax
      2.1 Basic Usage
      2.2 Adding/removing cpus
@@ -607,8 +608,17 @@  If your situation is:
  - The latency is required even it sacrifices cache hit rate etc.
    then increasing 'sched_relax_domain_level' would benefit you.
 
+1.9 What is cpuset oom ?
+--------------------------
+If there is no available memory to allocate on the nodes specified by
+cpuset.mems, then an OOM (Out-Of-Memory) will be invoked.
+
+Since the victim selection is a heuristic algorithm, we cannot select
+the "perfect" victim. Therefore, currently, the victim will be selected
+from all the cpusets that have the same mems_allowed as the cpuset
+which invoked OOM.
 
-1.9 How do I use cpusets ?
+1.10 How do I use cpusets ?
 --------------------------
 
 In order to minimize the impact of cpusets on critical kernel
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 980b76a1237e..75465bf58f74 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -171,6 +171,8 @@  static inline void set_mems_allowed(nodemask_t nodemask)
 	task_unlock(current);
 }
 
+int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline bool cpusets_enabled(void) { return false; }
@@ -287,6 +289,10 @@  static inline bool read_mems_allowed_retry(unsigned int seq)
 	return false;
 }
 
+static inline int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+	return 0;
+}
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index bc4dcfd7bee5..4c51225568aa 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4013,6 +4013,50 @@  void cpuset_print_current_mems_allowed(void)
 	rcu_read_unlock();
 }
 
+/**
+ * cpuset_scan_tasks - specify the oom scan range
+ * @fn: callback function to select oom victim
+ * @arg: argument for callback function, usually a pointer to struct oom_control
+ *
+ * Description: This function is used to specify the oom scan range. Return 0 if
+ * no task is selected, otherwise return 1. The selected task will be stored in
+ * arg->chosen. Thins function can only be called in select_bad_process()
+ * while oc->onstraint == CONSTRAINT_CPUSET.
+ *
+ * The selection algorithm is heuristic, therefore requires constant iteration
+ * based on user feedback. Currently, we just iterate through all cpusets with
+ * the same mems_allowed as the current cpuset.
+ */
+int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+	int ret = 0;
+	struct css_task_iter it;
+	struct task_struct *task;
+	struct cpuset *cs;
+	struct cgroup_subsys_state *pos_css;
+
+	/*
+	 * Situation gets complex with overlapping nodemasks in different cpusets.
+	 * TODO: Maybe we should calculate the "distance" between different mems_allowed.
+	 *
+	 * But for now, let's make it simple. Just iterate through all cpusets
+	 * with the same mems_allowed as the current cpuset.
+	 */
+	cpuset_read_lock();
+	rcu_read_lock();
+	cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+		if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
+			css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
+			while (!ret && (task = css_task_iter_next(&it)))
+				ret = fn(task, arg);
+			css_task_iter_end(&it);
+		}
+	}
+	rcu_read_unlock();
+	cpuset_read_unlock();
+	return ret;
+}
+
 /*
  * Collection of memory_pressure is suppressed unless
  * this flag is enabled by writing "1" to the special
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 044e1eed720e..228257788d9e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -367,6 +367,8 @@  static void select_bad_process(struct oom_control *oc)
 
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
+	else if (oc->constraint == CONSTRAINT_CPUSET)
+		cpuset_scan_tasks(oom_evaluate_task, oc);
 	else {
 		struct task_struct *p;
 
@@ -427,6 +429,8 @@  static void dump_tasks(struct oom_control *oc)
 
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
+	else if (oc->constraint == CONSTRAINT_CPUSET)
+		cpuset_scan_tasks(dump_task, oc);
 	else {
 		struct task_struct *p;